WO2020232613A1 - 一种视频处理方法、系统、移动终端、服务器及存储介质 - Google Patents
一种视频处理方法、系统、移动终端、服务器及存储介质 Download PDFInfo
- Publication number
- WO2020232613A1 WO2020232613A1 PCT/CN2019/087662 CN2019087662W WO2020232613A1 WO 2020232613 A1 WO2020232613 A1 WO 2020232613A1 CN 2019087662 W CN2019087662 W CN 2019087662W WO 2020232613 A1 WO2020232613 A1 WO 2020232613A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image frame
- network
- encoded image
- layer
- encoded
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims description 49
- 238000013528 artificial neural network Methods 0.000 claims description 27
- 238000012549 training Methods 0.000 claims description 25
- 238000011176 pooling Methods 0.000 claims description 22
- 230000004913 activation Effects 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 19
- 238000010606 normalization Methods 0.000 claims description 14
- 230000008859 change Effects 0.000 claims description 11
- 230000015556 catabolic process Effects 0.000 claims description 7
- 238000006731 degradation reaction Methods 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 abstract description 12
- 244000182067 Fraxinus ornus Species 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 14
- 230000000694 effects Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 8
- 210000002569 neuron Anatomy 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000006835 compression Effects 0.000 description 6
- 238000007906 compression Methods 0.000 description 6
- 230000007704 transition Effects 0.000 description 4
- 244000025254 Cannabis sativa Species 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005315 distribution function Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000005034 decoration Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000008014 freezing Effects 0.000 description 1
- 238000007710 freezing Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
Definitions
- This application relates to the field of image processing, in particular to a video processing method, system, mobile terminal, server and storage medium.
- Digital image compression coding is a very important technology, which is of great significance to the transmission and storage of digital images.
- the traditional image coding algorithm is based on pixel value coding. Whether it is transform coding, predictive coding or other coding algorithms, it compresses on the basis of pixel value. Although the degree of compression is gradually increasing, the compression effect is getting better and better. Pixel value coding is difficult to compress the image or video volume to a minimum; and for traditional image coding algorithms, security issues cannot be ignored.
- Traditional image coding algorithms need to develop various security mechanisms to ensure the transmission of the image after coding. safety.
- the main problem to be solved by this application is to provide a video processing method, system, mobile terminal, server and storage medium, which can decode floating-point data into images, realize safe transmission of images, and enrich the decoded images.
- the technical solution adopted in this application is to provide a video processing method, which is applied to the client, and the method includes: receiving the first encoded image frame sent by the server; determining whether an image rich instruction is received ; If the image rich instruction is received, random noise is added to the first encoded image frame to generate a second encoded image frame; wherein the first encoded image frame is floating-point data, the first encoded image frame and the second encoded image The difference between frames is within a preset range.
- the method includes: receiving an input image; using a neural network-based coding network to process the input image , To obtain the first encoded image frame; wherein the first encoded image frame is floating-point data, the neural network-based encoding network includes at least an input layer, and each input layer includes at least two sub-input layers, the sub-input layer is used to receive Input the data of at least one channel in the image.
- a mobile terminal which includes a memory and a processor connected to each other, wherein the memory is used to store a computer program, and the computer program is executed by the processor. , Used to implement the above video processing method.
- another technical solution adopted in this application is to provide a server including a memory and a processor connected to each other, wherein the memory is used to store a computer program, and when the computer program is executed by the processor, To realize the above-mentioned video processing method.
- the video processing system includes a server and a mobile terminal that are connected to each other.
- the server is used to encode input images to obtain encoded image frames.
- the mobile terminal is used to decode the encoded image frame to obtain the decoded image frame, where the mobile terminal is the above-mentioned mobile terminal, and the server is the above-mentioned server.
- the computer storage medium is used to store a computer program.
- the computer program is executed by a processor, it is used to implement the above video processing method.
- the beneficial effects of the present application are: the client receives the first encoded image frame sent by the server, and the first encoded image frame is floating-point data; the client determines whether the image enrichment instruction is received, and if it receives the image enrichment Command to add random noise to the first encoded image frame to generate a second encoded image frame whose difference with the second encoded image frame is within a preset range, which can decode floating-point data into an image, and because Floating-point data is encoded based on semantics, and cannot be decoded if intercepted by a third party.
- the secure transmission of images is realized, and the decoded images can be enriched, so that every time the user watches the video, the same frame will be Seeing different pictures improves the freshness of users' viewing.
- FIG. 1 is a schematic flowchart of a first embodiment of a video processing method provided by this application
- Fig. 2 is a schematic flowchart of a second embodiment of a video processing method provided by the present application
- FIG. 3 is a schematic flowchart of a third embodiment of a video processing method provided by this application.
- FIG. 4 is a schematic flowchart of a fourth embodiment of a video processing method provided by the present application.
- FIG. 5 is a schematic diagram of the structure of the codec network provided by this application.
- FIG. 6 is a schematic diagram of a flow of generating a first encoded image frame in the encoding network corresponding to FIG. 5;
- FIG. 7 is a schematic diagram of the flow of generating decoded image frames in the decoding network corresponding to FIG. 5;
- FIG. 8 is a schematic diagram of another structure of the codec network provided by this application.
- FIG. 9 is a schematic diagram of a flow of generating a first encoded image frame in the encoding network corresponding to FIG. 8;
- FIG. 10 is a schematic diagram of the flow of generating decoded image frames in the decoding network corresponding to FIG. 8;
- FIG. 11 is a schematic structural diagram of an embodiment of a mobile terminal provided by the present application.
- FIG. 12 is a schematic structural diagram of an embodiment of a server provided by this application.
- FIG. 13 is a schematic structural diagram of an embodiment of a video processing system provided by the present application.
- FIG. 14 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.
- Fig. 1 is a schematic flowchart of a first embodiment of a video processing method provided by the present application.
- the video processing method is applied to a client, and the method includes:
- Step 11 Receive the first encoded image frame sent by the server.
- the first encoded image frame is floating-point data.
- the floating-point data is obtained after the server performs encoding processing on the input image.
- the encoding processing is semantic (image content)-based encoding, extracting the semantics of the input image, and performing Encode to obtain the first coded image frame. Since the first coded image frame is not obtained by using a pixel value-based coding algorithm, even if the first coded image frame is intercepted by a third party, the third party will not be able to do so without a corresponding decoding network.
- the first encoded image frame is decoded to ensure the security of image transmission.
- Step 12 Determine whether an image rich instruction is received.
- the client After receiving the first encoded image frame, the client can determine whether it has received the image enrichment instruction input by the user or the image enrichment instruction set by default.
- the image enrichment instruction is used to instruct to process the first encoded image frame so that the decoded image Compared with the input image, the image has some more image details or some details in the image have changed.
- Step 13 If the image rich instruction is received, random noise is added to the first encoded image frame to generate a second encoded image frame.
- the random noise is also floating-point data, and the data length is the same as the first encoded image frame; the client has two modes: no noise and random noise. The user can choose to enter one of the two modes or Random noise is added by default.
- the difference between the first coded image frame and the second coded image frame is within the preset range to ensure that the first coded image frame and the second coded image frame are decoded separately
- the difference between the two images is within the allowable range.
- the content of the two images is roughly the same, and only some details may be different, so as to avoid the big difference between the decoded image and the original image in content;
- the input image includes grass and a child.
- the decoded image After superimposing random noise with the first encoded image frame and then decoding, the decoded image includes grass and the child, but the child has an extra hairpin on his head.
- this embodiment provides a video processing method.
- the client receives the first encoded image frame sent by the server, and after receiving the image rich instruction, processes the first encoded image frame to change the image
- Some detailed features can decode floating-point data into an image, and because floating-point data is encoded based on semantics, it cannot be decoded even if it is intercepted by a third party, which realizes the safe transmission of images and can decode the decoded image Enriching, so that every time a user watches a video, they will see a different picture for the same frame, which improves the freshness of the user’s viewing.
- FIG. 2 is a schematic flowchart of a second embodiment of a video processing method provided by the present application.
- the video processing method is applied to a client, and the method includes:
- Step 201 Send a download request message to the server according to a preset time interval or a preset number of frames.
- the client can send a download request message to the client to request the server to deliver certain encoded image frames in the video to the client.
- the client can request the server at intervals of a preset number of frames or a preset time; specifically, the client needs to The server requests to download the encoded image frame corresponding to the first frame in the video, so as to generate the next at least one image frame according to the encoded image frame corresponding to the first frame, and play the video smoothly.
- Step 202 Receive the first encoded image frame sent by the server.
- Step 203 Determine whether an image rich instruction is received.
- steps 202-203 are similar to steps 12-13 in the foregoing embodiment, and will not be repeated here.
- Step 204 Use the scene change detection network to determine whether a scene change occurs.
- the scene conversion detection network is a convolutional neural network, which is used to detect whether a scene conversion occurs. It can use three-dimensional convolution or two-dimensional convolution, and use various manually labeled images to form a training set for training.
- the output layer is a neuron Corresponds directly to whether a scene transition occurs.
- Step 205 If a scene change occurs, generate new random noise, and add the new random noise to the first encoded image frame to generate a second encoded image frame.
- Step 206 If there is no scene change, continue to add current random noise to the first encoded image frame to generate a second encoded image frame.
- the transition detection is performed.
- the first frame of each video ie, the 0th frame
- the added random noise can change the costumes of the characters in the video, change the details of the background scenery, environmental decorations, or change the color style, but it does not affect the main plot.
- the user repeatedly plays the same TV series or movie they can watch different Keep the content fresh.
- Step 207 Use a neural network-based decoding network to decode the second encoded image frame to obtain a decoded image frame.
- a neural network-based decoding network After receiving the second encoded image frame, in order to restore the floating-point data into image data, a neural network-based decoding network is used to decode the second encoded image frame.
- Step 208 Use the image degradation removal network to process the decoded image frame to obtain the first image frame.
- the input image may appear blurred after the encoding and decoding process, and the image degradation removal network can remove the blurred and noise contained in the generated decoded image frame.
- the client can obtain multiple arbitrary images as the original image; then perform Gaussian blur or noise processing on the original image to generate the corresponding training image and establish the training set; and then use the image blur restoration network or
- the image super-resolution network trains the training images in the training set, uses the loss function to measure the loss between the original image and the image output by the image degraded network, and minimizes the loss until the required image degraded network model is trained.
- test set can also be established to test whether the trained image degradation removal network model has a better effect in removing image degradation.
- Step 209 Use a motion estimation network to estimate the first image frame to generate at least one second image frame.
- the motion estimation network is a Generative Adversarial Networks (GAN, Generative Adversarial Networks).
- the generative adversarial network includes a generation network and a discriminant network.
- the generation network includes a two-dimensional convolutional layer and a three-dimensional deconvolutional layer.
- the two-dimensional convolutional layer is used for The characteristic information is extracted from the first image frame.
- the three-dimensional deconvolution layer is used to receive the characteristic information and generate at least one second image frame.
- the discrimination network includes a three-dimensional convolution layer and a fully connected layer, which are used to determine the second Whether the image frame is an image that meets the preset requirements.
- the image that meets the preset requirements may be an image with a relatively high similarity to the image frame located after the first image frame in the video.
- the number of second image frames is defined as ⁇ , if the current client The number of frames requested by the end from the server is the i-th (i is a positive integer) frame, and the next time the request is sent, it can request the i+ ⁇ +1-th frame from the server.
- the value of ⁇ can be 5, and when ⁇ is 0,
- the client needs to request each frame in the video from the server; the use of the motion estimation network can reduce the amount of transmitted information and further increase the security of information transmission.
- the operations of the server and the client are not performed at the same time.
- the server encodes all the frames in all video resources in advance, and stores the encoding result and the corresponding frame number.
- the client's request comes, it will be based on the image required by the client.
- Frames send encoded image frames to the client, and the client does not request every image frame.
- the client can use the motion estimation network to generate the next few images after the current frame, so the client can send the server every few frames Take a frame.
- Step 210 Send the first image frame and the second image frame to the video player for playback.
- the client uses the first image frame to generate at least one second image frame
- the first image frame and the second image frame may be sent to the video player in order to play the video.
- this embodiment provides a video processing method.
- the client receives the first encoded image frame sent by the server, and determines whether the random noise added to the first encoded image frame has changed by detecting whether the scene changes.
- Generate the second coded image frame and use the decoding network to decode the second coded image frame to obtain the decoded image frame.
- the decoded image frame can be de-degraded to obtain the first image frame, and then the motion estimation network is used according to the first image Frame generates at least one second image frame to avoid the need for the client to request each frame of the video from the server, which can reduce the number of data transmissions, further increase security, and can enrich and de-degrade the decoded image. Processing to improve image quality.
- FIG. 3 is a schematic flowchart of a third embodiment of a video processing method provided by the present application.
- the video processing method is applied to a server, and the method includes:
- Step 31 Receive the input image.
- the input image can be a color image, and its color format can be RGB or YCrCb, where Y, Cr, and Cb are brightness, red difference, and blue difference, respectively.
- Step 32 Use a neural network-based coding network to perform coding processing on the input image to obtain a first coded image frame.
- the first encoded image frame is floating-point data, and the floating-point data has nothing to do with the pixel value.
- the floating-point data can be regarded as a "style" of the image, and the real image content is learned into the network as a distribution function Among the various layer parameters in the structure, a higher compression rate can be achieved; specifically, a 1920*1080 image can be compressed into 64 floating-point data, which greatly improves the compression rate and reduces the bandwidth required for video transmission .
- the neural network-based coding network includes at least an input layer, and the number of input layers can be multiple to facilitate processing multiple input images at the same time when training the coding network model, and each input layer includes at least two sub-input layers ,
- the sub-input layer is used to receive the data of at least one channel in the input image; for example, for the input image in YCrCb format, one sub-input layer can receive the data of the Y channel in the input image, and the other sub-input layer can receive the Cr in the input image. And Cb channel data.
- this embodiment provides a video processing method.
- the server receives the input image and encodes the input image using the encoding network to obtain the first encoded image frame, which can encode the digital image as floating-point data. And because the floating-point data is encoded based on semantics, it cannot be decoded even if it is intercepted by a third party to realize the safe transmission of the image.
- FIG. 4 is a schematic flowchart of a fourth embodiment of a video processing method provided by the present application.
- the video processing method is applied to a server, and the method includes:
- Step 41 Receive the input image.
- Step 42 Use a neural network-based coding network to perform coding processing on the input image to obtain a first coded image frame.
- the neural network-based coding network includes at least an input layer, at least one convolutional hidden layer, a fully connected hidden layer for encoding, and a fully connected output layer for encoding, and each input layer includes at least two sub-input layers, which are used to receive input images At least one channel in the data.
- the server encodes multiple video resources using an encoding network, and stores the encoding result and the corresponding frame number, so that when the client initiates a request, it can quickly find the encoding result corresponding to the frame number.
- Step 43 Perform decoding processing on the first encoded image frame to obtain a decoded image frame.
- the server After the server encodes the input image to obtain the first encoded image frame, the first encoded image frame may be decoded to obtain the decoded image frame.
- Step 44 After receiving the video viewing request sent by the client, the neural network-based decoding network is sent to the client.
- the neural network-based decoding network includes a decoding fully connected hidden layer, at least one deconvolution hidden layer, and an output layer.
- the server can train multiple first encoded image frames output by the encoding network to obtain a neural network-based decoding network, and when the client initiates a request, the neural network-based decoding network is directly sent to the client; After the client sends a download request message to the server to request the server to download the first encoded image frame, the client can directly decode the first encoded image frame by using the decoding network sent by the server to obtain the decoded image frame.
- This method of training the decoding network by the server is suitable for processing special videos. Since all special videos are trained on the client, it will take up too much resources of the client, and users may rarely use the decoding network, resulting in a waste of resources Therefore, it can be trained in the server. Only when the client needs it, the server initiates a request. The server directly sends the decoding network to the client to reduce the burden on the client; for example, for anime, anime has the same effect as a live-action drama. Completely different distribution functions, so the universal codec network of animation and the universal codec network of live-action dramas cannot use the same one, and the universal codec network should be trained separately for animation.
- a coding and decoding network based on a neural network is shown in Figure 5.
- This network is a variational self-encoding network.
- the YCrCb color space is used during training.
- Both the coding network and the decoding network have two branches.
- the input layer includes a first sub-input layer and a second sub-input layer.
- the steps for the server to obtain the first encoded image frame can be specifically shown in Figure 6:
- Step 61 Use the first sub-input layer to receive the data of the first channel in the input image.
- the color format of the input image is brightness-red difference-blue difference
- the first channel is brightness channel Y
- the second channel is red difference and blue difference channels CrCb.
- Step 62 Perform down-sampling processing on the data of the second channel in the input image, and input the down-sampled data into the second sub-input layer.
- the image data of the red and blue difference channels CrCb in the input image is down-sampled by N times, where N is a positive integer.
- Step 63 Perform convolution, activation, pooling, batch normalization or discarding regularization on the output data of the first sub-input layer and the second sub-input layer by using the convolution hidden layer, respectively, to obtain the first coded image data and the second Encode image data.
- Each convolutional hidden layer can have five operations: convolution, activation, pooling, batch normalization, or discard regularization, and pooling and discard regularization operations are optional.
- the number of convolutional hidden layers of the two branches and the number of convolution kernels in the convolutional hidden layer are inconsistent.
- the difference from the twin network is that the two branches of the coding network do not share weights, and the branch where the brightness channel Y is located The number of corresponding convolutional hidden layers is larger.
- the data output by the first sub-input layer and the second sub-input layer can be processed separately, until the resolution of the data generated after processing is the same, the operations on the two branches are stopped, that is, the first encoded image data generated The resolution is the same as that of the second encoded image data.
- Step 64 Combine the first encoded image data and the second encoded image data to obtain third encoded image data.
- the first coded image data is 320 ⁇ 180 ⁇ 3
- the second coded image data is 320 ⁇ 180 ⁇ 5.
- the obtained third coded image data is 320 ⁇ 180 ⁇ 8.
- Step 65 Use the convolution hidden layer to perform convolution, activation, pooling, batch normalization or discard regularization processing on the third coded image data to obtain the fourth coded image data.
- the convolutional hidden layer After merging the first coded image data and the second coded image data generated by the two branches, the convolutional hidden layer is used to perform various processing on the merged data, and finally the fourth coded image data is obtained.
- Step 66 Perform flattening processing on the fourth encoded image data output by the convolution hidden layer to obtain fifth encoded image data.
- the flattening process is used for dimensionality reduction, so that the dimension of the fifth encoded image data is smaller than the dimension of the fourth encoded image data.
- Step 67 Perform activation, batch normalization or discard regularization processing on the fifth coded image data by using the fully connected hidden layer of coding to obtain the sixth coded image data.
- Each coded fully connected hidden layer can have three operations: activation, batch normalization, or discard regularization, and discard regularization is optional.
- Step 68 Use the coded fully connected output layer to process the sixth coded image data to obtain the first coded image frame.
- the number of neurons encoding the fully connected output layer is less than the number of neurons encoding the fully connected hidden layer, and its storage space is much smaller than the size of the input image; the encoding fully connected output layer is also used as the input layer of the neural network-based decoding network ,
- the server decodes the first encoded image frame, and the steps to obtain the decoded image frame can be specifically shown in Figure 7:
- Step 71 Receive the first coded image frame output by the coded fully connected output layer.
- Step 72 Use the decoded fully connected hidden layer to process the first encoded image frame to obtain the first decoded image data.
- Step 73 Set up the deconvolution hidden layer in the two branches, respectively, use the deconvolution hidden layer in each branch to deconvolve, activate, pool, batch normalize or batch the first decoded image data. The regularization process is discarded to obtain two second decoded image data.
- Each deconvolution hidden layer can contain five operations: deconvolution, activation, pooling, batch normalization, or discard regularization, and pooling and discarding regularization operations are optional operations.
- deconvolution activation, pooling, batch normalization, or discard regularization
- pooling and discarding regularization operations are optional operations.
- the deconvolution of the two branches The number of product hidden layers and the number of deconvolution kernels in the deconvolution hidden layer are inconsistent, and they do not share weights, and the number of deconvolution hidden layers corresponding to the branch where the luminance channel Y is located is larger.
- Step 74 Use the output layer to process each second decoded image data to obtain the first decoded image frame and the second decoded image frame.
- the output layer is a deconvolution output layer, and the number of deconvolution kernels corresponding to the luminance channel Y is 1, and the number of deconvolution kernels corresponding to the red and blue difference channels CrCb is 2.
- Step 75 Perform up-sampling processing on the second decoded image frame to obtain a third decoded image frame.
- the second sub-input layer receives the down-sampled data, during synthesis, the image output by the output layer corresponding to the red and blue difference channels CrCb is up-sampled, so that the luminance channel Y and the red and blue difference channels
- the data size of channel CrCb remains the same.
- Step 76 Combine the first decoded image frame and the third decoded image frame to obtain a decoded image frame.
- the number of convolution kernels, the number of deconvolution kernels, activation functions, pooling parameters, upper pooling parameters, and the number of neurons in the hidden layer in the entire codec network are not hard requirements, and can be as needed Design.
- a training set composed of various TV shows or movies is used to train the codec network.
- the data in the brightness channel Y of each frame of image is sent to the coding network
- the red difference channel Cr and the green difference channel Cb form a dual-channel image
- the data is downsampled by 4 times and sent to the second branch, and they are respectively used as the two branches of the decoding network Label, calculate the loss, and add the losses of the two branches as the final loss.
- the codec network in this embodiment is a special-purpose high-quality codec network, which is suitable for processing some special videos, specifically, it can be obtained through training of a certain TV show or movie, and is responsible for encoding and decoding the TV show or movie; For ordinary video, you don’t need to train on the server side, but train the general high-quality codec network on the client side.
- the structure and training method of the general high-quality codec network are similar to the special high-quality codec network. The difference is The number of hidden layers and the number of convolution kernels is greater than or equal to the special high-quality codec network, which can process most videos.
- a coding network based on a neural network and a decoding network based on a neural network are shown in Figure 8.
- the coding network based on a neural network and a decoding network based on a neural network constitute a coding and decoding network based on a neural network, This network is a variational self-encoding network, which uses RGB color space during training.
- the encoding network has one input and the decoding network has multiple outputs to support multi-resolution output.
- the steps for the server to obtain the first encoded image frame may be specifically as shown in FIG. 9:
- Step 91 Use the input layer to receive the input image.
- the color format of the input image is red-green-blue.
- Step 92 Use the convolution hidden layer to perform convolution, activation, pooling, batch normalization or discarding regularization processing on the input image to obtain seventh coded image data.
- the number of convolutional hidden layers is at least 2.
- Each convolutional hidden layer can have five operations: convolution, activation, pooling, batch normalization, or discard regularization, and pooling and discard regularization operations are optional operations , That is, in the convolution hidden layer, it is necessary to use the convolution kernel to perform the convolution operation on the output data of the previous layer to extract the feature information in the input image, and then pool the convolved data to downsample the data. Then use the activation function to activate the pooled data to increase the nonlinearity of the coding network model.
- Step 93 Perform flattening processing on the seventh encoded image data output by the convolutional hidden layer to obtain eighth encoded image data.
- the flattening process is used for dimensionality reduction, and expand the three-dimensional data to one dimension, that is, the dimension of the eighth coded image data is smaller than that of the seventh coded image data.
- the input image is 1280 ⁇ 720 ⁇ 3, the number of convolution kernels is 5, the pooling operation performs 2 times downsampling, the number of convolutional hidden layers is 2, and the data after the first convolution is 1280 ⁇ 720 ⁇ 5.
- the data after pooling is 640 ⁇ 360 ⁇ 5, the data after processing by the activation function is 640 ⁇ 360 ⁇ 5, the data after the second convolution is 640 ⁇ 360 ⁇ 10, and the data after pooling is 320 ⁇ 180 ⁇ 10, the data is 320 ⁇ 180 ⁇ 10 after the activation function is processed, and the output becomes one-dimensional data after the flattening process, and its length is 320 ⁇ 180 ⁇ 10.
- Step 94 Perform activation, batch normalization or discard regularization processing on the eighth coded image data by using the coded fully connected hidden layer to obtain the ninth coded image data.
- Each coded fully connected hidden layer can have three operations: activation, batch normalization, or discard regularization, and discard regularization is optional.
- Step 95 Use the coded fully connected output layer to process the ninth coded image data to obtain the first coded image frame.
- the number of neurons encoding the fully connected output layer is less than the number of neurons encoding the fully connected hidden layer, and its storage space is much smaller than the size of the input image; the encoding fully connected output layer is also used as the input layer of the neural network-based decoding network ,
- the server decodes the first encoded image frame, and the steps to obtain the decoded image frame can be specifically shown in Figure 10:
- Step 101 Receive the first coded image frame output by the coded fully connected output layer.
- Step 102 Use the decoded fully connected hidden layer to process the first encoded image frame to obtain third decoded image data.
- Step 103 Set up a deconvolution hidden layer in at least two branches, and use the deconvolution hidden layer in each branch to perform deconvolution, activation, pooling, and batch standardization on the third decoded image data. Or discard the regularization process to obtain at least two fourth decoded image data.
- Each deconvolution hidden layer can contain five operations: deconvolution, activation, pooling, batch normalization, or discard regularization, and pooling and discard regularization operations are optional operations.
- the three output branches in Figure 8 The number of deconvolution hidden layers of the path and the number of deconvolution kernels in the deconvolution hidden layer are inconsistent, and they do not share weights.
- the number of convolution hidden layers corresponding to the branch where the high-resolution output layer image is located is larger; For example, the resolution of the image output by the output layer may be 1920*1080, 1280*720, and 640*360, respectively.
- a training set of various TV shows or movies can be used to train the codec network.
- each frame of image is used as input, and each frame of image is linearly interpolated to 1920 *1080, 1280*720, 640*360 three resolutions, respectively, and the three output images of the decoding network for loss calculation.
- Step 104 Use the output layer to process each fourth decoded image data to obtain a corresponding decoded image frame.
- the number of output layers is the same as the number of branches, the number of deconvolution hidden layers and the number of deconvolution kernels in each branch are different, and no weight is shared, the resolution of any two decoded image frames is different, and The higher the resolution, the greater the number of deconvolution hidden layers corresponding to its branch.
- the number of convolution kernels, the number of deconvolution kernels, activation functions, pooling parameters, upper pooling parameters, and the number of neurons in the hidden layer in the entire codec network are not hard requirements, and can be done as needed design.
- the codec network in this embodiment is a special multi-resolution codec network, which is suitable for processing some special videos. Specifically, it can be obtained through training of a certain TV show or movie, and is responsible for encoding and decoding the TV show or movie. ; For ordinary video, it is not necessary to train on the server side, but to train the general multi-resolution codec network on the client side, the structure and training method of the general multi-resolution codec network and the special multi-resolution codec network Similarly, the difference is that the number of hidden layers and the number of convolution kernels is greater than or equal to the special multi-resolution codec network, which can process most videos.
- special codec networks including special multi-resolution codec networks and special high-quality codec networks
- the images decoded by special codec networks have higher definition, better results, and shorter decoding time.
- users need to click to download the special decoding network corresponding to a certain TV series or movie.
- the special codec network can add special effects to the video.
- the special effect function needs to be realized in the training process of the codec network.
- the label image decorated with a certain special effect is used as a new label image during training.
- the decoding network that generates the special effect image can be obtained, and more complex special effects can be made. For example, it can complete the special effects of live-action drama to animation or animation to live-action drama, ordinary picture to blockbuster style, etc., such as freezing effect or computer animation effect Wait.
- FIG. 11 is a schematic structural diagram of an embodiment of a mobile terminal provided by the present application.
- the mobile terminal 110 includes a memory 111 and a processor 112 connected to each other.
- the memory 111 is used to store a computer program.
- the mobile terminal 110 can train a general decoding network, a de-image degradation network, a motion estimation network, or a scene change detection network.
- FIG. 12 is a schematic structural diagram of an embodiment of a server provided by the present application.
- the server 120 includes a memory 121 and a processor 122 that are connected to each other.
- the memory 121 is used to store a computer program, and the computer program is executed by the processor 122.
- the processor 122 When used to implement the video processing method in the foregoing embodiment.
- the server 120 can train a general encoding network, a special encoding network, and a special decoding network.
- the server 120 stores a special decoding network so that when the mobile terminal initiates a request for a special video, the special decoding network is sent to the mobile terminal. It is convenient for the mobile terminal to decode the special video, so that the user can watch the specific video.
- FIG. 13 is a schematic structural diagram of an embodiment of a video processing system provided by the present application.
- the video processing system 130 includes a server 131 and a mobile terminal 132 connected to each other.
- the server 131 is used to encode input images to obtain For the encoded image frame, the mobile terminal 132 is used to decode the encoded image frame to obtain the decoded image frame, where the server 131 is the server in the foregoing embodiment, and the mobile terminal 132 is the mobile terminal in the foregoing embodiment.
- the video processing system 130 is an encoding and decoding system based on image content, which can compress an image to several floating-point data, greatly improving the compression rate, reducing the bandwidth required for video transmission, and encoding floating Point data is extremely secure, and even if it is intercepted, it will not reveal the transmitted information.
- FIG. 14 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.
- the computer storage medium 140 is used to store a computer program 141.
- the computer program 141 is executed by a processor, it is used to implement the Video processing method.
- the storage medium 140 may be a server, a USB flash drive, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk, or an optical disk that can store various programs.
- the medium of the code may be a server, a USB flash drive, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk, or an optical disk that can store various programs.
- the medium of the code may be a server, a USB flash drive, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk, or an optical disk that can store various programs.
- the medium of the code may be a server, a USB flash drive, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk
- the disclosed method and device may be implemented in other ways.
- the device implementation described above is only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
- the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of this embodiment.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Description
Claims (17)
- 一种视频处理方法,其特征在于,应用于客户端,所述视频处理方法包括:接收服务器发送的第一编码图像帧;判断是否接收到图像丰富指令;若是,则将随机噪声加入所述第一编码图像帧中,生成第二编码图像帧;其中,所述第一编码图像帧为浮点型数据,所述第一编码图像帧与所述第二编码图像帧之间的差值在预设范围以内。
- 根据权利要求1所述的视频处理方法,其特征在于,所述将随机噪声加入所述第一编码图像帧中,生成第二编码图像帧的步骤,包括:利用场景转换检测网络判断是否发生场景改变;若是,则生成新的随机噪声,并将所述新的随机噪声加入所述第一编码图像帧中,生成所述第二编码图像帧;若否,则继续将当前随机噪声加入所述第一编码图像帧中,生成所述第二编码图像帧。
- 根据权利要求1所述的视频处理方法,其特征在于,所述方法还包括:利用基于神经网络的解码网络对所述第二编码图像帧进行解码处理,得到解码图像帧;利用去除图像退化网络对所述解码图像帧进行处理,得到第一图像帧;利用运动估计网络对所述第一图像帧进行估计,生成至少一张第二图像帧;将所述第一图像帧以及所述第二图像帧发送至视频播放器进行播放。
- 根据权利要求3所述的视频处理方法,其特征在于,所述接收服务器发送的第一编码图像帧的步骤之前,包括:按照预设时间间隔或间隔预设帧数发送下载请求消息至所述服务 器。
- 根据权利要求3所述的视频处理方法,其特征在于,所述利用去除图像退化网络对所述解码图像帧进行处理,得到第一图像帧的步骤,包括:获取多张图像作为原始图像;对所述原始图像进行高斯模糊处理或加噪处理,生成相应的训练图像,建立训练集;利用图像模糊复原网络或图像超分辨率网络对所述训练集中的训练图像进行训练。
- 根据权利要求3所述的视频处理方法,其特征在于,所述运动估计网络为生成式对抗网络,所述生成式对抗网络包括生成网络和判别网络,所述生成网络包括二维卷积层和三维反卷积层,所述二维卷积层用于从所述第一图像帧中提取特征信息,所述三维反卷积层用于接收所述特征信息,生成至少一张所述第二图像帧,所述判别网络包括三维卷积层和全连接层,其用于判断生成的所述第二图像帧是否为符合预设要求的图像。
- 一种视频处理方法,其特征在于,应用于服务器,所述视频处理方法包括:接收输入图像;利用基于神经网络的编码网络对所述输入图像进行编码处理,得到所述第一编码图像帧;其中,所述第一编码图像帧为浮点型数据,所述基于神经网络的编码网络至少包括输入层,且每个所述输入层包括至少两个子输入层,所述子输入层用于接收所述输入图像中至少一个通道的数据。
- 根据权利要求7所述的视频处理方法,其特征在于,所述基于神经网络的编码网络还包括至少一个卷积隐藏层、编码全连接隐藏层以及编码全连接输出层。
- 根据权利要求8所述的视频处理方法,其特征在于,所述方法还包括:对所述第一编码图像帧进行解码处理,得到解码图像帧;在接收到所述客户端发送的视频观看请求后,将基于神经网络的解码网络发送至所述客户端;其中,所述基于神经网络的解码网络包括解码全连接隐藏层、至少一个反卷积隐藏层以及输出层。
- 根据权利要求9所述的视频处理方法,其特征在于,所述输入层包括第一子输入层和第二子输入层,所述利用基于神经网络的编码网络对所述输入图像进行编码处理,得到所述第一编码图像帧的步骤,包括:利用所述第一子输入层接收所述输入图像中第一通道的数据;对所述输入图像中第二通道的数据进行下采样处理,并将下采样后的数据输入所述第二子输入层;分别利用所述卷积隐藏层对所述第一子输入层和所述第二子输入层输出的数据进行卷积、激活、池化、批标准化或丢弃正则化处理,得到第一编码图像数据和第二编码图像数据,其中,所述第一编码图像数据和第二编码图像数据的分辨率相同;将所述第一编码图像数据和所述第二编码图像数据进行合并,得到第三编码图像数据;利用所述卷积隐藏层对所述第三编码图像数据进行卷积、激活、池化、批标准化或丢弃正则化处理,得到第四编码图像数据;对所述卷积隐藏层输出的所述第四编码图像数据进行扁平化处理,得到第五编码图像数据,其中,所述第五编码图像数据的维度小于所述第四编码图像数据的维度;利用所述编码全连接隐藏层对所述第五编码图像数据进行激活、批标准化或丢弃正则化处理,得到第六编码图像数据;利用编码全连接输出层对所述第六编码图像数据进行处理,得到所述第一编码图像帧。
- 根据权利要求10所述的视频处理方法,其特征在于,所述对所述第一编码图像帧进行解码处理,得到解码图像帧的步骤,包括:接收所述编码全连接输出层输出的所述第一编码图像帧;利用所述解码全连接隐藏层对所述第一编码图像帧进行处理,得到第一解码图像数据;在两个支路中分别设置反卷积隐藏层,分别利用每个支路中的所述反卷积隐藏层对所述第一解码图像数据进行反卷积、激活、上池化、批标准化或丢弃正则化处理,以得到两个第二解码图像数据;分别利用所述输出层对每个所述第二解码图像数据进行处理,得到第一解码图像帧和第二解码图像帧;对所述第二解码图像帧进行上采样处理,得到第三解码图像帧;将所述第一解码图像帧与所述第三解码图像帧进行合并,以得到所述解码图像帧。
- 根据权利要求10所述的视频处理方法,其特征在于,所述输入图像的颜色格式为亮度-红色差-蓝色差,所述第一输入层为亮度通道,所述第二输入层为红色差和蓝色差通道。
- 根据权利要求9所述的视频处理方法,其特征在于,所述对所述第一编码图像帧进行解码处理,得到解码图像帧的步骤,包括:接收所述编码全连接输出层输出的所述第一编码图像帧;利用所述解码全连接隐藏层对所述第一编码图像帧进行处理,得到第三解码图像数据;在至少两个支路中分别设置反卷积隐藏层,分别利用每个支路中的所述反卷积隐藏层对所述第三解码图像数据进行反卷积、激活、上池化、批标准化或丢弃正则化处理,得到至少两个第四解码图像数据;分别利用所述输出层对每个所述第四解码图像数据进行处理,得到相应的所述解码图像帧;其中,所述输出层的数量与所述支路的数量相同,每个支路中的所述反卷积隐藏层的数量以及反卷积核的数量不同,且不共享权重,任意两个所述解码图像帧的分辨率不同,且分辨率越高其所在支路对应的所述反卷积隐藏层的数量越多。
- 一种移动终端,包括互相连接的存储器和处理器,其中,所述存储器用于存储计算机程序,所述计算机程序在被所述处理器执行时,用 于实现权利要求1-6中任一项所述的视频处理方法。
- 一种服务器,包括互相连接的存储器和处理器,其中,所述存储器用于存储计算机程序,所述计算机程序在被所述处理器执行时,用于实现权利要求7-13中任一项所述的视频处理方法。
- 一种视频处理系统,其特征在于,包括互相连接的服务器和移动终端,其中,所述服务器用于对输入图像进行编码处理,得到编码图像帧,所述移动终端用于对所述编码图像帧进行解码,得到解码图像帧,其中,所述移动终端为权利要求14所述的移动终端,所述服务器为权利要求15所述的服务器。
- 一种计算机存储介质,用于存储计算机程序,其特征在于,所述计算机程序在被处理器执行时,用于实现权利要求1-13中任一项所述的视频处理方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/087662 WO2020232613A1 (zh) | 2019-05-20 | 2019-05-20 | 一种视频处理方法、系统、移动终端、服务器及存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/087662 WO2020232613A1 (zh) | 2019-05-20 | 2019-05-20 | 一种视频处理方法、系统、移动终端、服务器及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020232613A1 true WO2020232613A1 (zh) | 2020-11-26 |
Family
ID=73459256
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/087662 WO2020232613A1 (zh) | 2019-05-20 | 2019-05-20 | 一种视频处理方法、系统、移动终端、服务器及存储介质 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020232613A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114066764A (zh) * | 2021-11-23 | 2022-02-18 | 电子科技大学 | 基于距离加权色偏估计的沙尘退化图像增强方法及装置 |
CN114529746A (zh) * | 2022-04-02 | 2022-05-24 | 广西科技大学 | 基于低秩子空间一致性的图像聚类方法 |
CN115984675A (zh) * | 2022-12-01 | 2023-04-18 | 扬州万方科技股份有限公司 | 一种用于实现多路视频解码及ai智能分析的系统及方法 |
CN117829312A (zh) * | 2023-12-29 | 2024-04-05 | 南京硅基智能科技有限公司 | 视频驱动数字人表情模型的生成方法、装置及设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101795344A (zh) * | 2010-03-02 | 2010-08-04 | 北京大学 | 数字全息图像压缩、解码方法及系统、传输方法及系统 |
CN105740916A (zh) * | 2016-03-15 | 2016-07-06 | 北京陌上花科技有限公司 | 图像特征编码方法及装置 |
CN107018422A (zh) * | 2017-04-27 | 2017-08-04 | 四川大学 | 基于深度卷积神经网络的静止图像压缩方法 |
CN107396124A (zh) * | 2017-08-29 | 2017-11-24 | 南京大学 | 基于深度神经网络的视频压缩方法 |
US20190058489A1 (en) * | 2017-08-21 | 2019-02-21 | Kabushiki Kaisha Toshiba | Information processing apparatus, information processing method, and computer program product |
-
2019
- 2019-05-20 WO PCT/CN2019/087662 patent/WO2020232613A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101795344A (zh) * | 2010-03-02 | 2010-08-04 | 北京大学 | 数字全息图像压缩、解码方法及系统、传输方法及系统 |
CN105740916A (zh) * | 2016-03-15 | 2016-07-06 | 北京陌上花科技有限公司 | 图像特征编码方法及装置 |
CN107018422A (zh) * | 2017-04-27 | 2017-08-04 | 四川大学 | 基于深度卷积神经网络的静止图像压缩方法 |
US20190058489A1 (en) * | 2017-08-21 | 2019-02-21 | Kabushiki Kaisha Toshiba | Information processing apparatus, information processing method, and computer program product |
CN107396124A (zh) * | 2017-08-29 | 2017-11-24 | 南京大学 | 基于深度神经网络的视频压缩方法 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114066764A (zh) * | 2021-11-23 | 2022-02-18 | 电子科技大学 | 基于距离加权色偏估计的沙尘退化图像增强方法及装置 |
CN114066764B (zh) * | 2021-11-23 | 2023-05-09 | 电子科技大学 | 基于距离加权色偏估计的沙尘退化图像增强方法及装置 |
CN114529746A (zh) * | 2022-04-02 | 2022-05-24 | 广西科技大学 | 基于低秩子空间一致性的图像聚类方法 |
CN114529746B (zh) * | 2022-04-02 | 2024-04-12 | 广西科技大学 | 基于低秩子空间一致性的图像聚类方法 |
CN115984675A (zh) * | 2022-12-01 | 2023-04-18 | 扬州万方科技股份有限公司 | 一种用于实现多路视频解码及ai智能分析的系统及方法 |
CN115984675B (zh) * | 2022-12-01 | 2023-10-13 | 扬州万方科技股份有限公司 | 一种用于实现多路视频解码及ai智能分析的系统及方法 |
CN117829312A (zh) * | 2023-12-29 | 2024-04-05 | 南京硅基智能科技有限公司 | 视频驱动数字人表情模型的生成方法、装置及设备 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110139147B (zh) | 一种视频处理方法、系统、移动终端、服务器及存储介质 | |
WO2020232613A1 (zh) | 一种视频处理方法、系统、移动终端、服务器及存储介质 | |
CN112991203B (zh) | 图像处理方法、装置、电子设备及存储介质 | |
TWI826321B (zh) | 提高影像品質的方法 | |
CN108337465B (zh) | 视频处理方法和装置 | |
US10819994B2 (en) | Image encoding and decoding methods and devices thereof | |
US10560731B2 (en) | Server apparatus and method for content delivery based on content-aware neural network | |
CN111586412B (zh) | 高清视频处理方法、主设备、从设备和芯片系统 | |
CN110827380B (zh) | 图像的渲染方法、装置、电子设备及计算机可读介质 | |
CN113747242B (zh) | 图像处理方法、装置、电子设备及存储介质 | |
WO2023005140A1 (zh) | 视频数据处理方法、装置、设备以及存储介质 | |
US20190171916A1 (en) | Increasing network transmission capacity and data resolution quality and computer systems and computer-implemented methods for implementing thereof | |
WO2022268181A1 (zh) | 视频增强处理方法、装置、电子设备和存储介质 | |
CN112330541A (zh) | 直播视频处理方法、装置、电子设备和存储介质 | |
CN107396002B (zh) | 一种视频图像的处理方法及移动终端 | |
Huang et al. | A cloud computing based deep compression framework for UHD video delivery | |
CN113436061B (zh) | 人脸图像重构方法及系统 | |
CN113822803A (zh) | 图像超分处理方法、装置、设备及计算机可读存储介质 | |
Kato et al. | Split rendering of the transparent channel for cloud ar | |
WO2023010981A1 (zh) | 编解码方法及装置 | |
CN115665427A (zh) | 直播数据的处理方法、装置及电子设备 | |
CN114885178A (zh) | 基于双向帧预测的极低码率人脸视频混合压缩方法及系统 | |
You et al. | CNN-Based Local Tone Mapping in the Perceptual Quantization Domain | |
Groth et al. | Wavelet-Based Fast Decoding of 360 Videos | |
CN113450293A (zh) | 视频信息处理方法、装置、系统、电子设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19929924 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19929924 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19929924 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09-06-2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19929924 Country of ref document: EP Kind code of ref document: A1 |