CN117956178A - Video encoding method and device, and video decoding method and device - Google Patents

Video encoding method and device, and video decoding method and device Download PDF

Info

Publication number
CN117956178A
CN117956178A CN202410171043.XA CN202410171043A CN117956178A CN 117956178 A CN117956178 A CN 117956178A CN 202410171043 A CN202410171043 A CN 202410171043A CN 117956178 A CN117956178 A CN 117956178A
Authority
CN
China
Prior art keywords
video
data
network model
key frame
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410171043.XA
Other languages
Chinese (zh)
Inventor
王镜宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202410171043.XA priority Critical patent/CN117956178A/en
Publication of CN117956178A publication Critical patent/CN117956178A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The specification provides a video coding method and device, a video decoding method and device, and is applied to the field of artificial intelligence. Based on the method, after the encoding end obtains high-definition target video data, the encoding end can firstly perform compression encoding on the target video data to obtain first video encoding data; then, a network model is extracted by utilizing preset style characteristics to process a first key frame code in first video coding data, so as to obtain corresponding style characteristics; and generating second video encoded data having a relatively small amount of data and including style characteristics associated with the first key frame encoding based on the first video encoded data and the style characteristics. And the decoding end decodes and recovers the high-definition target video data locally according to the preset super-resolution network model and the received second video coding data. Therefore, the performance and the computing power of the decoding end can be effectively utilized, so that a user can smoothly watch the high-definition video on line under the weak network environment, and better video service experience is obtained.

Description

Video encoding method and device, and video decoding method and device
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a video encoding method and apparatus, and a video decoding method and apparatus.
Background
In general, when video is encoded, in order to ensure efficient transmission of video over a network, video encoding methods such as MPEG, h.264, h.265, HEVC, VVC, etc. are often used to perform compression encoding on original video data. Specifically, taking h.264 encoding as an example, a video may be first divided into a plurality of small blocks, and then the encoder is used to describe the displacement of an object between adjacent frames; while the pixel data within the block is processed using a Discrete Cosine Transform (DCT) according to the visual masking effect. Wherein the DCT is used to convert pixels into frequency domain coefficients, allowing the high frequency part to be discarded to reduce the amount of data. After the above steps are completed, entropy coding is used to further reduce the number of compressed bits.
However, based on the existing video coding method, the video compression coding effect is often not ideal. Taking h.264 as an example, if 1080p video is to be smoothly played, a bandwidth of 3Mbps to 5Mbps is generally required, so that in a weak network environment, it is still difficult for a user to smoothly play video, and the video service experience of the user is affected. In addition, with the development of technology, the performance and computing power of the user terminal are also improved continuously. However, the existing video coding method cannot fully and effectively utilize the high performance and high computing power of the user terminal.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The specification provides a video coding method and device, a video decoding method and device, which can effectively utilize the performance and computing power of a terminal, so that a user can smoothly watch high-definition videos on line under a weak network environment, and better video service experience is obtained.
The present specification provides a video coding method, applied to a coding end, including:
Acquiring first video encoding data regarding target video data; wherein the first video encoding data comprises a plurality of first image encodings;
extracting a network model by using preset style characteristics to process a first key frame code in first video coding data to obtain corresponding style characteristics; wherein the first key frame code is used for restoring and processing other first image codes;
generating second video coding data related to target video data according to the first video coding data and the style characteristics.
In one embodiment, generating second video encoded data for the target video data based on the first video encoded data and the style characteristics includes:
downsampling the first video coding data to obtain downsampled first video coding data;
Determining the position of the downsampled first key frame code in the downsampled first video coding data according to a preset adding rule;
And adding style characteristics corresponding to the first key frame codes to the position, adjacent to the rear position, of the first key frame codes after downsampling in the first video coding data after downsampling, so as to obtain the second video coding data.
In one embodiment, processing a first key frame code in first video coding data by using a preset style feature extraction network model to obtain a corresponding style feature includes:
Identifying and determining a first key frame code in the first video coding data;
And processing the first key frame code by using a preset style characteristic extraction network model to obtain corresponding style characteristics.
In one embodiment, identifying and determining a first key frame code in first video encoding data includes:
Extracting identification information of a first image code in first video code data; identifying and determining a first key frame code in the first video coding data according to the identification information;
Or processing the first video coding data by using a preset classifier to obtain a corresponding coding classification result; and determining a first key frame code in the first video coding data according to the code classification result.
In one embodiment, acquiring first video encoded data regarding target video data includes:
Acquiring target video data;
and carrying out compression coding on the target video data according to a preset coding rule to obtain corresponding first video coding data.
In one embodiment, the method further comprises:
Constructing a first initial network model;
Acquiring and training the first initial network model by using a first sample high-definition image to obtain a first network model meeting the requirements; wherein the first network model comprises at least a first network structure and a second network structure; the first network structure is used for carrying out feature compression on the high-definition image of the first sample; the second network structure is used for carrying out feature recovery on the compressed image;
intercepting a first network structure from the first network model;
And constructing and obtaining a preset style characteristic extraction network model meeting the requirements according to the first network structure.
In one embodiment, after constructing the network model for extracting the style characteristics meeting the requirements, the method further comprises:
Constructing a second initial network model based on a dense residual structure; wherein the second initial network model comprises at least a first input interface and a second input interface;
connecting an output interface of a preset style characteristic extraction network model with a first input interface of a second initial network model;
obtaining a second sample high-definition image; correspondingly processing the second sample high-definition image to obtain a corresponding second sample low-fraction image;
and inputting a second sample high-definition image into a preset style feature extraction network model, and simultaneously inputting a second sample low-resolution image corresponding to the second sample high-definition image into a second initial network model through a second input interface, and training the second initial network model to obtain a preset super-resolution network model meeting the requirements.
In one embodiment, after obtaining the preset super-resolution network model meeting the requirements, the method further includes:
transmitting the preset super-resolution network model to a decoding end; and the decoding end receives and stores the preset super-resolution network model.
The present disclosure also provides a video decoding method, applied to a decoding end, including:
Receiving second video encoding data;
extracting a first key frame code after downsampling and corresponding style characteristics from the second video coding data;
Processing the first key frame codes after downsampling and corresponding style characteristics by using a preset super-resolution network model, and recovering to obtain the first key frame codes;
And performing corresponding decoding processing according to the first key frame coding and the second video coding data to obtain target video data meeting the requirements.
In one embodiment, the preset super-resolution network model at least includes: a shallow feature extraction layer, a plurality of improved information distillation modules connected in a nested manner, a first convolution layer, and an upsampling operator;
Correspondingly, the first key frame code after downsampling and the corresponding style characteristics are processed by utilizing a preset super-resolution network model, and the first key frame code is recovered and obtained, which comprises the following steps:
Processing the first key frame code after downsampling and corresponding style characteristics by using a shallow layer characteristic extraction layer in a preset super-resolution network model, and outputting to obtain initial characteristics;
Performing feature processing of multiple rounds of iteration on the initial features by utilizing a plurality of improved information distillation modules connected in a nested manner, and outputting to obtain corresponding intermediate features;
carrying out convolution processing on the intermediate features by using a first convolution layer to obtain a corresponding intermediate convolution result;
And recovering to obtain a corresponding first key frame code through upsampling based on the intermediate convolution result by using an upsampling operator.
In one embodiment, the improved information distillation module comprises at least: the device comprises a first feature refinement module, a second feature refinement module, a connection layer, a second convolution layer and an ESA module;
The first feature refinement module is respectively connected with the second feature refinement module and the connecting layer through a channel separation structure; the second feature refinement module is connected with at least a connection layer, the connection layer is connected with a second convolution layer, and the second convolution layer is connected with the ESA module.
The present specification also provides a video encoding apparatus, applied to an encoding end, including:
An acquisition module for acquiring first video encoding data concerning target video data; wherein the first video encoding data comprises a plurality of first image encodings;
The processing module is used for processing a first key frame code in the first video coding data by utilizing a preset style characteristic extraction network model to obtain corresponding style characteristics; wherein the first key frame code is used for restoring and processing other first image codes;
And the encoding module is used for generating and obtaining second video encoding data about target video data according to the first video encoding data and the style characteristics.
The present specification also provides a video decoding apparatus applied to a decoding end, including:
A receiving module for receiving second video coding data;
The extraction module is used for extracting the first key frame codes after downsampling and corresponding style characteristics from the second video coding data;
The processing module is used for processing the first key frame codes after downsampling and corresponding style characteristics by utilizing a preset super-resolution network model, and recovering to obtain the first key frame codes;
and the decoding module is used for carrying out corresponding decoding processing according to the first key frame coding and the second video coding data to obtain target video data meeting the requirements.
The present specification also provides a server comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the video encoding method, or related steps of the video decoding method.
The present specification also provides a computer readable storage medium having stored thereon computer instructions which when executed by a processor implement the video encoding method, or related steps of the video decoding method.
Before the video coding method and device, the video decoding method and device are implemented, the network model and the preset super-resolution network model which have good effects and are related to each other can be obtained through joint training. In the implementation, after obtaining high-definition target video data, the decoding end can perform compression encoding on the target video data to obtain corresponding first video encoding data; then, a network model is extracted by utilizing preset style characteristics to process a first key frame code in first video coding data, so as to obtain corresponding style characteristics; and generating second video coding data which has relatively smaller data volume corresponding to the original target video data and contains style characteristics associated with the first key frame coding according to the first video coding data and the style characteristics. The decoding end can decode and recover the high-definition target video data locally according to the preset super-resolution network model and the received second video coding data. Therefore, the performance and calculation force of the encoding end and the decoding end can be effectively utilized, so that a user can smoothly watch the high-definition video on line under the weak network environment, and better video service experience is obtained.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure, the drawings that are required for the embodiments will be briefly described below, and the drawings described below are only some embodiments described in the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a flow chart of a video encoding method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of one embodiment of a video encoding method provided by embodiments of the present disclosure, in one example of a scenario;
FIG. 3 is a schematic diagram of one embodiment of a video encoding method provided by embodiments of the present disclosure, in one example of a scenario;
FIG. 4 is a schematic diagram of an embodiment of a video encoding method provided by embodiments of the present disclosure, in one example of a scenario;
FIG. 5 is a schematic diagram of an embodiment of a video encoding method provided by embodiments of the present disclosure, in one example of a scenario;
FIG. 6 is a schematic diagram of one embodiment of a video encoding method provided by embodiments of the present disclosure, in one example of a scenario;
FIG. 7 is a schematic diagram of an embodiment of a video encoding method provided by embodiments of the present disclosure, in one example of a scenario;
Fig. 8 is a schematic diagram of an embodiment of a video encoding method to which the embodiments of the present specification are applied in one scene example;
Fig. 9 is a flowchart of a video decoding method according to an embodiment of the present disclosure;
Fig. 10 is a schematic diagram of an embodiment of a video decoding method to which the embodiments of the present specification are applied, in one scene example;
FIG. 11 is a schematic diagram of the structural composition of a server provided in one embodiment of the present disclosure;
fig. 12 is a schematic structural composition diagram of an image encoding device provided in one embodiment of the present specification;
Fig. 13 is a schematic structural diagram of a video decoding apparatus according to an embodiment of the present disclosure;
FIG. 14 is a schematic diagram of one embodiment of a video encoding method provided by embodiments of the present disclosure, in one example of a scenario;
Fig. 15 is a schematic diagram of an embodiment of a video encoding method to which the embodiments of the present specification are applied in one scene example.
Detailed Description
In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
Referring to fig. 1, an embodiment of the present disclosure provides a video encoding method. The method is particularly applied to one side of the coding end. In particular implementations, the method may include the following:
s101: acquiring first video encoding data regarding target video data; wherein the first video encoding data comprises a plurality of first image encodings;
S102: extracting a network model by using preset style characteristics to process a first key frame code in first video coding data to obtain corresponding style characteristics; wherein the first key frame code is used for restoring and processing other first image codes;
S103: generating second video coding data related to target video data according to the first video coding data and the style characteristics.
The encoding terminal may be specifically understood as a terminal device capable of encoding video data. Such as a server of a platform, a desktop computer, a notebook computer, etc.
The above-mentioned target video data can be understood as high-definition video data having a high definition (or resolution). For example, 1080p resolution video data.
The first key frame coding is specifically understood as an image coding in the first video coding data, which can assist other first image coding in reconstructing and recovering. The above-mentioned other first image coding is specifically understood as image coding of the first video coding data other than the first key frame coding.
Specifically, in the process of performing compression encoding on the target video data to obtain corresponding first video encoded data, the first key frame encoding may be obtained by performing intra-frame encoding on a corresponding video image in the target video data. In contrast, the other first image encodings other than the first key frame encodings may be obtained by inter-encoding the corresponding video images in the target video data. Therefore, in the subsequent decoding recovery process, for the first key frame code, recovery processing can be directly performed on the first key frame code to recover the corresponding video image. In contrast, for other first image encodings, a restoration process is required to be performed by using the corresponding first key frame encodings (e.g., the first key frame encodings in the front-to-back vicinity) to restore the corresponding video image.
Generally, the higher the sharpness of video data, the relatively larger the data amount of video data. Accordingly, the higher the transmission cost of video data at night, the longer the transmission time period will be.
Based on the above embodiment, the server may obtain the corresponding first image encoded data by performing compression encoding on the target video data with higher definition and larger data size; then, a network model is extracted by utilizing preset style characteristics to process a first key frame code in first video coding data, so as to obtain corresponding style characteristics; generating second video coding data which has smaller corresponding data volume and contains style characteristics associated with the first key frame coding according to the first video coding data and style characteristics; and then the second image coding data is used for replacing the target video data with larger original data quantity and is transmitted to the decoding end. After the second video coding data is obtained, the decoding end can process the second image coding data by utilizing a super-resolution network model associated with a preset style characteristic extraction network model, and locally decode and recover the target video data with higher definition. Therefore, the performance and the computing power of the decoding end can be effectively utilized, so that a user can smoothly watch the high-definition video on line under the weak network environment, and better video service experience is obtained.
In some embodiments, referring to fig. 2, the video encoding method described above may be applied to a server (may be regarded as an encoding end) side.
The server specifically may include a background server applied to a side of a video service platform and capable of implementing functions such as video data transmission and video data processing. Specifically, the server may be, for example, an electronic device having data operation, storage function and network interaction function. Or the server may be a software program running in the electronic device that provides support for data processing, storage, and network interactions. In the present embodiment, the number of servers is not particularly limited. The server may be one server, several servers, or a server cluster formed by several servers.
When a user wants to watch a high-definition target video on a video service platform online, a user terminal (which can be regarded as a decoding end) can be used to initiate a corresponding target video playing request to the server.
The user terminal specifically may include a front end applied to a user side and capable of implementing functions such as data acquisition and data transmission. Specifically, the user terminal may be, for example, an electronic device such as a desktop computer, a tablet computer, a notebook computer, a mobile phone, a network television, and the like. Or the user terminal may be a software application capable of running in the electronic device described above. For example, it may be a video APP running on a cell phone, etc.
After receiving a target video playing request initiated by a user terminal, the server can respond to the target video playing request to find corresponding high-definition target video data in a video resource library of the video service platform.
Since the data size of the high-definition target video data is large, the server may perform compression encoding (for example, compression encoding processing based on h.264 encoding) on the target video data first, and perform first compression, to obtain corresponding first video encoded data with relatively small data size.
Specifically, taking h.264 encoding as an example, the first video encoded data obtained by compression encoding may specifically be an image frame sequence group. The image frame sequence group may include a plurality of images arranged in order, and each image corresponds to one compressed and encoded image frame in the target video data and may be recorded as a corresponding first image code.
Then, the server can process a first key frame code in the first video coding data by utilizing a preset style characteristic extraction network model to obtain a corresponding style characteristic; and then according to the first video coding data and the style characteristics, carrying out second compression processing to obtain corresponding second image coding data with relatively smaller data volume. Wherein the second video encoding data may include a plurality of second image encodings in a sequential arrangement, and style characteristics associated with the first key frame encodings. The above-mentioned second image coding is understood to mean in particular a coded data with a smaller data volume which is further compressed than the first image coding.
Specifically, when the second compression is performed, the server may perform downsampling processing (for example, bicubic downsampling) on the first video encoded data first to obtain downsampled first video encoded data; determining the position of the downsampled first key frame code in the downsampled first video coding data according to a preset adding rule; and finally, adding style characteristics corresponding to the first key frame codes to the position, adjacent to the rear position, of the first key frame codes after downsampling in the first video coding data after downsampling, so as to obtain the second video coding data.
Furthermore, the server can efficiently transmit the second image encoded data to the user terminal instead of the original high-definition target video data at a low transmission cost without requiring a high bandwidth requirement.
After receiving the second image coding data, the user terminal can utilize a super-resolution network model associated with a preset style characteristic extraction network model to obtain high-definition target video data by processing the second image coding data and locally decoding and reconstructing the second image coding data based on a corresponding video decoding method; and the target video data can be played and displayed to the user.
Therefore, the data processing performance and computing power of the user terminal can be fully utilized, and the user can smoothly watch the high-definition video of the video service platform with shorter loading time through decoding processing of the user terminal, so that better video service experience is obtained.
In some embodiments, the above-mentioned preset style feature extraction network model may be specifically understood as an algorithm model capable of extracting style features for restoring the compressed image from the input image. Specifically, the network model may be enabled to self-learn Xi Di to take out valid features in the image as the style features through deep learning. The above style feature is specifically understood to be a high-frequency image feature.
The above-mentioned preset super-resolution network model may be specifically understood as an algorithm model capable of restoring the compressed image to the original image according to the input compressed image and the style feature associated with the compressed image, with the style feature as a guide. Specifically, the network model can self-learn the input low-resolution image through deep learning, and process and output the corresponding high-resolution image by combining the corresponding style characteristics.
The preset style feature extraction network model is associated with a preset super-resolution network model. Specifically, the preset style feature extraction network model and the preset super-resolution network model are obtained through joint training in advance. The specific model structure, specific training mode, and the specific model structure of the preset style feature extraction network model and the preset super-resolution network model will be described later.
In some embodiments, the acquiring the first video encoded data related to the target video data may include the following when implemented:
S1: acquiring target video data;
s2: and carrying out compression coding on the target video data according to a preset coding rule to obtain corresponding first video coding data.
Based on the above embodiment, the first video encoded data having a relatively small data amount can be obtained by performing compression encoding processing on the target video data having a large original data amount.
In some embodiments, the above method for obtaining the target video data may include the following steps:
S1: receiving a target video playing request; the target video playing request at least carries a video identifier of the target video;
s2: and responding to the target video playing request, and acquiring corresponding target video data.
The video identifier of the target video may specifically include a video name, a video number, a video link, and the like of the target video.
In implementation, the encoding end can respond to the target video playing request, query a video resource library of the video service platform according to the video identification of the target video, and find video data corresponding to the video identification as the target video data.
In some embodiments, after the corresponding target video data is acquired, the encoding end may further detect whether the data size of the target video data is greater than a preset data size threshold, or detect whether the sharpness of the target video data is greater than a preset sharpness threshold; and triggering to execute the video coding method provided by the specification under the condition that the data volume of the target video data is determined to be larger than a preset data volume threshold value and/or the definition of the target video data is determined to be larger than a preset definition threshold value.
In some embodiments, the preset compression encoding rule may specifically include: compression coding rules based on h.264 coding, etc.
Of course, the above listed compression coding rules based on h.264 coding are only one illustrative example. In specific implementation, other types of compression coding rules can be used as preset compression coding rules according to specific situations and processing requirements. The present specification is not limited to this.
In some embodiments, taking a compression encoding rule based on h.264 encoding as a preset compression encoding rule as an example, when compression encoding is performed on target video data according to the preset compression encoding rule, the target video data with a larger data size may be split into ordered combinations of a plurality of image frames: VCL (Video coding layer) data; and packaging the VCL data into corresponding NAL (Network abstraction layer) data. I.e. the corresponding first image encoded data is obtained.
Reference may be made to fig. 3. The first image coding data may generally include a plurality of first image codes such as I frames, P frames, B frames, and IDR frames. Specifically, the I frame and the IDR frame are compression-encoded by using an intra-frame encoding method according to a preset compression encoding rule, so that no image encoding information based on other preceding and following frames is required when the original video image is restored. The P frame and the B frame are compression-encoded by adopting an inter-frame encoding method according to a preset compression encoding rule, so that image encoding information based on the previous and subsequent frames is also required when the original video image is restored.
In view of the above, since the I frame and the IDR frame can be used for the recovery processing of other first image encodings, the I frame and the IDR frame described above can be encoded as first key frames herein. In the subsequent decoding recovery, as long as the first key frame code is recovered, the other first image codes can be further recovered and processed based on the first key frame code.
In some embodiments, the processing the first key frame code in the first video coding data by using the preset style feature extraction network model to obtain the corresponding style feature may include the following steps when implemented:
s1: identifying and determining a first key frame code in the first video coding data;
S2: and processing the first key frame code by using a preset style characteristic extraction network model to obtain corresponding style characteristics.
Based on the embodiment, the network model can be extracted by utilizing the pre-trained preset style characteristics to efficiently and accurately extract the style characteristics of the required first key frame code.
In some embodiments, referring to fig. 4, the generating the second video encoded data related to the target video data according to the first video encoded data and the style characteristics may include the following when implemented:
s1: downsampling the first video coding data to obtain downsampled first video coding data;
s2: determining the position of the downsampled first key frame code in the downsampled first video coding data according to a preset adding rule;
s3: and adding style characteristics corresponding to the first key frame codes to the position, adjacent to the rear position, of the first key frame codes after downsampling in the first video coding data after downsampling, so as to obtain the second video coding data.
The downsampled first video encoding data includes a plurality of downsampled first image encodings.
The above downsampled first image encoding can be specifically understood as an image encoding with relatively smaller data size, losing much high frequency image information and lower resolution after the downsampling operation.
Based on the above embodiment, the second image encoded data having a relatively small data amount can be obtained by further downsampling compression processing according to the first image encoded data and style characteristics.
In some embodiments, the identifying and determining the first key frame code in the first video coding data may include, when implemented, the following:
Extracting identification information of a first image code in the first video code data; identifying and determining a first key frame code in the first video coding data according to the identification information;
Or processing the first video coding data by using a preset classifier to obtain a corresponding coding classification result; and determining a first key frame code in the first video coding data according to the code classification result.
In practice, for example, the header may be extracted from NAL data (a first image encoded data); then, corresponding identification information is extracted from the data head; and screening the I frame and the IDR frame image frames according to the identification information to obtain corresponding first key frame codes.
For another example, the first video coding data may be input into a pre-trained preset classifier, and the preset classifier determines a coding type of a first image coding in the first video coding data relative to other first image coding by processing the first video coding data, so as to obtain and output a corresponding coding classification result. And determining a first image code with the coding type of intra-frame coding from the first video coding data according to the coding classification result to obtain a corresponding first key frame code.
Based on the above embodiments, the first key frame code may be accurately identified and determined from the first video encoding data in a variety of determination manners.
In some embodiments, in implementation, the downsampling process may be performed on each first image code in the first video code data by using bicubic downsampling (Bicubic) to obtain a corresponding downsampled first image code; and then the compressed downsampled first image codes are arranged and combined according to the time information to obtain corresponding downsampled first video coding data.
In some embodiments, referring to fig. 5, according to a preset adding rule, a style feature (for example, an image style feature) associated with the first key frame code may be added at a position adjacent to a rear position of the first key frame code after downsampling in the first video code data after downsampling, so as to obtain second image code data meeting requirements.
Of course, in the implementation, according to a preset adding rule, a matched style feature or the like may be added at a position adjacent to the front position of the first key frame code after downsampling in the downsampled first video coding data.
Based on the above embodiment, by using the style characteristic and the downsampled first video encoding data in combination, it is possible to construct second image encoding data that simultaneously includes the downsampled first image encoding and the style characteristic that matches the downsampled first key frame encoding.
In some embodiments, when the method is implemented, referring to fig. 6, the method may further include the following:
S1: constructing a first initial network model;
S2: acquiring and training the first initial network model by using a first sample high-definition image to obtain a first network model meeting the requirements; wherein the first network model comprises at least a first network structure and a second network structure; the first network structure is used for carrying out feature compression on the high-definition image of the first sample; the second network structure is used for carrying out feature recovery on the compressed image;
s3: intercepting a first network structure from the first network model;
S4: and constructing and obtaining a preset style characteristic extraction network model meeting the requirements according to the first network structure.
Specifically, a network model similar in structure form to Unet (Convolutional Networks for Biomedical Image Segmentation) can be constructed as the first initial network model.
The first network model may be configured to perform feature compression on an input image to reduce intermediate feature dimensions, so as to obtain a compressed image, while retaining possible valid features; and then, the reserved effective characteristics are utilized to carry out characteristic recovery on the compressed image so as to obtain and output an image which is close to the input original image as much as possible.
For example, the image input into the first network model may be a 100×100 RGB image, and the data size is 3×100×100. After the image is input into a first network model, the image is compressed into a compressed image with the data size of 1x50x50 through a first network structure, and effective features in an original image are extracted; and then processing the compressed image by using the effective characteristics through a second network to output the image with the characteristics restored.
During specific training, a first network model meeting the requirements can be obtained through deep learning by using the first sample high-definition image and the first initial network model.
When the model is trained once by using the first sample high-definition image, the loss function value can be calculated by using the image input at this time and the image output at this time; detecting whether the loss function value is smaller than or equal to a preset tolerance threshold value; and when the loss function value is less than or equal to a preset tolerance threshold, determining that training is finished, and determining the current network model to be a first network model meeting the requirements. Otherwise, continuing to train the current model next time by using the new first sample high-definition image. Thus, the first network model can be forced to learn and extract effective features with better application effect when the image is restored by training.
Based on the embodiment, the network model can be extracted by utilizing the first sample high-definition image and through model training to obtain the preset style characteristics with the accuracy meeting the requirement.
In the implementation, the first network model obtained based on training can firstly perform feature compression on an input original image through a first network structure; and then, carrying out feature recovery on the low-resolution image and the effective features obtained after compression through a second network structure so as to obtain an original image. When the first network structure performs feature compression on the input original image, corresponding high-frequency image features (or effective features) are automatically extracted from the original image. In this embodiment, the first network structure may be specifically intercepted from the first network model; and generating a required preset style characteristic extraction network model by utilizing the network structure for extracting the high-frequency image characteristics in the first network structure.
In some embodiments, referring to fig. 7, after the network model is extracted by constructing the preset style characteristics meeting the requirements, the method may further include the following when implemented:
S1: constructing a second initial network model based on a dense residual structure; wherein the second initial network model comprises at least a first input interface and a second input interface;
S2: connecting an output interface of a preset style characteristic extraction network model with a first input interface of a second initial network model;
S3: obtaining a second sample high-definition image; correspondingly processing the second sample high-definition image to obtain a corresponding second sample low-fraction image;
S4: and inputting a second sample high-definition image into a preset style feature extraction network model, and simultaneously inputting a second sample low-resolution image corresponding to the second sample high-definition image into a second initial network model through a second input interface, and training the second initial network model to obtain a preset super-resolution network model meeting the requirements.
Based on the embodiment, the trained preset style characteristic extraction network model can be used jointly, so that the preset super-resolution network model with a good effect and associated with the preset style characteristic extraction network model can be trained efficiently.
In some embodiments, when specifically constructing the second initial network model based on the dense residual structure, first, an initial shallow feature extraction layer may be connected to an initial intermediate feature extraction structure; wherein an initial intermediate feature extraction structure is nested with a plurality of initial modified information distillation modules (e.g., RRLFB). Specifically, the initial improved information distillation module at least comprises an initial first feature refinement module, an initial second feature refinement module, an initial connection layer, an initial second convolution layer and an initial ESA module; the initial first feature refinement module is respectively connected with the initial second feature refinement module and the initial connecting layer through a channel separation structure; the initial second feature refinement module is coupled to an initial connection layer coupled to an initial second convolution layer coupled to an initial ESA module. Among other things, the ESA (ENHANCED SPATIAL Attention) module described above may also be referred to as an enhanced spatial Attention module. The ESA module may perform channel dimension-based feature processing on the features output by the second convolutional layer.
Then, the initial intermediate feature extraction structure may be connected to an initial first convolution layer; and connecting the initial first convolution layer with an up-sampling operator to obtain a second initial network model.
Based on the mode, a second initial network model which is stable and has high operation efficiency can be constructed.
In specific training, referring to fig. 8, the second sample high-definition image may be input to a preset style feature extraction network model, and further, the second sample style feature obtained by processing the second sample high-definition image based on the preset style feature extraction network model may be input as a first set of sample input data to the second initial network model through the first input interface; and meanwhile, a second sample low-resolution image obtained through downsampling based on the second sample high-definition image is used as a second group of sample input data and is input into a second initial network model through a second input interface.
In this way, the second initial network model can obtain and output the corresponding second sample recovery image through recovery processing according to the second sample style characteristics and the second sample low-resolution image which are input simultaneously and correspond to the same second sample high-definition image.
Calculating a loss function value according to the first sample high-definition image and the second sample recovery image by using the corresponding loss function; and adjusting model parameters of the second initial network model by using the optimizer and the loss function value to complete one round of model training.
According to the mode, the second initial network model is subjected to multiple rounds of iterative model training by utilizing the plurality of second sample high-definition images and the corresponding second sample low-resolution images, so that a second network model meeting the requirements is obtained and is used as a required preset super-resolution network model.
In some embodiments, a loss contrast function may be utilized in place of a conventional loss function when implemented.
Correspondingly, during specific training, a corresponding second sample positive anchor point image and a corresponding second sample negative anchor point image can be obtained according to the second sample high-definition image; then, calculating a corresponding loss function value according to a second sample recovery image, a second sample positive anchor point image and a second sample negative anchor point image which are output by the second initial network model by utilizing the contrast loss function; and adjusting model parameters of the second initial network model by using the optimizer and the loss function value to complete one round of model training.
Specifically, the corresponding loss function value may be calculated according to the following formula by using the comparison loss function according to the second sample recovery image, the second sample positive anchor point image, and the second sample negative anchor point image output by the second initial network model:
Where CL denotes a loss function value, phi i denotes an image feature extraction structure numbered i, lambda i denotes a weight coefficient corresponding to phi i, Y anchpor denotes a second sample restoration image, Y pos denotes a second sample positive anchor image, Y neg denotes a second sample secondary anchor image, and d (x, Y) denotes an L 1 distance (or an image euclidean distance) between image feature x and image feature Y.
In specific implementation, the second sample high-definition image can be used as a second sample positive anchor image, and the corresponding second sample low-resolution image can be used as a second sample negative anchor image. Further, the conv_k3s1-Tanh-conv_k3s1 structure may also be used as an image feature extraction function in the image feature extraction structure.
Based on the above embodiment, by using a contrast loss function instead of a conventional loss function, the model can be guided to push the positive target towards the anchor point and the negative target away from the anchor point during the process of the latent space representation. Therefore, the model can be focused on the image texture layer generation in the training process, and further the detail characteristics of the image texture can be recovered more finely based on the model obtained through training.
In some embodiments, in building the second initial network model specifically, with the improved information distillation module structure, different feature refinement modules, and the connection layer, may be connected by a channel separation structure. Thus, the computational complexity can be reduced, and the stability of the feature extraction capability can be ensured.
In some embodiments, in specifically building the second initial network model, a re-parameterized structure may be employed to build the second initial network model; and combining the re-parameterized structures through reasoning according to training process data in the model training process so as to improve the training speed of the model.
In some embodiments, when a preset super-resolution network model is specifically trained, the currently used optimizer parameters may be initialized according to the corresponding training rule when model training with a specified number of rounds is completed, so as to implement multi-stage hot start training. Therefore, the model can be led to leave a local optimal value, and the model has better generalization.
In some embodiments, after training the second initial network model to obtain a preset super-resolution network model meeting the requirements, the method may further include the following when implemented:
s1: acquiring a third sample high-definition image;
S2: extracting a network model and the third sample high-definition image by using a preset super-resolution network model and preset style characteristics to perform a repair test to obtain a corresponding repair test result;
s3: and adjusting the downsampling parameters according to the repair test result.
When the repairing test is specifically performed, a preset feature extraction network model is utilized to process a third sample high-definition image, so that corresponding style features are obtained; performing corresponding compression coding treatment on the third sample high-definition image to obtain a corresponding third sample compressed image; then, a preset high-resolution network model is utilized to obtain a corresponding third sample recovery image by processing the third sample compression image and style characteristics; and determining a corresponding repair test result according to the third sample high-definition image and the third sample recovery image.
The repair test result may specifically include: image signal-to-noise ratio (PSNR) and/or image Structural Similarity (SSIM), etc. The repair test results are used for representing the definition similarity degree between the third sample high-definition image and the third sample recovery image.
In a specific implementation, when it is determined that the degree of similarity of sharpness between the third sample high-definition image and the third sample recovery image is low (for example, lower than a preset degree threshold) according to the repair test result, the downsampling parameter may be adjusted in a targeted manner according to the repair test result. Correspondingly, the subsequent up-sampling parameters can be correspondingly adjusted according to the adjusted down-sampling parameters.
The downsampling parameter may specifically include a downsampling ratio.
Based on the above embodiment, by introducing and using the repair test, the downsampling parameter can be adjusted in a targeted manner, so that the downsampling process can be performed more accurately later.
In some embodiments, after obtaining the preset super-resolution network model meeting the requirements, when the method is implemented, the method may further include:
transmitting the preset super-resolution network model to a decoding end; and the decoding end receives and stores the preset super-resolution network model.
Correspondingly, the encoding end can hold and store a preset style feature extraction network model associated with the preset super-resolution network model.
The decoding end may be specifically understood as a terminal device capable of performing decoding processing on the second video encoded data. For example, it may be a user terminal such as a smart phone. In addition, the decoding end also has certain data processing performance.
Based on the above embodiment, after receiving the second video data about the high-definition target video data sent by the encoding end, the decoding end can quickly and conveniently decode and recover the high-definition target video data locally by using the preset super-resolution network model.
From the above, before implementation, the video coding method provided by the embodiment of the present disclosure may obtain the network model and the preset super-resolution network model by means of joint training, which have a better effect and are related to each other. In the implementation, after obtaining high-definition target video data, the decoding end can perform compression encoding on the target video data to obtain corresponding first video encoding data; then, a network model is extracted by utilizing preset style characteristics to process a first key frame code in first video coding data, so as to obtain corresponding style characteristics; and generating second video coding data which has smaller data volume corresponding to the original target video data and contains style characteristics associated with the first key frame coding according to the first video coding data and the style characteristics. The decoding end can decode and recover the high-definition target video data locally according to the preset super-resolution network model and the received second video coding data. Therefore, the performance and calculation power of the coding end and the decoding end can be effectively utilized, the waste of terminal performance resources such as the decoding end is avoided, and a user can smoothly watch high-definition videos on line under a weak network environment, so that better video service experience is obtained.
Referring to fig. 9, the embodiment of the present disclosure further provides an image decoding method. The method can be applied to a decoding end. In particular implementations, the method may include the following:
s901: receiving second video encoding data;
S902: extracting a first key frame code after downsampling and corresponding style characteristics from the second video coding data;
S903: processing the first key frame codes after downsampling and corresponding style characteristics by using a preset super-resolution network model, and recovering to obtain the first key frame codes;
s904: and performing corresponding decoding processing according to the first key frame coding and the second video coding data to obtain target video data meeting the requirements.
The first key frame code may be, for example, a high definition image frame.
In specific implementation, decoding and recovering the first image codes (or called second image codes) after downsampling in the second video coding data except for the first key frame codes after downsampling can be performed by using the first key frame codes according to a preset compression coding rule, so as to obtain corresponding higher-definition first image codes. And sequentially combining the first key frame codes and other first image codes according to the time information of the first image codes, and recovering to obtain high-definition target video data meeting the requirements with higher definition.
Based on the above embodiment, the user terminal may decode and recover the received second video encoded data in a local efficient and convenient manner by using a preset super-resolution network model to obtain high-definition target video data meeting the requirements.
In some embodiments, the extracting the downsampled first key frame code from the second video encoded data and the corresponding style feature may include: acquiring and determining a first key frame code after the next sample according to the identification information of the first image code after the downsampling in the second video coding data; and acquiring style characteristics associated with the first key frame code at a position adjacent to the rear position of the first key frame code after the next sample according to the position information of the first key frame code after the next sample.
In some embodiments, after corresponding decoding processing is performed according to the first key frame encoded data and the second video encoded data to obtain target video data meeting requirements, the target video data can be played and displayed to a user, so that video loading time can be reduced, and the user can smoothly watch high-definition target video on line through a decryption end, and better video service experience is obtained.
In some embodiments, the preset super-resolution network model may at least include: the device comprises a shallow characteristic extraction layer, a plurality of improved information distillation modules, a first convolution layer, an up-sampling operator and other structures, wherein the improved information distillation modules are connected in a nested manner; wherein, the above-mentioned nest connects a plurality of improved information distillation modules to form an intermediate feature extraction structure;
correspondingly, the above-mentioned processing the downsampled first key frame code and the corresponding style features by using the preset super-resolution network model, to recover the first key frame code, where the specific implementation may include the following:
s1: processing the first key frame code after downsampling and corresponding style characteristics by using a shallow layer characteristic extraction layer in a preset super-resolution network model, and outputting to obtain initial characteristics;
S2: performing feature processing of multiple rounds of iteration on the initial features by utilizing a plurality of improved information distillation modules connected in a nested manner, and outputting to obtain corresponding intermediate features;
S3: carrying out convolution processing on the intermediate features by using a first convolution layer to obtain a corresponding intermediate convolution result;
S4: and recovering to obtain a corresponding first key frame code through upsampling based on the intermediate convolution result by using an upsampling operator.
The upsampling operator may specifically be an upsampling operator based on Pixelshuffle.
Based on the above embodiment, the first key frame code with higher definition can be quickly and accurately decoded and recovered by using the preset super-resolution network model according to the first key frame code after downsampling and the associated style characteristics.
In some embodiments, when the method is implemented, the downsampled first key frame code and the associated style feature are processed by using a preset super-resolution network model, and the first key frame code is recovered by the following formula:
Fin=hext(ILR,Fi),
Fn=RRLFBn=(RRLFBn-1(…RRLFB0*(Fin))),
ISR=Upsample(Conv3(Fn)+F0)
Wherein, F in represents an initial feature, h ext represents a shallow feature extraction layer, L LR represents a first key frame code after downsampling, F i represents a style feature, RRLFB n represents an intermediate feature extraction structure, the RRLFB n is specifically obtained by nesting n improved information distillation modules (RRLFB) numbered 0 to n-1, F n represents an intermediate feature, I SR represents a first key frame code with high definition, upsample represents an upsampling operator, conv3 represents a third convolution layer, and F 0 represents an initial low frequency feature. Specifically, the initial low-frequency feature F 0 may be extracted according to the initial feature F in.
In some embodiments, the improved information distillation module may include at least: a first feature refinement module (e.g., which may be denoted as RM 1), a second feature refinement module (e.g., which may be denoted as RM 2), a connection layer (e.g., concat), a second convolution layer (e.g., conv 1), and an ESA module;
the first feature refinement module and the second feature refinement module can specifically be composed of convolution sum Relu of 3×3 and 1×1 and an activation function, and are used for extracting local refinement features.
The first feature refinement module is respectively connected with the second feature refinement module and the connecting layer through a channel separation structure (for example, CHANNEL SPLIT); the second feature refinement module is connected with at least a connection layer, the connection layer is connected with a second convolution layer, and the second convolution layer is connected with the ESA module.
Wherein the first feature refinement module and the second feature refinement module may be specifically configured to extract different locally refined features (e.g.,). The connection layer is used for connecting different local refinement features output by different feature refinement modules to obtain a joint local refinement feature (for example, F refined). The second convolution layer is used for carrying out convolution operation on the combined local refinement feature to obtain a corresponding convolution result.
Based on the above embodiments, by introducing and using the improved information distillation module, the calculation amount of the model can be reduced, and at the same time, the stability of the feature extraction capability can still be ensured.
In particular, the improved information distillation module may further include: a third feature refinement module (e.g., may be denoted as RM 3). Specifically, referring to fig. 10, the first feature refinement module may be connected to the second feature refinement module and the connection layer through a channel separation structure, respectively; the second feature refinement module can be respectively connected with the third feature refinement module and the connecting layer through a channel separation structure; the third feature refinement module may also be connected to the connection layer and a convolution layer (Conv-3) via a channel separation structure, respectively. The connection layer is in turn connected to the second convolution layer. The second convolution layer is in turn connected to the ESA module.
Accordingly, in the specific implementation, the corresponding first intermediate feature can be obtained and output by using the first improved information distillation module in the intermediate feature extraction structure according to the following formula:
Frefined3=RMa(Frefined2),
Fout=ESA(Conv1((Frefined)))
Wherein, And F refined3 is a first local refinement feature, a second local refinement feature and a third local refinement feature which are extracted and output by the first feature refinement module, the second feature refinement module and the third feature refinement module respectively, F refined is a joint local refinement feature which is connected and output by a connecting layer, and F out is a first intermediate feature which is output by the improved information distillation module.
After the first intermediate feature is obtained, the second improved information distillation module in the intermediate feature extraction structure can receive the first intermediate feature, and further intermediate feature extraction is performed by using the first intermediate feature to replace the initial feature according to the mode, so that a second intermediate feature with better effect is obtained. And so on, the final intermediate feature may ultimately be output by the intermediate feature extraction structure described above.
Based on the above embodiment, by using the improved information distillation module and introducing and using the channel separation structure, the calculation amount can be reduced, the running efficiency of the model can be improved, and meanwhile, the intermediate features with better effects can be extracted more stably and reliably.
The embodiment of the present disclosure further provides a server, and in particular, may refer to fig. 11. The server includes a network communication port 1101, a processor 1102, and a memory 1103, where the foregoing structures are connected by internal cables, so that each structure may perform specific data interaction.
Wherein, the network communication port 1101 may be specifically configured to acquire target video data;
The processor 1102 may be specifically configured to perform compression encoding on the target video data to obtain first video encoded data related to the target video data; wherein the first video encoding data comprises a plurality of first image encodings; extracting a network model by using preset style characteristics to process a first key frame code in first video coding data to obtain corresponding style characteristics; wherein the first key frame code is used for restoring and processing other first image codes; generating second video coding data related to target video data according to the first video coding data and the style characteristics;
The network communication port 1101 may be further configured to send the second image frame sequence set to a user terminal; the user terminal rebuilds according to a preset super-resolution network model and the second image frame sequence group to obtain target video data; and the preset super-resolution network model is associated with the preset style characteristic extraction network model.
The memory 1103 may be specifically configured to store a corresponding program of instructions.
In this embodiment, the network communication port 1101 may be a virtual port that binds with different communication protocols, so that different data may be sent or received. For example, the network communication port may be a port responsible for performing web data communication, a port responsible for performing FTP data communication, or a port responsible for performing mail data communication. The network communication port may also be an entity's communication interface or a communication chip. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it may also be a Wifi chip; it may also be a bluetooth chip.
In this embodiment, the processor 1102 may be implemented in any suitable manner. For example, a processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, among others. The description is not intended to be limiting.
In this embodiment, the memory 1103 may include multiple levels, and in a digital system, the memory may be any memory as long as it can hold binary data; in an integrated circuit, a circuit with a memory function without a physical form is also called a memory, such as a RAM, a FIFO, etc.; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card, and the like.
The embodiment of the specification also provides a user terminal, which comprises a processor and a memory for storing instructions executable by the processor, wherein the following steps are realized when the processor executes the instructions: receiving second video encoding data; extracting a first key frame code after downsampling and corresponding style characteristics from the second video coding data; processing the first key frame codes after downsampling and corresponding style characteristics by using a preset super-resolution network model, and recovering to obtain the first key frame codes; and performing corresponding decoding processing according to the first key frame coding and the second video coding data to obtain target video data meeting the requirements.
The embodiments of the present specification also provide a computer readable storage medium based on the video encoding method described above, the computer readable storage medium storing computer program instructions that when executed implement: acquiring first video encoding data regarding target video data; wherein the first video encoding data comprises a plurality of first image encodings; extracting a network model by using preset style characteristics to process a first key frame code in first video coding data to obtain corresponding style characteristics; wherein the first key frame code is used for restoring and processing other first image codes; generating second video coding data related to target video data according to the first video coding data and the style characteristics.
The embodiments of the present specification also provide a computer readable storage medium based on the video decoding method described above, the computer readable storage medium storing computer program instructions that when executed implement: receiving second video encoding data; extracting a first key frame code after downsampling and corresponding style characteristics from the second video coding data; processing the first key frame codes after downsampling and corresponding style characteristics by using a preset super-resolution network model, and recovering to obtain the first key frame codes; and performing corresponding decoding processing according to the first key frame coding and the second video coding data to obtain target video data meeting the requirements.
In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a hard disk (HARD DISK DRIVE, HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects of the program instructions stored in the computer readable storage medium may be explained in comparison with other embodiments, and are not described herein.
Referring to fig. 12, the embodiment of the present disclosure further provides a video encoding device, which may specifically include the following structural modules:
The acquiring module 1201 may be specifically configured to acquire first video encoding data related to target video data; wherein the first video encoding data comprises a plurality of first image encodings;
the processing module 1202 may be specifically configured to process a first key frame code in the first video coding data by using a preset style feature extraction network model to obtain a corresponding style feature; wherein the first key frame code is used for restoring and processing other first image codes;
The encoding module 1203 may be specifically configured to generate second video encoding data related to the target video data according to the first video encoding data and the style characteristics.
In some embodiments, when the encoding module 1203 is specifically implemented, the second video encoding data related to the target video data may be generated according to the first video encoding data and the style characteristics in the following manner: downsampling the first video coding data to obtain downsampled first video coding data; determining the position of the downsampled first key frame code in the downsampled first video coding data according to a preset adding rule; and adding style characteristics corresponding to the first key frame codes to the position, adjacent to the rear position, of the first key frame codes after downsampling in the first video coding data after downsampling, so as to obtain the second video coding data.
In some embodiments, when the processing module 1202 is specifically implemented, the first key frame code in the first video coding data may be processed by using the preset style feature extraction network model in the following manner, so as to obtain the corresponding style feature: identifying and determining a first key frame code in the first video coding data; and processing the first key frame code by using a preset style characteristic extraction network model to obtain corresponding style characteristics.
In some embodiments, when the processing module 1202 is implemented, the first key frame code in the first video coding data may be identified and determined as follows: extracting identification information of a first image code in first video code data; identifying and determining a first key frame code in the first video coding data according to the identification information; or processing the first video coding data by using a preset classifier to obtain a corresponding coding classification result; and determining a first key frame code in the first video coding data according to the code classification result.
In some embodiments, when the above-mentioned acquisition module 1201 is embodied, the first video encoding data about the target video data may be acquired as follows: acquiring target video data; and carrying out compression coding on the target video data according to a preset coding rule to obtain corresponding first video coding data.
In some embodiments, the apparatus, when embodied, may also be used to: constructing a first initial network model; acquiring and training the first initial network model by using a first sample high-definition image to obtain a first network model meeting the requirements; wherein the first network model comprises at least a first network structure and a second network structure; the first network structure is used for carrying out feature compression on the high-definition image of the first sample; the second network structure is used for carrying out feature recovery on the compressed image; intercepting a first network structure from the first network model; and constructing and obtaining a preset style characteristic extraction network model meeting the requirements according to the first network structure.
In some embodiments, after the network model is extracted by building the preset style characteristics meeting the requirements, the device may be further configured to: constructing a second initial network model based on a dense residual structure; wherein the second initial network model comprises at least a first input interface and a second input interface; connecting an output interface of a preset style characteristic extraction network model with a first input interface of a second initial network model; obtaining a second sample high-definition image; correspondingly processing the second sample high-definition image to obtain a corresponding second sample low-fraction image; and inputting a second sample high-definition image into a preset style feature extraction network model, and simultaneously inputting a second sample low-resolution image corresponding to the second sample high-definition image into a second initial network model through a second input interface, and training the second initial network model to obtain a preset super-resolution network model meeting the requirements.
In some embodiments, after obtaining the preset super-resolution network model meeting the requirements, the apparatus may be further configured to: transmitting the preset super-resolution network model to a decoding end; and the decoding end receives and stores the preset super-resolution network model.
Referring to fig. 13, the embodiment of the present disclosure further provides a video decoding apparatus, which may specifically include the following structural modules:
the receiving module 1301 may be specifically configured to receive second video encoded data;
The extracting module 1302 may be specifically configured to extract the downsampled first key frame code and the corresponding style feature from the second video encoding data;
The processing module 1303 may be specifically configured to restore and obtain the first key frame code by processing the first key frame code after downsampling and the corresponding style feature by using a preset super-resolution network model;
the decoding module 1304 may be specifically configured to perform corresponding decoding processing according to the first key frame encoded data and the second video encoded data, so as to obtain target video data meeting requirements.
In some embodiments, the preset super-resolution network model may at least include: the device comprises a shallow characteristic extraction layer, a plurality of improved information distillation modules, a first convolution layer, an up-sampling operator and other structures, wherein the improved information distillation modules are connected in a nested manner;
Correspondingly, when the processing module 1303 is specifically implemented, the first key frame code after downsampling and the corresponding style feature may be processed by using a preset super-resolution network model in the following manner, so as to recover and obtain the first key frame code: processing the first key frame code after downsampling and corresponding style characteristics by using a shallow layer characteristic extraction layer in a preset super-resolution network model, and outputting to obtain initial characteristics; performing feature processing of multiple rounds of iteration on the initial features by utilizing a plurality of improved information distillation modules connected in a nested manner, and outputting to obtain corresponding intermediate features; carrying out convolution processing on the intermediate features by using a first convolution layer to obtain a corresponding intermediate convolution result; and recovering to obtain a corresponding first key frame code through upsampling based on the intermediate convolution result by using an upsampling operator.
In some embodiments, the improved information distillation module may include at least: the device comprises a first feature refinement module, a second feature refinement module, a connection layer, a second convolution layer, an ESA module and other structures;
The first feature refinement module can be connected with the second feature refinement module and the connecting layer through a channel separation structure; the second feature refinement module is connected with at least a connection layer, the connection layer is connected with a second convolution layer, and the second convolution layer is connected with the ESA module.
It should be noted that, the units, devices, or modules described in the above embodiments may be implemented by a computer chip or entity, or may be implemented by a product having a certain function. For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, when the present description is implemented, the functions of each module may be implemented in the same piece or pieces of software and/or hardware, or a module that implements the same function may be implemented by a plurality of sub-modules or a combination of sub-units, or the like. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
From the above, based on the video encoding device and the video decoding device provided in the embodiments of the present disclosure, the performance and the computing power of the encoding end and the decoding end can be effectively utilized, so that the user can smoothly watch the high-definition video on line in the weak network environment, and a better video service experience can be obtained.
In a specific scene example, the video coding method provided in the specification can be applied to realize high-definition video transmission based on deep learning.
In this scenario example, consider that a conventional video transmission method may cause a degradation or complete interruption of video quality under weak network conditions. In addition, in order to enable users to enjoy high-quality video experience in various network environments as the computing power of the user terminal is continuously improved, the video coding method provided by the specification can be applied, so that the server deployment cost is reduced, the data processing pressure of the server side is reduced, and the video transmission efficiency of the users under the condition of weak network is remarkably improved.
The core idea of this approach is, among other things, to combine video transmission with super resolution technology. And reconstructing the low-resolution video into a high-resolution video at the user side through a super-resolution algorithm, so that the definition and quality of the video are improved, and the watching experience of the user under the weak network condition is further improved.
In practice, referring to fig. 14, the following steps are included.
S1: for original video data (e.g., high definition target video data) in video, a conventional encoding compression process is first performed (resulting in first video encoded data). Taking h.264 coding as an example, four frames, I, P, B and IDR, exist in the coded video, in general, the IDR frame must be an I frame, but the I frame is not necessarily an IDR frame, but considering that the I frame and the IDR frame both adopt intra-frame coding, the recovery of P, B needs to be based on information of previous and next frames, so that the key frame has an important effect on the definition of the video, and for convenience of explanation, the I frame and the IDR frame are regarded as the same frame in the method, and are collectively referred to as a key frame (i.e., the first key frame coding). Reference may be made to fig. 3.
S2: for the compressed video, the compression mode of the key frame mark exists, the type of the frame can be judged directly by reading the mark type of the frame, and for the unknown video or the video without the key frame mark, the key frame and the common frame in the video can be classified (to identify and determine the first key frame code) by training a classifier. Taking h.264 coding as an example, by judging that data with NAL unit type 5 in NAL unit is a key frame, determining the position of the key frame, performing feature extraction on original video data through a style feature extraction network N style (for example, a preset style feature extraction network model), and marking the extracted style feature as F i. The style feature extraction network is obtained by training in combination with a super-resolution network N SR (for example, a preset super-resolution network model). When reconstructing the high-resolution image, the style characteristics can be used as input to provide style guidance for reconstructing the high-resolution image so as to help the super-resolution network to reproduce the original high-definition image.
S3: in order to reduce the bandwidth requirement of the video in the transmission process, downsampling operation is needed after feature extraction processing is performed on the video, the method adopts bicubic downsampling (Bicubic) to downsample a single frame image in the video, the downsampled video is denoted as V L, and the compressed video is denoted as V H (namely, downsampled first video coding data).
S4: to keep the video streaming supported without affecting the user's random access, an F i feature (resulting in second video encoded data) may be further added after each key frame of the video. The processed video inter-frame content may be as shown in fig. 5. The user may then locally reconstruct the high-resolution video via the user terminal using the F i feature described above, the reconstructed high-resolution video being denoted V sR.
When the implementation is carried out, after receiving the compressed video frame, a user inputs the compressed IDR frame and the image style characteristics into a super-resolution network together, and the high-definition IDR frame is recovered; and then the recovered data is sent to a corresponding decoding algorithm (H.264 is taken as an example here) for recovery, and high-definition video is obtained.
In order to ensure that videos can be played in real time at a user side, when a super-resolution network is designed, in order to enhance the efficiency of the super-resolution network, a high-efficiency super-resolution network structure based on dense residual errors is provided. Based on this structure, the inference speed of the model is improved by improving the information distillation module and the residual module (corresponding to the improved information distillation module).
In addition, the performance loss caused by the improvement of the reasoning speed is compensated by adopting a re-parameterized structure and a multi-stage hot start training mode when the model is trained. The structure of the heavy parameters helps the network to introduce contrast loss in the network training process during training, and helps the model to better capture the texture details of the image.
The whole operation process of the super-resolution network is shown in the following formula:
Fin=hext(ILR,Fi),
Fn=RRLFBn=(RRLFBn-1(…RRLFB0*(Fin))),
ISR=Upsample(Conv3(Fn)+F0)
Where h ext represents the shallow feature extraction layer, F i represents the style feature, RRLFB i represents the i-th feature extraction module, each RRLFB module contains 3×3 and 1×1 convolutions and Relu activation functions for local feature extraction.
For a given feature F in, its overall process can be expressed in the form:
Frefined3=RM3(Frefined2),
where RM i represents the ith optimization module (e.g., the ith feature refinement module), Representing the ith refinement feature (e.g., the ith local refinement feature), after multiple local feature refinement steps, the final refinement feature is added by residual connection and input to the 1 x1 convolutional layer and ESA module to obtain the final output of RRLFB modules. Specifically, the expression can be expressed as follows:
Fout=ESA(Conv1((Frefined)))。
Meanwhile, in order to meet the requirement of high-efficiency super resolution, an improved distillation residual structure is adopted, a dense residual block is improved in a channel separation mode, the operation amount is reduced, meanwhile, the stability of the feature extraction capacity can be ensured, and the network structure is shown in the figure 10.
In addition, in order to enhance the ability of the network to recover texture information in video, contrast loss functions are also employed to assist the network in capturing texture information in images during model training. The basic idea is to push a positive target towards the anchor point and a negative target away from the anchor point in the latent space representation. The expression of the contrast loss function is described as follows:
Where phi j denotes the intermediate features of the j-th layer, d (x, y) denotes the L 1 distance between x and y, and lambda i is the weighting factor. Contrast loss allows the model to be more focused in the study on the generation of image texture levels than the L 1 loss function. With respect to the selection of phi, a randomly initialized Conv_k3s1-Tanh-Conv_k3s1 structure is chosen as the feature extraction function.
The feature map extracted based on the above manner is compared with the feature map extracted based on the VGG model. Referring to fig. 15, it can be seen that the randomly initialized network (corresponding to (a)) can extract better image texture details than the trained VGG network (corresponding to (b)).
In addition, in order to ensure the correctness of the restored image content, a restoration test can be performed on the compressed data before the implementation, and the compression degree (the downsampling parameter is adjusted) can be gradually reduced for the frames with PSNR and SSIM lower than the threshold value until the restored data meets the requirements.
Through the above scene example, it is verified that the video encoding method provided in the present specification does have higher compression efficiency: the key frames are subjected to feature extraction and advanced compression by using the traditional coding technology, so that higher compression efficiency can be better realized, and transmission and storage costs can be reduced; and, can also save the network bandwidth: by means of the more efficient compression and transmission method, bandwidth can be saved when video is transmitted through a network, and users are helped to better realize online video streaming, video on demand service, remote monitoring application and the like; furthermore, the user experience is improved: for the end user, better viewing experience can be provided through higher quality video reconstruction and lower loading time, so that the user can obtain better service experience based on streaming media, online video and the like.
Although the present description provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an apparatus or client product in practice, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment, or even in a distributed data processing environment). The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, it is not excluded that additional identical or equivalent elements may be present in a process, method, article, or apparatus that comprises a described element. The terms first, second, etc. are used to denote a name, but not any particular order.
Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer-readable storage media including memory storage devices.
From the above description of embodiments, it will be apparent to those skilled in the art that the present description may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be embodied essentially in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions to cause a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present specification.
Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. The specification is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Although the present specification has been described by way of example, it will be appreciated by those skilled in the art that there are many variations and modifications to the specification without departing from the spirit of the specification, and it is intended that the appended claims encompass such variations and modifications as do not depart from the spirit of the specification.

Claims (15)

1. A video encoding method, applied to an encoding end, comprising:
Acquiring first video encoding data regarding target video data; wherein the first video encoding data comprises a plurality of first image encodings;
extracting a network model by using preset style characteristics to process a first key frame code in first video coding data to obtain corresponding style characteristics; wherein the first key frame code is used for restoring and processing other first image codes;
generating second video coding data related to target video data according to the first video coding data and the style characteristics.
2. The method of claim 1, wherein generating second video encoded data for the target video data based on the first video encoded data and the style characteristics comprises:
downsampling the first video coding data to obtain downsampled first video coding data;
Determining the position of the downsampled first key frame code in the downsampled first video coding data according to a preset adding rule;
And adding style characteristics corresponding to the first key frame codes to the position, adjacent to the rear position, of the first key frame codes after downsampling in the first video coding data after downsampling, so as to obtain the second video coding data.
3. The method of claim 1, wherein processing the first key frame code in the first video coding data using the predetermined style feature extraction network model to obtain the corresponding style feature comprises:
Identifying and determining a first key frame code in the first video coding data;
And processing the first key frame code by using a preset style characteristic extraction network model to obtain corresponding style characteristics.
4. The method of claim 3, wherein identifying and determining a first key frame code in the first video encoded data comprises:
Extracting identification information of a first image code in first video code data; identifying and determining a first key frame code in the first video coding data according to the identification information;
Or processing the first video coding data by using a preset classifier to obtain a corresponding coding classification result; and determining a first key frame code in the first video coding data according to the code classification result.
5. The method of claim 1, wherein obtaining first video encoded data for the target video data comprises:
Acquiring target video data;
and carrying out compression coding on the target video data according to a preset coding rule to obtain corresponding first video coding data.
6. The method according to claim 1, wherein the method further comprises:
Constructing a first initial network model;
Acquiring and training the first initial network model by using a first sample high-definition image to obtain a first network model meeting the requirements; wherein the first network model comprises at least a first network structure and a second network structure; the first network structure is used for carrying out feature compression on the high-definition image of the first sample; the second network structure is used for carrying out feature recovery on the compressed image;
intercepting a first network structure from the first network model;
And constructing and obtaining a preset style characteristic extraction network model meeting the requirements according to the first network structure.
7. The method of claim 6, wherein after constructing the desired pre-set style feature extraction network model, the method further comprises:
Constructing a second initial network model based on a dense residual structure; wherein the second initial network model comprises at least a first input interface and a second input interface;
connecting an output interface of a preset style characteristic extraction network model with a first input interface of a second initial network model;
obtaining a second sample high-definition image; correspondingly processing the second sample high-definition image to obtain a corresponding second sample low-fraction image;
and inputting a second sample high-definition image into a preset style feature extraction network model, and simultaneously inputting a second sample low-resolution image corresponding to the second sample high-definition image into a second initial network model through a second input interface, and training the second initial network model to obtain a preset super-resolution network model meeting the requirements.
8. The method of claim 7, wherein after obtaining the satisfactory pre-set super-resolution network model, the method further comprises:
transmitting the preset super-resolution network model to a decoding end; and the decoding end receives and stores the preset super-resolution network model.
9. A video decoding method, applied to a decoding end, comprising:
Receiving second video encoding data;
extracting a first key frame code after downsampling and corresponding style characteristics from the second video coding data;
Processing the first key frame codes after downsampling and corresponding style characteristics by using a preset super-resolution network model, and recovering to obtain the first key frame codes;
And performing corresponding decoding processing according to the first key frame coding and the second video coding data to obtain target video data meeting the requirements.
10. The method according to claim 9, wherein the pre-set super-resolution network model comprises at least: a shallow feature extraction layer, a plurality of improved information distillation modules connected in a nested manner, a first convolution layer, and an upsampling operator;
Correspondingly, the first key frame code after downsampling and the corresponding style characteristics are processed by utilizing a preset super-resolution network model, and the first key frame code is recovered and obtained, which comprises the following steps:
Processing the first key frame code after downsampling and corresponding style characteristics by using a shallow layer characteristic extraction layer in a preset super-resolution network model, and outputting to obtain initial characteristics;
Performing feature processing of multiple rounds of iteration on the initial features by utilizing a plurality of improved information distillation modules connected in a nested manner, and outputting to obtain corresponding intermediate features;
carrying out convolution processing on the intermediate features by using a first convolution layer to obtain a corresponding intermediate convolution result;
And recovering to obtain a corresponding first key frame code through upsampling based on the intermediate convolution result by using an upsampling operator.
11. The method of claim 10, wherein the improved information distillation module comprises at least: the device comprises a first feature refinement module, a second feature refinement module, a connection layer, a second convolution layer and an ESA module;
The first feature refinement module is respectively connected with the second feature refinement module and the connecting layer through a channel separation structure; the second feature refinement module is connected with at least a connection layer, the connection layer is connected with a second convolution layer, and the second convolution layer is connected with the ESA module.
12. A video encoding apparatus, for use at an encoding end, comprising:
An acquisition module for acquiring first video encoding data concerning target video data; wherein the first video encoding data comprises a plurality of first image encodings;
The processing module is used for processing a first key frame code in the first video coding data by utilizing a preset style characteristic extraction network model to obtain corresponding style characteristics; wherein the first key frame code is used for restoring and processing other first image codes;
And the encoding module is used for generating and obtaining second video encoding data about target video data according to the first video encoding data and the style characteristics.
13. A video decoding device, applied to a decoding end, comprising:
A receiving module for receiving second video coding data;
The extraction module is used for extracting the first key frame codes after downsampling and corresponding style characteristics from the second video coding data;
The processing module is used for processing the first key frame codes after downsampling and corresponding style characteristics by utilizing a preset super-resolution network model, and recovering to obtain the first key frame codes;
and the decoding module is used for carrying out corresponding decoding processing according to the first key frame coding and the second video coding data to obtain target video data meeting the requirements.
14. A server comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 8, or 9 to 11.
15. A computer readable storage medium, having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 8, or 9 to 11.
CN202410171043.XA 2024-02-06 2024-02-06 Video encoding method and device, and video decoding method and device Pending CN117956178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410171043.XA CN117956178A (en) 2024-02-06 2024-02-06 Video encoding method and device, and video decoding method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410171043.XA CN117956178A (en) 2024-02-06 2024-02-06 Video encoding method and device, and video decoding method and device

Publications (1)

Publication Number Publication Date
CN117956178A true CN117956178A (en) 2024-04-30

Family

ID=90792218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410171043.XA Pending CN117956178A (en) 2024-02-06 2024-02-06 Video encoding method and device, and video decoding method and device

Country Status (1)

Country Link
CN (1) CN117956178A (en)

Similar Documents

Publication Publication Date Title
US20200280730A1 (en) Training end-to-end video processes
US10630996B2 (en) Visual processing using temporal and spatial interpolation
US10701394B1 (en) Real-time video super-resolution with spatio-temporal networks and motion compensation
CN109949222B (en) Image super-resolution reconstruction method based on semantic graph
GB2548749A (en) Online training of hierarchical algorithms
CN117956178A (en) Video encoding method and device, and video decoding method and device
WO2024093627A1 (en) Video compression method, video decoding method, and related apparatuses
Liu et al. Geometry-guided compact compression for light field image using graph convolutional networks
KR20240024921A (en) Methods and devices for encoding/decoding image or video
CN115103188A (en) SVC error concealment method, model training method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination