CN110913221A

CN110913221A - A kind of video bit rate prediction method and device

Info

Publication number: CN110913221A
Application number: CN201811086393.7A
Authority: CN
Inventors: 徐威; 宣章洋; 张新峰; 杨超; 郭宗杰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2020-03-24

Abstract

A video code rate prediction method and device are used for solving the problem of resource waste during video transmission in the prior art. The method comprises the following steps: a first parameter set of a group of pictures, GOP, to be predicted and a second parameter set of a reference video are determined. And then, determining a feature set of the GOP to be predicted, wherein the feature set comprises a feature value which is determined based on the first parameter set and the second parameter set and is related to human eye perception quality and expected user satisfaction, and the expected user satisfaction is an expected user satisfaction ratio of a video obtained after compressing the GOP to be predicted by adopting a prediction code rate and compared with a reference video. And then, inputting the feature set of the GOP to be predicted into a regression model trained in advance to obtain the predicted code rate.

Description

Video code rate prediction method and device

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a video code rate prediction method and device.

Background

Due to the limitation of network bandwidth, in the traditional streaming media service, the same video source needs to be compressed into video streams with different code rates, and in practical application, the server selects the video stream with the suitable code rate to transmit according to the current bandwidth requirement. However, in the multi-rate compression storage method, the server needs to store the video stream compressed at multiple rates, which results in waste of storage resources. Also, multiple compressions of the same video source result in wasted computational resources. In addition, in order to provide high-quality video for users, the server side is often forced to select the video stream with the highest bitrate under the current bandwidth condition for transmission, thereby causing waste of transmission bandwidth resources.

Disclosure of Invention

The application provides a video code rate prediction method and device, which are used for solving the problem of resource waste during video transmission in the prior art.

In a first aspect, the present application provides a video bitrate prediction method, including: determining a first parameter set of a GOP (group of pictures) to be predicted and a second parameter set of a reference video, wherein the first parameter set comprises one or more video parameters of the GOP to be predicted, the second parameter set comprises one or more video parameters of the reference video, and the reference video is a video obtained by compressing the GOP to be predicted based on a preset compression standard. And then, determining a feature set of the GOP to be predicted, wherein the feature set comprises feature values which are determined based on the first parameter set and the second parameter set and are related to human eye perception quality, and expected user satisfaction which is a user satisfaction ratio of a video obtained by compressing the GOP to be predicted by adopting a prediction coding rate and compared with the reference video. And then, inputting the feature set of the GOP to be predicted into a regression model trained in advance to obtain the predicted code rate. In the embodiment of the application, in a regression model-based mode, by extracting the characteristic value related to the perception quality of the original video and the human eye and the characteristic value related to the perception quality of the reference video and setting the user satisfaction degree of the video corresponding to the target code rate compared with the reference video, the code rate meeting the user requirement can be predicted accurately through the pre-trained regression model, and therefore the original video can be compressed and transmitted by adopting the predicted code rate. Compared with the prior art, the method has the advantages that the original video needs to be compressed into the video streams with different code rates, and then the video stream with the proper code rate is selected for transmission according to the current bandwidth requirement. Compared with the prior art that the server side selects the highest-bit-rate video stream under the current bandwidth condition for transmission, the method and the device for compressing the original video can compress the original video by adopting the bit rate meeting the user requirement, and therefore bandwidth resources can be saved.

In one possible design, the feature set of the GOP to be predicted is input into a regression model trained in advance, and when the predicted code rate is obtained, the feature set of the GOP to be predicted can be mapped to a high-dimensional feature space based on a nonlinear mapping mode to obtain a high-dimensional feature set. And inputting the high-dimensional feature set into the regression model to obtain the difference value of the code rate of the reference video and the predicted code rate in a logarithmic space. And determining the predicted code rate based on the difference value of the code rate of the reference video and the predicted code rate in logarithmic space. In the design, the accuracy of code rate prediction can be improved by mapping the feature set to the high-dimensional feature space based on a nonlinear mapping mode.

In one possible design, the regression model may conform to the following formula: f (x) w^TPhi (X) + b. Wherein f (x) is the difference value of the code rate of the reference video and the predicted code rate in logarithmic space, and w^TFor weight, φ (X) is the set of high-dimensional features, and b is bias. In the above design, by using support vectorsThe machine regression model can effectively predict the prediction code rate meeting the satisfaction degree of the expected user.

In one possible design, the first set of parameters may include: the time domain masking effect value of each video frame in the GOP to be predicted, the space domain masking effect value of each pixel point in the GOP to be predicted and the visual significance value of each pixel point in the GOP to be predicted. The second set of parameters may include: the video objective quality value of the reference video and the code rate of the reference video. The feature set may include: the average value of the time domain masking effect values, the weighted average value of the spatial domain masking effect values, the video objective quality value of the reference video, and the logarithm of the code rate of the reference video with base 2. The average value of the time domain masking effect values is determined based on the time domain masking effect value of each video frame in the GOP to be predicted; the weighted average value of the spatial masking effect value is determined based on the weight of each pixel point in the GOP to be predicted and the spatial masking effect value, and the weight of each pixel point in the GOP to be predicted is determined based on the visual saliency value of each pixel point. In the design, the quality of human eye perception of the GOP to be predicted by a user can be reflected through video parameters such as a time domain masking effect value, a spatial domain masking effect value and a visual saliency value, so that the prediction code rate meeting the satisfaction degree of the expected user can be predicted accurately by combining the video quality and the human eye perception quality.

In a possible design, the first parameter set may further include N compression code rates, where the N compression code rates are code rates corresponding to N compressed videos, the N compressed videos are obtained by performing compression coding on the GOP to be predicted by using N fixed quantization parameter QP points, respectively, and N is an integer greater than 0. The feature set may further include: and taking the logarithm of the N compression code rates as a base and/or the proportion of at least one adjacent difference in a maximum difference value, wherein the adjacent difference is the difference between the compression code rates corresponding to two adjacent fixed QP points, and the maximum difference value is the difference between the compression code rate corresponding to the maximum fixed QP point in the N fixed QP points and the compression code rate corresponding to the minimum fixed QP point in the N fixed QP points. In the design, the GOP to be predicted is compressed by adopting different QP points to obtain compressed videos with different compression code rates, and the compressed videos compressed by adopting different QP points can reflect the video quality of the GOP to be predicted under different compression code rates, so that the accuracy of code rate prediction can be improved by combining the video parameters of the compressed videos compressed by adopting different QP points.

In one possible design, the first set of parameters may further include the M video objective quality values, where the M video objective quality values are video objective quality values of M compressed videos relative to the GOP to be predicted, where the M compressed videos are any M of the N compressed videos, and M is an integer greater than 0 and not greater than N. The feature set may further include: at least one difference value of the objective quality values of the videos, wherein the difference value of the objective quality values of the videos is the difference value of two objective quality values of the M video. In the design, the accuracy of code rate prediction can be improved by combining the video objective quality values of the compressed video compressed by different QP points relative to the reference video.

In one possible design, the second set of parameters may further include: a frame rate and a resolution of the reference video. The feature set may further include: a frame rate and a resolution of the reference video. In the design, the frame rate and the resolution of the reference video can reflect the video quality of the reference video, so that the accuracy of code rate prediction can be improved by combining the frame rate and the resolution of the reference video.

In one possible design, the regression model may be trained by: in the K training process, inputting the feature set of the K sample video in a training sample database into a regression model which is adjusted for K-1 times to obtain the predicted code rate of the K sample video, wherein K is an integer greater than 0, the training sample database comprises the feature sets of a plurality of sample videos and the target code rate corresponding to each sample video, and the regression model comprises weight and bias. And after the K training, obtaining an error value between the predicted code rate of the K sample video and the target code rate of the K sample video. If the error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video does not meet the preset condition, adjusting the weight and the bias used in the K +1 th training process based on the error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video. And if the error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video meets a preset condition, obtaining a functional relation between the feature set of the sample video and the target code rate, wherein the functional relation is the regression model. In the design, a regression model with higher accuracy can be trained through a plurality of sample videos, so that the regression model can be used for predicting the more accurate code rate.

In one possible design, the training sample database may be constructed by: several sample videos are acquired. For each sample video, determining one or more video parameters of the sample video and one or more video parameters of a reference video of the sample video, where the reference video of the sample video is a video corresponding to the sample video compressed based on the preset compression standard. Determining the feature value based on video parameters of the sample video and video parameters of a reference video of the sample video. And compressing the sample video by adopting a preset code rate to obtain a target video, wherein the preset code rate is smaller than the code rate of a reference video of the sample video. And counting the user satisfaction degree of the reference video of the sample video and the user satisfaction degree of the target video. Determining a user satisfaction ratio of the user satisfaction of the target video to the user satisfaction of the reference video of the sample video. And taking the characteristic value determined based on the video parameter of the sample video and the video parameter of the reference video of the sample video, the user satisfaction ratio of the user satisfaction of the target video and the user satisfaction of the reference video of the sample video as the characteristic set of the sample video in the training sample database, and taking the preset code rate as the target code rate of the sample video in the training sample database. In the design, a target code rate lower than a reference video code rate is adopted to compress a sample video to obtain a target video, user satisfaction of the target video relative to the reference video is counted, a regression model is trained by adopting the target code rate and the user satisfaction corresponding to the target code rate, so that the regression model can predict the code rate lower than the reference video and the prediction code rate of the user satisfaction, a GOP to be predicted can be compressed by adopting the prediction code rate lower than the code rate of the reference video, and compared with the prior art that the compressed video with the highest code rate meeting the bandwidth requirement is transmitted, the transmission of the compressed video with the prediction code rate lower than the code rate of the reference video in the design can effectively reduce the waste of transmission resources.

In a second aspect, the present application provides a video bitrate prediction apparatus. The apparatus has a function of implementing any of the embodiments of the first aspect and the first aspect described above. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a third aspect, the present application provides an electronic device, comprising: a processor and a memory. The memory is used for storing computer-executable instructions, and when the electronic device is operated, the processor executes the computer-executable instructions stored in the memory, so as to enable the apparatus to perform the video bitrate prediction method according to the first aspect or any one of the first aspects.

In a fourth aspect, the present application further provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the video bitrate prediction method according to the first aspect or any one of the first aspects.

In a fifth aspect, the present application further provides a computer program product comprising instructions, which when run on a computer, cause the computer to perform the video bitrate prediction method according to the first aspect or any one of the first aspects.

Drawings

Fig. 1 is a schematic flowchart of a video bitrate prediction method provided in the present application;

fig. 2 is a schematic diagram of a video bitrate prediction process provided in the present application;

FIG. 3 is a schematic diagram of a regression model training process provided herein;

fig. 4 is a schematic structural diagram of a video bitrate prediction apparatus provided in the present application;

fig. 5 is a schematic diagram of a video bitrate prediction apparatus provided in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail with reference to the accompanying drawings.

Due to the limitation of network bandwidth, in the traditional streaming media service, in order to improve the satisfaction degree of users to videos, the same video source needs to be compressed into video streams with different code rates, and in practical application, a server selects the video stream with the code rate meeting the current requirement according to the current bandwidth requirement to transmit. However, the server compresses the same video source for multiple times to obtain video streams with different code rates, which increases the consumption of computing resources and energy. And after the same video source is compressed for multiple times, the server also needs to store the video streams with multiple code rates, so that a large amount of storage space is needed, and the waste of storage resources is caused. In addition, in order to provide high-quality video for users, the server is usually forced to select the highest bitrate video stream under the current bandwidth requirement for transmission when selecting the video stream according to the current bandwidth requirement, however, sometimes the user may not need the bitrate video stream, thereby causing waste of transmission bandwidth resources.

Based on this, the application provides a video code rate prediction method and device, which are used for solving the problem of resource waste during video transmission in the prior art. The embodiment of the application provides a feature set capable of effectively reflecting human eye perception quality in a video from the perception characteristic of human eyes, and meanwhile establishes a regression model by taking user satisfaction as a target, and the target code rates under different user satisfaction relative to a reference video can be effectively predicted through the regression model, so that coding parameters can be effectively provided for the dynamic self-adaptive network bandwidth change of video streaming media. The method and the device are based on the same inventive concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated. In order that the embodiments of the present application may be more readily understood, some of the descriptions set forth in the embodiments of the present application are first presented below and should not be taken as limiting the scope of the claims of the present application.

Group of pictures (GOP): a GOP is a set of consecutive multiple frames of pictures, each GOP containing a certain length of video, e.g., 5 seconds, etc.

Time domain masking effect value: human eyes have different distortion perception abilities for the distortion of moving objects in videos, and can perceive relatively fine distortion for slow or regular moving objects, and the human eyes are not easy to perceive the distortion of the videos for violent or irregular moving objects. The sensitivity of human eyes to the distorted perception of different moving objects in video is called the temporal masking effect of video. Videos with severe or irregular motion tend to have large temporal masking values, while videos with slow or regular motion have smaller temporal masking values.

Spatial masking effect value: human eyes have different distortion perceptibility for different areas in an image, and can perceive relatively fine distortion near a relatively smooth or regular image structure, while the human eyes are not easy to perceive the image distortion in a texture area with a disordered structure. The sensitivity of human eyes to the perception of different content distortions in the video image spatial domain is called the spatial masking effect of the video, the spatial masking effect value of the video is larger in a region with a complex video spatial structure, and the spatial masking effect value is smaller in a region with a regular video spatial structure.

Visual saliency: visual saliency describes the degree of attention that the human eye is interested in different areas in a video. The Human Visual System (HVS) typically focuses most of its attention on a small area around the visual focus point, which may have high resolution, when viewing a video or image. While for the areas outside the focus point they tend to be presented in a low resolution. This area of visual interest is called the video saliency area, and the process of predicting the area of visual interest for the human eye is called saliency detection. The saliency detection result is output in a saliency map form, the visual saliency value of each pixel point in the saliency map represents the possible attention degree of the pixel point, and the higher the visual saliency value is, the more prominent the pixel point is, the more attention is possible.

Objective quality value of video: the video objective quality value is a full-reference video quality evaluation method. The method predicts subjective quality by combining a plurality of basic quality measure indicators. A machine learning algorithm is used for fusing a plurality of quality evaluation indexes of the images or videos, and a final video quality evaluation score, namely a video objective quality value, is calculated.

The plural in the present application means two or more.

In addition, it is to be understood that the terms first, second, etc. in the description of the present application are used for distinguishing between the descriptions and not necessarily for describing a sequential or chronological order.

The resource scheduling scheme provided by the present application is specifically described below with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a video bitrate prediction method provided by the present application is shown. The video bitrate prediction method provided by the application can be used in streaming media equipment, such as a streaming media server and the like. The method comprises the following steps:

s101, determining a first parameter set of a GOP (group of pictures) to be predicted and a second parameter set of a reference video, wherein the first parameter set comprises one or more video parameters of the GOP to be predicted, the second parameter set comprises one or more video parameters of the reference video, and the reference video is a video obtained by compressing the GOP to be predicted based on a preset compression standard. For example, the preset compression standard may be a correspondence between resolution and bitrate, so that the bitrate of the reference video may be determined according to the resolution of the GOP to be predicted.

S102, determining a feature set of the GOP to be predicted, wherein the feature set comprises feature values which are determined based on the first parameter set and the second parameter set and are related to human eye perception quality and expected user satisfaction, and the expected user satisfaction is the user satisfaction of a video which is obtained by compressing the GOP to be predicted by adopting a prediction code rate and is compared with the reference video.

In the field of encoding and decoding, the user satisfaction may refer to a ratio of quality differences that cannot be seen by a user between a reference video and a processed target video, for example, if 50% of users do not see the quality difference between the reference video and the processed target video, the user satisfaction of the processed target video with respect to the reference video is considered to be 50%.

S103, inputting the feature set of the GOP to be predicted into a regression model trained in advance to obtain the predicted code rate. The regression model may be, but is not limited to, a Support Vector Regression (SVR) model, a Gaussian Process Regression (GPR), or other regression models in machine learning, and the like.

In a possible implementation manner, the feature set of the GOP to be predicted is input into a regression model trained in advance to obtain the predicted code rate, and the prediction method can be implemented by the following steps:

a1, mapping the feature set of the GOP to be predicted to a high-dimensional feature space based on a nonlinear mapping mode to obtain a high-dimensional feature set.

A2, inputting the high-dimensional feature set into the regression model to obtain the difference between the code rate of the reference video and the predicted code rate in a logarithmic space.

A3, determining the predicted bitrate based on the difference between the bitrate of the reference video and the predicted bitrate in logarithmic space.

For example, the regression model may conform to the following formula:

f(x)＝w^Tφ(X)+b；

wherein f (x) is the difference value of the code rate of the reference video and the predicted code rate in logarithmic space, and w^TFor weight, φ (X) is the high dimensional feature set, b is bias, and X is the feature set of the GOP to be predicted.

In the embodiment of the application, based on a regression model, by extracting the characteristic value related to the perception quality of the original video and the human eye and the characteristic value related to the perception quality of the reference video and setting the user satisfaction ratio of the video corresponding to the target code rate compared with the reference video, the code rate meeting the user requirement can be predicted accurately through the pre-trained regression model, and therefore the original video can be compressed and transmitted by adopting the predicted code rate. Compared with the prior art, the method has the advantages that the original video needs to be compressed into the video streams with different code rates, and then the video stream with the proper code rate is selected for transmission according to the current bandwidth requirement. Compared with the prior art that the server side selects the highest code rate video stream under the current bandwidth condition for transmission, the method and the device for compressing the original video can compress the original video by adopting the code rate meeting the user requirement, and therefore bandwidth resources can be saved.

In one possible example, the first set of parameters may include: the time domain masking effect value of each video frame in the GOP to be predicted, the spatial masking effect value of each pixel point in the GOP to be predicted and the visual significance value of each pixel point in the GOP to be predicted. The second set of parameters comprises: the video objective quality value of the reference video and the code rate of the reference video. The video objective quality value of the reference video may be the video objective quality value of the reference video compared to the GOP to be predicted.

Thus, the feature set may include: a temporal masking effect value mean, wherein the temporal masking effect value mean is determined based on a temporal masking effect value of each video frame in the GOP to be predicted.

For example, the mean value of the temporal masking effect values may satisfy the following formula:

wherein f1 is the average value of the time domain masking effect values, N is the number of video frames in the GOP to be predicted, and M is the number of video frames in the GOP to be predicted_t(i) And the time domain masking effect value of the ith video frame in the GOP to be predicted is obtained.

The feature set may further include: and the weighted average value of the spatial masking effect value is determined based on the weight of each pixel point in the GOP to be predicted and the spatial masking effect value, and the weight of each pixel point in the GOP to be predicted is determined based on the visual saliency value of each pixel point.

For example, the weighted average of the spatial masking effect values may conform to the following formula:

wherein f2 is the weighted average of the spatial masking effect values, and W is the length of the video frame in the GOP to be predicted. And H is the width of the video frame in the GOP to be predicted. The M is_sAnd (i, x, y) is the spatial masking effect value of the pixel point with the coordinate (x, y) in the ith video frame in the GOP to be predicted. The w (i, x, y) is a weight of a pixel point with a coordinate (x, y) in the ith video frame in the GOP to be predicted, and the w (i, x, y) may conform to the following formula:

and S (i, x, y) is the visual saliency value of a pixel point with the coordinate (x, y) in the ith video frame in the GOP to be predicted.

The feature set may further include: a video objective quality value for the reference video.

The feature set may further include: logarithm of the code rate of the reference video with base 2, log₂(R_ref)，R_refIs the bitrate of the reference video.

In one implementation, the first parameter set may further include N compression code rates, where the N compression code rates are code rates corresponding to N compressed videos, the N compressed videos are obtained by performing compression coding on the GOP to be predicted by using N fixed Quantization Parameter (QP) points, respectively, and N is an integer greater than 0.

Thus, the feature set may further include: and taking the logarithm of the N compression code rates as a base and/or the proportion of at least one adjacent difference in a maximum difference value, wherein the adjacent difference is the difference between the compression code rates corresponding to two adjacent fixed QP points, and the maximum difference value is the difference between the compression code rate corresponding to the maximum fixed QP point in the N fixed QP points and the compression code rate corresponding to the minimum fixed QP point in the N fixed QP points.

For example, taking N as 4, and taking 4 QP points as 22, 27, 32, and 37 as examples, the first parameter set may include R₂₂、R₂₇、R₃₂、R₃₇. Wherein R is₂₂For the rate of compressed video obtained by compression coding a GOP to be predicted with QP 22, R₂₇For the rate of compressed video obtained by compression coding a GOP to be predicted with QP of 27, R₃₂For the rate of compressed video obtained by compression coding a GOP to be predicted with QP of 32, R₃₇The code rate of the compressed video obtained by compression coding the GOP to be predicted by adopting the QP of 37. The second feature set includes: { log₂(R₂₂)，log₂(R₂₇)， log₂(R₃₂)，log₂(R₃₇) And/or { (R) }, and/or { (R)₂₂-R₂₇)/(R₂₂-R₃₇)，(R₂₇-R₃₂)/(R₂₂-R₃₇)，(R₃₂-R₃₇)/(R₂₂-R₃₇) At least one of.

In one implementation, the first parameter set may further include the M video objective quality values, where the M video objective quality values are video objective quality values of M compressed videos relative to the GOP to be predicted, where the M compressed videos are any M of the N compressed videos, and M is an integer greater than 0 and not greater than N. The feature set may further include: at least one difference value of the objective quality values of the videos, wherein the difference value of the objective quality values of the videos is the difference value of two objective quality values of the M video.

For example, taking M compressed videos as compressed videos obtained by compression-coding a GOP to be predicted with QP equal to 22, and compressed videos obtained by compression-coding a GOP to be predicted with QP equal to 27 as examples, the first parameter set may further include: v. of₂₂And v₂₇Wherein v is₂₂The video objective quality value v of the compressed video obtained by compression coding the GOP to be predicted by using QP (22)₂₇The video objective quality value of the compressed video obtained by compression coding the GOP to be predicted by using the QP of 27 is obtained. The feature set includes: v. of₂₂And v₂₇A difference of (i.e. v)₂₂-v₂₇。

In one implementation, the second parameter set may further include: a frame rate and a resolution of the reference video. The feature set may further include: a frame rate and a resolution of the reference video. The resolution included in the feature set may be a resolution obtained by performing normalization processing on the resolution of the reference video.

For example, the resolution obtained after the normalization process may be N_ref/(640X 360) where N_refIs the resolution of the reference video. Alternatively, other parameter values may be used to normalize the resolution of the reference video, and the normalized parameter values are not specifically limited in this embodiment of the application.

Of course, the first parameter set and the second parameter set may also include other video parameters, and the feature set may also include other feature values related to the human eye perception quality, which are not listed here.

In order to better understand the video bitrate prediction method provided in the embodiment of the present application, a process of predicting a video bitrate is described in detail below with reference to specific embodiments. The process of predicting the video bitrate is shown in fig. 2. It should be understood that the embodiment shown in fig. 2 is only an exemplary illustration and does not specifically limit the number, types, etc. of video parameters included in the first parameter set, the second parameter set, or the number, types, etc. of feature values included in the feature set.

S201, a first parameter set of a GOP to be predicted and a second parameter set of a reference video are determined.

Wherein the first parameter set may include: time domain masking effect of each video frame in the GOP to be predicted, spatial masking effect and visual significance of each pixel point in each video frame in the GOP to be predicted, and code rate R of the compressed video obtained by compressing the GOP to be predicted by adopting QP (22)₂₂And compressing the GOP to be predicted by adopting the QP (27) to obtain the code rate R of the compressed video₂₇And compressing the GOP to be predicted by adopting the QP (rate of quantization) of 32 to obtain the code rate R of the compressed video₃₂And compressing the GOP to be predicted by adopting the QP (37) to obtain the code rate R of the compressed video₃₇And the objective quality value v of the compressed video obtained by compressing the GOP to be predicted by adopting the QP (22) relative to the video of the GOP to be predicted₂₂And obtaining the objective quality value v of the compressed video obtained by compressing the GOP to be predicted by adopting the QP (27) relative to the video of the GOP to be predicted₂₇Objective quality value v of the video of a reference video relative to a GOP to be predicted_ref. The second set of parameters may include: frame rate F of reference video_refResolution N_refSum code rate R_ref。

S202, determining a second feature value f1 in the feature set by the following formula:

The second eigenvalue f2 in the feature set is determined by the following equation:

wherein f2 is the weighted average of the spatial masking effect values, and W is the length of the video frame in the GOP to be predicted. And H is the width of the video frame in the GOP to be predicted. The M is_sAnd (i, x, y) is the spatial masking effect value of the pixel point with the coordinate (x, y) in the ith video frame in the GOP to be predicted. The w (i, x, y) may conform to the following formula:

R is to be₂₂、R₂₇、R₃₂、R₃₇Respectively transforming the feature values into a logarithmic space to obtain a third feature value f3, a fourth feature value f4, a fifth feature value f5 and a sixth feature value f6 in the feature set, wherein f3 is log₂(R₂₂)，f4＝log₂(R₂₇)， f5＝log₂(R₃₂)，f6＝log₂(R₃₇)。

Determining the proportion of the adjacent QP point code rate change to the whole QP point code rate change to obtain a seventh characteristic value f7, an eighth characteristic value f8 and a ninth characteristic value f9 in the characteristic set, wherein f7 is (R)₂₂-R₂₇)/(R₂₂-R₃₇)， f8＝(R₂₇-R₃₂)/(R₂₂-R₃₇)，f9＝(R₃₂-R₃₇)/(R₂₂-R₃₇)。

Determining a compression value obtained by compression-encoding a GOP to be predicted with QP of 22Video objective quality value v for compressed video₂₂And a video objective quality value v of a compressed video obtained by compression-encoding a GOP to be predicted with QP of 22₂₇To obtain a tenth feature value f10 in the feature set, i.e., f10 ═ v₂₂-v₂₇。

Objective quality value v of video to be referenced to_refAs the eleventh eigenvalue f11 in the eigenvalue set, i.e., f11 ═ v_ref。

Frame rate F of reference video_refA twelfth eigenvalue F12 in the feature set, i.e., F12 ═ F_ref。

Resolution N of reference video_refAfter normalization, the result is used as a thirteenth feature value f13 in the feature set, i.e., f13 ═ Nref/(640 × 360).

Transforming the code rate of the reference video into a logarithmic space to obtain a fourteenth characteristic value f14 in the feature set, i.e. f14 ═ log2 (R)_ref)。

And setting a user satisfaction ratio of the video obtained by compressing the GOP to be predicted by adopting the prediction code rate and compared with the reference video, and taking the user satisfaction ratio as a fifteenth characteristic value f15 in the characteristic set.

S203, inputting the 15 feature values in the feature set, i.e., f1 to f15, into a regression model trained in advance, i.e., f1 to f15, and f (x) w^TPhi (X) + b to give f (X) ═ w^TPhi (f 1-f 15) + b, f (x) is the difference between the code rate of the reference video and the predicted code rate in the logarithmic space. The predicted code rate, i.e. R, can be determined from the difference between the code rate of the reference video and the predicted code rate in logarithmic space_obj＝R_ref-log^-1(f (x)), wherein R_objTo predict the code rate, log^-1(. cndot.) is the inverse of a logarithmic function.

In one possible embodiment, the regression model may be trained by:

and B1, in the K training process, inputting the feature set of the K sample video in the training sample database into a regression model which is adjusted for K-1 times to obtain the predicted code rate of the K sample video, wherein K is an integer greater than 0, the training sample database comprises the feature sets of a plurality of sample videos and the target code rate corresponding to each sample video, and the regression model comprises weight and bias.

The training sample database can be constructed in the following way:

c1, obtaining a plurality of sample videos.

And C2, determining, for each sample video, one or more video parameters of the sample video and one or more video parameters of a reference video of the sample video, where the reference video of the sample video is a video corresponding to the sample video compressed based on the preset compression standard.

C3, determining the feature value based on the video parameters of the sample video and the video parameters of the reference video of the sample video.

And C4, compressing the sample video by adopting a preset code rate to obtain a target video, wherein the preset code rate is smaller than the code rate of the reference video of the sample video.

And C5, counting the user satisfaction degrees of the reference videos of the sample videos and the user satisfaction degree of the target videos, and determining a user satisfaction degree ratio of the user satisfaction degree of the target videos and the user satisfaction degree of the reference videos of the sample videos.

And C6, taking the characteristic value determined based on the video parameter of the sample video and the video parameter of the reference video of the sample video, the user satisfaction ratio of the user satisfaction of the target video and the user satisfaction of the reference video of the sample video as the characteristic set of the sample video in the training sample database, and taking the preset code rate as the target code rate of the sample video in the training sample database.

And B2, obtaining an error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video after the Kth training.

B3, if the error value between the predicted code rate of the kth sample video and the target code rate of the kth sample video does not satisfy the preset condition, adjusting the weight and bias used in the K +1 th training process based on the error value between the predicted code rate of the kth sample video and the target code rate of the kth sample video; and if the error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video meets a preset condition, obtaining a functional relationship between the feature set of the sample video and the target code rate, wherein the functional relationship is the regression model.

In order to better understand the training method of the regression model provided in the embodiments of the present application, the regression model is given as f (x) w^TPhi (X) + b, and training the model through N sample videos for example, where N is an integer greater than 0, the process of regression model training is described in detail. It should be understood that the embodiment described below is only an exemplary illustration and does not specifically limit the number, types, etc. of video parameters included in the first parameter set, the second parameter set, or the feature values included in the feature set.

S301, constructing a training sample database based on the N sample videos. Specifically, D1 through D3 may be performed separately for each sample video.

D1, determining the first characteristic value to the fourteenth characteristic value, namely f 1-f 14, of the sample video. The process of determining the first characteristic value to the fourteenth characteristic value of the sample video may refer to the process of determining the first characteristic value to the fourteenth characteristic value of the GOP to be predicted in step S202 in fig. 2, and details are not repeated here.

And D2, compressing the sample video by adopting a preset code rate smaller than the code rate of the reference video to obtain the target video.

And D3, counting the user satisfaction degree of the reference video and the user satisfaction degree of the target video, determining the user satisfaction degree ratio of the user satisfaction degree of the target video and the user satisfaction degree of the reference video, and taking the user satisfaction degree ratio as a fifteenth characteristic value.

And taking the first characteristic value to the fourteenth characteristic value obtained in the step D1 and the fifteenth characteristic value obtained in the step D3 as a characteristic set of the sample video, wherein the characteristic set is input data. And taking the difference value of the preset code rate and the code rate of the reference video in the logarithmic space as expected output.

And S302, training a regression model based on the constructed training sample database. Specifically, during the ith training process, the following steps E1 to E4 may be respectively performed, where i is an integer no greater than N, as shown in fig. 3:

e1, in the training of the ith time, inputting the feature set of the video of the ith sample of the training sample database into the regression model adjusted by the (i-1) th time to obtain an output result. Step E2 is performed.

I.e. f_i(x)＝w^Tφ(X_i) + b, wherein, X_iIs the feature set of the ith sample video, f_i(x) Output results from the initialized regression model for the ith sample video input, w^TIs weight and b is bias.

E2, determining the difference between the output result and the expected output of the ith sample video, and determining whether the absolute value of the difference is greater than the loss function parameter. If yes, go to step E3; if not, go to step E4.

Namely, it is

Wherein, y_iFor the desired output of the ith sample video, f_i(x) And inputting the ith sample video into an initialized regression model to obtain an output result, wherein epsilon is a loss function parameter.

And E3, adjusting the weight and the bias of the initialized regression model, and performing the (i + 1) th training.

E4, training is completed.

Based on the same inventive concept as the method embodiment, the embodiment of the present invention provides a video bitrate prediction apparatus 40, which is specifically used for implementing the method described in the embodiments of fig. 1 to 3, and the structure of the apparatus is shown in fig. 4, and includes a parameter determination module 41, a feature determination module 42, and a prediction module 43. The parameter determining module 41 is configured to determine a first parameter set of a GOP of a picture group to be predicted and a second parameter set of a reference video, where the first parameter set includes one or more video parameters of the GOP to be predicted, the second parameter set includes one or more video parameters of the reference video, and the reference video is a video obtained by compressing the GOP to be predicted based on a preset compression standard. A feature determining module 42, configured to determine a feature set of the GOP to be predicted, where the feature set includes a feature value related to human eye perceptual quality determined based on the first parameter set and the second parameter set determined by the parameter determining module 41, and an expected user satisfaction, where the expected user satisfaction is a user satisfaction ratio of a video obtained after compressing the GOP to be predicted by using a prediction code rate, compared with the reference video. And the prediction module 43 is configured to input the feature set of the GOP to be predicted determined by the feature determination module 42 into a pre-trained regression model to obtain the predicted code rate.

The obtaining module 31, the first extracting module 32, the second extracting module 33, and the determining module 34 may also be configured to perform other steps corresponding to the above method embodiment, which may specifically refer to the above method embodiment, and details are not repeated here.

Illustratively, the apparatus further includes a training module 44. The training module 44 may be configured to train to obtain the regression model. The process of obtaining the regression model by training the training module 44 may specifically refer to the above method embodiment, and details are not repeated here.

Illustratively, the apparatus further includes a construction module 45. The constructing module 45 is configured to construct the training sample database. The process of constructing the training sample database by the constructing module 45 may specifically refer to the above method embodiment, and details are not repeated here.

The division of the modules in the embodiments of the present application is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, or may exist alone physically, or two or more modules are integrated in one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

When the integrated module can be implemented in the form of hardware, as shown in fig. 5, the video bitrate prediction apparatus may include a processor 501. The hardware of the entity corresponding to the above modules may be the processor 501. The processor 501 may be a Central Processing Unit (CPU), a digital processing module, or the like. The device also includes: a memory 502 for storing programs executed by the processor 501. The memory 502 may be a non-volatile memory, such as a Hard Disk Drive (HDD) or a solid-state drive (SSD), and may also be a volatile memory (RAM), such as a random-access memory (RAM). The memory 502 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The apparatus may further include a communication interface 503, and the processor 501 may obtain the GOP to be predicted acquired by other acquisition devices through the communication interface 503, or the processor 501 may obtain the GOP to be predicted in the database through the communication interface 503.

The processor 501 is configured to execute the program code stored in the memory 502, and is specifically configured to execute the method according to the embodiment shown in fig. 1 to 3. Reference may be made to the methods described in the embodiments shown in fig. 1 to 3, which are not described herein again.

The specific connection medium between the processor 501, the memory 502 and the communication interface 503 is not limited in the embodiments of the present application. In the embodiment of the present application, the processor 501, the memory 502, and the communication interface 503 are connected by the bus 504 in fig. 5, the bus is represented by a thick line in fig. 5, and the connection manner between other components is merely illustrative and not limited. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

The embodiment of the present invention further provides a computer-readable storage medium, which is used for storing computer software instructions required to be executed for executing the processor, and contains a program required to be executed for executing the processor.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for predicting video bitrate, comprising:

determining a first parameter set of a GOP (group of pictures) to be predicted and a second parameter set of a reference video, wherein the first parameter set comprises one or more video parameters of the GOP to be predicted, the second parameter set comprises one or more video parameters of the reference video, and the reference video is a video obtained by compressing the GOP to be predicted based on a preset compression standard;

determining a feature set of the GOP to be predicted, wherein the feature set comprises feature values which are determined based on the first parameter set and the second parameter set and are related to human eye perception quality, and expected user satisfaction which is the user satisfaction ratio of a video obtained by compressing the GOP to be predicted by adopting a prediction code rate and compared with the reference video;

and inputting the feature set of the GOP to be predicted into a pre-trained regression model to obtain the predicted code rate.

2. The method of claim 1, wherein inputting the feature set of the GOP to be predicted into a regression model trained in advance to obtain the predicted code rate comprises:

mapping the feature set of the GOP to be predicted to a high-dimensional feature space based on a nonlinear mapping mode to obtain a high-dimensional feature set;

inputting the high-dimensional feature set into the regression model to obtain a difference value of the code rate of the reference video and the predicted code rate in a logarithmic space;

and determining the predicted code rate based on the difference value of the code rate of the reference video and the predicted code rate in a logarithmic space.

3. The method of claim 2, wherein the regression model conforms to the following formula:

f(x)＝w^Tφ(X)+b；

wherein f (x) is the difference value of the code rate of the reference video and the predicted code rate in logarithmic space, and w^TFor weight, φ (X) is the set of high-dimensional features, and b is bias.

4. The method of any of claims 1 to 3, wherein the first set of parameters comprises: the time domain masking effect value of each video frame in the GOP to be predicted, the spatial masking effect value of each pixel point in the GOP to be predicted and the visual saliency value of each pixel point in the GOP to be predicted;

the second set of parameters comprises: the video objective quality value of the reference video and the code rate of the reference video;

the feature set includes: the average value of the time domain masking effect values, the weighted average value of the spatial domain masking effect values, the video objective quality value of the reference video and the logarithm of the code rate of the reference video with the base 2;

the average value of the time domain masking effect values is determined based on the time domain masking effect value of each video frame in the GOP to be predicted; the weighted average value of the spatial masking effect value is determined based on the weight of each pixel point in the GOP to be predicted and the spatial masking effect value, and the weight of each pixel point in the GOP to be predicted is determined based on the visual saliency value of each pixel point.

5. The method of claim 4, wherein the first parameter set further comprises N compression code rates, where the N compression code rates are code rates corresponding to N compressed videos respectively obtained by compression-encoding the GOP to be predicted by using N fixed quantization parameter QP points respectively, and N is an integer greater than 0;

the feature set further comprises: and taking the logarithm of the N compression code rates as a base and/or the proportion of at least one adjacent difference in a maximum difference value, wherein the adjacent difference is the difference between the compression code rates corresponding to two adjacent fixed QP points, and the maximum difference value is the difference between the compression code rate corresponding to the maximum fixed QP point in the N fixed QP points and the compression code rate corresponding to the minimum fixed QP point in the N fixed QP points.

6. The method according to claim 5, wherein the first set of parameters further includes the M video objective quality values, the M video objective quality values being the video objective quality values of M compressed videos relative to the GOP to be predicted, wherein the M compressed videos are any M of the N compressed videos, and M is an integer greater than 0 and not greater than N;

the feature set further comprises: at least one difference value of the objective quality values of the videos, wherein the difference value of the objective quality values of the videos is the difference value of two objective quality values of the M video.

7. The method of any of claims 4 to 6, wherein the second set of parameters further comprises: a frame rate and a resolution of the reference video;

the feature set further comprises: a frame rate and a resolution of the reference video.

8. The method of any one of claims 1 to 7, wherein the regression model is trained by:

in the K training process, inputting the feature set of the K sample video in a training sample database into a regression model which is adjusted for K-1 times to obtain the predicted code rate of the K sample video, wherein K is an integer greater than 0, the training sample database comprises the feature sets of a plurality of sample videos and the target code rate corresponding to each sample video, and the regression model comprises weight and bias;

after the Kth training, obtaining an error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video;

if the error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video does not meet the preset condition, adjusting the weight and the bias used in the K +1 training process based on the error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video;

and if the error value between the predicted code rate of the Kth sample video and the target code rate of the Kth sample video meets a preset condition, obtaining a functional relationship between the feature set of the sample video and the target code rate, wherein the functional relationship is the regression model.

9. The method of claim 8, in which the training sample database is constructed by:

obtaining a plurality of sample videos;

for each sample video, determining one or more video parameters of the sample video and one or more video parameters of a reference video of the sample video, where the reference video of the sample video is a video corresponding to the sample video compressed based on the preset compression standard;

determining the feature value based on video parameters of the sample video and video parameters of a reference video of the sample video;

compressing the sample video by adopting a preset code rate to obtain a target video, wherein the preset code rate is smaller than the code rate of a reference video of the sample video;

counting user satisfaction degrees of a reference video of the sample video and user satisfaction degrees of the target video;

determining a user satisfaction ratio of the user satisfaction of the target video to the user satisfaction of the reference video of the sample video;

and taking the characteristic value determined based on the video parameter of the sample video and the video parameter of the reference video of the sample video, the user satisfaction ratio of the user satisfaction of the target video and the user satisfaction of the reference video of the sample video as the characteristic set of the sample video in the training sample database, and taking the preset code rate as the target code rate of the sample video in the training sample database.

10. An apparatus for predicting a video bitrate, comprising:

the device comprises a parameter determining module, a parameter determining module and a video prediction module, wherein the parameter determining module is used for determining a first parameter set of a GOP (group of pictures) to be predicted and a second parameter set of a reference video, the first parameter set comprises one or more video parameters of the GOP to be predicted, the second parameter set comprises one or more video parameters of the reference video, and the reference video is a video obtained by compressing the GOP to be predicted based on a preset compression standard;

the feature determination module is used for determining a feature set of the GOP to be predicted, wherein the feature set comprises a feature value which is determined based on the first parameter set and the second parameter set determined by the parameter determination module and is related to human eye perception quality, and an expected user satisfaction degree, and the expected user satisfaction degree is an expected user satisfaction degree ratio of a video obtained after the GOP to be predicted is compressed by adopting a prediction code rate and compared with the reference video;

and the prediction module is used for inputting the feature set of the GOP to be predicted determined by the feature determination module into a pre-trained regression model to obtain the predicted code rate.

11. The apparatus of claim 10, wherein the prediction module is specifically configured to:

12. The apparatus of claim 11, wherein the regression model conforms to the following formula:

f(x)＝w^Tφ(X)+b；

13. The apparatus of any of claims 10 to 12, wherein the first set of parameters comprises: the time domain masking effect value of each video frame in the GOP to be predicted, the spatial masking effect value of each pixel point in the GOP to be predicted and the visual saliency value of each pixel point in the GOP to be predicted;

14. The apparatus of claim 13, wherein the first parameter set further comprises N compression code rates, the N compression code rates are corresponding to N compressed videos respectively obtained by compression-encoding the GOP to be predicted by using N fixed quantization parameter QP points respectively, and N is an integer greater than 0;

15. The apparatus according to claim 14, wherein the first set of parameters further includes the M video objective quality values, the M video objective quality values being video objective quality values of M compressed videos relative to the GOP to be predicted, wherein the M compressed videos are any M of the N compressed videos, the M being an integer greater than 0 and not greater than N;

16. The apparatus of any of claims 13 to 15, wherein the second set of parameters further comprises: a frame rate and a resolution of the reference video;

17. The apparatus of any one of claims 10 to 16, further comprising a training module;

the training module is used for obtaining the regression model through training in the following mode:

18. The apparatus of claim 17, wherein the apparatus further comprises a construction module;

the construction module is used for constructing the training sample database in the following way:

obtaining a plurality of sample videos;

19. A computer storage medium storing program instructions that, when run on an electronic device, cause the electronic device to perform the method of any of claims 1 to 9.