WO2019105535A1

WO2019105535A1 - Device and method for providing an encoded version of a media signal

Info

Publication number: WO2019105535A1
Application number: PCT/EP2017/080714
Authority: WO
Inventors: Shahid Mahmood Satti; Matthias Obermann; Christian Schmidmer; Michael Keyhl
Original assignee: Opticom Dipl.-Ing. Michael Keyhl Gmbh
Priority date: 2017-11-28
Filing date: 2017-11-28
Publication date: 2019-06-06

Abstract

A more effective provision of an encoded version of a media signal which is to meet a certain perceptual-distortion dependent optimization target than compared to concepts seeking to categorize the media signal with then deducing, in one go, the predetermined set of one or more control parameters to be used for the encoding, is described. To this end, advantage is taken from the information gained from each trial: encoding the media signal using a certain set of one or more control parameters results in a certain encoded version of the media signal and allows for measuring the perceptual distortion associated with that encoded version to obtain a set of one or more quality indicators.

Description

DEVICE AND METHOD FOR PROVIDING AN ENCODED VERSION OF A MEDIA SIGNAL

Description

The present application is concerned with a provision of an encoded version of a media signal such as, for instance, a video signal.

In adaptive streaming systems, videos are typically encoded/compressed once for different network connection speeds and then downloaded by millions of users. Suboptimal video compression leads to either unnecessary visual artifacts and reduced user experi- ence, or to more network traffic if the video compression was inefficient. Small savings with regard to the tradeoff between quality and compression rate can lead to major savings of cost for storage and transmission, as well as improved end user experience. Sig- n if i cant efforts to improve the encoding process are therefore justified.

Modern video codecs like H.264, HEVC or VP9 are typically controlled by a multitude of parameters which need to be supplied by the user. These parameters may involve bitrate, resolution, frame rate, GOP length and structure, compression speed, quality targets, methods for prediction, slicing, motion estimation and others. The full list is codec and implementation dependent.

Selecting the best values for these parameters is essential in order to achieve the best tradeoff between visual compression artifacts and compression ratio. The optimal parame- ters depend on the codec implementation as well as the content and individual scenes of a video. Based on these parameters, the codec will try to optimize the result by estimating the visual impact using simple methods like PSNR for very few consecutive video frames. Typically, choosing the ideal set of parameters to achieve a targeted tradeoff between visual artifacts due to compression and bitrate savings is a very difficult manual task which requires expert knowledge and experience. Therefore, in most cases some predefined parameter sets are used, which provide sufficient, but clearly suboptimal compression performance for the sake of improved usability. This imperfection has been recognized by the industry and attempts exist to improve the parameter selection.

WO2016160295A1 [1] (Netflix) reveals a method where a video quality metric is used in order to determine an optimized bitrate ladder to encode one video in very different bi- trates with well-defined subjective quality differences. In a first step, complexity of the video is estimated. Based on this estimation, the rate-distortion curve of the codec for the given video is determined, based on the assumption, that the scene complexity is sufficient to do so. The estimated rate-distortion curve allows then to choose a bitrate which corresponds to a target distortion.

The concept disclosed in [1] cannot take into account individual codec parameters or different requirements of individual scenes. Furthermore, it is based on estimating the complexity of the video, which is fairly inaccurate. This very coarse estimate is then combined with the assumption that coding efficiency is a simple function of the video complexity. The method in [1] is therefore inherently sub-optimal.

A further publication [2] (https://qz.com/920857/netflix-nflx-uses-ai-in-its-new-codec-to- compress-video-scene-by-scene/, Netflix) reveals a system which tries to categorize the content type of individual video scenes and to choose a predefined set of parameters which has been optimized for similar content. Since in this case the scene needs to be categorized, which can be performed very coarsely only by a machine, the result will still be less than optimal.

It is the object of the present invention to provide the concept for providing an encoded version of a media signal which is to meet a certain perceptual-distortion dependent optimization target, a concept which is more effective in terms of, for instance, accuracy in meeting the optimization target and/or computational overhead with comparable accuracy.

This object is achieved by the subject-matter of the independent claims of the present application.

The present invention is based on the finding that it is possible to perform a more effective provision of an encoded version of a media signal which is to meet a certain perceptual- distortion dependent optimization target than compared to concepts seeking to categorize the media signal with then deducing, in one go, the predetermined set of one or more control parameters to be used for the encoding, if one takes advantage of the information gained from each trial: encoding the media signal using a certain set of one or more control parameters results in a certain encoded version of the media signal and allows for measuring the perceptual distortion associated with that encoded version to obtain a set of one or more quality indicators. The inventors of the present application found out that this information pair, namely the certain set of one or more control parameters and the associated set of one or more quality indicators, which results from this trial, is already a good starting point for a guess as to which set of one or more control parameters to try next. In particular, the inventors of the present application realized that the available tools for measuring perceptual distortion are quite accurate and do not require exaggerated computational power. Owing to that accuracy, it is feasible to derive a pretty good candidate for the final or predetermined set of one or more control parameters by iterative determination: each iteration involves the encoding of the media signal using a certain set of one or more control parameters and the measurement of the associated perceptual distortion to obtain the associated set of one or more quality indicators, and based thereon, a successor set of or more control parameters is derived in order to come closer to the perceptual-distortion dependent optimization target, and this successor set of one or more control parameters is then used in the next iteration, i.e. , the subsequent iteration is com- menced using this successor set of one or more control parameters. According to some predetermined abort criterion, the iterative determination may be aborted in a convenient manner. As the encoded version of the media signal, the encoded version as obtained in the last iteration may be used or provided, i.e., the one obtained at the last iteration, but in accordance with an alternative embodiment, it is also feasible to allow for the usage or provision of a previously obtained encoded version of the media signal, i.e., an encoded version obtained from a previous trial or iteration, for which a set of one or more quality indicators has been obtained which is closer to the perceptual-distortion dependent optimization target than the currently obtained set of one or more quality indicators.

In accordance with an embodiment, the successor set of one or more control parameters is derived depending on the current set of one or more control parameters, i.e., the set of one or more control parameters tried out in the current iteration, the set of one or more quality indicators obtained in the current iteration by measuring the perceptual distortion of the encoded version having been obtained using the current set of one or more control parameters as well as one or more sets of one or more control parameters and the associated one or more sets of one or more quality indicators obtained in any previous iteration. For instance, in the second iteration, not only the set of one or more control parameters and the associated set of one or more quality indicators obtained in the current trial, i.e., the second iteration, is used for the determination of the successor set of one or more control parameters, but also the corresponding information of the first iteration or trial, respectively. By use of this exploitation of the history of iterations or trials, the approxima- tion or convergence to the optimization targets may significantly improve with the additional overhead associated with the handling of an increased amount of input data for de- termining the successor set of one or more control parameters for the succeeding iteration being comparatively low when compared to the reduction in computational complexity owing to the reduction of necessary iterations/trials in order to find the predetermined set of one or more control parameters which leads to an associated set of one or more quality indicators sufficiently likely being one closest to the perceptual-distortion dependent opti- mization target compared to all other possible control parameter sets.

In accordance with an embodiment, the determination of the successor set of one or more control parameters is done using an artificial neural network. Alternatively, numerical methods or machine learning may be used. When using an artificial neural network, the same may use the history of a passed iterative determination up to the final/predetermined set of one or more control parameters for a certain perceptual- distortion dependent optimization target in order to learn, i.e., modify the internal weights of the neurons of the artificial neural network so that the convergence speed for subsequent iterative determinations increases.

Further, the fact that the individual trials allow for a fast convergence to the fi- nal/predetermined set of one or more control parameters may even further be improved in accordance with an embodiment where even the starting set of one or more control parameters, namely the one to be used for encoding the media signal in the first iteration, is not just set to default values, independently of the media signal, but dependent on a priori information. For instance, the media signal itself may already represent one segment of a greater video material such as, for instance, a scene of a video, and those descriptors having been used for the segmentation may be used in order to determine the starting set of one or more control parameters for the first iteration. That is, information may be used for this starting set determination, which is available anyway so that the providing of this information does not require any additional effort. The fact that the determination of the first starting set of one or more control parameters is not that accurate does not matter as, as indicated above, the fast convergence property of the iterative determination resolves any remaining deviation.

In accordance with an embodiment, the measurement of the perceptual distortion associated with a certain tested encoded version is done using a full-reference or no-reference quality measurement, i.e., a quality measurement complying with known quality standards, for instance. By this measure, the quality measurement is quite "accurate" which fact further increases the above-mentioned convergence speed. In accordance with a further embodiment, the concept for providing an encoded signal using iterative determination is used to provide, in fact, a set of encoded versions by using different perceptual-distortion dependent optimization targets for each individual encoded version from the set of encoded versions. The iterative determinations for the individual encoded versions may be done in parallel or in series. If done series, the finally determined, i.e. , predetermined, set of one or more control parameters determined for a first one of the set of encoded versions may be used in order to speed up the convergence speed for the iterative determination for the next encoded version of the set of encoded versions. If done in parallel, all trials and their results may be exploited in the subsequent iteration of all parallel running iterative determinations for all of the set of encoded versions, thereby significantly speeding-up the convergence speed.

Advantageous aspects of the present application are the subject of dependent claims. Preferred embodiments of the present application are described below with respect to the figures, among which:

Fig. 1 shows a block diagram for a device for providing an encoded version of a media signal in accordance with an embodiment;

Fig. 2 shows a block diagram of a device for providing a plurality of encoded versions of a media signal in accordance with an embodiment;

Fig. 3 shows a flow diagram of the iterative determination of the encoded version and its underlying control parameter set as it may be executed by the device of Fig. 1 or Fig. 2;

Fig. 4 shows a schematic diagram illustrating a client and a server and a streaming scenario there between which may involve representations generated in accordance with embodiments of the present application such as the ones described with respect to Figs. 1-3.

Before describing certain embodiments of the present application, an attempt shall be made to explain the thoughts underlying the embodiments of the present application and the advantages resulting therefrom. Although, preliminarily, these considerations are exemplified using a video as an example for a media signal, it is clear that this focusing onto videos is merely made representatively and for ease of understanding and should not be treated as limiting the embodiments of the present application.

The present inventors recognized that characterizing video scenes is extremely difficult since the possible number of different scenes is infinite and the behavior of the codec for each scene is strongly depending on its implementation.

The inventors further recognized that video quality measurement may be made far more accurate than scene characterization. Very advanced and standardized methods like ITU- T J.247. ITU-T J.343 exist to measure the perceived video quality. These methods are simulating the human visual system, and are able to predict video quality as a human observer would score it. They are far more accurate than any existing scene characterization algorithm or simple image quality metrics like PSNR. Most video quality measurement algorithms are internally based on a distortion classification and quantification, and these can be used as an additional output of the video quality measurement algorithm, besides a single quality metric.

Thus, embodiments of the present application seek to characterize the degradations introduced by the video codec rather than characterizing the scenes. This knowledge of the distortion characteristics can then be exploited to optimize the codec parameters iteratively in multiple passes.

The embodiments further outlined below work without limitation for entire videos, or other media content as a whole, but to achieve best results it may be advisable to apply it to individual scenes or media segments, respectively. Such scene cuts can be detected automatically and methods to do so are available.

Thus, embodiments of the present application outlined in more detail below, involve applying a video quality measurement module to the decoded output of the video codec. The quality indicators resulting from the video quality measurement and distortion analysis can then be used as input to a Parameter Optimization Module, which proposes a better set of parameters. Video compression is then repeated with this optimized set. Finally, the entire process is repeated until an optimum solution has been found.

The ultimate optimization target will typically depend on the application and purpose of the encoded media and may comprise achieving certain thresholds for one or several of the quality indicators or by finding a solution where one or all quality indicators fall into a cer- tain range. Reasons for terminating the iterations may include, reaching the optimization target, reaching a maximum number of iterations, achieving certain quality targets, while specific control parameters fall into a certain range or detecting that improvements between iterations become very small.

In order to reduce the number of iterations and thus the required processing time, it is recommended to start the first iteration with carefully selected parameters. This first selection may be based on a coarse scene characterization. All subsequent iterations will ignore the scene characterization and rely on the distortion analysis only. The Parameter Optimization Module is preferably implemented using methods of artificial intelligence or numerical optimization algorithms in order to derive the optimal parameters after as few iterations as possible. Any solution incorporating a memory of any kind for previous optimization results is preferred, since this allows the optimization unit to learn from history and to quicker converge to an optimal solution in the future.

Just for the sake of completeness, it must be mentioned that it is also possible to derive the optimal parameters by a brute force method, by simply trying a reasonably large number of parameter values and combinations. This will however require significantly more processing power than the described intelligent parameter optimization.

Additional benefit can be achieved, if different encoded versions of the same video are determined which represent e.g. the different quality renditions required for an adaptive streaming service. In this configuration, the parameter optimization unit can deduct near optimal control parameters from previous iterations for already calculated optimal compressed versions of the same video and converge much faster in subsequent iterations aiming at a different optimization target, e.g. a different bitrate range. In this arrangement, it is also possible and often desirable, to constrain the variation of the codec control parameters, i.e. the determination for the subsequent iteration, in a way, that one or several of the control parameters result in the same value for all encoded versions e.g. GOP size. This may lead to suboptimal encoding of individual encoded versions, but maybe advantageous overall.

As a side effect, since video quality indicators have already been calculated for each video frame, the quality indicators can be stored either frame by frame, or for reasonable sections of the video without additional computational cost. In adaptive streaming systems like DASH or HLS, this information can be passed on to the client-side player via e.g. a manifest file or other means. The player can then further exploit this information in order to improve the switching behavior between different representations of the same video.

Other use cases for the recorded video quality indicators include reporting to network monitoring or analytics systems.

Figure 1 shows an apparatus for providing an encoded version of a media signal in accordance with a first embodiment of the present application. As previously indicated, it assumed that the media signal is a video although this restriction is merely made for sake of an easier understanding and the subsequent description may easily be broadened to relate to media signals in a general or to transfer the description brought forward below to other sorts of media signals such as audio signals or the like.

The apparatus of Fig.1 comprises an input 99 at which the media signal, i.e. video 101 is received, and an output 1 11 at which the apparatus of Fig. 1 outputs the encoded version 107, i.e., the compressed video, of the inbound video 101. Connected between input 99 and output 111 , the apparatus of Fig, 1 comprises an encoder 102. Encoder 102 operates according to any compression codec, such as any video codec in case of a video as a media signal 101 , and encodes the inbound media signal 101 into an encoded version thereof. The video codec according to which the encoder 102 operates may be a stand- ardized codec such as H.264, HEVC, VP9 or any other video codec or, in case of other media signal types, AAC, or the like.

The encoder 102 is controllable by one or more control parameters 110, where the one or more parameters are individually variable within certain limits. These control parameters 1 10 and the fact how often encoder 102 encodes the same inbound signal and which of these encodings or trials is finally used and output at output 111 , is controlled by an optimization device 100. The aim of the optimization device 100 is to, in an iterative manner, namely by iteratively running encoder 102 using different sets of one or more control parameters 110, determine a final set of one or more control parameters which leads to an encoded version which is optimal in some perceptual-distortion dependent sense. That is, the optimization aim is, for instance, the minimization of a function which increases with increasing deviation from a perceptual-distortion target. The function may, however, also depend on other values such as values of one or more parameters within set 1 10, i.e., the set of one or more control parameters. For instance, the optimization function could be a sum having one addend measuring the deviation from the perceptual-distortion target and a further addend depending on one of the control parameters such as bitrate, computational complexity or the like or a combination thereof. The device 100 operates iteratively in a manner as described in more described in more detail below and is connected, to this end to the output of encoder 102 and the control input of encoder 102.

Internally, device 100 comprises a video quality characterization unit 103 which is configured to measure the results of encodings made by encoder 102, i.e., to measure the quality or perceptual distortion of encoded versions which encoder 102 generates using the set 110 of one or more control parameters as applied to its parameter input.

Further, device 100 comprises a Parameter Optimization Module 104 which receives from the video quality characterization unit 103 a set 109 of one or more quality indicators per quality or distortion measurement made by unit 103 and derives, based thereon, a set 110 of one or more control parameters which is applied to encoder 102 to be used for the next encoding.

Further, device 100 comprises a Control Logic 106 communicatively coupled to module 104. For instance, Control Logic 106 could direct module 104 with respect to the setting of the set 1 10 of one or more control parameters at times where a set 109 of one or more quality indicators has not yet been available, although in the present embodiment this is done by an optionally present unit 105 described later on. Further, Control Logic 106 may assume tasks like storing a history of used sets 1 10 of one or more control parameters applied to encoder 102 and the associated sets 109 of one or more quality indicators obtained by unit 103 on the basis of the encoded versions generated, in turn, by the encoder 102 on the basis of these histories of control parameter sets 110, and may indicate which of the encoded versions generated by encoder 102 during the individual iterations, shall be interpreted at the output 1 1 1 the final/predetermined encoded version, i.e. the one finally chosen.

Optionally, Fig. 1 shows that device 100 may comprise a scene characterization unit 105 which is connected to input 99 in order to assign scene characterizing descriptors to the media signal applied at input 99. Again, it is noted that, according to one embodiment, the media signal or video 101 applied at input 99 may represent one portion/Segment/scene out of several such portions/segments/scenes into which a greater media material or video has been coarsely sub-divided so that all the tasks of modules 102, 103, 104 and 106 are then performed on a scene by scene basis, but according to another embodiment this subdivision which is not shown in Fig. 1 is not used.

Finally, as also shown in Fig. 1 , device 100 may optionally be configured to output, in addition to the finally chosen encoded version of encoder 102, the associated set of 108 of one or more quality indicators which unit 103 obtained from that encoded version by measurement. Later on, it will be shown that this quality indicator set 108 may, for instance, be used in order to store it, or information derived therefrom, into a media presentation description or other media description so as to be used for media streaming the encoded representation of video 101 from a server.

The interaction between the modules shown in Fig. 1 which, together, results in the iterative determination of the final/predetermined control parameter set to which the final set 108 of one or more quality indicators belongs, will become clearer from the following description.

Thus, as described above, it might be that the inbound video 101 is already restricted to one scene and another device, which is not shown in Fig.1 , would have already split a longer video signal into individual, short scenes which then represent the video 101 in Fig. 1. However, this is merely optional. Whether additional scene segmentation is used or not, one task of the scene characterization unit 105 is to provide descriptors which describe a current scene using one or more descriptors which are then forwarded by unit 105 to the parameter optimization module 104. The descriptors might be the same which were used for the scene segmentation. This set of one or more descriptors may be used by Parame- ter Optimization Module 104 in order to determine a starting set of control parameters to be used for the first iteration or first encoding trial to be performed by encoder 102.

Having said that, the iterative determination of the finally chosen control parameters set 110 could be performed as follows:

The original video sequence 101 is provided as input to the arrangement. In the first iteration, a coarse scene characterization is performed by unit 105 in order to derive suitable start parameters for the Parameter Optimization Module 104. The Parameter Optimization Module 104 is designed to derive a suitable set of control parameters from the scene characterization. After a first compression run using the codec 102 with the first set of control parameters, the compressed and again decoded video is analyzed by the Video Characterization Unit 103. The output of the Video Characterization Unit 103 is a set of quality indicators 109, which may contain at least one of a plurality of indicators which classify the overall video quality, the amount of temporal distortions, the amount of spatial distortions, the amount of chrominance distortions or the amount of motion in the video sequence. Based on the Quality Indicators 109, the Parameter Optimization Module 104 will determine a new, better set of Control Parameters 110, and the next iteration will start. The last iteration is reached, when either a predetermined number of iterations has been calculated, or the Parameter Optimization Module 104 signals via some Control Logic 106, that a reasonable solution has been found or no better solution can be found. As a result, the best coded version of the video will be provided as output 107.

Suitable example embodiments of the Parameter Optimization Module 104 implement this based on artificial intelligence (Al) or machine learning. Well suited candidates include Artificial Neural Networks (ANN) or classification trees like random forest. Traditional numerical optimization strategies can be applied as well, but are less likely to converge fast. In the case of using an Al module, it is advisable to select a type which is recursive in order to include the history of deriving the current optimal set of indicators into the optimization process and thus speed up the convergence. An even bigger gain can be achieved by implementing an Al module with the capability of continuous learning. As a result, all historic determinations for different media lead to an improvement of the Al and thus the optimization process becomes faster and more accurate with every run.

Preferred embodiments of the Video Quality Characterization Unit 103 are built around standardized full reference video quality measurement algorithms like ITU-TJ.247 or ITU- T J.343 or derivates thereof. The best know version so far is PEVQ, which is based on ITU-TJ.343. Other metrics, including no-reference methods will work as well, but with reduced or at least less well documented accuracy.

The Coarse Scene Complexity Estimation Unit 105 is preferable and implementation of temporal and spatial complexity metrics as they are defined in ITU T-P.910.

The preferred embodiment from Figure 1 assumes that the last encoded version is the one which yielded optimum results. If this is the case depends on the detailed implementation of the Control Logic 106 and the Parameter Optimization Unit 104. If this assumption is not true, the Device 100 must store the best-so-far solution including the Codec Control parameters 1 19 and the resulting Quality Indicators 109 and return the stored best-found solution after the last iteration.

In another preferred embodiment, the arrangement from Figure 1 is e.g. used by a streaming service in order to efficiently encode the different quality renditions required for an adaptive streaming service based on e.g. DASH or HLS. In this case, the quality indicators 109 for the resulting best encoding are an additional output of the device 100 and are stored together with the encoded videos and sent to the player upon request. This sending is best implemented by including the indicators in a media description file. The knowledge of the quality of video sections from individual quality renderings may help the client to better control its quality switching behavior for varying network conditions. The same embodiment can be used in order to monitor the video quality for the sequence of quality renditions seen by the end-user. This can be accomplished since the payer knows already about the quality of individual sections of the video. The quality indicators for these individual sections can be suitably aggregated and reported back to an analytics system, the end-user or both.

Fig. 2 shows an apparatus for providing a set of encoded versions of a media signal, wherein, again, a video is used as an example. As to the reference signs, same of Fig. 1 have been re-used, but increased by 100. Accordingly, modules having mutually corresponding reference signs in Figs. 1 and 2 either coincide in functionality or the functionality somehow corresponds to each other with the functionality in Fig. 2 being adapted to the provision of multiple encoded versions as outlined in more detail below.

From the structural point of view, the apparatus of Fig. 2 comprises one output 21 1_j per encoded version j to be provided, indicated by 207_j in Fig. 2. The apparatus of Fig. 2 is shown to comprise one instantiation of an encoder 20¾ for each encoded version 207_j to be provided at the respective output 21 1 _j, but it should be clear that from the structural point of view, one encoder 202 could suffice in order to perform the trials related to a certain encoded version to be provided, j, sequentially. As depicted in Fig. 2, a control parameter set 220_] is separately used to control a respective encoder 202_j. The encoded versions generated by encoders 202] are all fed into the quality characterization unit 203 which, in turn, generates a quality indicator set 209_j for each of these encoded versions.

Thus, all these blocks are essentially the same as their equivalents in Figure 1. The big difference is, that this time, multiple compressed versions 207 and their associated op- tional Final Quality Indicators 208 are generated for one single source video. The differ- ence between the different compressed videos is the different target for the optimization, which could be for example a different perceptual target quality or a different bitrate range or resolution or combinations thereof. The codec parameter optimization can be imple- mented for all videos simultaneously in each iteration (parallel operation), or first for all iterations for the first video, then all iterations for the next video and so on (sequential operation), until all videos are compressed with their optimal parameters. In any case, the knowledge on the video and coding characteristics is shared for the different compressed versions. This allows for a potentially faster convergence to the optimum solution. This shared knowledge can also be used, to constrain the variation of the codec control parameters in a way, that one or several of the control parameters result in the same value for all encoded versions (e.g. GOP size), if this is required and results in an overall advantage, even if for some individual encoded versions this may be suboptimal.

The above description should have rendered clear how the modules shown in Fig. 1 and 2 act together in order to perform the iterative determination which finally yields the wanted encoded version encoded using the finally chosen coding parameter set which is associ- ated with the quality indicator set 108/208, respectively. In the following, with respect to Fig. 3, this functionality is described again, this time in form of a flow diagram, and it should be clear that the mode of operation is explained with respect to Fig. 3 may not only be performed by apparatuses corresponding to Fig. 1 and 2. Rather, another structure may alternatively be used.

In order to be able to follow the following description more easily, let’s assume that the number of encoded versions of the media signal s was N. N might be 1 or any number greater than 1. In accordance with an embodiment, N is equal to or greater than 4. For each wanted encoded version having index j = 1... N, the iterative determination process is performed which involves, per iteration, an encoding of the media signal s using a respective control parameter set CP(i, j), where / indexes the iterations done for wanted encoded version index j. The encoding of the media signal using CP(i, j), leads to an encoded version of the media signal, let’s say ev(j), and the latter suffers from some coding distortion which is measured by the quality indicator set Ql(i, j).

Having said this, Fig. 3 shows the mode of operation of a device for providing an encoded version of a media signal in accordance with an embodiment, such as Fig. 1 or 2, in more detail. First, at 301 , an encoding of the media signal is performed using CP(i, j). It is re- minded that for the first iteration, a default control parameter set CP(1, j) might be used, or such first-iteration set is determined based on a priori knowledge such as the aforementioned scene descriptors or, as described later on, on the basis of trial histories obtained from encodings and their measurements obtained in any preceding iterations relating to other wanted encoded version than the current one, ones with different j, and/or relating to other media signals, namely previous scene sections for instance.

Then, at 302, a perceptual distortion measurement is performed on the outcome of the encoding at 301 , thereby obtaining Ql(i, j). Possible details have been set out above. Then, at 303, it is determined whether a certain abort criterion is fulfilled or not. The abort criterion check 303 might be performed by the Control Logic 106/206 in Figs. 1 and 2, for instance. Examples for such abort criterion have also been already mentioned above. For instance, it might be checked whether a certain maximum number of iterations has been reached. Alternatively, the abort criterion 303 may involve checking whether Ql(i, j) is within an acceptable distance from the perceptual-distortion dependent optimization target for wanted encoded versions j. The check, whether some Qi(i, j)“is within” an acceptable distance from the perceptual-distortion dependent optimization target may involve determining a difference between Ql(i, j) or some value determined therefrom, and a value associated with the perceptual-distortion dependent optimization target and checking whether the difference is smaller than a predetermined threshold or not. If smaller, the abort criterion 303 might be fulfilled. Another abort criterion checked in 303 might be to check whether \QI(i, j) - Ql(i-1, j)\ is smaller than a predetermined threshold, i.e., whether the perceptual distortion change from last iteration i-1 to the current iteration / is smaller than a predetermined threshold suggesting that a good candidate for the final control parameter set had been found. The just-discussed abort criteria form alternatives to each other or may be used in combinations.

If the abort criterion is not fulfilled, then a successor control parameter set CP(i + 1, j) is determined at 304 and the process proceeds with a next iteration i

i + 1 so that the newly derived control parameter set is then used in the encoding step 301. If, however, the abort criterion is fulfilled, the encoded version ev(j) is finally provided at 305 so that at least as far as wanted encoded version j, all is done and the task is finished.

When providing the encoded version ev(j), i.e., the wanted encoded version associated with the f^h perceptual-distortion dependent optimization target, generally two possibilities have been discussed above with respect to Figs. 1 and 2: here, as the wanted encoded version ev(j) simply e(i, j) may be output, i.e., the encoded version recently generated at step 301 using CP(i, j), in the last iteration /^', or, e(i_best, j) may be appointed the finally chosen encoded version ev(j) with i_best being the iteration for which Ql(i_best, j) is nearest to the perceptual-distortion dependent quality target for encoded version j with a minimization being determined, for instance, among all previous iterations, i.e., 1,2, ...i, or merely a fraction thereof such as, for instance, i - m, i - m + 1, ... i with m being some integer such as 1 , 2 or some other advantageous value. The task of the actual providing 305 is, for instance, performed by the above-identified Control Logic 106 or 206, respectively.

In the derivation 304, the successor control parameter set CP(i + 1, j) is determined, at least, on the basis of CP(i, j) and Ql(i, j) as well as the perceptual-distortion dependent quantization target. The derivation 304 may, for instance, be performed using an artificial neural network or a decision tree. The artificial neural network may be recursive in that it has some feedback of its output or some intermediate layer to respective feedback input so as to be considered when being fed with CP(i + 1, j) and Ql(i + 1, j) in the next iteration for sake of deriving CP(i + 2, j). In this manner, the derivation in step 304 could be dependent on CP and Ql of not only the current iteration /, but also previous iterations / - 1, i - 2, and so forth. Another possibility could be to use different artificial neural networks at different iterations. For instance, at the first iteration a neural network could be used to for which uses as an input, besides OT(j), i.e., the Optimization Target for the f^h encoded version, CP(1, j) and Ql(1, j), i.e., merely the current pair of control parameter set and quality index set. For the next iteration, an artificial neural network could be used which uses, as an input, besides OT(j), CP(1, j), Ql(1, j), CP(2, j) and Ql(2, j), i.e., the control parameters set and quality index set of all previous iterations, here merely one, and the current iteration / = 2. It is easy to ascertain, as to how the artificial neural network for the third iteration i = 3 could look like. It could be the same as the one for the second iteration and be fed with, in addition to OT(j), CP(2, j), Ql(2, j), CP(3, j) and Ql(3, j). Alternatively, it could be formed to use all previous control parameter and quality index sets in addition to the current ones. The derivations of the successor control parameter set described so far merely relied on control parameter and quality index sets obtained from trials performed with respect to the same, i.e., current, encoded version j. If there is merely one encoded version to be provided, i.e., N - 1, then there is no alternative. However, if N ³ 1, then different possibilities exist. For instance, the iterative determination of EV(j) may be done for each encoded version 0 <j < N +1 serially, i.e., the iterations for j prior to the iterations j ⁺ 1 , or in parallel such as in a manner so that all trials of iterations i for all 0 < j < N + 1 are available for successor control parameter set determination for all 0 < j < N + 1. In the parallel case, i.e., the latter case, many trials, i.e. , pairs of control parameter and quality index sets, are ready or are available for the successor control parameter set derivation with respect to each encoded version j, namely N in number. Accordingly, for each encoded version j, in the first iteration / = 1, the successor control parameter set derivation in step 304, may use a neural network which processes as an input, in addition to OT(j), all pairs CP(1, j) and Ql(1, j) for all j with 0 > j > N + 1. For the second iteration, the same neural network may be used and fed with, in addition to OT(j), CP(2, j) and Ql(2, j) for all / s or a greater artificial neural network may be used which is in addition fed with all CP(1, j) and Ql(1, j) for all /s.

The providing 305 of the encoded version j may be done in different manners as already outlined above, namely by simply outputting or providing the encoded version of the last iteration for encoded version j or the encoding resulting in any other iteration for encoded version j using a control parameter set CP(i, j) for which Ql(i, j) results in a lowest distance to the optimization target OT(j) measured using any appropriate distance measure. Additionally, the provision 305 may involve additionally providing Ql(i_pr0v, j) which measures the perceptual distortion of the encoded version EV(j) obtained by encoding using CP(i_prov, j). As indicated above, this QI(i_prov, j) may be entered in a MPD so that EV(j) may be offered at a server for download by a client which is provided, for adaptively controlling the download, with the MPD. In this manner, the client is able to control the download with also considering the quality index information. As a minor note, it is noted that according to the above examples, i_prov may be i_best or be i as manifesting itself at the last iteration.

As mentioned before, in case of N > 1, the iterative determination of the respective encoded version EV(j) may be done serially. In that case, the fact that i_prov turns out to be associated with the optimum control parameter set CP(i_prov, j) with respect to optimization target OT(j) may be exploited by adapting, improving or optimizing the weights of the aforementioned artificial neural networks. That is, the determination of i_prov for one encoded version j may be used for learning so as to perform the successor control parameter set derivation 304 with respect to the next encoded version j + 1 on the basis of improved or learned artificial neural networks. The learning, however, may also be done consecutively in time: as mentioned above, it might be that a greater media material has been subdivided into segments, such as video scenes, and that the iterative determination as depicted in Fig. 3, is not only done for the one or more encoded versions j for the current scene k, but also for a subsequent scene k + 1 and so forth. Accordingly, it might be that the learning takes place from one scene k to the next scene k + 1. Accordingly, even if the number of encoded versions N would be 1, then the just-mentioned learning which could take place is one part of step 305, could be used to adjust the weights of the artificial neural network before being used in the iterative determination of Fig. 3 with respect to the next segment or scene k + 1. This sort of learning reduces the number of necessary iterations even more and takes advantage of the fact that some segments or scenes in a greater media material such as a video, repeat intermittently so that such learning improves the guesses in the successor control parameter set derivation 304 for future performances of this derivation for segments and scenes to come.

Fig. 4 shows, for sake of completeness, a client 300 which streams a media content from a server 302. In order to control this streaming, such as adaptive streaming, the server 302 offers to the client 300 a media’s description 304 for controlling the download of the media content. The server 302 has an access to the media content in form of several representations 306, namely N representations. As mentioned before, N may be one or may alternatively be greater than 1 such as greater than 4. The representations are, for instance, stored in a form sub-divided into pieces of fragments which sub-divide the media content a long time. Along such fragment boundaries, the client 300 could be able to switch between different representations 306 which may differ in bitrate and different qualities. In fact, the N representation 306 are generated as described above, i.e. , all these representations 306 may be composed of one or more segments or scenes for which the N encoded versions have been generated as described above with respect to Figs. 1 to 3 in order to form the representations 306. The media description 304 provides information on each representation including, for instance, bitrate, resolution and quality index indications. The quality index information is provided in the above-outlined manner, namely by deriving information and entering same into the media description 304, which is derived from the QEs provided in step 304 for each encoded version and segment/scene respectively. For instance, a mean value may be entered in the media description 304 for each representation. The client is, thus, able to use the bitrate and quality index information in order to control the adaptive streaming download.

It should be clear that the individual successor control parameter set derivations 304 and the determination of the control parameter set for the first iteration for a respective wanted encoded version, might be subject to constraints which might differ among differing wanted versions and which might even depend on the trials of iterative determination done for other wanted encoded versions in order to avoid, for instance, that the finally selected control parameter sets for different wanted encoded versions become to close with re- spect to certain values such as bitrate or the like. For instance, the constraints may restrict the derivation/determination to for one or more control parameter such as bitrate, to con- tinuous intervals excluding control parameters tested in any iteration for any other encod- ed version.

Advantageously, the iterative determination outlined above with respect to Figs. 1 to 3 does not necessitate any additional step for providing the QE values and the encoded version, respectively, finally chosen. Rather, the iterative determination is done in a manner so that one of the trials, i.e., one of the encodings in step 301 , is later on adopted and used for forming a part of the respective representation 306 and the same applies to the perceptual distortion measurement 302 which is finally used for providing the respective information and entering same into the media description 304.

Accordingly, the above embodiments allow for QoE (Quality of Experience) controlled encoding such as perceptually controlled video compression. Thus, the above revealed concept or device to optimize coding efficiency, consisting of a video codec 102, a video quality characterization unit 103 and a parameter optimization module 104, where the device iteratively optimizes the video quality by using a parameter optimization module 104, which in each iteration derives a new set of optimized control parameters 1 10 for the video codec based on a set of quality indicators 109 which characterize the perceptual distortions caused by a previous encoding pass with a different set of control parameters and where said quality indicators are determined by the video quality characterization module. An Al module may be used as parameter optimization module. Alternatively, a machine learned classification module may be used as parameter optimization module, or a numerical method may be used as parameter optimization module. An FR method or an NR method may be used to determine quality indicators. Final quality indicators 108 may be stored such as in an MPD. A reporting of final quality indicators to the client may be performed so that the client may use them to optimize its switching behavior. The client may report the quality indicators for the actually rendered frames back to the server. Thus, reporting of final quality indicators to the client may be done for sake of further aggrega- tion by the client and reporting of the aggregated value to a server or presenting it in a suitable form. As explained above. The optimization loop of Fig. 3 may be done for multiple optimization targets in parallel or serially.

Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.

Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.

Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.

Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.

In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.

A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or nontransitionary. A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.

A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.

A further embodiment comprises a computer having installed thereon the computer pro- gram for performing one of the methods described herein.

A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.

In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.

The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and/or in software.

The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

The methods described herein, or any components of the apparatus described herein, may be performed at least partially by hardware and/or by software. The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, there- fore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.

Claims

1. Device for providing an encoded version of a media signal (99), the device sup- porting a plurality of sets of one or more control parameters, and the device being config- ured to iteratively determine out of the plurality a predetermined set of one or more control parameters using which the media signal is encoded into the encoded version, wherein the device is configured to, in a current iteration (i), encode the media signal using a current set (CP(ij)) of one or more control parameters to obtain a current encoded version of the media signal; measure a perceptual distortion associated with the current encoded version to obtain a set (Ql(ij)) of one or more quality indicators; and depending on an abort criterion, derive a successor set (CP(i+1 ,j)) of one or more control parameters out of the plurality of sets of one or more control parameters depending on the current set of one or more control parameters and the set of one or more quality indicators in order to come closer to a perceptual-distortion dependent optimization target, and commence a subsequent iteration using the successor set of one or more control parameters for the encoding of the media signal, or provide, as the encoded version of the media signal, the current encoded version or a previously obtained encoded version of the media signal obtained in any pre- vious iteration for which a set of one or more quality indicators is obtained which is closer to the perceptual-distortion dependent optimization target than the set of one or more quality indicators obtained for the current encoded version.

2. Device of claim 1 , wherein the sets of one or more control parameters relate to one or more of bitrate, resolution, frame rate, GOP length, GOP structure, compression speed, quality targets, prediction tool selection, slicing, deblocking filter settings, quantization parameters and motion estimation complexity.

3. Device of claim 1 or 2, wherein the abort criterion is such that commencing the subsequent iteration is refrained from if a maximum number of iterations is reached.

4 Device of any of claims 1 to 3, the abort criterion is such that commencing the subsequent iteration is refrained from if one or more of the indicators of the set of one or more quality indicators is within an acceptable distance from the perceptual-distortion dependent optimization target.

5. Device of any of claims 1 to 4, wherein the abort criterion is such that commencing the subsequent iteration is refrained from if one or more of the indicators of the set of one or more quality indicators obtained in the current iteration and an immediately preceding iteration deviate less than a predetermined threshold.

6. Device of any of claims 1 to 4, configured to derive the successor set of one or more control parameters by feeding the current set of one or more control parameters and the set of one or more quality indicators into an artificial neural network or a decision tree so as to obtain the successor set of one or more control parameters at an output of the artificial neural network or a decision tree.

7. Device of claim 6, configured to use a recursive artificial neural network, so that the successor set of one or more control parameters also depends on sets of one or more control parameters and sets of one or more quality indicators of previous iterations.

8. Device of claim 6 or 7, wherein the artificial neural network supports continuous learning based on previous optimization results, so that the successor set of one or more control parameters also depends on sets of one or more control parameters and sets of one or more quality indicators of previous iterations for the same or previously encoded media.

9. Device of any of claims 1 to 8, configured to derive the successor set of one or more control parameters depending on the current set of one or more control parameters, the set of one or more quality indicators obtained in the current iteration and one or more sets of one or more control parameters and one or more sets of one or more quality indicator which are obtained in any previous iteration.

10. Device of any of claims 1 to 9, configured to derive the successor set of one or more control parameters using numerical methods, machine learning or artificial intelligence.

1 1. Device of any of claims 1 to 10, configured to perform media segmentation of a media material into media segments, and perform, for each of the media segments, the providing of the encoded version of the media signal by using the respective media segment as the media signal.

12. Device of any of claims 1 to 10, wherein the media material is a video and the device is configured to segment the video into scenes and perform, for each of the scenes, the providing of the encoded version of the media signal by using the respective scene as the media signal.

13. Device of claim 12, configured to perform the segmentation on the basis of video content descriptors derived from the video and use one or more of the video content descriptors for determining a starting set of one or more control parameters for encoding the media signal in a first iteration.

14. Device of any of claims 1 to 13, configured to measure the perceptual distortion associated with the current encoded version using a full-reference or no-reference quality measurement.

15. Device of any of claims 1 to 14, configured to, in a current iteration, store the set of one or more quality indicators for availability in the subsequent iteration or the subsequent iteration and further iterations to come.

16. Device of any of claims 1 to 15, configured to, in performing the providing as the encoded version of the media signal, select among the sets of one or more quality indicators of the current and previous iterations, the one closest to the perceptual-distortion dependent optimization target and provide the encoded version for which the selected set of one or more quality indicators is obtained as the encoded version of the media signal.

17. Device of any of claims 1 to 15, configured to, in performing the providing as the encoded version of the media signal, provide the current encoded version as the encoded version of the media signal.

18. Device of any of claims 1 to 17, configured to output the set of one or more quality indicators associated with the provided encoded version.

19. Device of any of claims 1 to 18, configured to enter the set of one or more quality indicators associated with the provided encoded version or information obtained there- from, into an adaptive streaming protocol media description.

20. Device of any of claims 1 to 19, configured to provide a set of encoded versions with performing the iterative determination for each individual version of the set of encod- ed versions using different perceptual-distortion dependent optimization targets for differ- ent individual versions of the set of encoded versions.

21. Device of any of claims 1 to 20, configured to provide a set of encoded versions with performing the iterative determination for each of the set of encoded versions using different perceptual-distortion dependent optimization targets and using different constraints for the set of one or more control parameters for different individual versions of the set of encoded versions.

22. Device of claim 21 , where constraints for the set of one or more control parameters for one encoded version depend on sets of one or more control parameters used for one or more other encoded versions of said set of encoded versions.

23. Device of any of claims 20 to 22, configured to perform the iterative determination for each of the individual versions of the set of encoded versions sequentially for the set of encoded versions, wherein the device is configured to determine a starting set of one or more control parameters for the encoding of the media signal in a first iteration for the providing of a current encoded version depending on one or more sets of one or more control parameters and one or more sets of one or more quality indicators obtained during one or more iterations for providing any preceding encoded version.

24. Device of any of claims 20 to 22, configured to perform the iterative determination for each of the set of encoded versions in parallel for the set of encoded versions.

25. Client for adaptive streaming of media content from a server which offers the media content in different bitrate representations, each comprising an encoded version as provided by the device of claim 19, the client configured to use the set of one or more quality indicators or the information derived therefrom as entered into the media descrip- tion for control of the adaptive streaming in terms of switching between the representations.

26. Method for providing an encoded version of a media signal (99), the device supporting a plurality of sets of one or more control parameters, and the device being configured to iteratively determine out of the plurality a predetermined set of one or more control parameters using which the media signal is encoded into the encoded version, wherein the method comprises, in a current iteration (i), encoding the media signal using a current set (CP(i,j)) of one or more control parameters to obtain a current encoded version of the media signal; measuring a perceptual distortion associated with the current encoded version to obtain a set (Ql(i,j)) of one or more quality indicators; and depending on an abort criterion, deriving a successor set (CP(i+1 ,j)) of one or more control parameters out of the plurality of sets of one or more control parameters depending on the current set of one or more control parameters and the set of one or more quality indicators in order to come closer to a perceptual-distortion dependent optimization target, and commencing a subsequent iteration using the successor set of one or more control parameters for the encoding of the media signal, or providing, as the encoded version of the media signal, the current encoded version or a previously obtained encoded version of the media signal obtained in any previous iteration for which a set of one or more quality indicators is obtained which is closer to the perceptual-distortion dependent optimization target than the set of one or more quality indicators obtained for the current encoded version.

27 Computer program having a program code for performing, when running on a computer, a method according to claim 26.