EP3777194A1

EP3777194A1 - Method, hardware device and software program for post-processing of transcoded digital signal

Info

Publication number: EP3777194A1
Application number: EP18716589.9A
Authority: EP
Inventors: Ziyue ZHAO; Tim Fingscheidt
Original assignee: Technische Universitaet Braunschweig
Current assignee: Technische Universitaet Braunschweig
Priority date: 2018-04-05
Filing date: 2018-04-05
Publication date: 2021-02-17
Also published as: WO2019192705A1

Abstract

Abstract Method for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal, whereby said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal using an encoder, whereby the method comprising the following steps using a postprocessor: providing a plurality of transcoded signal frames, whereby said transcoded signal frames were generated by separating one of said transcoded digital signal processing each transcoded signal frame into a first representation within a processing domain; feeding each first representation of the transcoded signal frames into an artificial neural network to obtain for each first representation a second representation of the respective transcoded signal frame, whereby said artificial neural network is provided such that the artificial neural network is trained a mapping from a representation of a transcoded signal frame within said processing domain to a repre- sentation of a source signal frame within said processing domain; generating an enhanced transcoded digital signal based on the second represen- tations obtained from the artificial neural network; and outputting said enhanced transcoded digital signal.

Description

Method, hardware device and software program for post-processing of transcoded digital signal

The invention relates to a method for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal, whereby said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal including audio and/or video data using an encoder. The invention relates also to a hardware device and a software program for executing said post- processing method. The invention relates also to a method for training an artificial neural network.

Digital signals including audio and/or video data are often stored on a hardware device and accessed (read out) at a later point in time. In other cases, digital sig- nals including audio and/or video data are often transmitted from a first hardware device to a second hardware device. Note that without loss of generality, also the process of“storage” can be considered to be a“transmission”, which will be our terminology in the following. During these transmissions, the digital signals must be transformed into a bit stream which is suitable for transmission of the data rep- resenting the digital signal over a transmission channel. The transcoding process includes two steps. At first, the source signal including the audio and/or video data must be encoded into an encoded digital signal using an encoder. This encoded digital signal is transmitted over the communication channel to a receiver, whereby the receiver must decode the encoded digital signal into a decoded digital signal. The joint processing of encoder and decoder are sometimes abbreviated as co dec. To enhance the quality of the decoded digital signal, the decoded digital sig nal is sometimes post-processed to enhance the quality of the data.

Such decoded digital signals are often called“transcoded” signals or simply“cod- ed” signals. Transcoded digital signals often suffer from far-end background noise, quantization noise, and potentially transmission errors. To enhance the quality of these transcoded signals, post-processing methods, operating just after decoding, can be advantageously employed. Due to the transmission bandwidth (or storage) limitation, transcoded signals typically perform - so called - lossy compression to achieve a relatively low bit rate during transmission, while still preserving a rea- sonable audio and/or video quality at the same time. As a result, however, the re- constructed audio and/or video signal is degraded in quality due to quantization errors during the lossy compression process.

This kind of degradation cannot be effectively healed because during the lossy- compression process a part of the data and/or part of the information of the origi- nal digital signal are lost. To mitigate this problem, an extra post-processing pro- cess on the receiver side is well known from the state of the art. The basic idea of the post-processing method is to enhance the quality of a transcoded digital signal to reduce the signal distortion due to quantization, coding, and/or transmission errors.

To combat quantization errors at the receiver side, a kind of postfilter based on classical Wiener theory of optimal filtering has been standardized for the logarith- mic pulse code modulation (PCM) G.711 codec (ITU, Rec.G.711 : pulse code modulation (PCM) of voice frequencies, International Telecommunication Union, Telecommunication standardization sector (ITU-T), November 1988). This postfil- ter uses a priori information on the A- or m-law properties to estimate to quantiza- tion noise power spectral density (PSD) using a transcoded digital signal including speech data as audio data, assuming the quantization noise to be spectrally white. Then, a Wiener filter is derived by the estimation of the a priori signal-to-noise-ratio (SNR) based on a two-step noise reduction approach (C. Plapous et al.“A two- step noise reduction technique" in Proc. of ICASSP, Montreal, QC, Canada, May 2004, pp. I-289-292). After the filtering process, a limitation of distortions is per- formed to control the waveform difference between the original signal and the post-processed coded signal. However, as the bit rates go down for most of the modern codecs, it becomes more difficult for the classical Wiener filter to effective- ly suppress the quantization noise, while maintaining the speech or the video per- ceptually undistorted, so the SNR drops. Note that the Wiener filter anyway only minimizes the mean squared error (MSE), but not perceptual distortion.

For this background, it is an aspect of the present invention to provide a better post-processing method to enhance the quality of a decoded digital signal contain- ing audio and/or video data without modifying the encoder and decoder, respec- tively, without modifying the transmitter side and the receiver side. It is also an as- pect of the present invention to provide a post-processing method to enhance the quality of a transcoded digital signal after the transcoded digital signal was fully decoded using a decoder.

The problem is solved by the post-processing method according to claim 1 , the hardware device for post-processing according to claim 14 and the computer pro- gram according to claim 15.

According to claim 1 , a method for post-processing of at least one transcoded digi- tal signal including audio and/or video data to obtain at least one enhanced trans- coded digital signal is proposed. In the sense of the present invention, audio data are typically data, which includes audible information like music, speech, sounds, or other noises. These audible informations are coded into the digital signal as audio data. Video data are data, which include“moving pictures”. Video data can include audio data.

The transcoded digital signal, which shall be processed by a post-processor, was obtained by decoding of an encoded digital signal using a decoder. In most cases, the decoded digital signal obtained by decoding of the encoded signal is the trans- coded digital signal. It is possible, that a post-processing method well-known from the state of the art is applied to the decoded digital signal to enhance the quality of the transcoded digital signal in a previous step. Said encoded digital signal, fur- thermore, was obtained by encoding of a source signal using an encoder, whereby the source signal, advantageously, includes the raw data of the audio and/or video data.

According to the invention, the post-processing method is using a post-processor, whereby the post-processor can be a computer or any other electronic data pro- cessing unit. The basic idea of the present invention is to use an artificial neural network to enhance the transcoded digital signal without modifying the decoder on the receiver side or the encoder on the transmitter side. The artificial neural net- work has been trained a mapping of parts of the transcoded signal to parts of the source signal so that based on the transcoded signal by using the trained artificial neural network the source signal can be reconstructed or at least approximated in a high quality manner. In a first step, a plurality of transcoded signal frames are provided, whereby said transcoded signal frames were generated by separating one of said transcoded digital signal. In an embodiment, the first step of providing said plurality of trans- coded signal frames comprises the step of separating one of said transcoded digi- tal signal into said plurality of transcoded signal frames. The first step of providing said plurality of transcoded signal frames can comprise, furthermore, the step of building the plurality of transcoded signal frames from a plurality of transcoded digital signal segments provided from the decoder, whereby each transcoded digi tal signal segments can be assumed as a transcoded digital signal derived from a superior transcoded digital signal.

A transcoded signal frame in the meaning of the present invention is a part of a transcoded digital signal. Furthermore, a transcoded digital signal can be a seg- ment of a superior transcoded digital signal, which was segmented into a plurality of transcoded digital signals, often called as transcoded digital signal segments.

In some cases, it is advantageous to merge the decoder and the post-processor into one single processing unit. This may particularly be advisable in order to save algorithmic delay of the decoder in conjunction with the post-processor, in case that both of these functions share the same structure.

The transcoded signal frames can be overlapped in time or non-overlapped. If a window function is used, the length of the transcoded signal frame is equal to the length of the window. In the next step called data preparation, a first representation within a processing domain is prepared for each transcoded signal frame. The processing domain is a mathematical and/or physical description or specification to represent the trans- coded signal frames in a mathematical and/or physical manner. In the simplest form, the representation of the transcoded signal frames within a processing do- main is a description of the waveform of the transcoded signal frame (so-called time domain). Other processing domains, for example, are the frequency domain or the cepstral domain. The first representations are designated for feeding an arti- ficial neural network as described below. In the broadest meaning of the present invention, the transcoded signal frames are provided such that at least one (or each) transcoded signal frame is provided in the first representation within a pro- cessing domain. In this case, the transcoded signal frames are provided within said processing domain. The data preparation step can include, furthermore, the step of processing each transcoded signal frame into said first representation within said processing do- main.

Now, each first representation of the transcoded signal frames is inputted into an artificial neural network to obtain for each first representation a second representa- tion of the respective transcoded signal frame. The artificial neural network is pro- vided such that the artificial neural network is trained a mapping from a represen- tation of a transcoded signal frame within said predefined processing domain to a representation of the source signal frame within said processing domain. After this step, for each transcoded signal frame, advantageously, exists an enhanced sec- ond representation of the respective transcoded signal frame obtained from the artificial neural network.

Based on the second representations obtained from the artificial neural network, an enhanced transcoded digital signal is generated by converting the second rep- resentations into the form of a digital signal including the audio and/or video data. After the generation of the enhanced transcoded digital signal, the enhanced transcoded digital signal is outputted. With the proposed method for post-processing in the present invention, it is possi- ble to enhance audio and/or video data in a transcoded digital signal without modi- fying the encoder and decoder side. Using an artificial neural network, the post- processing method can be executed in real-time, for example in a digital speech communication using digital speech codecs. By means of the present invention, the problems of the prior art post-processing filters can be overcome and the quality gap between the source signal data and the transcoded signal data due to the lossy compression can be reduced without increase of the transmission bitrate. The loss of information by using the lossy compression can be reduced or minimized by using the post-processing method of the present invention without modifying the encoder or decoder and without modi- fying the lossy compression method itself. Given the fact that in many communica- tion systems, either the encoder and/or the decoder are standardized in a very specific fashion, this allows the use of the present invention in a standard- compatible manner. Furthermore, the loss of information raised by the lossy com- pression can be reduced and/or partly healed with the artificial neural network of the present invention. In an embodiment, said processing domain are the time domain, the frequency domain, the cepstral domain or the log-magnitude domain.

In a first preferred embodiment, the processing domain is the time domain, where- by a waveform representation for each transcoded signal frame is prepared. In the broadest meaning of the present invention, each provided transcoded signal frame has a waveform representation within the time domain so that no further pro- cessing steps for converting the transcoded frames into the waveform representa- tion is necessary. For the time domain approach, a quite straight-forward frame- work structure which fits to most speech decoders can be used. The separated frames then serve directly as an input of the artificial neural network, whereby the input vector is a representation of the waveform of the transmitted digital signal frame. Furthermore, it is also possible that the transcoded signal frames are pro- cessed into the waveform representation. The artificial neural network is provided such that the artificial neural network is trained a mapping from the waveform representation of the transcoded signal frame to the waveform representation of the source signal frame. The enhanced transcoded signal is then generated based on the waveform representation ob- tained from the artificial neural network. For this purpose, an overlap-add (OLA) technique can be used or not. In a preferred embodiment, the output of the artifi cial neural network is a frame structure so that the enhanced digital signal can be generated directly from the output of the artificial neural network. In this case, the second representation obtained from the artificial neural network has a frame structure.

Furthermore, it is also possible that the frames are reconstructed based on the waveform representation obtained from the artificial neural network and the en- hanced transcoded signal is generated based on the reconstructed frames.

The time domain approach is well fitting into many contexts, also very suitable for integration into the decoder processing, because if the time domain post- processor is embedded into the segmentation structure of the decoder, no addi- tional algorithmic delay is incurred beyond the already provided segmentation. The decoder segmentation can be used for providing the plurality of frames without any further segmentation.

In a further preferred embodiment, said processing domain is the frequency do- main, whereby the transcoded signal frames are processed in the frequency do- main by transforming each transcoded signal frame in a magnitude-phase repre- sentation or in a real and imaginary part representation by using, for example, the Fast Fourier Transformation (FFT). This representation in the frequency domain (for example a spectrum vector or a part of it) is then inputted into the artificial neu- ral network, whereby the artificial neural network is provided such that the artificial neural network is trained a mapping from the magnitude-phase representation or from the real and imaginary part representation of a transcoded signal frame to the magnitude-phase representation or to the real and imaginary part representation of the source signal frame. The enhanced transcoded signal is generated based on the magnitude-phase rep- resentation or the real and imaginary part representation obtained from the artifi cial neural network. An overlap-add (OLA) technique or an overlap-save (OLS) technique can be used along with the inverse transformation. It is advantageous, if the frames are reconstructed based on the magnitude-phase representation or the real and imaginary part representation obtained from the artificial neural network and the enhanced transcoded signal is generated based on the reconstructed frames In the frequency domain, advantageously, the magnitude spectrum is subject to a logarithm function, resulting into the so called log-magnitude domain being used as representation domain at the input and/or output of the artificial neural network. The log-magnitude representation of a source signal frame can be subject to an inverse logarithm function and appended with the phase as obtained above to ob- tain a magnitude-phase representation of a source signal frame.

In a further embodiment, said processing domain is a cepstral domain, whereby the transcoded signal frames are processed into the cepstral domain by transform- ing each transcoded signal frame in a cepstral coefficients representation. This cepstral coefficients representation of each transcoded signal frame is, e.g., sepa- rated into two parts: the cepstral coefficients representation responsible for the spectral envelope and the residual cepstral coefficients representation. The spec- tral envelope cepstral coefficients representation is inputted into the artificial neural network to obtain an enhanced spectral envelope cepstral coefficient representa- tion, whereby the enhanced transcoded signal is generated based on the spectral envelope cepstral coefficient representation obtained from the artificial neural net- work and the residual cepstral coefficients representation. It is advantageous, if the frames are reconstructed based on the spectral envelope cepstral coefficient representation obtained from the artificial neural network and the enhanced trans- coded signal is generated based on the reconstructed frames.

The artificial neural network is provided such that the artificial neural network is trained a mapping from the spectral envelope cepstral coefficient representation of a transcoded signal frame to the spectral envelope cepstral coefficient representa- tion of a source signal frame.

In a further advantageous embodiment, said artificial neural network is a convolu- tional neural network. Advantageously, said convolutional neural network has a plurality of hidden layers, whereby the hidden layers comprising at least one convolutional layer, at least one max pooling layer and at least one upsampling layer. The convolutional layers are defined by a number F of feature maps (filter kernels) and the kernel size (a x b). The number of trainable weights, including the bias, of a convolutional layer is denoted as F x (a x b) + F. It is worth noting that in each convolutional layer, the stride is one and zero padding of the layer input is always performed to guarantee that the first dimension of the layer output is the same as that for the layer input. In the max pooling layers, a 2 x 1 maximum filter is applied in a non-overlapping fashion, resulting in a 50 % reduction of the layer input along the first dimension. On the contrary, the upsampling layer simply copies each ele- ment of the layer input into a 2 x 1 vector and stacks these vectors just following the original order, which actually doubles the first dimension of the layer input.

In a further advantageous embodiment, an input layer of the convolutional neural network is connected with the first convolutional layer, said first convolutional layer is connected with a max pooling layer, said max pooling layer is connected with the second convolutional layer, said second convolutional layer is connected with the upsampling layer and said upsampling layer is connected with an output layer.

In a further embodiment, for each second representation an enhanced transcoded signal frame is generated based on the respective second representation obtained from the artificial neural network. Based on the enhanced transcoded signal frames, the enhanced digital signal is generated, e.g., by OLA or OLS.

In a further embodiment, the transcoded signal frames and/or the enhanced trans- coded signal frames comprising a frame length between 1 ms and 100 ms. Advan- tageously for audio signals, the frame length is between 5 ms and 35 ms, for video signals between 1 ms up to 100 ms.

These short frame lengths guarantee that the data within one frame is more or less stationary or static without big changes of the statistics of the data within one frame. In claim 14, a hardware device for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal is proposed. The hardware device is arranged to execute the method as described above.

Furthermore, a computer program according to claim 15 is arranged to execute the post-processing method as described above, as the computer program is run- ning on a computer device.

According to claim 16, a method for training an artificial neural network is pro- posed. At first, a plurality of source signal frames and corresponding transcoded signal frames are provided. Said source signal frames were generated by separat- ing at least one source signal and said transcoded signal frames were generated by separating at least one transcoded digital signal. The separating step can be performed prior to the providing step. That means in other words, that a plurality of sets of signal frames are provided, whereby each set of signal frames includes at least one source signal frame and at least one corresponding transcoded signal frame, which was obtained by encoding and decoding of the source signal. Each transcoded digital signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of the corresponding source signal using an encoder. The decoding and encoding step can be per- formed prior to the separating step. In a second step, a first representation within a processing domain for each trans- coded signal frame and a second representation within said processing domain for each source signal are prepared. That can includes, that each transcoded signal frame is processed into the first representation within said processing domain and each source signal frame is processed into the second representation within said processing domain. The source and the corresponding transcoded signal frames can produced on the basis of the source and the corresponding transcoded signal segments. The length and structure of the source and the corresponding trans- coded signal frames are the same in training and also in further use of the artificial neural network. Then, a plurality of source signal frames and the corresponding transcoded signal frames are selected by comparing the power ratio of each source signal frame and the whole source signal to a threshold.

The next step, each transcoded signal frame is processed into a first representa- tion within a processing domain and each source signal frame is processed into a second representation within said processing domain. Then, the artificial neural network is trained by inputting the first and corresponding second representations such that a mapping from a first representation of the transcoded signal frame to a second representation of a source signal frame is trained. In an embodiment, the step of providing a plurality of source signal frames and corresponding transcoded signal frames comprises the step of selecting the source signal frames and the corresponding transcoded signal frames for training said artificial neural network by comparing the power ratio of each source signal frame and the whole source signal to a threshold.

In a further embodiment, it is possible that the plurality of source signal frames are provided by using at least one source signal. The at least one source signal is then separated into a plurality of source signal frames e.g., by using a separating func- tion. The source signal frames can be overlapped in time or non-overlapped. It is further possible, that for the at least one source signal at least one transcoded sig- nal is generated by using an encoder and decoder. It is also possible that the at least one transcoded signal transcoded from the at least one source signal is pro- vided. Then, the at least one transcoded signal is separated into a plurality of transcoded signal frames. The present invention will be described in more detail by reference to the following figures:

Figure 1 - General flowchart of post-processing for enhancement of transcoded signals;

Figure 2 - High-level structure of the post-processing method;

Figure 3a, 3b - Processing structure of the cepstral domain approach;

Figure 4 - Example of the structure of a convolutional neural network.

Figure 1 shows a general flowchart of post-processing for enhancement of trans- coded signals. At first, a source signal s(n ) is inputted to an encoder to obtain an encoded signal. The encoded signal can be transmitted to the receiver side and then to a decoder for decoding the encoded signal. The decoded signal s(n), so- called in the present invention as a transcoded signal s(n), is then transferred to a post-processor for post-processing the transcoded signal s(n). The result of the post-processing is an enhanced transcoded signal s(n) .

Figure 2 shows a high level structure of the post-processor shown in figure 1. Firstly, the transcoded signal s(n ) is separated into a plurality of segments with signal vectors r(A),with A being the discrete segment index. The signal vectors r(A) typically represent 5 ms to 35 ms of audio, or 1 ms to 100 ms of video. The length of the segment may depend on the decoder. After the segmentation, the segments r(A) are delivered to the framing process, where each frame x( ) is produced on the basis of one or a plurality of the seg- ments r(A). The framing (i.e., production of the frames) could be done with or without overlap and with or without any windowing function. Then, in the data preparation process, the frames are prepared for inputting into the artificial neural network. During the data preparation process, each frame is transformed into the processing domain, for example the time domain, frequen- cy domain, or cepstral domain. The input vector of the neural network ( ^x for time domain and for cepstral domain) is obtained from the data prepara- tion process with normalization, and may depend on one or a plurality of segments r(A) from the past (l - I, l - 2, ...), present (A), or even future (A + 1, A + 2, ...). Then the input vectors are processed by the neural network with the same struc- ture as in the training stage. As a result, the output of the neural network ( time domain and for cepstral domain) is obtained.

Based on the output of the neural network ( ^x(^) for time domain and ^env(^) for cepstral domain) which is the enhanced second representation within the preferred processing domain, the signal is formed based on these output vectors. The out- put of this signal forming process is the enhanced transcoded signal s(n).

In the time domain solution, each transcoded segment r(A) is provided after the segmentation process of the transcoded signal s(n).Then, each frame ^XW is produced after the framing process and is normalized as

(Equationl ) where and ^sc are the mean vector and the standard deviation vector from the training stage, and division is to be performed element-wise.

For the cepstral domain solution, the framing and data preparation step is shown in figure 3a and the signal forming process including frame reconstruction in the cepstral domain is shown in figure 3b.

For the transcoded signal, each transcoded segment r(A) is provided after the segmentation processof the transcoded signal s(n). Then, each frame is produced after the framing process and for each frame a Fast Fourier Transfor- mation (FFT) of size K is performed, achieving with k being the frequen- cy bin. Then the Discrete Cosine Transform of type II (DCT II) is performed on the log-magnitude values of k) _to obtain the cepstral coefficients. The transform can be expressed as

K

c(£, q) = \og(\X(£, k) \) - cos(nq(k + 0.b)/K)

k= 0

(Equation 2) to obtain the vector ^cenv(^) with elements ^c( ^ ^)’ 9 ^e Qen_V for the cepstral do- main solution, with Q-em being the set of cepstral coefficients indices, representing the spectral envelope. Two vectors are stored for the following frame reconstruc- tion step, at first the argument vector ^a(^) for the th frame complex FFT coeffi- cients and second the residual cepstral coefficients vector ^cres(^) with elements ^ Qres for the £th frame cepstral coefficients, with Qres being the set of residual cepstral coefficients indices. As a result, the set

Q = {Qenv,Qres} contains all cepstral coefficients indices. For example, the first

32 coefficients are regarded as spectral envelope coefficients when K equals 512, and the remaining 480 coefficients are regarded as the residual cepstral coeffi- cients. Finally, the input vector for the input of a neural network ^cenv j_s obtained after the normalization of ^cenv(^) which can be expressed by the element-wise normalization

(Equation3) where ^^c(^<?) and ^s ) are the mean value and the standard deviation value from the training stage.

This preparation process is shown in figure 3a.

The input vectors in the cepstral domain are processed by the neural network with the same structure as in the training stage. Based on the output vector of the neu- ral network ^env(^) the enhanced transcoded signal can be formed. In the cepstral domain, a frame reconstruction process is performed firstly showing in figure 3b. The output of the neural network and the residual cepstral coef- ficients ^cres(^) stored in the data preparation procedure, are concatenated to form the complete cepstral coefficients ^). Then the inverse DCT II (IDCT II) is per- formed to go back to the logarithm domain of the amplitude spectrum which is de- noted as

(Equation 4)

X(£, k)

Then, the exponential function is used to obtain and the complex FFT are computed along with the pre-stored argument vector

X(£, k) = X(£, k ) exp (j - a(£, k))

(Equation 5) with j being the imaginary number.

Finally, the reconstructed frame in the time domain is obtained by taking the real part of the inverse FFT of the FFT coefficients vector ).

Subsequently, the enhanced transcoded digital signal is generated respectively formed from the output vector (time domain) or the reconstructed frames (cepstral domain). In the following, three different example fashions of signal forming meth- ods along with corresponding framing methods will be introduced to finally obtain the enhanced transcoded signal s(n). These signal forming methods can be either used for time domain processing or cepstral domain processing and also used for frequency domain processing with or without the logarithm.

Frame-wise direct signal forming The segmentation and framing procedure could be expressed as x( ) = [s(£N_w - N_w + 1), . . . , s(£N_w)]_>

(Equation 6) where N_w is the frame length and the frame shift is equal to the frame length N_w.

In this case, the current frame is obtained by directly taking the current signal segment without overlapping and windowing. Therefore, the signal frames and signal segments are identical, denoted as ⁼ r(A). Signal forming now goes as follows: The processed frames are concatenat- ed directly along the frame index to achieve the improved signal s(n ) which could be expressed as

(Equation 7) where L is the number of frames for the speech to be formed. The approach has no additional algorithm latency beyond the segmentation. Current segment and past signal forming

The segmentation and framing procedure could be expressed as

(Equation 8) where N_w is the frame length and N_s is the frame shift. Note that a plurality of ze ros are padded before the beginning of s(n). In this case, the current frame ^XW is obtained by taking the current signal segment r(A) along with several past sam- ples from the past transcoded signal (can be obtained from the past transcoded segments r(A - l), r(A - 2), ...) without windowing. Therefore, the signal frame can be denoted as ^x(^) ⁼ [s(n—N_w + 1),. . .,s(n— N₃), r(A)], _wj^ t e signal segment length being N_s.

Signal forming now goes as follows: The improved signal s(n ) is achieved with the processed frames ^x(^) as

(Equation 9)

This approach also has no additional algorithm latency beyond segmentation, but has longer frames to be processed compared to frame-wise direct forming.

Overlap add signal forming

The segmentation and framing procedure are performed first with overlap, within also the frames are multiple with a kind of windowing function, which could be ex- pressed as x{£) = [s((£ - l)iV_s + l), . . . , s((£ - 1 )N_S + N_W)]^T o f_W

(Equation 10) where N_w is the frame length, N_s is the frame shift length, f_w is the window func- tion (e.g., Hann window), [] is the vector transpose, and o is defined here as el- ement-wise multiplication. Note that a plurality of zeros are padded before the be- ginning of s(n). As an example, a 50 % overlap can be performed with N_s = ^N_W-

In this example case, the current frame is obtained by taking the current signal segment and one future segment, i.e., one segment lookahead. Therefore, the signal frame can be denoted as x(^) = [^r(A), ^r(A+l)] °fw ₀r with

x(£) _wjth the signal segment length being N_s. Signal forming now goes as follows: The improved speech s(n ) is achieved with overlap add of the processed frames which could be expressed as

(Equation 11 ) beginning part of the indices in a frame and

4 = { _s + 1, . . . , N_w } j_s t e end part of indices in a frame. This method is expected to perform with the best quality while it has an algorithm latency of the shift length.

Depending on the chosen domain, a neural network has to be trained. Independ- ent of the chosen domain, a similar neural network topology will be used in this embodiment with only the different dimensions in the input and output layer. An example of the convolutional neural network used in the present invention is shown in figure 4.

A plurality of source signal segments and the corresponding transcoded signal segments are provided. Then, the source and transcoded signal frames are pro- duced on the basis of the source and the corresponding transcoded signal seg- ments, respectively.

A simple frame-based voice activity detection (VAD) is performed to select the ac- tive frames for the training stage by comparing the power ratio of each source sig nal frame and the whole source signal to a threshold. A source frame is regarded > OvAD , , as a speech-active source frame with index £' if ¹ , where m = rh å \ ^~ n) \² P = jh å |S(n)|

i&Afe h£Lί , and the threshold VAD is e.g.,

0.0001. The set contains all sample indices nbelonging to frame l and denotes the number of elements in this set. Similarly, A/^” contains all sample indi- ces n belonging to the complete speech signal, I-^Ί denotes the number of ele- ments in this set. By performing a VAD, only the active source signal frames and the corresponding transcoded signal frames will be used for the training stage, while the rest parts (speech pauses) will not be used for the training stage.

In the case of video, a similar selection procedure can be used by excluding long static scenes from the training process.

During the training of the neural network, the prepared inputs of the neural network will firstly go in forward direction to the neural network, achieving the network yielding outputs y^w where N is the total number of layers. After that, the outputs are compared to the targets, guided by a cost function. The trainable weights of the neural network are then iteratively adjusted to minimize the cost function based on some learning rules (i.e., backpropagation training). When some preset stopping criteria are meet, the training process will be finished and the weights in the neural network will stay unchanged.

A kind of convolutional neural network, with N = 6, used in the invention is depict- ed in figure 4, which is just an example of the structure and topology and can be adjusted as needed. Other kinds of neural networks could also be used, e.g., feed- forward neural networks, deep neural networks (DNNs), or recurrent neural net- works (RNNs) such as long short-term memory (LSTM).

In the following, the training stage is presented in detail for the depicted convolu- tional neural network in figure 4.

Output for each layer

The input layer (first layer): in time domain

c( , m G Qenv) in cepstral domain

(Equation 12) where m denotes the index of the input vector.

The convolutional layer 1 (second layer):

(Equation 13) where ^* denotes to convolutional operation, w_p denotes the weight vector of the pth kernel and denotes the pth bias, M⁽¹⁾ is the dimension of the input vector i and being the number of kernels used in this layer. Please note that the frame in- dex £' is omitted for convenience as soon as internal processing of the neural network is presented. The convolution is computed as

(Equation 14) with i_m being zero when m > M⁽¹⁾ and the kernel size here is two. Note that the stride of the kernel is one and the input vector i is zero-padded before the convolu- tion is computed to make sure that the output vector dimension is the same as the input vector dimension. The activation function /⁽²⁾ used here is the leaky rectified linear unit (ReLU) function, which can be denoted as if x > 0

otherwise

(Equation 15)

The max-pooling layer (third layer)

(Equation 16) with max() being the maximum function. The first dimension of the matrix is de- creased by half in the max-pooling layer.

The convolutional layer 2 (forth layer):

(Equation 17) which is similar to the expressions in the second layer (convolutional layer 1 ).

The upsampling layer (fifth layer):

(Equation 18)

It could be seen that the first dimension of the matrix is doubled in this layer.

Finally, the output layer (sixth layer):

(Equation 19)

Weights updating

After the output vector y⁽⁶⁾ is achieved, the cost function in terms of mean squared error (MSE) between the outputs and targets can be vision as

(Equation 20) where W is the set of weight matrix in all layers of the neural network, zis the index set for the dth batch. The term is the target vector, in which the ele- ments can be denoted as in time domain

tit' , TO)

c(t', m e Q_e nv) in cepstral domain

(Equation 21 )

Specifically, the indices of the training set T are divided into D batches with the same size and with no repetition, which could be denoted as

(Equation 22)

Similarly, the corresponding training pairs are also divided into D batches and could be denoted as

(Equation 23) with O being the training pairs. Furthermore, the training pairs in each batch con- tribute to a weight-updating and one epoch is finished when all training pairs in the training data are already performed.

The weights are then trained using batch backpropagation (BP) in which the weight matrix W is changed iteratively to minimize the cost function with the sto- chastic gradient descent (SGD) algorithm.

Stop criteria

After every epoch, the MSE will be calculated on the validation set, which could be denoted as lit(f) - y<⁶⁾(f) ||²

(Equation 24) where v(Wg) j_s the MSE on the validation set after the gth epoch is the set of frame indices on the validation set and is the output of the neural net- work after the gth epoch. The training process will end after gth epoch, if either of the following conditions is satisfied.

(Equation 25) where Q_M5E is the MSE threshold. The stop of the training process means that the neural network is assumed to have already achieve this state of proper generaliza- tion. Finally, the structure of the neural network and the trained weight matrix set together with the mean vector and the standard deviation vector, are stored for the further usage of the invention.

Claims

Patent claims

1. Method for post-processing of at least one transcoded digital signal including audio and/or video data to obtain at least one enhanced transcoded digital signal, whereby one of said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal using an encoder, whereby the method compris- ing the following steps using a postprocessor:

providing a plurality of transcoded signal frames, whereby said trans- coded signal frames were generated by separating one of said trans- coded digital signal;

preparing a first representation within a processing domain for each transcoded signal frame;

feeding each first representation of the transcoded signal frames into an artificial neural network to obtain for each first representation a second representation of the respective transcoded signal frame, whereby said artificial neural network is provided such that the artificial neural network is trained a mapping from a representation of a transcoded signal frame within said processing domain to a representation of a source signal frame within said processing domain;

generating an enhanced transcoded digital signal based on the second representations obtained from the artificial neural network; and outputting said enhanced transcoded digital signal.

2. Method according to claim 1 , wherein said providing step comprises:

separating one of said transcoded digital signal into said plurality of transcoded signal frames, each transcoded signal frame being a part of said transcoded digital signal; or

building said plurality of transcoded signal frames from a plurality of transcoded digital signal segments provided from the decoder.

3. Method according to claim 1 or 2, wherein said preparing step comprises:

processing each transcoded signal frame into a first representation within said processing domain.

4. Method according to one of claim 1 to 3, wherein said processing domain is the time domain, the frequency domain, the cepstral domain or the log- magnitude domain.

5. Method according to one of claim 1 to 3, wherein said processing domain is the time domain, whereby

preparing a waveform representation for each transcoded signal frame said artificial neural network is provided such that the artificial neural network is trained a mapping from the waveform representation of a transcoded signal frame to the waveform representation of a source signal frame;

the enhanced transcoded signal is generated based on the waveform representation obtained from the artificial neural network.

6. Method according to one of claim 1 to 3, wherein said processing domain is the frequency domain, whereby

the transcoded signal frames are processed into the frequency domain by transforming each transcoded signal frame in a magnitude-phase representation or in a real and imaginary part representation; said artificial neural network is provided such that the artificial neural network is trained a mapping from the magnitude-phase representation or from the real and imaginary part representation of a transcoded sig- nal frame to the magnitude-phase representation or to the real and im- aginary part representation of a source signal frame; and

the enhanced transcoded signal is generated based on the magnitude- phase representation or the real and imaginary part representation ob- tained from the artificial neural network.

7. Method according to claim 6, wherein said processing domain is the log- magnitude domain, whereby

the transcoded signal frames are processed into the frequency domain by transforming each transcoded signal frame in a magnitude-phase representation; said artificial neural network is provided such that the artificial neural network is trained a mapping from the log-magnitude representation of a transcoded signal frame to the log-magnitude representation of a source signal frame;

- whereby the log-magnitude representation of a source signal frame is subject to an inverse logarithm function and appended with the phase as obtained above to obtain a magnitude-phase representation of a source signal frame; and

the enhanced transcoded signal is generated based on the above ob- tained magnitude-phase representation obtained from the artificial neu- ral network.

8. Method according to one of claim 1 to 3, wherein said processing domain is the cepstral domain, whereby

- the transcoded signal frames are processed into the cepstral domain by transforming each transcoded signal frame in a cepstral coefficients representation;

said artificial neural network is provided such that the artificial neural network is trained a mapping from the cepstral coefficients representa- tion of a transcoded signal frame to the cepstral coefficients representa- tion of a source signal frame; and

the enhanced transcoded signal is generated based on the cepstral co- efficients representation obtained from the artificial neural network.

9. Method according to one of the foregoing claims, wherein said artificial neural network is a convolutional neural network.

10. Method according to claim 9, wherein said convolutional neural network has a plurality of hidden layers, whereby the hidden layers comprising at least one convolutional layer, at least one max pooling layer and at least one up- sampling layer.

11. Method according to claim 10, wherein an input layer of the convolutional neural network is connected with a first convolutional layer, said first convolu- tional layer is connected with a max pooling layer, said max pooling layer is connected with a second convolutional layer, said second convolutional layer is connected with a upsampling layer and said upsampling layer is connected with an output layer.

12. Method according to one of the foregoing claims, wherein for each second representation an enhanced transcoded signal frame is generated based on the respective second representation obtained from the artificial neural net- work, whereby based on the enhanced transcoded signal frames said en- hanced transcoded digital signal is generated.

13. Method according to one of the foregoing claims, wherein the transcoded signal frames comprising a frame length between 1 ms and 100 ms, advan- tageously for audio signals between 5 ms and 35 ms.

14. Hardware device for post-processing of a transcoded digital signal containing audio and/or video data to get an enhanced transcoded digital signal, said transcoded signal was obtained prior by decoding of an encoded signal using a decoder and said encoded signal was obtained prior by encoding of a source signal using an encoder, whereby the hardware device is arranged to execute the method according to one of the claims 1 to 13.

15. Computer program arranged to execute the post-processing method accord- ing to one of the claims 1 to 13, if the computer program is running on a computer device.

16. Method for training an artificial neural network, whereby the training method comprising the following steps:

providing a plurality of source signal frames and corresponding trans- coded signal frames, whereby said source signal frames were generat- ed by separating at least one source signal and said transcoded signal frames were generated by separating at least one transcoded digital signal, each transcoded digital signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of the corresponding source signal using an encoder; preparing a first representation within a processing domain for each transcoded signal frame and a second representation within said pro- cessing domain for each source signal frame;

training said artificial neural network by inputting the first and corre- sponding second representations such that a mapping from a first rep- resentation of a transcoded signal frame to a second representation of a source signal frame is trained.

17. Method according to claim 16, wherein the step of providing a plurality of source signal frames and corresponding transcoded signal frames compris- es:

selecting the source signal frames and the corresponding transcoded signal frames for training said artificial neural network by comparing the power ratio of each source signal frame and the whole source signal to a threshold.