EP3777194A1 - Method, hardware device and software program for post-processing of transcoded digital signal - Google Patents

Method, hardware device and software program for post-processing of transcoded digital signal

Info

Publication number
EP3777194A1
EP3777194A1 EP18716589.9A EP18716589A EP3777194A1 EP 3777194 A1 EP3777194 A1 EP 3777194A1 EP 18716589 A EP18716589 A EP 18716589A EP 3777194 A1 EP3777194 A1 EP 3777194A1
Authority
EP
European Patent Office
Prior art keywords
transcoded
signal
representation
neural network
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP18716589.9A
Other languages
German (de)
French (fr)
Inventor
Ziyue ZHAO
Tim Fingscheidt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technische Universitaet Braunschweig
Original Assignee
Technische Universitaet Braunschweig
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Technische Universitaet Braunschweig filed Critical Technische Universitaet Braunschweig
Publication of EP3777194A1 publication Critical patent/EP3777194A1/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/86Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/154Measured or subjectively estimated visual quality after decoding, e.g. measurement of distortion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field

Definitions

  • the invention relates to a method for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal, whereby said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal including audio and/or video data using an encoder.
  • the invention relates also to a hardware device and a software program for executing said post- processing method.
  • the invention relates also to a method for training an artificial neural network.
  • Digital signals including audio and/or video data are often stored on a hardware device and accessed (read out) at a later point in time.
  • digital sig- nals including audio and/or video data are often transmitted from a first hardware device to a second hardware device.
  • the process of“storage” can be considered to be a“transmission”, which will be our terminology in the following.
  • the digital signals must be transformed into a bit stream which is suitable for transmission of the data rep- resenting the digital signal over a transmission channel.
  • the transcoding process includes two steps. At first, the source signal including the audio and/or video data must be encoded into an encoded digital signal using an encoder.
  • This encoded digital signal is transmitted over the communication channel to a receiver, whereby the receiver must decode the encoded digital signal into a decoded digital signal.
  • the joint processing of encoder and decoder are sometimes abbreviated as co dec.
  • the decoded digital sig nal is sometimes post-processed to enhance the quality of the data.
  • Such decoded digital signals are often called“transcoded” signals or simply“cod- ed” signals.
  • Transcoded digital signals often suffer from far-end background noise, quantization noise, and potentially transmission errors.
  • post-processing methods operating just after decoding, can be advantageously employed. Due to the transmission bandwidth (or storage) limitation, transcoded signals typically perform - so called - lossy compression to achieve a relatively low bit rate during transmission, while still preserving a rea- sonable audio and/or video quality at the same time. As a result, however, the re- constructed audio and/or video signal is degraded in quality due to quantization errors during the lossy compression process.
  • a Wiener filter is derived by the estimation of the a priori signal-to-noise-ratio (SNR) based on a two-step noise reduction approach (C. Plapous et al.“A two- step noise reduction technique" in Proc. of ICASSP, Montreal, QC, Canada, May 2004, pp. I-289-292).
  • SNR signal-to-noise-ratio
  • a limitation of distortions is per- formed to control the waveform difference between the original signal and the post-processed coded signal.
  • the Wiener filter anyway only minimizes the mean squared error (MSE), but not perceptual distortion.
  • a method for post-processing of at least one transcoded digi- tal signal including audio and/or video data to obtain at least one enhanced trans- coded digital signal is proposed.
  • audio data are typically data, which includes audible information like music, speech, sounds, or other noises. These audible informations are coded into the digital signal as audio data.
  • Video data are data, which include“moving pictures”. Video data can include audio data.
  • the transcoded digital signal which shall be processed by a post-processor, was obtained by decoding of an encoded digital signal using a decoder.
  • the decoded digital signal obtained by decoding of the encoded signal is the trans- coded digital signal.
  • a post-processing method well-known from the state of the art is applied to the decoded digital signal to enhance the quality of the transcoded digital signal in a previous step.
  • Said encoded digital signal, fur- thermore was obtained by encoding of a source signal using an encoder, whereby the source signal, advantageously, includes the raw data of the audio and/or video data.
  • the post-processing method is using a post-processor, whereby the post-processor can be a computer or any other electronic data pro- cessing unit.
  • the basic idea of the present invention is to use an artificial neural network to enhance the transcoded digital signal without modifying the decoder on the receiver side or the encoder on the transmitter side.
  • the artificial neural net- work has been trained a mapping of parts of the transcoded signal to parts of the source signal so that based on the transcoded signal by using the trained artificial neural network the source signal can be reconstructed or at least approximated in a high quality manner.
  • a plurality of transcoded signal frames are provided, whereby said transcoded signal frames were generated by separating one of said transcoded digital signal.
  • the first step of providing said plurality of trans- coded signal frames comprises the step of separating one of said transcoded digi- tal signal into said plurality of transcoded signal frames.
  • the first step of providing said plurality of transcoded signal frames can comprise, furthermore, the step of building the plurality of transcoded signal frames from a plurality of transcoded digital signal segments provided from the decoder, whereby each transcoded digi tal signal segments can be assumed as a transcoded digital signal derived from a superior transcoded digital signal.
  • a transcoded signal frame in the meaning of the present invention is a part of a transcoded digital signal.
  • a transcoded digital signal can be a seg- ment of a superior transcoded digital signal, which was segmented into a plurality of transcoded digital signals, often called as transcoded digital signal segments.
  • the transcoded signal frames can be overlapped in time or non-overlapped. If a window function is used, the length of the transcoded signal frame is equal to the length of the window.
  • a first representation within a processing domain is prepared for each transcoded signal frame.
  • the processing domain is a mathematical and/or physical description or specification to represent the trans- coded signal frames in a mathematical and/or physical manner.
  • the representation of the transcoded signal frames within a processing do- main is a description of the waveform of the transcoded signal frame (so-called time domain).
  • Other processing domains for example, are the frequency domain or the cepstral domain.
  • the first representations are designated for feeding an arti- ficial neural network as described below.
  • the transcoded signal frames are provided such that at least one (or each) transcoded signal frame is provided in the first representation within a pro- cessing domain.
  • the transcoded signal frames are provided within said processing domain.
  • the data preparation step can include, furthermore, the step of processing each transcoded signal frame into said first representation within said processing do- main.
  • each first representation of the transcoded signal frames is inputted into an artificial neural network to obtain for each first representation a second representa- tion of the respective transcoded signal frame.
  • the artificial neural network is pro- vided such that the artificial neural network is trained a mapping from a represen- tation of a transcoded signal frame within said predefined processing domain to a representation of the source signal frame within said processing domain.
  • an enhanced transcoded digital signal is generated by converting the second rep- resentations into the form of a digital signal including the audio and/or video data. After the generation of the enhanced transcoded digital signal, the enhanced transcoded digital signal is outputted.
  • the proposed method for post-processing in the present invention it is possi- ble to enhance audio and/or video data in a transcoded digital signal without modi- fying the encoder and decoder side.
  • the post- processing method can be executed in real-time, for example in a digital speech communication using digital speech codecs.
  • the problems of the prior art post-processing filters can be overcome and the quality gap between the source signal data and the transcoded signal data due to the lossy compression can be reduced without increase of the transmission bitrate.
  • the loss of information by using the lossy compression can be reduced or minimized by using the post-processing method of the present invention without modifying the encoder or decoder and without modi- fying the lossy compression method itself.
  • the encoder and/or the decoder are standardized in a very specific fashion, this allows the use of the present invention in a standard- compatible manner.
  • the loss of information raised by the lossy com- pression can be reduced and/or partly healed with the artificial neural network of the present invention.
  • said processing domain are the time domain, the frequency domain, the cepstral domain or the log-magnitude domain.
  • the processing domain is the time domain, where- by a waveform representation for each transcoded signal frame is prepared.
  • each provided transcoded signal frame has a waveform representation within the time domain so that no further pro- cessing steps for converting the transcoded frames into the waveform representa- tion is necessary.
  • the separated frames then serve directly as an input of the artificial neural network, whereby the input vector is a representation of the waveform of the transmitted digital signal frame.
  • the transcoded signal frames are pro- Switched into the waveform representation.
  • the artificial neural network is provided such that the artificial neural network is trained a mapping from the waveform representation of the transcoded signal frame to the waveform representation of the source signal frame.
  • the enhanced transcoded signal is then generated based on the waveform representation ob- tained from the artificial neural network.
  • an overlap-add (OLA) technique can be used or not.
  • the output of the artifi cial neural network is a frame structure so that the enhanced digital signal can be generated directly from the output of the artificial neural network.
  • the second representation obtained from the artificial neural network has a frame structure.
  • the frames are reconstructed based on the waveform representation obtained from the artificial neural network and the en- hanced transcoded signal is generated based on the reconstructed frames.
  • the time domain approach is well fitting into many contexts, also very suitable for integration into the decoder processing, because if the time domain post- processor is embedded into the segmentation structure of the decoder, no addi- tional algorithmic delay is incurred beyond the already provided segmentation.
  • the decoder segmentation can be used for providing the plurality of frames without any further segmentation.
  • said processing domain is the frequency do- main, whereby the transcoded signal frames are processed in the frequency do- main by transforming each transcoded signal frame in a magnitude-phase repre- sentation or in a real and imaginary part representation by using, for example, the Fast Fourier Transformation (FFT).
  • FFT Fast Fourier Transformation
  • This representation in the frequency domain (for example a spectrum vector or a part of it) is then inputted into the artificial neu- ral network, whereby the artificial neural network is provided such that the artificial neural network is trained a mapping from the magnitude-phase representation or from the real and imaginary part representation of a transcoded signal frame to the magnitude-phase representation or to the real and imaginary part representation of the source signal frame.
  • the enhanced transcoded signal is generated based on the magnitude-phase rep- resentation or the real and imaginary part representation obtained from the artifi cial neural network.
  • An overlap-add (OLA) technique or an overlap-save (OLS) technique can be used along with the inverse transformation. It is advantageous, if the frames are reconstructed based on the magnitude-phase representation or the real and imaginary part representation obtained from the artificial neural network and the enhanced transcoded signal is generated based on the reconstructed frames
  • the magnitude spectrum is subject to a logarithm function, resulting into the so called log-magnitude domain being used as representation domain at the input and/or output of the artificial neural network.
  • the log-magnitude representation of a source signal frame can be subject to an inverse logarithm function and appended with the phase as obtained above to ob- tain a magnitude-phase representation of a source signal frame.
  • said processing domain is a cepstral domain, whereby the transcoded signal frames are processed into the cepstral domain by transform- ing each transcoded signal frame in a cepstral coefficients representation.
  • This cepstral coefficients representation of each transcoded signal frame is, e.g., sepa- rated into two parts: the cepstral coefficients representation responsible for the spectral envelope and the residual cepstral coefficients representation.
  • the spec- tral envelope cepstral coefficients representation is inputted into the artificial neural network to obtain an enhanced spectral envelope cepstral coefficient representa- tion, whereby the enhanced transcoded signal is generated based on the spectral envelope cepstral coefficient representation obtained from the artificial neural net- work and the residual cepstral coefficients representation. It is advantageous, if the frames are reconstructed based on the spectral envelope cepstral coefficient representation obtained from the artificial neural network and the enhanced trans- coded signal is generated based on the reconstructed frames.
  • the artificial neural network is provided such that the artificial neural network is trained a mapping from the spectral envelope cepstral coefficient representation of a transcoded signal frame to the spectral envelope cepstral coefficient representa- tion of a source signal frame.
  • said artificial neural network is a convolu- tional neural network.
  • said convolutional neural network has a plurality of hidden layers, whereby the hidden layers comprising at least one convolutional layer, at least one max pooling layer and at least one upsampling layer.
  • the convolutional layers are defined by a number F of feature maps (filter kernels) and the kernel size (a x b).
  • the number of trainable weights, including the bias, of a convolutional layer is denoted as F x (a x b) + F. It is worth noting that in each convolutional layer, the stride is one and zero padding of the layer input is always performed to guarantee that the first dimension of the layer output is the same as that for the layer input.
  • the upsampling layer simply copies each ele- ment of the layer input into a 2 x 1 vector and stacks these vectors just following the original order, which actually doubles the first dimension of the layer input.
  • an input layer of the convolutional neural network is connected with the first convolutional layer, said first convolutional layer is connected with a max pooling layer, said max pooling layer is connected with the second convolutional layer, said second convolutional layer is connected with the upsampling layer and said upsampling layer is connected with an output layer.
  • an enhanced transcoded signal frame is generated based on the respective second representation obtained from the artificial neural network. Based on the enhanced transcoded signal frames, the enhanced digital signal is generated, e.g., by OLA or OLS.
  • the transcoded signal frames and/or the enhanced trans- coded signal frames comprising a frame length between 1 ms and 100 ms.
  • the frame length is between 5 ms and 35 ms, for video signals between 1 ms up to 100 ms.
  • a hardware device for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal is proposed.
  • the hardware device is arranged to execute the method as described above.
  • a computer program according to claim 15 is arranged to execute the post-processing method as described above, as the computer program is run- ning on a computer device.
  • a method for training an artificial neural network is pro- posed.
  • a plurality of source signal frames and corresponding transcoded signal frames are provided.
  • Said source signal frames were generated by separat- ing at least one source signal and said transcoded signal frames were generated by separating at least one transcoded digital signal.
  • the separating step can be performed prior to the providing step. That means in other words, that a plurality of sets of signal frames are provided, whereby each set of signal frames includes at least one source signal frame and at least one corresponding transcoded signal frame, which was obtained by encoding and decoding of the source signal.
  • Each transcoded digital signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of the corresponding source signal using an encoder.
  • the decoding and encoding step can be per- formed prior to the separating step.
  • a first representation within a processing domain for each trans- coded signal frame and a second representation within said processing domain for each source signal are prepared. That can includes, that each transcoded signal frame is processed into the first representation within said processing domain and each source signal frame is processed into the second representation within said processing domain.
  • the source and the corresponding transcoded signal frames can produced on the basis of the source and the corresponding transcoded signal segments.
  • the length and structure of the source and the corresponding trans- coded signal frames are the same in training and also in further use of the artificial neural network. Then, a plurality of source signal frames and the corresponding transcoded signal frames are selected by comparing the power ratio of each source signal frame and the whole source signal to a threshold.
  • each transcoded signal frame is processed into a first representa- tion within a processing domain and each source signal frame is processed into a second representation within said processing domain.
  • the artificial neural network is trained by inputting the first and corresponding second representations such that a mapping from a first representation of the transcoded signal frame to a second representation of a source signal frame is trained.
  • the step of providing a plurality of source signal frames and corresponding transcoded signal frames comprises the step of selecting the source signal frames and the corresponding transcoded signal frames for training said artificial neural network by comparing the power ratio of each source signal frame and the whole source signal to a threshold.
  • the plurality of source signal frames are provided by using at least one source signal.
  • the at least one source signal is then separated into a plurality of source signal frames e.g., by using a separating func- tion.
  • the source signal frames can be overlapped in time or non-overlapped.
  • at least one transcoded sig- nal is generated by using an encoder and decoder.
  • the at least one transcoded signal transcoded from the at least one source signal is pro- vided. Then, the at least one transcoded signal is separated into a plurality of transcoded signal frames.
  • Figure 1 General flowchart of post-processing for enhancement of transcoded signals
  • Figure 4 Example of the structure of a convolutional neural network.
  • Figure 1 shows a general flowchart of post-processing for enhancement of trans- coded signals.
  • a source signal s(n ) is inputted to an encoder to obtain an encoded signal.
  • the encoded signal can be transmitted to the receiver side and then to a decoder for decoding the encoded signal.
  • the decoded signal s(n), so- called in the present invention as a transcoded signal s(n) is then transferred to a post-processor for post-processing the transcoded signal s(n).
  • the result of the post-processing is an enhanced transcoded signal s(n) .
  • Figure 2 shows a high level structure of the post-processor shown in figure 1.
  • the transcoded signal s(n ) is separated into a plurality of segments with signal vectors r(A),with A being the discrete segment index.
  • the signal vectors r(A) typically represent 5 ms to 35 ms of audio, or 1 ms to 100 ms of video.
  • the length of the segment may depend on the decoder.
  • the segments r(A) are delivered to the framing process, where each frame x( ) is produced on the basis of one or a plurality of the seg- ments r(A).
  • the framing i.e., production of the frames
  • each frame is transformed into the processing domain, for example the time domain, frequen- cy domain, or cepstral domain.
  • the input vector of the neural network ( x for time domain and for cepstral domain) is obtained from the data prepara- tion process with normalization, and may depend on one or a plurality of segments r(A) from the past (l - I, l - 2, ...), present (A), or even future (A + 1, A + 2, ).
  • the input vectors are processed by the neural network with the same struc- ture as in the training stage.
  • the output of the neural network time domain and for cepstral domain
  • the signal is formed based on these output vectors.
  • the out- put of this signal forming process is the enhanced transcoded signal s(n).
  • each transcoded segment r(A) is provided after the segmentation process of the transcoded signal s(n). Then, each frame X W is produced after the framing process and is normalized as
  • the framing and data preparation step is shown in figure 3a and the signal forming process including frame reconstruction in the cepstral domain is shown in figure 3b.
  • each transcoded segment r(A) is provided after the segmentation processof the transcoded signal s(n). Then, each frame is produced after the framing process and for each frame a Fast Fourier Transfor- mation (FFT) of size K is performed, achieving with k being the frequen- cy bin. Then the Discrete Cosine Transform of type II (DCT II) is performed on the log-magnitude values of k) to obtain the cepstral coefficients.
  • FFT Fast Fourier Transfor- mation
  • DCT II Discrete Cosine Transform of type II
  • Equation 2 (Equation 2) to obtain the vector c env( ⁇ ) with elements c ( ⁇ ⁇ )’ 9 e Qen V for the cepstral do- main solution, with Q-em being the set of cepstral coefficients indices, representing the spectral envelope.
  • Two vectors are stored for the following frame reconstruc- tion step, at first the argument vector a ( ⁇ ) for the th frame complex FFT coeffi- cients and second the residual cepstral coefficients vector c res( ⁇ ) with elements ⁇ Qres for the £th frame cepstral coefficients, with Qres being the set of residual cepstral coefficients indices.
  • Equation3 Equation3 where ⁇ c ( ⁇ ?) and s ) are the mean value and the standard deviation value from the training stage.
  • the input vectors in the cepstral domain are processed by the neural network with the same structure as in the training stage. Based on the output vector of the neu- ral network ⁇ env( ⁇ ) the enhanced transcoded signal can be formed.
  • a frame reconstruction process is performed firstly showing in figure 3b.
  • the output of the neural network and the residual cepstral coef- ficients c res( ⁇ ) stored in the data preparation procedure, are concatenated to form the complete cepstral coefficients ⁇ ).
  • the inverse DCT II (IDCT II) is per- formed to go back to the logarithm domain of the amplitude spectrum which is de- noted as
  • the reconstructed frame in the time domain is obtained by taking the real part of the inverse FFT of the FFT coefficients vector ).
  • the enhanced transcoded digital signal is generated respectively formed from the output vector (time domain) or the reconstructed frames (cepstral domain).
  • time domain time domain
  • cepstral domain reconstructed frames
  • three different example fashions of signal forming meth- ods along with corresponding framing methods will be introduced to finally obtain the enhanced transcoded signal s(n).
  • These signal forming methods can be either used for time domain processing or cepstral domain processing and also used for frequency domain processing with or without the logarithm.
  • Equation 6 Equation 6 where N w is the frame length and the frame shift is equal to the frame length N w .
  • Signal forming now goes as follows: The processed frames are concatenat- ed directly along the frame index to achieve the improved signal s(n ) which could be expressed as
  • Equation 7 Equation 7 where L is the number of frames for the speech to be formed.
  • L is the number of frames for the speech to be formed.
  • the segmentation and framing procedure could be expressed as
  • Equation 8 where N w is the frame length and N s is the frame shift. Note that a plurality of ze ros are padded before the beginning of s(n).
  • This approach also has no additional algorithm latency beyond segmentation, but has longer frames to be processed compared to frame-wise direct forming.
  • a neural network has to be trained. Independ- ent of the chosen domain, a similar neural network topology will be used in this embodiment with only the different dimensions in the input and output layer.
  • An example of the convolutional neural network used in the present invention is shown in figure 4.
  • a plurality of source signal segments and the corresponding transcoded signal segments are provided. Then, the source and transcoded signal frames are pro- prised on the basis of the source and the corresponding transcoded signal seg- ments, respectively.
  • a simple frame-based voice activity detection is performed to select the ac- tive frames for the training stage by comparing the power ratio of each source sig nal frame and the whole source signal to a threshold.
  • the threshold VAD is e.g.,
  • the set contains all sample indices nbelonging to frame l and denotes the number of elements in this set.
  • A/ contains all sample indi- ces n belonging to the complete speech signal, I- ⁇ denotes the number of ele- ments in this set.
  • the prepared inputs of the neural network will firstly go in forward direction to the neural network, achieving the network yielding outputs y w where N is the total number of layers. After that, the outputs are compared to the targets, guided by a cost function. The trainable weights of the neural network are then iteratively adjusted to minimize the cost function based on some learning rules (i.e., backpropagation training). When some preset stopping criteria are meet, the training process will be finished and the weights in the neural network will stay unchanged.
  • some learning rules i.e., backpropagation training
  • Other kinds of neural networks could also be used, e.g., feed- forward neural networks, deep neural networks (DNNs), or recurrent neural net- works (RNNs) such as long short-term memory (LSTM).
  • DNNs deep neural networks
  • RNNs recurrent neural net- works
  • LSTM long short-term memory
  • the input layer (first layer): in time domain
  • Equation 12 Equation 12
  • the convolutional layer 1 (second layer):
  • Equation 13 where * denotes to convolutional operation, w p denotes the weight vector of the pth kernel and denotes the pth bias, M (1) is the dimension of the input vector i and being the number of kernels used in this layer. Please note that the frame in- dex £' is omitted for convenience as soon as internal processing of the neural network is presented.
  • the convolution is computed as
  • Equation 14 with i m being zero when m > M (1) and the kernel size here is two. Note that the stride of the kernel is one and the input vector i is zero-padded before the convolu- tion is computed to make sure that the output vector dimension is the same as the input vector dimension.
  • the activation function / (2) used here is the leaky rectified linear unit (ReLU) function, which can be denoted as if x > 0
  • the max-pooling layer (third layer)
  • Equation 16 (Equation 16) with max() being the maximum function.
  • the first dimension of the matrix is de- creased by half in the max-pooling layer.
  • the convolutional layer 2 (forth layer):
  • Equation 17 which is similar to the expressions in the second layer (convolutional layer 1 ).
  • the upsampling layer (fifth layer):
  • the cost function in terms of mean squared error (MSE) between the outputs and targets can be vision as
  • the indices of the training set T are divided into D batches with the same size and with no repetition, which could be denoted as
  • the corresponding training pairs are also divided into D batches and could be denoted as
  • Equation 23 (Equation 23) with O being the training pairs. Furthermore, the training pairs in each batch con- tribute to a weight-updating and one epoch is finished when all training pairs in the training data are already performed.
  • the weights are then trained using batch backpropagation (BP) in which the weight matrix W is changed iteratively to minimize the cost function with the sto- chastic gradient descent (SGD) algorithm.
  • BP batch backpropagation
  • SGD sto- chastic gradient descent
  • the MSE will be calculated on the validation set, which could be denoted as lit(f) - y ⁇ 6) (f)
  • Equation 24 Equation 24 where v(Wg) j s the MSE on the validation set after the gth epoch is the set of frame indices on the validation set and is the output of the neural net- work after the gth epoch.
  • the training process will end after gth epoch, if either of the following conditions is satisfied.
  • Equation 25 Equation 25 where Q M5E is the MSE threshold.
  • the stop of the training process means that the neural network is assumed to have already achieve this state of proper generaliza- tion.
  • the structure of the neural network and the trained weight matrix set together with the mean vector and the standard deviation vector, are stored for the further usage of the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Abstract Method for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal, whereby said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal using an encoder, whereby the method comprising the following steps using a postprocessor: providing a plurality of transcoded signal frames, whereby said transcoded signal frames were generated by separating one of said transcoded digital signal processing each transcoded signal frame into a first representation within a processing domain; feeding each first representation of the transcoded signal frames into an artificial neural network to obtain for each first representation a second representation of the respective transcoded signal frame, whereby said artificial neural network is provided such that the artificial neural network is trained a mapping from a representation of a transcoded signal frame within said processing domain to a repre- sentation of a source signal frame within said processing domain; generating an enhanced transcoded digital signal based on the second represen- tations obtained from the artificial neural network; and outputting said enhanced transcoded digital signal.

Description

Method, hardware device and software program for post-processing of transcoded digital signal
The invention relates to a method for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal, whereby said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal including audio and/or video data using an encoder. The invention relates also to a hardware device and a software program for executing said post- processing method. The invention relates also to a method for training an artificial neural network.
Digital signals including audio and/or video data are often stored on a hardware device and accessed (read out) at a later point in time. In other cases, digital sig- nals including audio and/or video data are often transmitted from a first hardware device to a second hardware device. Note that without loss of generality, also the process of“storage” can be considered to be a“transmission”, which will be our terminology in the following. During these transmissions, the digital signals must be transformed into a bit stream which is suitable for transmission of the data rep- resenting the digital signal over a transmission channel. The transcoding process includes two steps. At first, the source signal including the audio and/or video data must be encoded into an encoded digital signal using an encoder. This encoded digital signal is transmitted over the communication channel to a receiver, whereby the receiver must decode the encoded digital signal into a decoded digital signal. The joint processing of encoder and decoder are sometimes abbreviated as co dec. To enhance the quality of the decoded digital signal, the decoded digital sig nal is sometimes post-processed to enhance the quality of the data.
Such decoded digital signals are often called“transcoded” signals or simply“cod- ed” signals. Transcoded digital signals often suffer from far-end background noise, quantization noise, and potentially transmission errors. To enhance the quality of these transcoded signals, post-processing methods, operating just after decoding, can be advantageously employed. Due to the transmission bandwidth (or storage) limitation, transcoded signals typically perform - so called - lossy compression to achieve a relatively low bit rate during transmission, while still preserving a rea- sonable audio and/or video quality at the same time. As a result, however, the re- constructed audio and/or video signal is degraded in quality due to quantization errors during the lossy compression process.
This kind of degradation cannot be effectively healed because during the lossy- compression process a part of the data and/or part of the information of the origi- nal digital signal are lost. To mitigate this problem, an extra post-processing pro- cess on the receiver side is well known from the state of the art. The basic idea of the post-processing method is to enhance the quality of a transcoded digital signal to reduce the signal distortion due to quantization, coding, and/or transmission errors.
To combat quantization errors at the receiver side, a kind of postfilter based on classical Wiener theory of optimal filtering has been standardized for the logarith- mic pulse code modulation (PCM) G.711 codec (ITU, Rec.G.711 : pulse code modulation (PCM) of voice frequencies, International Telecommunication Union, Telecommunication standardization sector (ITU-T), November 1988). This postfil- ter uses a priori information on the A- or m-law properties to estimate to quantiza- tion noise power spectral density (PSD) using a transcoded digital signal including speech data as audio data, assuming the quantization noise to be spectrally white. Then, a Wiener filter is derived by the estimation of the a priori signal-to-noise-ratio (SNR) based on a two-step noise reduction approach (C. Plapous et al.“A two- step noise reduction technique" in Proc. of ICASSP, Montreal, QC, Canada, May 2004, pp. I-289-292). After the filtering process, a limitation of distortions is per- formed to control the waveform difference between the original signal and the post-processed coded signal. However, as the bit rates go down for most of the modern codecs, it becomes more difficult for the classical Wiener filter to effective- ly suppress the quantization noise, while maintaining the speech or the video per- ceptually undistorted, so the SNR drops. Note that the Wiener filter anyway only minimizes the mean squared error (MSE), but not perceptual distortion.
For this background, it is an aspect of the present invention to provide a better post-processing method to enhance the quality of a decoded digital signal contain- ing audio and/or video data without modifying the encoder and decoder, respec- tively, without modifying the transmitter side and the receiver side. It is also an as- pect of the present invention to provide a post-processing method to enhance the quality of a transcoded digital signal after the transcoded digital signal was fully decoded using a decoder.
The problem is solved by the post-processing method according to claim 1 , the hardware device for post-processing according to claim 14 and the computer pro- gram according to claim 15.
According to claim 1 , a method for post-processing of at least one transcoded digi- tal signal including audio and/or video data to obtain at least one enhanced trans- coded digital signal is proposed. In the sense of the present invention, audio data are typically data, which includes audible information like music, speech, sounds, or other noises. These audible informations are coded into the digital signal as audio data. Video data are data, which include“moving pictures”. Video data can include audio data.
The transcoded digital signal, which shall be processed by a post-processor, was obtained by decoding of an encoded digital signal using a decoder. In most cases, the decoded digital signal obtained by decoding of the encoded signal is the trans- coded digital signal. It is possible, that a post-processing method well-known from the state of the art is applied to the decoded digital signal to enhance the quality of the transcoded digital signal in a previous step. Said encoded digital signal, fur- thermore, was obtained by encoding of a source signal using an encoder, whereby the source signal, advantageously, includes the raw data of the audio and/or video data.
According to the invention, the post-processing method is using a post-processor, whereby the post-processor can be a computer or any other electronic data pro- cessing unit. The basic idea of the present invention is to use an artificial neural network to enhance the transcoded digital signal without modifying the decoder on the receiver side or the encoder on the transmitter side. The artificial neural net- work has been trained a mapping of parts of the transcoded signal to parts of the source signal so that based on the transcoded signal by using the trained artificial neural network the source signal can be reconstructed or at least approximated in a high quality manner. In a first step, a plurality of transcoded signal frames are provided, whereby said transcoded signal frames were generated by separating one of said transcoded digital signal. In an embodiment, the first step of providing said plurality of trans- coded signal frames comprises the step of separating one of said transcoded digi- tal signal into said plurality of transcoded signal frames. The first step of providing said plurality of transcoded signal frames can comprise, furthermore, the step of building the plurality of transcoded signal frames from a plurality of transcoded digital signal segments provided from the decoder, whereby each transcoded digi tal signal segments can be assumed as a transcoded digital signal derived from a superior transcoded digital signal.
A transcoded signal frame in the meaning of the present invention is a part of a transcoded digital signal. Furthermore, a transcoded digital signal can be a seg- ment of a superior transcoded digital signal, which was segmented into a plurality of transcoded digital signals, often called as transcoded digital signal segments.
In some cases, it is advantageous to merge the decoder and the post-processor into one single processing unit. This may particularly be advisable in order to save algorithmic delay of the decoder in conjunction with the post-processor, in case that both of these functions share the same structure.
The transcoded signal frames can be overlapped in time or non-overlapped. If a window function is used, the length of the transcoded signal frame is equal to the length of the window. In the next step called data preparation, a first representation within a processing domain is prepared for each transcoded signal frame. The processing domain is a mathematical and/or physical description or specification to represent the trans- coded signal frames in a mathematical and/or physical manner. In the simplest form, the representation of the transcoded signal frames within a processing do- main is a description of the waveform of the transcoded signal frame (so-called time domain). Other processing domains, for example, are the frequency domain or the cepstral domain. The first representations are designated for feeding an arti- ficial neural network as described below. In the broadest meaning of the present invention, the transcoded signal frames are provided such that at least one (or each) transcoded signal frame is provided in the first representation within a pro- cessing domain. In this case, the transcoded signal frames are provided within said processing domain. The data preparation step can include, furthermore, the step of processing each transcoded signal frame into said first representation within said processing do- main.
Now, each first representation of the transcoded signal frames is inputted into an artificial neural network to obtain for each first representation a second representa- tion of the respective transcoded signal frame. The artificial neural network is pro- vided such that the artificial neural network is trained a mapping from a represen- tation of a transcoded signal frame within said predefined processing domain to a representation of the source signal frame within said processing domain. After this step, for each transcoded signal frame, advantageously, exists an enhanced sec- ond representation of the respective transcoded signal frame obtained from the artificial neural network.
Based on the second representations obtained from the artificial neural network, an enhanced transcoded digital signal is generated by converting the second rep- resentations into the form of a digital signal including the audio and/or video data. After the generation of the enhanced transcoded digital signal, the enhanced transcoded digital signal is outputted. With the proposed method for post-processing in the present invention, it is possi- ble to enhance audio and/or video data in a transcoded digital signal without modi- fying the encoder and decoder side. Using an artificial neural network, the post- processing method can be executed in real-time, for example in a digital speech communication using digital speech codecs. By means of the present invention, the problems of the prior art post-processing filters can be overcome and the quality gap between the source signal data and the transcoded signal data due to the lossy compression can be reduced without increase of the transmission bitrate. The loss of information by using the lossy compression can be reduced or minimized by using the post-processing method of the present invention without modifying the encoder or decoder and without modi- fying the lossy compression method itself. Given the fact that in many communica- tion systems, either the encoder and/or the decoder are standardized in a very specific fashion, this allows the use of the present invention in a standard- compatible manner. Furthermore, the loss of information raised by the lossy com- pression can be reduced and/or partly healed with the artificial neural network of the present invention. In an embodiment, said processing domain are the time domain, the frequency domain, the cepstral domain or the log-magnitude domain.
In a first preferred embodiment, the processing domain is the time domain, where- by a waveform representation for each transcoded signal frame is prepared. In the broadest meaning of the present invention, each provided transcoded signal frame has a waveform representation within the time domain so that no further pro- cessing steps for converting the transcoded frames into the waveform representa- tion is necessary. For the time domain approach, a quite straight-forward frame- work structure which fits to most speech decoders can be used. The separated frames then serve directly as an input of the artificial neural network, whereby the input vector is a representation of the waveform of the transmitted digital signal frame. Furthermore, it is also possible that the transcoded signal frames are pro- cessed into the waveform representation. The artificial neural network is provided such that the artificial neural network is trained a mapping from the waveform representation of the transcoded signal frame to the waveform representation of the source signal frame. The enhanced transcoded signal is then generated based on the waveform representation ob- tained from the artificial neural network. For this purpose, an overlap-add (OLA) technique can be used or not. In a preferred embodiment, the output of the artifi cial neural network is a frame structure so that the enhanced digital signal can be generated directly from the output of the artificial neural network. In this case, the second representation obtained from the artificial neural network has a frame structure.
Furthermore, it is also possible that the frames are reconstructed based on the waveform representation obtained from the artificial neural network and the en- hanced transcoded signal is generated based on the reconstructed frames.
The time domain approach is well fitting into many contexts, also very suitable for integration into the decoder processing, because if the time domain post- processor is embedded into the segmentation structure of the decoder, no addi- tional algorithmic delay is incurred beyond the already provided segmentation. The decoder segmentation can be used for providing the plurality of frames without any further segmentation.
In a further preferred embodiment, said processing domain is the frequency do- main, whereby the transcoded signal frames are processed in the frequency do- main by transforming each transcoded signal frame in a magnitude-phase repre- sentation or in a real and imaginary part representation by using, for example, the Fast Fourier Transformation (FFT). This representation in the frequency domain (for example a spectrum vector or a part of it) is then inputted into the artificial neu- ral network, whereby the artificial neural network is provided such that the artificial neural network is trained a mapping from the magnitude-phase representation or from the real and imaginary part representation of a transcoded signal frame to the magnitude-phase representation or to the real and imaginary part representation of the source signal frame. The enhanced transcoded signal is generated based on the magnitude-phase rep- resentation or the real and imaginary part representation obtained from the artifi cial neural network. An overlap-add (OLA) technique or an overlap-save (OLS) technique can be used along with the inverse transformation. It is advantageous, if the frames are reconstructed based on the magnitude-phase representation or the real and imaginary part representation obtained from the artificial neural network and the enhanced transcoded signal is generated based on the reconstructed frames In the frequency domain, advantageously, the magnitude spectrum is subject to a logarithm function, resulting into the so called log-magnitude domain being used as representation domain at the input and/or output of the artificial neural network. The log-magnitude representation of a source signal frame can be subject to an inverse logarithm function and appended with the phase as obtained above to ob- tain a magnitude-phase representation of a source signal frame.
In a further embodiment, said processing domain is a cepstral domain, whereby the transcoded signal frames are processed into the cepstral domain by transform- ing each transcoded signal frame in a cepstral coefficients representation. This cepstral coefficients representation of each transcoded signal frame is, e.g., sepa- rated into two parts: the cepstral coefficients representation responsible for the spectral envelope and the residual cepstral coefficients representation. The spec- tral envelope cepstral coefficients representation is inputted into the artificial neural network to obtain an enhanced spectral envelope cepstral coefficient representa- tion, whereby the enhanced transcoded signal is generated based on the spectral envelope cepstral coefficient representation obtained from the artificial neural net- work and the residual cepstral coefficients representation. It is advantageous, if the frames are reconstructed based on the spectral envelope cepstral coefficient representation obtained from the artificial neural network and the enhanced trans- coded signal is generated based on the reconstructed frames.
The artificial neural network is provided such that the artificial neural network is trained a mapping from the spectral envelope cepstral coefficient representation of a transcoded signal frame to the spectral envelope cepstral coefficient representa- tion of a source signal frame.
In a further advantageous embodiment, said artificial neural network is a convolu- tional neural network. Advantageously, said convolutional neural network has a plurality of hidden layers, whereby the hidden layers comprising at least one convolutional layer, at least one max pooling layer and at least one upsampling layer. The convolutional layers are defined by a number F of feature maps (filter kernels) and the kernel size (a x b). The number of trainable weights, including the bias, of a convolutional layer is denoted as F x (a x b) + F. It is worth noting that in each convolutional layer, the stride is one and zero padding of the layer input is always performed to guarantee that the first dimension of the layer output is the same as that for the layer input. In the max pooling layers, a 2 x 1 maximum filter is applied in a non-overlapping fashion, resulting in a 50 % reduction of the layer input along the first dimension. On the contrary, the upsampling layer simply copies each ele- ment of the layer input into a 2 x 1 vector and stacks these vectors just following the original order, which actually doubles the first dimension of the layer input.
In a further advantageous embodiment, an input layer of the convolutional neural network is connected with the first convolutional layer, said first convolutional layer is connected with a max pooling layer, said max pooling layer is connected with the second convolutional layer, said second convolutional layer is connected with the upsampling layer and said upsampling layer is connected with an output layer.
In a further embodiment, for each second representation an enhanced transcoded signal frame is generated based on the respective second representation obtained from the artificial neural network. Based on the enhanced transcoded signal frames, the enhanced digital signal is generated, e.g., by OLA or OLS.
In a further embodiment, the transcoded signal frames and/or the enhanced trans- coded signal frames comprising a frame length between 1 ms and 100 ms. Advan- tageously for audio signals, the frame length is between 5 ms and 35 ms, for video signals between 1 ms up to 100 ms.
These short frame lengths guarantee that the data within one frame is more or less stationary or static without big changes of the statistics of the data within one frame. In claim 14, a hardware device for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal is proposed. The hardware device is arranged to execute the method as described above.
Furthermore, a computer program according to claim 15 is arranged to execute the post-processing method as described above, as the computer program is run- ning on a computer device.
According to claim 16, a method for training an artificial neural network is pro- posed. At first, a plurality of source signal frames and corresponding transcoded signal frames are provided. Said source signal frames were generated by separat- ing at least one source signal and said transcoded signal frames were generated by separating at least one transcoded digital signal. The separating step can be performed prior to the providing step. That means in other words, that a plurality of sets of signal frames are provided, whereby each set of signal frames includes at least one source signal frame and at least one corresponding transcoded signal frame, which was obtained by encoding and decoding of the source signal. Each transcoded digital signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of the corresponding source signal using an encoder. The decoding and encoding step can be per- formed prior to the separating step. In a second step, a first representation within a processing domain for each trans- coded signal frame and a second representation within said processing domain for each source signal are prepared. That can includes, that each transcoded signal frame is processed into the first representation within said processing domain and each source signal frame is processed into the second representation within said processing domain. The source and the corresponding transcoded signal frames can produced on the basis of the source and the corresponding transcoded signal segments. The length and structure of the source and the corresponding trans- coded signal frames are the same in training and also in further use of the artificial neural network. Then, a plurality of source signal frames and the corresponding transcoded signal frames are selected by comparing the power ratio of each source signal frame and the whole source signal to a threshold.
The next step, each transcoded signal frame is processed into a first representa- tion within a processing domain and each source signal frame is processed into a second representation within said processing domain. Then, the artificial neural network is trained by inputting the first and corresponding second representations such that a mapping from a first representation of the transcoded signal frame to a second representation of a source signal frame is trained. In an embodiment, the step of providing a plurality of source signal frames and corresponding transcoded signal frames comprises the step of selecting the source signal frames and the corresponding transcoded signal frames for training said artificial neural network by comparing the power ratio of each source signal frame and the whole source signal to a threshold.
In a further embodiment, it is possible that the plurality of source signal frames are provided by using at least one source signal. The at least one source signal is then separated into a plurality of source signal frames e.g., by using a separating func- tion. The source signal frames can be overlapped in time or non-overlapped. It is further possible, that for the at least one source signal at least one transcoded sig- nal is generated by using an encoder and decoder. It is also possible that the at least one transcoded signal transcoded from the at least one source signal is pro- vided. Then, the at least one transcoded signal is separated into a plurality of transcoded signal frames. The present invention will be described in more detail by reference to the following figures:
Figure 1 - General flowchart of post-processing for enhancement of transcoded signals;
Figure 2 - High-level structure of the post-processing method;
Figure 3a, 3b - Processing structure of the cepstral domain approach;
Figure 4 - Example of the structure of a convolutional neural network.
Figure 1 shows a general flowchart of post-processing for enhancement of trans- coded signals. At first, a source signal s(n ) is inputted to an encoder to obtain an encoded signal. The encoded signal can be transmitted to the receiver side and then to a decoder for decoding the encoded signal. The decoded signal s(n), so- called in the present invention as a transcoded signal s(n), is then transferred to a post-processor for post-processing the transcoded signal s(n). The result of the post-processing is an enhanced transcoded signal s(n) .
Figure 2 shows a high level structure of the post-processor shown in figure 1. Firstly, the transcoded signal s(n ) is separated into a plurality of segments with signal vectors r(A),with A being the discrete segment index. The signal vectors r(A) typically represent 5 ms to 35 ms of audio, or 1 ms to 100 ms of video. The length of the segment may depend on the decoder. After the segmentation, the segments r(A) are delivered to the framing process, where each frame x( ) is produced on the basis of one or a plurality of the seg- ments r(A). The framing (i.e., production of the frames) could be done with or without overlap and with or without any windowing function. Then, in the data preparation process, the frames are prepared for inputting into the artificial neural network. During the data preparation process, each frame is transformed into the processing domain, for example the time domain, frequen- cy domain, or cepstral domain. The input vector of the neural network ( x for time domain and for cepstral domain) is obtained from the data prepara- tion process with normalization, and may depend on one or a plurality of segments r(A) from the past (l - I, l - 2, ...), present (A), or even future (A + 1, A + 2, ...). Then the input vectors are processed by the neural network with the same struc- ture as in the training stage. As a result, the output of the neural network ( time domain and for cepstral domain) is obtained.
Based on the output of the neural network ( x(^) for time domain and ^env(^) for cepstral domain) which is the enhanced second representation within the preferred processing domain, the signal is formed based on these output vectors. The out- put of this signal forming process is the enhanced transcoded signal s(n).
In the time domain solution, each transcoded segment r(A) is provided after the segmentation process of the transcoded signal s(n).Then, each frame XW is produced after the framing process and is normalized as
(Equationl ) where and sc are the mean vector and the standard deviation vector from the training stage, and division is to be performed element-wise.
For the cepstral domain solution, the framing and data preparation step is shown in figure 3a and the signal forming process including frame reconstruction in the cepstral domain is shown in figure 3b.
For the transcoded signal, each transcoded segment r(A) is provided after the segmentation processof the transcoded signal s(n). Then, each frame is produced after the framing process and for each frame a Fast Fourier Transfor- mation (FFT) of size K is performed, achieving with k being the frequen- cy bin. Then the Discrete Cosine Transform of type II (DCT II) is performed on the log-magnitude values of k) to obtain the cepstral coefficients. The transform can be expressed as
K
c(£, q) = \og(\X(£, k) \) - cos(nq(k + 0.b)/K)
k= 0
(Equation 2) to obtain the vector cenv(^) with elements c( ^ ^)’ 9 e QenV for the cepstral do- main solution, with Q-em being the set of cepstral coefficients indices, representing the spectral envelope. Two vectors are stored for the following frame reconstruc- tion step, at first the argument vector a(^) for the th frame complex FFT coeffi- cients and second the residual cepstral coefficients vector cres(^) with elements ^ Qres for the £th frame cepstral coefficients, with Qres being the set of residual cepstral coefficients indices. As a result, the set
Q = {Qenv,Qres} contains all cepstral coefficients indices. For example, the first
32 coefficients are regarded as spectral envelope coefficients when K equals 512, and the remaining 480 coefficients are regarded as the residual cepstral coeffi- cients. Finally, the input vector for the input of a neural network cenv js obtained after the normalization of cenv(^) which can be expressed by the element-wise normalization
(Equation3) where ^c(<?) and s ) are the mean value and the standard deviation value from the training stage.
This preparation process is shown in figure 3a.
The input vectors in the cepstral domain are processed by the neural network with the same structure as in the training stage. Based on the output vector of the neu- ral network ^env(^) the enhanced transcoded signal can be formed. In the cepstral domain, a frame reconstruction process is performed firstly showing in figure 3b. The output of the neural network and the residual cepstral coef- ficients cres(^) stored in the data preparation procedure, are concatenated to form the complete cepstral coefficients ^). Then the inverse DCT II (IDCT II) is per- formed to go back to the logarithm domain of the amplitude spectrum which is de- noted as
(Equation 4)
X(£, k)
Then, the exponential function is used to obtain and the complex FFT are computed along with the pre-stored argument vector
X(£, k) = X(£, k ) exp (j - a(£, k))
(Equation 5) with j being the imaginary number.
Finally, the reconstructed frame in the time domain is obtained by taking the real part of the inverse FFT of the FFT coefficients vector ).
Subsequently, the enhanced transcoded digital signal is generated respectively formed from the output vector (time domain) or the reconstructed frames (cepstral domain). In the following, three different example fashions of signal forming meth- ods along with corresponding framing methods will be introduced to finally obtain the enhanced transcoded signal s(n). These signal forming methods can be either used for time domain processing or cepstral domain processing and also used for frequency domain processing with or without the logarithm.
Frame-wise direct signal forming The segmentation and framing procedure could be expressed as x( ) = [s(£Nw - Nw + 1), . . . , s(£Nw)]>
(Equation 6) where Nw is the frame length and the frame shift is equal to the frame length Nw.
In this case, the current frame is obtained by directly taking the current signal segment without overlapping and windowing. Therefore, the signal frames and signal segments are identical, denoted as = r(A). Signal forming now goes as follows: The processed frames are concatenat- ed directly along the frame index to achieve the improved signal s(n ) which could be expressed as
(Equation 7) where L is the number of frames for the speech to be formed. The approach has no additional algorithm latency beyond the segmentation. Current segment and past signal forming
The segmentation and framing procedure could be expressed as
(Equation 8) where Nw is the frame length and Ns is the frame shift. Note that a plurality of ze ros are padded before the beginning of s(n). In this case, the current frame XW is obtained by taking the current signal segment r(A) along with several past sam- ples from the past transcoded signal (can be obtained from the past transcoded segments r(A - l), r(A - 2), ...) without windowing. Therefore, the signal frame can be denoted as x(^) = [s(n—Nw + 1),. . .,s(n— N3), r(A)], wj^ t e signal segment length being Ns.
Signal forming now goes as follows: The improved signal s(n ) is achieved with the processed frames x(^) as
(Equation 9)
This approach also has no additional algorithm latency beyond segmentation, but has longer frames to be processed compared to frame-wise direct forming.
Overlap add signal forming
The segmentation and framing procedure are performed first with overlap, within also the frames are multiple with a kind of windowing function, which could be ex- pressed as x{£) = [s((£ - l)iVs + l), . . . , s((£ - 1 )NS + NW)]T o fW
(Equation 10) where Nw is the frame length, Ns is the frame shift length, fw is the window func- tion (e.g., Hann window), [] is the vector transpose, and o is defined here as el- ement-wise multiplication. Note that a plurality of zeros are padded before the be- ginning of s(n). As an example, a 50 % overlap can be performed with Ns = ^NW-
In this example case, the current frame is obtained by taking the current signal segment and one future segment, i.e., one segment lookahead. Therefore, the signal frame can be denoted as x(^) = [r(A), r(A+l)] °fw 0r with
x(£) wjth the signal segment length being Ns. Signal forming now goes as follows: The improved speech s(n ) is achieved with overlap add of the processed frames which could be expressed as
(Equation 11 ) beginning part of the indices in a frame and
4 = { s + 1, . . . , Nw } js t e end part of indices in a frame. This method is expected to perform with the best quality while it has an algorithm latency of the shift length.
Depending on the chosen domain, a neural network has to be trained. Independ- ent of the chosen domain, a similar neural network topology will be used in this embodiment with only the different dimensions in the input and output layer. An example of the convolutional neural network used in the present invention is shown in figure 4.
A plurality of source signal segments and the corresponding transcoded signal segments are provided. Then, the source and transcoded signal frames are pro- duced on the basis of the source and the corresponding transcoded signal seg- ments, respectively.
A simple frame-based voice activity detection (VAD) is performed to select the ac- tive frames for the training stage by comparing the power ratio of each source sig nal frame and the whole source signal to a threshold. A source frame is regarded > OvAD , , as a speech-active source frame with index £' if 1 , where m = rh å \ ~ n) \2 P = jh å |S(n)|
i&Afe h£Lί , and the threshold VAD is e.g.,
0.0001. The set contains all sample indices nbelonging to frame l and denotes the number of elements in this set. Similarly, A/ contains all sample indi- ces n belonging to the complete speech signal, I-^Ί denotes the number of ele- ments in this set. By performing a VAD, only the active source signal frames and the corresponding transcoded signal frames will be used for the training stage, while the rest parts (speech pauses) will not be used for the training stage.
In the case of video, a similar selection procedure can be used by excluding long static scenes from the training process.
During the training of the neural network, the prepared inputs of the neural network will firstly go in forward direction to the neural network, achieving the network yielding outputs yw where N is the total number of layers. After that, the outputs are compared to the targets, guided by a cost function. The trainable weights of the neural network are then iteratively adjusted to minimize the cost function based on some learning rules (i.e., backpropagation training). When some preset stopping criteria are meet, the training process will be finished and the weights in the neural network will stay unchanged.
A kind of convolutional neural network, with N = 6, used in the invention is depict- ed in figure 4, which is just an example of the structure and topology and can be adjusted as needed. Other kinds of neural networks could also be used, e.g., feed- forward neural networks, deep neural networks (DNNs), or recurrent neural net- works (RNNs) such as long short-term memory (LSTM).
In the following, the training stage is presented in detail for the depicted convolu- tional neural network in figure 4.
Output for each layer
The input layer (first layer): in time domain
c( , m G Qenv) in cepstral domain
(Equation 12) where m denotes the index of the input vector.
The convolutional layer 1 (second layer):
(Equation 13) where * denotes to convolutional operation, wp denotes the weight vector of the pth kernel and denotes the pth bias, M(1) is the dimension of the input vector i and being the number of kernels used in this layer. Please note that the frame in- dex £' is omitted for convenience as soon as internal processing of the neural network is presented. The convolution is computed as
(Equation 14) with im being zero when m > M(1) and the kernel size here is two. Note that the stride of the kernel is one and the input vector i is zero-padded before the convolu- tion is computed to make sure that the output vector dimension is the same as the input vector dimension. The activation function /(2) used here is the leaky rectified linear unit (ReLU) function, which can be denoted as if x > 0
otherwise
(Equation 15)
The max-pooling layer (third layer)
(Equation 16) with max() being the maximum function. The first dimension of the matrix is de- creased by half in the max-pooling layer.
The convolutional layer 2 (forth layer):
(Equation 17) which is similar to the expressions in the second layer (convolutional layer 1 ).
The upsampling layer (fifth layer):
(Equation 18)
It could be seen that the first dimension of the matrix is doubled in this layer.
Finally, the output layer (sixth layer):
(Equation 19)
Weights updating
After the output vector y(6) is achieved, the cost function in terms of mean squared error (MSE) between the outputs and targets can be vision as
(Equation 20) where W is the set of weight matrix in all layers of the neural network, zis the index set for the dth batch. The term is the target vector, in which the ele- ments can be denoted as in time domain
tit' , TO)
c(t', m e Qe nv) in cepstral domain
(Equation 21 )
Specifically, the indices of the training set T are divided into D batches with the same size and with no repetition, which could be denoted as
(Equation 22)
Similarly, the corresponding training pairs are also divided into D batches and could be denoted as
(Equation 23) with O being the training pairs. Furthermore, the training pairs in each batch con- tribute to a weight-updating and one epoch is finished when all training pairs in the training data are already performed.
The weights are then trained using batch backpropagation (BP) in which the weight matrix W is changed iteratively to minimize the cost function with the sto- chastic gradient descent (SGD) algorithm.
Stop criteria
After every epoch, the MSE will be calculated on the validation set, which could be denoted as lit(f) - y<6)(f) ||2
(Equation 24) where v(Wg) js the MSE on the validation set after the gth epoch is the set of frame indices on the validation set and is the output of the neural net- work after the gth epoch. The training process will end after gth epoch, if either of the following conditions is satisfied.
(Equation 25) where QM5E is the MSE threshold. The stop of the training process means that the neural network is assumed to have already achieve this state of proper generaliza- tion. Finally, the structure of the neural network and the trained weight matrix set together with the mean vector and the standard deviation vector, are stored for the further usage of the invention.

Claims

Patent claims
1. Method for post-processing of at least one transcoded digital signal including audio and/or video data to obtain at least one enhanced transcoded digital signal, whereby one of said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal using an encoder, whereby the method compris- ing the following steps using a postprocessor:
providing a plurality of transcoded signal frames, whereby said trans- coded signal frames were generated by separating one of said trans- coded digital signal;
preparing a first representation within a processing domain for each transcoded signal frame;
feeding each first representation of the transcoded signal frames into an artificial neural network to obtain for each first representation a second representation of the respective transcoded signal frame, whereby said artificial neural network is provided such that the artificial neural network is trained a mapping from a representation of a transcoded signal frame within said processing domain to a representation of a source signal frame within said processing domain;
generating an enhanced transcoded digital signal based on the second representations obtained from the artificial neural network; and outputting said enhanced transcoded digital signal.
2. Method according to claim 1 , wherein said providing step comprises:
separating one of said transcoded digital signal into said plurality of transcoded signal frames, each transcoded signal frame being a part of said transcoded digital signal; or
building said plurality of transcoded signal frames from a plurality of transcoded digital signal segments provided from the decoder.
3. Method according to claim 1 or 2, wherein said preparing step comprises:
processing each transcoded signal frame into a first representation within said processing domain.
4. Method according to one of claim 1 to 3, wherein said processing domain is the time domain, the frequency domain, the cepstral domain or the log- magnitude domain.
5. Method according to one of claim 1 to 3, wherein said processing domain is the time domain, whereby
preparing a waveform representation for each transcoded signal frame said artificial neural network is provided such that the artificial neural network is trained a mapping from the waveform representation of a transcoded signal frame to the waveform representation of a source signal frame;
the enhanced transcoded signal is generated based on the waveform representation obtained from the artificial neural network.
6. Method according to one of claim 1 to 3, wherein said processing domain is the frequency domain, whereby
the transcoded signal frames are processed into the frequency domain by transforming each transcoded signal frame in a magnitude-phase representation or in a real and imaginary part representation; said artificial neural network is provided such that the artificial neural network is trained a mapping from the magnitude-phase representation or from the real and imaginary part representation of a transcoded sig- nal frame to the magnitude-phase representation or to the real and im- aginary part representation of a source signal frame; and
the enhanced transcoded signal is generated based on the magnitude- phase representation or the real and imaginary part representation ob- tained from the artificial neural network.
7. Method according to claim 6, wherein said processing domain is the log- magnitude domain, whereby
the transcoded signal frames are processed into the frequency domain by transforming each transcoded signal frame in a magnitude-phase representation; said artificial neural network is provided such that the artificial neural network is trained a mapping from the log-magnitude representation of a transcoded signal frame to the log-magnitude representation of a source signal frame;
- whereby the log-magnitude representation of a source signal frame is subject to an inverse logarithm function and appended with the phase as obtained above to obtain a magnitude-phase representation of a source signal frame; and
the enhanced transcoded signal is generated based on the above ob- tained magnitude-phase representation obtained from the artificial neu- ral network.
8. Method according to one of claim 1 to 3, wherein said processing domain is the cepstral domain, whereby
- the transcoded signal frames are processed into the cepstral domain by transforming each transcoded signal frame in a cepstral coefficients representation;
said artificial neural network is provided such that the artificial neural network is trained a mapping from the cepstral coefficients representa- tion of a transcoded signal frame to the cepstral coefficients representa- tion of a source signal frame; and
the enhanced transcoded signal is generated based on the cepstral co- efficients representation obtained from the artificial neural network.
9. Method according to one of the foregoing claims, wherein said artificial neural network is a convolutional neural network.
10. Method according to claim 9, wherein said convolutional neural network has a plurality of hidden layers, whereby the hidden layers comprising at least one convolutional layer, at least one max pooling layer and at least one up- sampling layer.
11. Method according to claim 10, wherein an input layer of the convolutional neural network is connected with a first convolutional layer, said first convolu- tional layer is connected with a max pooling layer, said max pooling layer is connected with a second convolutional layer, said second convolutional layer is connected with a upsampling layer and said upsampling layer is connected with an output layer.
12. Method according to one of the foregoing claims, wherein for each second representation an enhanced transcoded signal frame is generated based on the respective second representation obtained from the artificial neural net- work, whereby based on the enhanced transcoded signal frames said en- hanced transcoded digital signal is generated.
13. Method according to one of the foregoing claims, wherein the transcoded signal frames comprising a frame length between 1 ms and 100 ms, advan- tageously for audio signals between 5 ms and 35 ms.
14. Hardware device for post-processing of a transcoded digital signal containing audio and/or video data to get an enhanced transcoded digital signal, said transcoded signal was obtained prior by decoding of an encoded signal using a decoder and said encoded signal was obtained prior by encoding of a source signal using an encoder, whereby the hardware device is arranged to execute the method according to one of the claims 1 to 13.
15. Computer program arranged to execute the post-processing method accord- ing to one of the claims 1 to 13, if the computer program is running on a computer device.
16. Method for training an artificial neural network, whereby the training method comprising the following steps:
providing a plurality of source signal frames and corresponding trans- coded signal frames, whereby said source signal frames were generat- ed by separating at least one source signal and said transcoded signal frames were generated by separating at least one transcoded digital signal, each transcoded digital signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of the corresponding source signal using an encoder; preparing a first representation within a processing domain for each transcoded signal frame and a second representation within said pro- cessing domain for each source signal frame;
training said artificial neural network by inputting the first and corre- sponding second representations such that a mapping from a first rep- resentation of a transcoded signal frame to a second representation of a source signal frame is trained.
17. Method according to claim 16, wherein the step of providing a plurality of source signal frames and corresponding transcoded signal frames compris- es:
selecting the source signal frames and the corresponding transcoded signal frames for training said artificial neural network by comparing the power ratio of each source signal frame and the whole source signal to a threshold.
EP18716589.9A 2018-04-05 2018-04-05 Method, hardware device and software program for post-processing of transcoded digital signal Pending EP3777194A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/058737 WO2019192705A1 (en) 2018-04-05 2018-04-05 Method, hardware device and software program for post-processing of transcoded digital signal

Publications (1)

Publication Number Publication Date
EP3777194A1 true EP3777194A1 (en) 2021-02-17

Family

ID=61913162

Family Applications (1)

Application Number Title Priority Date Filing Date
EP18716589.9A Pending EP3777194A1 (en) 2018-04-05 2018-04-05 Method, hardware device and software program for post-processing of transcoded digital signal

Country Status (2)

Country Link
EP (1) EP3777194A1 (en)
WO (1) WO2019192705A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739545B (en) * 2020-06-24 2023-01-24 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, device and storage medium

Also Published As

Publication number Publication date
WO2019192705A1 (en) 2019-10-10

Similar Documents

Publication Publication Date Title
US20220148613A1 (en) Speech signal processing method and apparatus, electronic device, and storage medium
CN110379412B (en) Voice processing method and device, electronic equipment and computer readable storage medium
US7379866B2 (en) Simple noise suppression model
US7558729B1 (en) Music detection for enhancing echo cancellation and speech coding
US20110125490A1 (en) Noise suppressor and voice decoder
EP1327242A1 (en) Error concealment in relation to decoding of encoded acoustic signals
AU2001284608A1 (en) Error concealment in relation to decoding of encoded acoustic signals
US6694018B1 (en) Echo canceling apparatus and method, and voice reproducing apparatus
KR20090039660A (en) Updating of decoder states after packet loss concealment
JP2010538317A (en) Noise replenishment method and apparatus
JP2003108196A (en) Frequency domain postfiltering for quality enhancement of coded speech
AU2013314636B2 (en) Generation of comfort noise
EP2502231A1 (en) Bandwidth extension of a low band audio signal
CN110556122A (en) frequency band extension method, device, electronic equipment and computer readable storage medium
US6665638B1 (en) Adaptive short-term post-filters for speech coders
CN101141533A (en) Method and system for providing an acoustic signal with extended bandwidth
US20230377584A1 (en) Real-time packet loss concealment using deep generative networks
WO2022228144A1 (en) Audio signal enhancement method and apparatus, computer device, storage medium, and computer program product
CN1276898A (en) Reducing sparseness in coded speech signals
US20030065507A1 (en) Network unit and a method for modifying a digital signal in the coded domain
CN110556121A (en) Frequency band extension method, device, electronic equipment and computer readable storage medium
JP2023548707A (en) Speech enhancement methods, devices, equipment and computer programs
EP3777194A1 (en) Method, hardware device and software program for post-processing of transcoded digital signal
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN112751820B (en) Digital voice packet loss concealment using deep learning

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20201030

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230201