CN107112024B

CN107112024B - Encoding and decoding of audio signals

Info

Publication number: CN107112024B
Application number: CN201580057771.7A
Authority: CN
Inventors: 克里斯托弗·薛林; 亚历山大·格罗舍尔; 海科·普尔哈根; 霍尔格·赫里希; 库尔特·克劳斯
Original assignee: Dolby International AB
Current assignee: Dolby International AB
Priority date: 2014-10-24
Filing date: 2015-10-23
Publication date: 2020-07-14
Anticipated expiration: 2035-10-23
Also published as: US10304471B2; BR112017007833A2; JP2017532603A; RU2017117896A3; EP3210206A1; RU2708942C2; EP3210206B1; WO2016062869A1; KR20170076671A; JP6728154B2; RU2017117896A; KR102474541B1; ES2709274T3; US20170243595A1; CN107112024A

Abstract

The audio signal (X) is represented by a bitstream (B) divided into frames. The audio processing system (500) includes a buffer (510) and a decoding section (520). The buffer will be composed of N corresponding frames (F)₁,F₂,...,F_N) Carried audio data set (D)₁；D₂,...,D_N) Are combined into one decodable set of audio data (D) corresponding to the first frame rate and corresponding to the first number of samples of the audio signal per frame. The frames have a second frame rate corresponding to a second number of samples of the audio signal per frame. The first number of samples is N times the second number of samples. The decoding section decodes the decodable set of audio data into a segment of the audio signal by employing at least signal synthesis based on the decodable set of audio data and using a step corresponding to a first number of samples of the audio signal.

Description

Encoding and decoding of audio signals

Cross Reference to Related Applications

This application claims priority to U.S. provisional patent application No. 62/068,187, filed on 24/10/2014, the entire contents of which are incorporated herein by reference.

Technical Field

The invention disclosed herein relates generally to encoding and decoding of audio signals, and in particular to audio bitstream formats with advantageous scaling behavior for high frame rates.

Background

Audio and video frame rates (or frame frequencies) used in most commercial applications available today follow separately established industry standards in terms of recording and playback software products, hardware components, and agreed formats for transmitting audio and video between communicating parties audio frame rates are typically specific to different encoding algorithms and are associated with specific audio sampling frequencies (e.g., 44.1kHz and 48kHz) that are as well known as video frame rates 29.97fps (ntsc) and 25fps (PA L) in their respective geographic regions, and additional standard video frame rates include 23.98fps, 24fps, and 30fps, or in a more generalized form, 24fps, 25fps, 30fps, and (24,25,30) × 1000/1001 fps-although conversion from analog to digital distribution, attempts to join or coordinate audio frame rates have not been successful, meaning that audio frames (e.g., packets or coding units suitable for transmission over a network) typically do not correspond to an integer number of video frames in an audiovisual data stream.

The need to synchronize the streams of audiovisual data arises repeatedly, due to clock drift or when several streams are received from different sources for common processing, editing or splicing (splice) in the server, a situation often encountered in broadcasting stations. Attempts to improve video-video synchronicity between two streams of audiovisual data by copying or discarding video frames in one stream (e.g. a stream prepared for splicing) typically result in an audio-video lag within the stream of audiovisual data if the size of the audio frames does not match the size of the video frames. Typically, the delay lasts for at least some non-zero duration, even if the audio frame corresponding to the video edit is deleted or copied.

At the expense of more processing, a larger maneuvering space can be created by temporarily decoding the audio during synchronization to a low-level format that is independent of frame partitioning, e.g., baseband format or Pulse Code Modulation (PCM) parsed at the original sampling frequency. However, such decoding obscures the exact anchoring of the metadata to the particular audio piece and causes irreparable loss of information by decoding to a "perfect" intermediate format. As an example, Dynamic Range Control (DRC) is typically mode-dependent and device-dependent and can therefore only be used at the time of actual playback; the data structure that manages DRC characteristics throughout the audio packet is difficult to recover faithfully after synchronization occurs. Thus, the task of preserving this type of metadata through successive decoding, synchronization and encoding stages is not a simple task if subject to complexity constraints.

Even more serious difficulties may arise in connection with legacy infrastructure designed to carry two-channel PCM signals, thereby being able to process multi-channel content in encoded form only.

It is indeed more convenient to encode audio and video data frame-synchronously, in the sense that the data in a given frame corresponds exactly to the same time period in the recorded and encoded audiovisual signal. This preserves audio-video synchronization under frame-wise manipulation (frame-wise manipulation) of the audiovisual stream, i.e. copying or rejecting one or more complete individual coding units in the stream. Dolby E^TMThe available frame length for the audio format matches the video frame length. However, with a typical bit rate of 448kbps, this format is designed primarily for professional production purposes, with hard media like digital video tape as its preferred storage form.

In the applicant's co-pending unpublished application PCT/EP2014/056848, systems and methods are proposed that are compatible with the following audio formats: the audio format is suitable for distribution purposes as part of a frame-synchronized audiovisual format.

There is a need for an alternative audio format that is suitable for distribution purposes as part of a frame-synchronized audiovisual format and that has improved scaling behavior for high frame rates. There is also a need for encoding and decoding devices suitable for their use.

Drawings

Example embodiments will be described in more detail below and with reference to the accompanying drawings, in which:

FIG. 1 is a general block diagram of an audio processing system for representing an audio signal as an audio bitstream, according to an example embodiment;

FIG. 2 is a flow diagram of a method of representing an audio signal as an audio bitstream, according to an example embodiment;

FIGS. 3 and 4 illustrate examples of audio bitstreams provided by the audio processing system shown in FIG. 1, according to an exemplary embodiment;

FIG. 5 is a general block diagram of an audio processing system for reconstructing an audio signal represented by a bitstream, according to an example embodiment;

fig. 6 is a flowchart of a method of reconstructing an audio signal represented by a bitstream according to an example embodiment; and

fig. 7 is a general block diagram of an audio processing system for transcoding an audio bitstream representing an audio signal, according to an example embodiment.

All the figures are schematic and generally only show parts which are necessary for elucidating the invention, while other parts may be omitted or merely suggested.

Detailed Description

As used herein, an audio signal may be any of a stand-alone audio signal, an audio portion of an audiovisual signal or a multimedia signal, or an audio signal combined with metadata.

I. Overview-encoder side

According to a first aspect, the exemplary embodiments propose an audio processing system, a method and a computer program product for representing an audio signal as an audio bitstream. According to a first aspect, the proposed system, method and computer program product may generally share the same features and advantages.

According to an example embodiment, a method of representing an audio signal as an audio bitstream is provided. The method comprises the following steps: a segment of an audio signal is encoded into a decodable set of audio data by performing at least a signal analysis on the segment of the audio signal using a stride corresponding to a first number of samples of the audio signal (referred to herein as a base stride). The decodable set of audio data corresponds to a first frame rate and a first number of samples of the audio signal per frame. The method comprises the following steps: dividing the decodable set of audio data into N portions, wherein N ≧ 2; and forming N bitstream frames carrying the respective portions. The bitstream frames have a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame. The first number of samples is N times the second number of samples. The method comprises the following steps: outputting a bitstream, the bitstream being partitioned into bitstream frames including the formed N bitstream frames.

In the stream of audiovisual data the audio frames and the video frames may be synchronized and may have equal duration, for example to facilitate frame dropping or frame copying in connection with splicing or compensation of clock drift. The audio frame rate may also be increased in order to maintain audio-video synchronicity in the stream of audiovisual data for higher video frame rates. However, while predictive coding is typically used to reduce the bit rate cost of increasing the video frame rate, predictive coding may be less efficient for audio frames because audio content may vary over a shorter time scale and may be associated with a lower degree of correlation between successive frames as compared to video content. For purposes of this disclosure, unless otherwise noted, a video frame corresponds to one complete screen image (e.g., a still image in a sequence), while an audio frame may in principle carry audio data corresponding to a segment of an audio signal having any duration.

The ability of the present method to provide N bitstream frames of a second (higher) frame rate together with carrying a decodable set of audio data associated with a first (lower) frame rate allows for maintaining audiovisual synchronicity of the higher video frame rate without a corresponding increase in bit rate consumption. More precisely, operating at an increased frame rate according to the present method generally results in the following bit rates: which is lower than the bit rate required when using conventional audio frames with such a higher frame rate. Thus, the method may for example facilitate splicing of streams of audio-visual data and/or facilitate compensation for clock drift.

Indeed, even though the N bitstream frames may still need to contain additional non-payload data necessary for complying with the frame format (see below), the decodable set of audio data may correspond to the amount of data carried by conventional audio frames of a first (lower) frame rate, the total amount of data transmitted from the encoder side to the decoder side may be reduced compared to using conventional audio frames having a second (higher) frame rate. In particular, performing the signal analysis in basic steps instead of in shorter steps (e.g., corresponding to the second number of samples of the audio signal) reduces the amount of data required to synthesize the audio signal again on the decoder side, thereby reducing the bit rate required to transmit the data to the decoder side.

Splicing an audio bitstream with other bitstreams may be performed, for example, without regard to the audio data carried by the bitstream frames. In other words, the device or unit performing the joining need not be aware of the fact that: all N bitstream frames may be needed to reconstruct a slice of the audio signal and, for example, treat the bitstream frames as if they could be decoded independently. Possible lost bitstream frames in the spliced bitstream can be processed on the decoder side, e.g. by concealing such bitstream frames that may not allow successful decoding.

A decodable set of audio data refers to a set of audio data sufficient to decode a segment of an audio signal. The decodable set of audio data may be complete in the sense that: decoding of a segment of an audio signal may be performed without additional data related to the segment of the audio signal (when non-payload data such as overhead bits, headers or preambles, for example, may be used on the decoder side to identify a decodable set of audio data).

Performing signal analysis by using a basic step corresponding to a first number of samples of the audio signal means: the signal analysis is performed within an analysis window of a certain number of samples of the audio signal, and the analysis window is moved by the same number of samples as the basic step when the next segment of the audio signal is to be encoded. The signal analysis may be performed, for example, with overlapping analysis windows, in which case the analysis windows may be longer than the basic stride. In another example, the length of the analysis window may coincide with the base stride.

It should be understood that if the audio signal is a multi-channel signal, the basic step may correspond to a first number of samples of the audio signal on a per-channel basis, rather than being the sum of the samples of the individual channels.

The step of encoding a segment of the audio signal may comprise, for example, a plurality of sub-steps, one or more of which may comprise performing signal analysis in basic steps.

The decodable set of audio data may represent a segment of the audio signal corresponding to a first number of samples of the audio signal. The decodable set of audio data can correspond to frames having a first frame rate.

Partitioning the decodable set of audio data may, for example, comprise dividing the decodable set of data into N at least substantially equally sized portions, e.g., comprising at least substantially the same number of bits.

Each of the N portions may be an incomplete audio data set in the sense that: without accessing other parts, one part may not be sufficient to decode a segment (or sub-segment) of the audio signal.

For each of the N bitstream frames, the N bitstream frames may, for example, be a minimum set of bitstream frames that includes the bitstream frame, and audio data from the minimum set may be combined to decode a segment of the audio signal represented by the data carried by the bitstream frame. In other words, the N bitstream frames may be those bitstream frames that carry data originally contained in the same decodable set of audio data.

Bitstream frames correspond to the second (higher) frame rate in the following sense: the N bitstream frames together represent the same audio signal segment as the decodable set of audio data corresponding to the first (lower) frame rate.

Similarly, a bitstream frame corresponds to the second (smaller) number of samples of each bitstream frame in the sense that: the N bitstream frames together represent a first (higher) number of samples that is also represented by the decodable set of audio data.

It is to be understood that the bitstream frames may for example carry respective portions of a spectral representation of a segment of the audio signal, and that there may be no correlation between one of the bitstream frames and a second (smaller) number of samples of the audio signal.

The N bitstream frames may for example conform to an audio format in the following sense: the bitstream frames may carry a payload and metadata that, at the primary stream level, conforms to an audio format, for example, as provided in a Moving Picture Experts Group (MPEG) primary stream. It should be understood that although conforming to an audio format in this sense, the payload and at least some of the metadata carried by the bitstream frames may, for example, be of a different type and/or format than those in audio frames known in the art.

The N bitstream frames carrying the N parts may for example be output as N consecutive bitstream frames in the bitstream.

In an example embodiment, performing the signal analysis may include performing at a base stride: carrying out spectrum analysis; analyzing energy; and/or entropy analysis. A spectral analysis with a basic step may for example be performed to convert a segment of the audio signal from the time domain to the frequency domain. An energy analysis, e.g. with basic steps, may be performed to encode a segment of the audio signal with an energy-based encoding technique. An entropy analysis, for example with a basic stride, may be performed to encode the audio signal with an entropy analysis based encoding technique.

In an example embodiment, encoding the segment of the audio signal may include: applying a windowed transform with the base stride as a transform stride; and/or calculating a downmix signal and parameters for a parametric reconstruction of an audio signal from the downmix signal, wherein the parameters are calculated based on a signal analysis.

The windowing transform may for example be a harmonic transform, such as for example a Modified Discrete Cosine Transform (MDCT) employing overlapping transform windows.

The audio signal may be, for example, a multi-channel audio signal, and the downmix signal may be a signal having fewer channels than the multi-channel signal, e.g. a signal obtained upon linear combination of the channels of the multi-channel signal. The downmix signal may be, for example, a mono or stereo downmix of a multi-channel audio signal.

In an example embodiment, the method may include: the metadata is included in at least one of the N bitstream frames carrying the portion. The metadata may indicate: a complete decodable set of audio data can be obtained from the portion carried by the N bitstream frames.

Each bitstream frame of the N bitstream frames from which a decodable set of audio data can be obtained may, for example, carry metadata identifying it as belonging to a group of N bitstream frames. In another example, one of the bitstream frames may carry metadata identifying all N bitstream frames, while the other N-1 bitstream frames in the group do not necessarily carry such metadata. The bitstream may, for example, comprise other bitstream frames that do not carry such metadata.

The metadata may allow the N bitstream frames to be located in non-predetermined positions relative to each other. The metadata may allow other bitstream frames between the N bitstream frames. The metadata may allow detecting when one or more of the N bitstream frames are lost in the bitstream, e.g. due to splicing or frame dropping.

In an example embodiment, the audio bitstream may be associated with a stream of video frames. The method may further comprise: in response to a stream of video frames comprising a certain type of video frames, a segment of the audio signal that is time-correlated with video frames is encoded into a second decodable set of audio data by performing at least a signal analysis on the segment of the audio signal that is time-correlated with video frames in shortening steps corresponding to a second number of samples of the audio signal. The second decodable set of audio data may correspond to a second frame rate and a second number of samples of the audio signal per frame. The method can comprise the following steps: a bitstream frame carrying a second set of decodable audio data is included in the bitstream.

A stream of video frames may be spliced, for example, at points adjacent to a frame of a certain type, such as an independently encoded video frame, in order to decode the sequence of spliced video frames on the decoder side. The method of encoding a segment of the audio signal temporally related to the certain type of video frame into a second decodable set of audio data corresponding to the second frame rate and the method of including a bitstream frame carrying the second decodable set of audio data in the bitstream allow for independent decoding of this segment of the audio signal on the decoder side. Thus, the present example embodiment may facilitate decoding of the segment of the audio signal in the following cases: previous or subsequent bitstream frames from the audio bitstream may be lost on the decoder side, e.g. due to splicing the audio-visual stream comprising data of the stream of audio bitstreams and video frames with one or more other streams of audio-visual data.

A segment of the audio signal that is temporally related to a certain type of video frame may for example correspond to a point in time when said certain type of video frame is intended to be reproduced on the display.

The stream of video frames may for example comprise independently coded frames and predictively coded frames (with unidirectional or bidirectional dependency on adjacent frames), and a certain type of video frame may for example be an independently coded video frame.

The method may for example comprise: the presence of a certain type of video frame in a stream of video frames is detected. The presence of a certain type of video frame may be detected, for example, via signaling from a video encoder.

Performing signal analysis with a shortened stride may, for example, include performing with a shortened stride: carrying out spectrum analysis; analyzing energy; and/or entropy analysis.

Encoding a segment of an audio signal that is temporally related to a certain type of video frame may for example comprise: applying a windowed transform having a shortened stride as a transform stride; and/or calculating a downmix signal and parameters for a parametric reconstruction of an audio signal from the downmix signal, wherein the parameters are calculated based on a signal analysis with a shortened stride.

In an example embodiment, the method may include: in response to a stream of video frames comprising a certain type of video frames, encoding N consecutive segments of the audio signal into respective decodable sets of audio data by applying at least a signal analysis with a shortened stride to each of the N consecutive segments. The segment temporally associated with the video frame may be one of N consecutive segments. The method can comprise the following steps: bitstream frames carrying respective decodable sets of audio data associated with the N consecutive slices are included in the bitstream.

The bitstream may for example comprise a set of N consecutive bitstream frames carrying respective portions of audio data that can be decoded together. Thus, on the decoder side, N bitstream frames of a bitstream can be decoded at a time. In this example embodiment, the structure of a group of N bitstream frames may be preserved when a video frame of the certain type is present in the associated stream of video frames, for example, regardless of the position of the video frame of the certain type in the stream of video frames relative to the position of the group of N consecutive bitstream frames in the bitstream.

According to an example embodiment, an audio processing system for representing an audio signal by an audio bitstream is provided. The audio processing system includes: an encoding section configured to encode a segment of the audio signal into a decodable set of audio data by performing at least a signal analysis on the segment of the audio signal using a basic stride corresponding to a first number of samples of the audio signal. The decodable set of audio data corresponds to a first frame rate and a first number of samples of the audio signal per frame. The audio processing system includes: a recombination section configured to: dividing the decodable set of audio data into N portions, wherein N ≧ 2; and form N bitstream frames carrying the respective portions. The bitstream frames have a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame. The first number of samples is N times the second number of samples. The recombination section is configured to output a bitstream divided into bitstream frames including the formed N bitstream frames.

According to an example embodiment, there is provided a computer program product comprising a computer readable medium for performing any of the methods of the first aspect.

According to an example embodiment, N-2 or N-4 may be considered, i.e. the N bitstream frames may be two bitstream frames out of four bitstream frames.

Overview-decoder side

According to a second aspect, the exemplary embodiments propose an audio processing system and a method and a computer program product for reconstructing an audio signal represented by a bitstream. According to a second aspect, the proposed system, method and computer program product may generally share the same features and advantages. Furthermore, the advantages presented above with respect to the features of the system, method and computer program product according to the first aspect are generally valid for the corresponding features of the system, method and computer program product according to the second aspect.

According to an example embodiment, a method of reconstructing an audio signal represented by a bitstream segmented into bitstream frames is provided. The method comprises the following steps: audio data sets carried by N respective bitstream frames are combined into a decodable audio data set corresponding to a first frame rate and corresponding to a first number of samples of the audio signal per frame, where N ≧ 2. The bitstream frames have a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame. The first number of samples is N times the second number of samples. The method comprises the following steps: the decodable set of audio data is decoded into a segment of the audio signal by applying at least a signal analysis based on the decodable set of data and using a stride (referred to herein as a base stride) corresponding to a first number of samples of the audio signal.

In the stream of audiovisual data, the audio frames and the video frames may be synchronized and may have the same duration, for example to facilitate frame dropping or frame copying in connection with splicing or compensation for clock drift. The audio frame rate may also be increased in order to maintain audio-video synchronicity in the stream of audiovisual data for higher video frame rates. However, while predictive coding is typically used to reduce the bit rate cost of increasing the video frame rate, predictive coding may be less efficient for audio frames because audio content may vary over a shorter time scale and may be associated with a lower degree of correlation between successive frames as compared to video content. Too short an audio frame length should also be avoided, since it may limit the transform step size and thus set a limit on the frequency resolution.

The ability of the method to combine multiple sets of audio data carried by N respective bitstream frames of the second (higher) frame rate into a decodable set of audio data associated with the first (lower) frame rate allows for maintaining audiovisual synchronicity of the higher video frame rate without a corresponding increase in bit rate consumption. More precisely, the bitrate when operating at an increased frame rate according to the present method may be lower than what is required when using conventional audio frames with such a higher frame rate. The method may for example facilitate splicing of streams of audio-visual data and/or facilitate compensation for clock drift.

In particular, using signal synthesis with basic steps instead of synthesis with shorter steps (e.g., corresponding to the second number of samples of the audio signal) reduces the amount of data required to synthesize the audio signal, thereby reducing the bit rate required for transmitting the data.

Each of the plurality of data sets that are combined into the decodable audio data set may be an incomplete audio data set in the sense that: each of the plurality of data sets may be insufficient to decode a segment (or sub-segment) of the audio signal without accessing other sets.

For each of the N bitstream frames, the N bitstream frames may, for example, be a minimum set of bitstream frames that includes the bitstream frame, and audio data from the minimum set may be combined to decode a segment of the audio signal represented by the data carried by the bitstream frame.

An audio data set sufficient for decoding a segment of the audio signal is represented by a decodable audio data set. The decodable set of audio data may be complete in the sense that: decoding of a segment of an audio signal may be performed without additional audio data.

The combination of the sets of audio data into a decodable set of audio data may for example comprise a concatenation of the sets of data, for example by arranging bits representing the respective sets of data after each other.

By using signal synthesis with a basic step corresponding to the first number of samples of the audio signal means: signal synthesis is performed on a segment of the audio signal corresponding to a certain number of samples of the audio signal, and when the next segment of the audio signal is to be reconstructed, the signal synthesis process produces an output for the following range: the range has been shifted by the same number of samples as the basic step.

The signal synthesis with basic steps may for example be used directly based on the decodable set of audio data or may be used indirectly based on the decodable set of audio data, for example based on audio data or signals obtained by processing the decodable set of audio data.

The step of decoding the decodable set of audio data may for example comprise a plurality of sub-steps, one or more of which may comprise signal synthesis in basic steps.

The bitstream provided by the encoder may for example already be spliced with another bitstream before reaching the decoder side. For example, one or more of the N bitstream frames may be lost, for example, in the bitstream received at the decoder side. In some example embodiments, the audio processing method may thus comprise detecting whether one or more of N bitstream frames from which the audio data is assembled into a complete decodable set are lost in the bitstream. The method may for example comprise: in response to detecting that one or more of the N bitstream frames are lost in the bitstream, error concealment is applied. Error concealment may for example comprise replacing audio data carried by one or more received bitstream frames with zeros and optionally applying a fade-out and/or fade-in.

In an example embodiment, decoding the decodable set of audio data may comprise: applying a windowed transform with the base stride as a transform stride; and/or performing a parametric reconstruction of segments of the audio signal in basic steps based on a downmix signal and associated parameters obtained from the decodable set of audio data.

The windowing transform may for example be a harmonic transform, such as a modified inverse discrete cosine transform (MDCT).

The audio signal may for example be a multi-channel audio signal and the downmix signal may be a signal having fewer channels than the multi-channel signal, e.g. the signal is obtained at a linear combination of the channels of the multi-channel signal. The downmix signal may for example be a mono or stereo downmix of a multi-channel audio signal. The decodable set of audio data may for example comprise the downmix signal and the associated parameters for the parametric reconstruction of the segment of the audio signal. Alternatively, the decodable set of audio data may comprise data representing the downmix signal and the associated parameters, e.g. in quantized form, from which the downmix signal and the associated parameters may be derived.

In an example embodiment, the N bitstream frames may be N consecutive bitstream frames from which audio data sets are combined into a decodable audio data set. Using consecutive frames to carry sets of audio data that can be combined into a decodable set of audio data can facilitate decoding of an audio signal and can reduce the need for metadata to identify bitstream frames for which data is to be combined into a decodable set of audio data. Using consecutive frames to carry sets of audio data that can be combined into a decodable set of audio data can reduce the need for buffered data for performing decoding.

In an example embodiment, the method may further comprise: a set of bitstream frames from which the incomplete set of audio data is to be combined into a decodable set of audio data is determined based on metadata carried by at least some of the bitstream frames in the bitstream. The metadata may be carried by all bitstream frames, or by one or more bitstream frames in terms of a set of N bitstream frames, for example, for identifying the set of N bitstream frames. Embodiments are also envisaged in which the bitstream comprises further frames carrying metadata identifying the set of N frames, and the N bitstream frames themselves may not carry such metadata.

In an example embodiment, the method may further comprise: detecting whether the bitstream frames carry a decodable set of audio data corresponding to a second frame rate; and decoding the decodable set of audio data corresponding to the second frame rate into a segment of the audio signal by employing at least signal synthesis based on the decodable set of audio data corresponding to the second frame rate and using shortened steps corresponding to the second number of samples.

Bitstream frames carrying independently decodable sets of audio data may be used, for example, to facilitate decoding of the bitstream after splicing and/or after frame dropping/copying. The ability of the method in this example embodiment to decode with shortened stride may make it compatible with bitstream formats that facilitate synchronization of audio and video frames.

Decoding the decodable set of audio data corresponding to the second frame rate may, for example, comprise: applying a windowed transform with the shortened stride as a transform stride; and/or performing a parametric reconstruction of the segments of the audio signal in a shortened step based on the downmix signal and the associated parameters obtained from the second decodable set of audio data.

The detection of whether a bitstream frame carries a decodable set of audio data corresponding to the second frame rate may be based, for example, on metadata carried by the bitstream frame, or on the absence of a particular type of metadata in the bitstream frame.

In an example embodiment, decoding the decodable set of audio data corresponding to the second frame rate may include: a delay is provided such that decoding of a set of N consecutive bitstream frames of the second frame rate is completed simultaneously as if the bitstream frames in the set of N bitstream frames each carried an audio data set that is required to be combined into a decodable audio data set. The present example embodiment facilitates smooth transitions between segments of an audio signal reconstructed using basic steps and segments of an audio signal reconstructed using shortened steps, and may improve the playback quality perceived by a listener.

In an example embodiment, the delay may be provided by buffering at least one decodable set of audio data corresponding to the second frame rate or buffering at least one segment of the audio signal. That is, the delay may be provided by: buffering one or more decodable sets of audio data corresponding to the second frame rate before performing the signal synthesis, or buffering one or more segments of the audio signal reconstructed from the one or more decodable sets of audio data corresponding to the second frame rate after performing the signal synthesis.

In an example embodiment, the bitstream may be associated with a stream of video frames having a frame rate consistent with the second frame rate. In this example embodiment, the frame rate of the bitstream frames may coincide with the frame rate of the video frames, which may facilitate splicing and/or synchronization of the stream of audiovisual data comprising the bitstream and the stream of video frames with other streams of audiovisual data.

In an example embodiment, decoding a segment of an audio signal based on a decodable set of audio data corresponding to a first frame rate may include: receiving quantized spectral coefficients corresponding to a decodable set of audio data corresponding to a first frame rate; performing inverse quantization followed by a frequency-time conversion, thereby obtaining a representation of the intermediate audio signal; performing at least one processing step of the frequency domain on the intermediate audio signal; and changing the sampling rate of the processed audio signal to the target sampling frequency, thereby obtaining a time-domain representation of the reconstructed audio signal.

The target sampling frequency may be a predefined amount that can be configured by a user or system designer independent of the properties of the incoming bitstream (e.g., frame rate).

The inverse quantization may be performed with a predetermined quantization level (or reconstruction level, or reconstruction point). The quantization levels may be selected on the encoder side based on psychoacoustic considerations, e.g. such that the quantization noise for a given frequency (or frequency band) does not exceed a masking threshold. Since the masking threshold is frequency-dependent, it is preferable from an economical point of view to have the encoder side select a quantization level that is not uniform with respect to frequency. Therefore, quantization and dequantization are typically performed in consideration of the particular physical sampling frequency that produces the best output.

At least one processing step may be associated with Spectral Band Replication (SBR) and/or Dynamic Range Control (DRC), for example.

When performing at least one processing step in the frequency domain, the method may comprise: performing a time-frequency conversion, e.g. performed by a Quadrature Mirror Filter (QMF) analysis filterbank, to obtain a frequency representation of the intermediate audio signal; and performing an additional frequency-to-time conversion, e.g. performed by a QMF synthesis filter bank, for converting the processed audio signal back to the time domain.

In an example embodiment, the method may accept a bitstream associated with at least two different values for the second frame rate, but associated with a common value for the second number of samples per frame. The respective values of the second frame rate may differ by at most 5%. The frequency-time conversion may be performed in the following functional components: the functional component is configured to use a windowing transformation with a common predetermined value for the base stride as a transformation stride for at least two different values of the second frame rate.

In an audiovisual data stream, the audio frame rate may be adapted to the video frame rate (e.g., may coincide with the video frame rate), for example, to facilitate audio-video synchronization and/or splicing. Thus, the ability of the method in this example embodiment to accept audio bit streams with different frame rates may facilitate audio-video synchronization and/or splicing of streams of audiovisual data.

In a closely sampled system, the physical sampling frequency corresponds to the ratio of the physical duration of an audio frame to the number of spectral coefficients contained therein. The functional component performing the inverse quantization and the frequency-time conversion does not need to know the physical duration of the coefficients in the decodable set of audio data, but only that the coefficients belong to the same decodable set of audio data. Since the values of the second frame rate differ by at most 5%, the resulting internal sampling frequency will vary very little (in physical units) and the resampling coefficients used in the final sample rate conversion will be close to 1. Thus, non-constancy of the internal sampling frequency does not generally result in any perceptible degradation of the reconstructed audio signal. In other words, a slight up-or down-sampling of the intermediate audio signal (which is produced to be optimal at a sampling frequency slightly different from the target sampling frequency) is not psycho-acoustically significant. In particular, as long as the deviation is limited, some amount of mismatch between the expected physical sampling frequency of the functional component performing the inverse quantization and/or the frequency-to-time conversion and the physical sampling frequency to which any component downstream thereof is tuned may be tolerable.

According to an example embodiment, an audio processing system is provided for reconstructing an audio signal represented by a bitstream segmented into bitstream frames. The audio processing system includes: a buffer configured to aggregate sets of audio data carried by N respective bitstream frames into one decodable set of audio data corresponding to a first frame rate and corresponding to a first number of samples of the audio signal per frame, where N ≧ 2. The bitstream frames have a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame. The first number of samples is N times the second number of samples. The system comprises a decoding section configured to decode a decodable set of audio data into a segment of an audio signal by employing at least signal synthesis based on the decodable set of audio data and using a basic stride corresponding to a first number of samples of the audio signal.

According to an example embodiment, there is provided a computer program product comprising a computer readable medium for performing any of the methods of the second aspect.

According to an example embodiment, N-2 or N-4 may be considered, i.e., the N bitstream frames may be two bitstream frames out of four bitstream frames.

Overview-transcoding

According to a third aspect, example embodiments are directed to an audio processing system and a method and computer program product for transcoding an audio bitstream representing an audio signal. According to a third aspect, the proposed system, method and computer program product may generally share the same features and advantages. Furthermore, the advantages presented above for the features of the system, method and computer program product according to the first and/or aspect are generally valid for the corresponding features of the system, method and computer program product according to the third aspect.

According to an example embodiment, a method of transcoding an audio bitstream representing an audio signal is provided. The bitstream comprises a sequence of decodable sets of audio data corresponding to a first frame rate and a first number of samples of the audio signal per frame. The method comprises the following steps: extracting a decodable set of audio data from the bitstream; dividing the decodable set of audio data into N portions, wherein N ≧ 2; and forming N bitstream frames carrying the respective portions. The bitstream frames have a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame. The first number of samples is N times the second number of samples. Thereafter, a bitstream is output, which is divided into bitstream frames including the formed N bitstream frames. Optionally, the step of processing the decodable set of audio data is performed before the step of dividing the structure (set up) into N parts. Depending on the nature of the processing, this may require that the audio data is initially decoded into a transform representation or a waveform representation.

The ability of the present method to provide N bitstream frames of a second (higher) frame rate together with carrying a decodable set of audio data associated with a first (lower) frame rate allows for maintaining audiovisual synchronicity of the higher video frame rate without a corresponding increase in bit rate consumption. The bitrate when operating at an increased frame rate according to the present method may be lower than what is required when using conventional audio frames with such a higher frame rate. Thus, the method may for example facilitate splicing of streams of audio-visual data and/or facilitate compensation for clock drift.

The method may for example comprise dividing the processed version of the decodable set of audio data into N parts.

According to an example embodiment, an audio processing system for transcoding an audio bitstream representing an audio signal is provided, wherein the bitstream comprises a sequence of decodable sets of audio data corresponding to a first frame rate and a first number of samples of the audio signal per frame. The audio processing system includes: a receiving section configured to extract a decodable set of audio data from a bitstream; optionally, the processing section is configured to process the decodable set of audio data. The audio processing system includes: a recombination section configured to: dividing the decodable set of audio data into N portions, wherein N ≧ 2; and forming N bitstream frames carrying the respective portions. The bitstream frames have a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame. The first number of samples is N times the second number of samples. The reconstruction section is configured to output a bitstream that is divided into bitstream frames including the formed N bitstream frames.

According to an example embodiment, a computer program product comprising a computer readable medium for performing any of the methods of the third aspect is provided.

Overview-computer-readable medium

According to a fourth aspect, example embodiments are directed to a computer-readable medium representing an audio signal. The advantages presented above with respect to the features of the system, the method and the computer program product according to the first, second and/or third aspect are generally valid for the corresponding features of the computer readable medium according to the fourth aspect.

According to an example embodiment, a computer-readable medium representing an audio signal and partitioned into bitstream frames is provided. In a computer readable medium, N bitstream frames carry respective sets of audio data that can be combined into one decodable set of audio data corresponding to a first frame rate and corresponding to a first number of samples of an audio signal per frame, where N ≧ 2. The decodable set of audio data can be decoded into a segment of the audio signal by employing at least signal synthesis based on the decodable set of audio data and using a basic stride corresponding to a first number of samples of the audio signal. The bitstream frames have a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame. The first number of samples is N times the second number of samples.

Together with carrying a decodable set of audio data associated with the first (lower) frame rate, the N bitstream frames of the second (higher) frame rate allow for maintaining audiovisual synchronicity of the higher video frame rate without a corresponding increase in bit rate consumption. More specifically, the bitrate when operating at an increased frame rate according to the present computer-readable medium may be lower than what is required when using conventional audio frames with such a higher frame rate. Thus, the computer readable medium may for example facilitate splicing of streams of audio-visual data and/or facilitate compensation for clock drift.

The N bitstream frames carrying the respective audio data sets that can be combined into one decodable audio data set may for example be N consecutive bitstream frames.

In an example embodiment, at least one bitstream frame of the N bitstream frames may carry metadata indicative of a set of bitstream frames from which the sets of audio data are combined into a decodable set of audio data.

In an example embodiment, the computer readable medium may further comprise bitstream frames carrying a second set of audio data that can be decoded into the segments of the audio signal by employing at least signal synthesis based on the second set of audio data and using shortened steps corresponding to a second number of samples of the audio signal.

According to the present example embodiment, bitstream frames carrying independently decodable sets of audio data may be used, for example to facilitate decoding of the bitstream after splicing and/or after frame dropping/copying.

V. example embodiments

Fig. 1 is a general block diagram of an audio processing system 100 for representing an audio signal X as an audio bitstream B according to an example embodiment.

The audio processing system 100 includes an encoding section 110 and a re-organizing section 120. The encoding part 110 encodes the segments of the audio signal X into one decodable audio data set D by performing at least a signal analysis on the segments of the audio signal X in basic steps (basic stride) corresponding to a first number of samples of the audio signal X.

By performing the signal analysis in basic steps corresponding to a first number of samples of the audio signal X is meant that the signal analysis is performed within an analysis window of a certain number of samples of the audio signal X and that the analysis window is moved by the same number of samples as the basic steps when the next segment of the audio signal X is to be encoded. The signal analysis may for example be performed with overlapping analysis windows, in which case the analysis windows may be longer than the basic stride. In another example, the length of the analysis window may coincide with the base stride.

In this context, the audio signal X is taken as an example of a multi-channel audio signal. In the present exemplary embodiment, the encoding section 110 applies a windowing transform (e.g., a Modified Discrete Cosine Transform (MDCT)) to a segment of the audio signal X with the basic steps as transform steps to provide a frequency domain representation of the segment of the audio signal X. In the frequency domain, the encoding section 110 then calculates a downmix signal (e.g., a mono or stereo downmix) as a linear combination of the respective channels of the audio signal X. The encoding section 110 also determines parameters for parametric reconstruction of the multi-channel audio signal X from the downmix signal. In the present exemplary embodiment, the decodable set of audio data D comprises the downmix signal and the parameters for the parametric reconstruction.

The parameter may be determined, for example, based on signal analysis of the frequency domain representation. The signal analysis may use a basic stride, i.e. it may use the same stride as the windowed transform. The signal analysis may for example comprise a calculation of the energy and/or covariance of the channels of the multi-channel audio signal X.

The following embodiments are also conceivable: wherein the parameters for the parameter reconstruction are determined based on an analysis of the signal with a different step size than the windowed transform. For example, the following embodiments may be envisaged: wherein the windowing transform uses a transform step shorter than the base step, and wherein the parameters for the parameter reconstruction are determined based on an analysis of the signal with the base step.

The decodable set of audio data D corresponds to a first frame rate (e.g., 30fps) and to a first number of samples of the audio signal per frame. That is, the decodable data set D represents a first number of samples of the audio signal and corresponds to frames that conform to the first frame rate.

The re-organizing section 120 divides the decodable audio data set D into N parts D₁，D₂，...,D_NE.g. by dividing the decodable set of audio data D into N at least substantially equally sized portions D₁，D₂，...，D_N. N may be, for example, 2 or 4, or may be any integer greater than or equal to 2.

In the present example embodiment, the decodable set of audio data D is a frequency domain representation of the first number of samples. Thus, when the decodable set of audio data D is divided into equally sized portions D₁，D₂，...，D_NWhen these moieties D are present₁，D₂，...，D_NMay comprise respective subsets of the frequency domain representation, which are not necessarily equal to the first number of samples of the audio signalAny particular subset corresponds. Therefore, these moieties D are in the following sense₁，D₂，...，D_NIs an incomplete audio data set: without accessing all N parts D₁，D₂，...，N_NIn the case of portion D₁，D₂，...，D_NNone of which can be decoded.

The recombination section 120 is formed to carry the respective portions D₁，D₂，...，D_NN bit stream frames F₁，F₂...F_N. Since N bit stream frames F₁，F₂...F_NRepresenting a decodable set of audio data D, thus a bitstream frame F₁，F₂...F_NHas a second frame rate which is N times the frame rate of the decodable set of audio data D. Similarly, although bitstream frame F₁，F₂...F_NNot representing some samples of the audio signal X itself, but N bitstream frames F₁，F₂...F_NRepresenting the decodable set of audio data D so as to correspond to the second number of samples per frame, wherein the first number of samples per frame is N times the second number of samples per frame.

The reconstruction unit 120 outputs a bit stream B divided into bit stream frames including the N bit stream frames F formed₁，F₂...F_NAs N consecutive bitstream frames.

Except for part D of the audio data₁，D₂，...，D_NBit stream frame F₁，F₂...F_NAlso included are various metadata, μ₁，μ₂...，μ_NSaid respective metadata indicating that a decodable set of audio data D can be decoded from a bitstream frame F₁，F₂...F_NCarried part D₁，D₂，...，D_NAnd (4) obtaining. Bitstream frame F₁，F₂...F_NOf each bitstream frame₁，μ₂...，μ_NIt may for example be indicated which part of the decodable set of audio data D is carried by the bitstream frame, optionallyBitstream frames carrying other N-1 parts of the decodable audio data set D may also be indicated.

Fig. 3 and 4 show examples of bitstreams provided by the audio processing system 100 according to an example embodiment described with reference to fig. 1.

The bitstream B output by the audio processing system 100 shown in fig. 1 may be associated with a stream of video frames. In fig. 3, the bitstream B is exemplarily represented by a stream a1 of bitstream frames and a stream V1 of video frames, wherein the right direction corresponds to the increasing time t.

The stream of video frames V1 comprises predicted encoded video frames P (including frames that depend only on previous frames and/or so-called bi-directional frames that depend on both previous and subsequent frames) and independently encoded video frames I. The stream a1 of bitstream frames comprises bitstream frames having the same frame rate and the same duration as the video frames to facilitate splicing and/or synchronization with other streams of audiovisual data.

In the present example embodiment, N-4, the audio processing system 100 provides bitstream frames in a group 310 of four

bitstream frames

311, 312, 313, 314 carrying respective portions of a decodable set of audio data. However, if the stream of video frames V1 is to be spliced with a stream of other video frames, splicing may be performed at a point adjacent to the independently encoded video frame I in order to decode the video frames after splicing. To maintain audio-video synchronization, stream a1 of bitstream frames may be spliced at the same splice point as stream V1 of video frames.

To facilitate decoding of a bitstream frame after stream splicing with another bitstream frame, the audio processing system 100 encodes a segment of the audio signal X temporally related to an independently encoded video frame I into a decodable audio data set by: the signal analysis is applied with a shortened transform step corresponding to a second number of samples of the audio signal X, which may for example correspond to the duration of the independently encoded video frames I.

Similar to encoding using signal analysis with basic steps, encoding using signal analysis with shortened steps may include: a windowing transform (e.g. MDCT) is applied with a shortened step as a transform step and parameters for a parametric reconstruction of a segment of the audio signal are determined from the downmix signal, wherein the parameters are determined based on a signal analysis with a shortened step. The decodable set of audio data associated with the shortened stride may include a downmix signal and a parameter.

The audio processing system 100 comprises bitstream frames 321 carrying decodable sets of audio data that can be decoded independently without accessing audio data carried by other bitstream frames. In stream a1 of bitstream frames, the bitstream frame 321 is followed by another group 330 of four

bitstream frames

331, 332, 333, 334 carrying respective portions of a decodable set of audio data.

The audio processing system 100 may for example comprise an additional encoding section (not shown in fig. 1) configured to encode a segment of the audio signal X by applying a signal analysis in shortened steps. Alternatively, the encoding section 110 may be operable to use the shortened stride, and the re-organizing section 120 may be operable to include in the bitstream B the bitstream frame 321 carrying the decodable set of audio data associated with the shortened stride.

In the example described with reference to fig. 3, the presence of independently coded video frames I at certain locations may be handled by: bitstream frames 321 carrying decodable sets of audio data associated with shortened transform steps are included between

groups

310, 330 of four bitstream frames. However, in at least some example scenarios, the location of independently encoded video frame I may be a priori unknown, and/or independently encoded video frame I may appear at a location that does not match the location between the group of four bitstream frames. Such a scenario is illustrated in fig. 4.

In fig. 4, the bitstream B and the associated stream of video frames are illustrated by a further bitstream a2 of bitstream frames and a further stream V2 of video frames, with time t propagating to the right.

Similar to the example scenario described with reference to fig. 3, bitstream frames are provided by the audio processing system 100 in

groups

410, 430 of four bitstream frames. However, once an independently encoded video frame I is detected in the stream of video frames V2, four consecutive bitstream frames 421, 422, 423, 424 are encoded by the audio processing system 100 for each of them with a shortened stride. Depending on the position of the independently encoded video frame I in the stream of video frames V2, the independently encoded video frame I may correspond to any one of the four

bitstream frames

421, 422, 423, 424 provided using a shortened transform step. In this scenario, independently encoded bitstream frame 423 may be provided at a location in bitstream a2 corresponding to independently encoded video frame I, regardless of the location of the independently encoded video frame I in stream V2 of video frames relative to any group of four bitstream frames in bitstream a2 encoded using basic strides. In this scenario, the bitstream frames are organized into groups of four bitstream frames, regardless of whether there are independently encoded video frames I in the stream of video frames V2.

Fig. 2 is a flow diagram of a method 200 of representing an audio signal by an audio bitstream, according to an example embodiment. The method 110 is illustrated herein by a method performed by the audio encoding system 100 described with reference to fig. 1.

The method 200 includes detecting 210 whether a current frame of the stream V1 of video frames is independently encoded. If the current frame is not independently encoded, indicated by N in the flow chart, the method 200 continues with the following operations: encoding 220 a segment of the audio signal X into a decodable audio data set D by using at least a signal analysis with a basic stride; dividing 230 the decodable set of audio data D into N parts D₁，D₂，...，D_N(ii) a Forming 240 carries respective portions D₁，D₂，...，D_NN bit stream frames F₁，F₂...F_N(ii) a And the bit stream frame F formed₁，F₂...F_NThe output 250 is part of the bit stream B. The method 200 then returns to encoding other segments of the audio signal X.

On the other hand, if, on the contrary, the current frame of stream V1 of video frames is independently encoded, indicated by Y in the flow chart, method 200 continues with the following operations: encoding 260 a segment of the audio signal X into a decodable audio data set by using at least a signal analysis with a shortened stride; and including 270 a bitstream frame carrying a second decodable set of audio data in bitstream B. The method 200 then returns to decoding other segments of the audio signal X.

Fig. 5 is a general block diagram of an audio processing system 500 for reconstructing an audio signal represented by a bitstream, according to an example embodiment.

In the present exemplary embodiment, the bitstream B is exemplified by the bitstream B output by the audio processing system 100 described with reference to fig. 1. Example embodiments are also described below in which the audio processing system 500 receives the following bitstreams: the bitstream has been modified, for example by frame dropping and/or frame copying, before being received by the audio processing system 500.

The audio processing system 500 includes a buffer 510 and a decoding section 520. The buffer 510 will be composed of N corresponding bitstream frames F₁，F₂...F_NCarried audio data set D₁，D₂，...，D_NCombined into one decodable set of audio data D corresponding to a first frame rate (e.g. 30fps) and a first number of samples of the audio signal per frame. As described with reference to fig. 1, bitstream frame F₁，F₂，...，F_NHaving a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame, wherein the first number of samples is N times the second number of samples. The buffer 510 uses the metadata μ carried by the bitstream frame₁，μ₂...，μ_NTo identify the carrying of the audio data set D to be combined₁，D₂，...，D_NFrame F₁，F₂...F_N。

The decoding section 520 decodes the decodable set of audio data D into a segment of the audio signal X by: based on the decodable set of audio data D, a signal synthesis with a basic stride as described with reference to fig. 1 is employed, i.e. the basic stride corresponds to the first number of samples of the audio signal X. The audio processing system 500 outputs a reconstructed version of the audio signal XBook (I)

As described with reference to fig. 1, the audio signal X is a multi-channel audio signal and the decodable set of audio data D comprises a downmix signal and associated upmix parameters for a parametric reconstruction of the audio signal X. The decoding section 520 performs the parametric reconstruction of the frequency domain representation of the segment of the audio signal X using the basic stride. Then, the decoding section 520 applies a windowing transform (e.g., inverse MDCT) having the basic step as a transform step for obtaining a time-domain representation of the segment of the audio signal X.

Embodiments are also envisaged: wherein the parameter reconstruction is performed with a different stride than the windowed transform. For example, embodiments are conceivable: wherein the windowed transform uses a transform step shorter than the base step, and wherein the parameter reconstruction is performed with the base step.

As described with reference to fig. 3 and 4, the bitstream B may include bitstream frames carrying decodable sets of audio data (i.e., sets of audio data that can be decoded independently of each other by using shortened strides). The audio processing system 500 may for example comprise an additional decoding section (not shown in fig. 5) configured to decode the decodable set of audio data using the shortened stride. Alternatively, the decoding section 520 is operable to decode such decodable sets of audio data using shortened stride, the buffer 510 is operable to: such a decodable set of audio data is delivered to the decoding section 520 without being combined with audio data from other bitstream frames.

To allow for smooth switching between segments of the audio signal X decoded using the shortened stride and segments of the audio signal X decoded using the basic stride, the audio processing system 500 may, for example, provide a delay such that decoding of a set of N consecutive bitstream frames having the second frame rate (i.e., using the shortened stride) is completed simultaneously, as if the bitstream frames each carried an audio data set that is required to be combined into a decodable audio data set, for decoding. The buffer 510 may provide such a delay, for example, by buffering decodable sets of audio data before transmitting them to the decoding section 520. Alternatively, the decoding section 520 may provide the delay by buffering the reconstructed segment of the audio signal X before providing the reconstructed segment of the audio signal X as an output.

As described with reference to fig. 1, the audio bitstream B output by the audio processing system 100 may have been modified, for example by splicing with other bitstreams or by frame dropping/frame copying before being received by the audio processing system 500 described with reference to fig. 5.

As described with reference to fig. 3, the bitstream frames may have the same duration as the corresponding video frames in the stream V1 of associated video frames. The use of such synchronized audio stream a1 and video stream V1 in a stream of audiovisual data facilitates splicing and/or synchronization of the audiovisual streams.

The device or component performing the splicing may not need to consider which types of bitstream frames are arranged before or after the splicing. In contrast, the audio processing system 500 may be used to handle the following situations: respective portions D carrying decodable sets of audio data D are lost in the received bitstream B, e.g. due to splicing and/or frame dropping/copying₁，D₂，...，D_NN bit stream frames F of the group₁，F₂...F_NOf the bitstream frame. The audio processing system 500 may be configured, for example, based on the frame F of the respective bitstream₁，F₂...F_NCarried metadata mu₁，μ₂...，μ_NTo detect bit stream frame loss.

Once a loss of bitstream frames required for decoding is detected, the audio processing system 500 may continue to decode the audio signal X, for example, using an error concealment strategy. The hidden policy may for example include: the audio data carried by bitstream frames in an incomplete set of bitstream frames (i.e., one or more bitstream frames from the set are lost in the received bitstream) is replaced with silence (e.g., used as zeros of the frequency domain coefficients of audio signal X). For example, the audio processing system 500 may use fade-in and/or fade-out to provide a smoother transition between decodable segments of the audio signal X and silence in place of non-decodable segments of the audio signal X, as perceived by a listener.

In some example embodiments, the audio processing system 500 may be configured to: accepting a bitstream associated with at least two different predetermined values for the second frame rate, but associated with a common value for the second number of samples per frame. In table 1, this is exemplified by the values 59.940fps and 60.000fps for the second frame rate and the common value 768 for the second number of samples per frame. Such frame rates may be useful for audio streams associated with video streams having these frame rates.

In this example, the values of the second frame rate differ by less than 5%. The audio processing system 500 may be configured to: for these two different values of the second frame rate, the audio signal X is decoded using the same value for the basic step. As described in the applicant's co-pending, yet unpublished patent application PCT/EP2014/056848 (see in particular the section in section "ii. example embodiment" where fig. 1 and table 1 are described), the variation in the internal sampling frequency of the decoding section 520 caused by the difference in the second frame rate may typically be so small that an acceptable playback quality of the reconstructed audio signal X as perceived by the listener may still be provided by the audio processing system 500. Another example of the difference in the values of the second frame rate in table 1 being less than 5% is given by the values of the second frame rate of 119.880fps and 120.000fps and the common value of the second number of samples per frame 384.

As shown in table 1, if the video frame rate is 60.00fps, one decodable set of audio data having a first frame rate of 30.000fps can be represented using 2 bitstream frames with a second frame rate of 60.000 fps. Similarly, if the video frame rate is 59.940fps, N-2 bitstream frames with the second frame rate 59.940 may be used to represent one decodable set of audio data with the first frame rate 29.970 fps. Table 1 also shows: if the video frame rate is 120fps, one decodable set of audio data having a first frame rate of 30.000fps can be represented using 4 bitstream frames with a second frame rate of 120.000 fps. Similarly, if the video frame rate is 119.880fps, N-4 bitstream frames with the second frame rate 119.880 may be used to represent one decodable set of audio data with the first frame rate 29.970 fps.

Fig. 6 is a flow diagram of an audio processing method 600 of reconstructing an audio signal represented by a bitstream, according to an example embodiment. The method 600 is illustrated herein by a method performed by the audio processing system 500 described with reference to fig. 5.

The method 600 includes detecting 610 whether the received bitstream frames carry a decodable set of audio data corresponding to a second frame rate.

If the received bitstream frames do not carry a decodable set of audio data corresponding to the second frame rate, indicated by N in the flow chart, the method 600 proceeds with the following: will consist of N corresponding bit stream frames F₁，F₂...F_NCarried multiple sets of audio data D₁，D₂，...，D_NCombining 620 into one decodable set of audio data D corresponding to the first frame rate and the first number of samples of the audio signal per frame; and decoding 630 the decodable set of audio data D into a segment of the audio signal X by using at least a signal synthesis having a basic step corresponding to the first number of samples of the audio signal X based on the decodable set of data D.

Conversely, if the received bitstream frames carry a decodable set of audio data corresponding to the second frame rate, indicated by Y in the flowchart, the method 600 proceeds with: the decodable set of audio data corresponding to the second frame rate is decoded 640 into a segment of audio data X by using at least the shortened stride corresponding to the second number of samples of audio signal X. The method 600 then returns to the step of detecting 610 whether the next received bitstream frame carries a decodable set of audio data.

Fig. 7 is a general block diagram of an audio processing system 700 for transcoding an audio bitstream representing an audio signal, according to an example embodiment.

The audio processing system 700 comprises a receiving section 710, an optional processing section 720 and a re-organizing section 730. The receiving section 710 receives a bitstream B1 comprising a sequence of decodable sets of audio data D corresponding to a first frame rate and a first number of samples of the audio signal per frame as described with reference to fig. 1, for example. The receiving section 710 extracts a decodable audio data set D from the bitstream B1.

The (optional) processing section 720 processes the decodable set of audio data D. Depending on the nature of the processing, this may require initially decoding the audio data into a transform representation or waveform representation; the processing section 720 may then perform sequential signal synthesis, processing, signal analysis.

The reconstruction section 730 divides the processed decodable set of audio data D into N parts D₁，D₂，...，D_NAnd is formed to carry a corresponding portion D₁，D₂，...，D_NN bit stream frames F₁，F₂...F_N. In the present exemplary embodiment, the re-organizing section 730 performs the same operation as the re-organizing section 120 in the audio processing system 100 described with reference to fig. 1. Thus, bitstream frame F₁，F₂...F_NHaving a second frame rate corresponding to a second number of samples of the audio signal per bitstream frame, the re-assembling section 730 outputs a bitstream B2, the bitstream B2 being split to include the formed N bitstream frames F₁，F₂...F_NThe bitstream frame of (2).

The bitstream B2 output by the audio processing system 700 may, for example, correspond to the bitstream B output by the audio processing system 100 described with reference to fig. 1. The bitstream B1 received by the audio processing system 700 may be, for example, a 30fps audio bitstream provided by an audio encoder as is known in the art.

It should be appreciated that the bitstream B described with reference to fig. 1 and 5, the bitstream a1 of the bitstream frame described with reference to fig. 3, are examples of computer-readable media representing the audio signal X and being partitioned into bitstream frames, according to example embodiments.

It should also be understood that N may be any integer greater than 1.

Equivalents, extensions, substitutions and others

Although this disclosure describes and depicts specific example embodiments, the present invention is not limited to these specific examples. Modifications and variations may be made to the above-described exemplary embodiments without departing from the scope of the present invention, which is limited only by the following claims.

In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. Any reference signs appearing in the claims shall not be construed as limiting the scope thereof.

The apparatus and methods disclosed above may be implemented as software, firmware, hardware, or a combination thereof. In a hardware implementation, the division of tasks between functional units mentioned in the above description does not necessarily correspond to a physical unit division; rather, one physical component may have multiple functions, and one task may be performed in a distributed manner by several physical components that cooperate. Some or all of the components may be implemented as software executed by a digital processor, signal processor, or microprocessor, or as hardware or application specific integrated circuits. Such software can be distributed on computer readable media including computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as is well known to those skilled in the art.

Claims

1. A method (200) of representing an audio signal (X) as an audio bitstream (B), the method comprising:

encoding (220) a segment of the audio signal into one decodable set of audio data (D) by performing at least a signal analysis on the segment of the audio signal with a basic stride corresponding to a first number of samples of the audio signal, the decodable set of audio data corresponding to a first frame rate and to a first number of samples of the audio signal per frame;

dividing (230) the decodable set of audio data into N parts (D)₁，D₂，...，D_N) Wherein N is more than or equal to 2;

forming (240) N bitstream frames (F) carrying respective portions₁，F₂，...，F_N) Wherein the N bitstream frames represent the decodable set of audio data and are associated with each frame (F)₁，F₂，...，F_N) Wherein the first number of samples per frame is N times the second number of samples per frame, and wherein the N bitstream frames have a second frame rate that is N times the first frame rate; and

outputting (250) a bitstream, the bitstream being partitioned into bitstream frames comprising the N bitstream frames previously formed,

wherein the method further comprises:

in response to a stream of video frames comprising video frames of a particular type (I), encoding (260) a segment of the audio signal temporally related to the video frames into a second decodable set of audio data by: performing at least a signal analysis on a segment of the audio signal that is time-correlated to the video frame with a shortened step corresponding to the second number of samples of the audio signal, the second decodable set of audio data corresponding to the second frame rate and a second number of samples of the audio signal per frame; and

-including (270) in the bitstream a bitstream frame carrying the second decodable set of audio data, the bitstream frame being independently decodable into a segment or sub-segment of the audio signal.

2. The method of claim 1, wherein performing signal analysis comprises performing at least one of the following at the base stride:

the analysis of the frequency spectrum is carried out,

the energy analysis is carried out on the basis of the energy,

and (4) entropy analysis.

3. The method of claim 1 or 2, wherein encoding the segment of the audio signal comprises at least one of:

applying a windowing transformation, the windowing transformation having the base stride as a transformation stride;

calculating a downmix signal and parameters for a parametric reconstruction of the audio signal from the downmix signal, wherein the parameters are calculated based on the signal analysis.

4. The method of claim 1 or 2, further comprising:

will metadata (μ)₁，μ₂，...，μ_N) Included in at least one of the N bitstream frames carrying the portion, the metadata indicating that a complete decodable set of audio data can be obtained from the portion carried by the N bitstream frames.

5. The method according to claim 1 or 2, comprising:

in response to a stream of video frames comprising video frames of said type, encoding N consecutive segments of said audio signal into respective decodable sets of audio data by applying at least a signal analysis to each of said N consecutive segments in said shortened steps, wherein said segment that is time-dependent on said video frames is one of said N consecutive segments; and

including bitstream frames carrying respective decodable sets of audio data associated with the N consecutive segments in the bitstream.

6. An audio processing system (100) for representing an audio signal (X) by an audio bitstream (B), the audio processing system comprising:

an encoding section (110) configured to: encoding a segment of the audio signal into one decodable set of audio data (D) by performing at least a signal analysis on the segment of the audio signal with a basic stride corresponding to a first number of samples of the audio signal, the decodable set of audio data corresponding to a first frame rate and to a first number of samples of the audio signal per frame;

another encoding section configured to: in response to a stream of video frames comprising video frames of a particular type (I), encoding (260) a segment of the audio signal temporally related to the video frames into a second decodable set of audio data by: performing at least a signal analysis on the segment of the audio signal that is time-correlated to the video frame with a shortened step corresponding to a second number of samples of the audio signal, the second decodable set of audio data corresponding to a second frame rate and a second number of samples of the audio signal per frame;

a recombination section (120) configured to:

dividing the decodable set of audio data into N parts (D)₁，D₂，...，D_N) Wherein N is more than or equal to 2;

forming N bitstream frames (F) carrying respective portions₁，F₂，...，F_N) Wherein the N bitstream frames represent the decodable set of audio data and are each (B) and (C) encoded with (C)F₁，F₂，...，F_N) Wherein the first number of samples per frame is N times the second number of samples per frame, and wherein the bitstream frames have a second frame rate that is N times the first frame rate; and

-outputting a bitstream segmented into bitstream frames comprising the previously formed N bitstream frames, and-including (270) in the bitstream frames carrying the second decodable set of audio data that can be decoded independently into segments or sub-segments of the audio signal.

7. A method (600) of reconstructing an audio signal (X) represented by a bitstream (B) segmented into bitstream frames, the method comprising:

will be composed of N corresponding bit stream frames (F)₁，F₂，...，F_N) Carried audio data set (D)₁，D₂，...，D_N) Combined into one decodable audio data set (D) corresponding to a first frame rate and to a first number of samples of the audio signal per frame, where N ≧ 2, wherein the N bitstream frames represent the decodable audio data set and to each frame (F)₁，F₂，...，F_N) Wherein the first number of samples per frame is N times the second number of samples per frame, and wherein the bitstream frames have a second frame rate that is N times the first frame rate; and

decoding (630) the decodable set of audio data into a segment of the audio signal by employing at least signal synthesis based on the decodable set of data and using a basic stride corresponding to a first number of samples of the audio signal,

wherein the method further comprises:

detecting (610) whether a bitstream frame carries a decodable set of audio data corresponding to the second frame rate; and

decoding (640) the decodable set of audio data corresponding to the second frame rate into a segment of the audio signal by employing at least signal synthesis based on the decodable set of audio data corresponding to the second frame rate and using shortened steps corresponding to the second number of samples, wherein the first number of samples is N times the second number of samples.

8. The method of claim 7, wherein decoding the decodable set of audio data comprises at least one of:

performing a parametric reconstruction of a segment of the audio signal with the basic stride based on a downmix signal and associated parameters obtained from the decodable set of audio data.

9. The method according to claim 7 or 8, wherein the N bitstream frames from which the sets of audio data are combined into the decodable sets of audio data are N consecutive bitstream frames.

10. The method of claim 7 or 8, further comprising:

based on metadata (μ) carried by at least some of the bitstream frames in the bitstream₁，μ₂，...，μ_N) A set of bitstream frames is determined, wherein an incomplete set of audio data is combined into the decodable set of audio data according to the set of bitstream frames.

11. The method of claim 7 or 8, wherein decoding the decodable set of audio data corresponding to the second frame rate comprises providing a delay such that decoding of a set of N consecutive bitstream frames having the second frame rate is completed simultaneously as if the bitstream frames in the set of N bitstream frames each carried a set of audio data that is required to be combined into a decodable set of audio data.

12. The method of claim 11, wherein the delay is provided by buffering at least one decodable set of audio data corresponding to the second frame rate or buffering at least one segment of the audio signal.

13. The method of claim 7 or 8, wherein the bitstream is associated with a stream of video frames (V1, V2) having a frame rate coinciding with the second frame rate.

14. The method of claim 7 or 8, wherein decoding a segment of the audio signal based on a decodable set of audio data corresponding to the first frame rate comprises:

receiving quantized spectral coefficients corresponding to a decodable set of audio data corresponding to the first frame rate;

performing inverse quantization followed by a frequency-time conversion, thereby obtaining a representation of the intermediate audio signal;

performing at least one processing step in the frequency domain on the intermediate audio signal; and

the sampling rate of the processed audio signal is changed to a target sampling frequency, thereby obtaining a time-domain representation of the reconstructed audio signal.

15. The method of claim 14, accepting bitstreams associated with at least two different values for the second frame rate, but associated with a common value for a first number of samples per frame, the respective values of the second frame rate differing by at most 5%, wherein the frequency-time conversion is performed in a functional component configured to use a windowed transform with a common predetermined value for a base stride as a transform stride for the at least two different values of the second frame rate.

16. An audio processing system (500) for reconstructing an audio signal (X) represented by a bitstream (B) segmented into bitstream frames, the audio processing system comprising:

a buffer (510) configured to be framed by N respective bit streams (F)₁，F₂，...，F_N) Carried audio data set (D)₁，D₂，...，D_N) Combined into one decodable audio data set (D) corresponding to a first frame rate and to a first number of samples of the audio signal per frame, where N ≧ 2, wherein the N bitstream frames represent the decodable audio data set and to each frame (F)₁，F₂，...，F_N) Wherein the first number of samples per frame is N times the second number of samples per frame, and wherein the bitstream frames have a second frame rate that is N times the first frame rate; and

a decoding section (520) configured to: decoding the decodable set of audio data into a segment of the audio signal by employing at least signal synthesis based on the decodable set of audio data and using a basic stride corresponding to the first number of samples of the audio signal,

decoding a set of decodable audio data corresponding to the second frame rate into a segment of the audio signal with at least signal synthesis with a shortened step corresponding to a second number of samples, wherein the first number of samples is N times the second number of samples.

17. A computer readable medium storing program instructions for performing the method of any one of claims 1-5 and 7-15.

18. A computer-readable medium (B, a1, a2) representing an audio signal (X) and being partitioned into bitstream frames, wherein:

n bit stream frames (F)₁，F₂，...，F_N) Carrying phaseCorresponding audio data set (D)₁，D₂，...，D_N) Said respective sets of audio data being capable of being combined into one set of decodable audio data (D), said one set of decodable audio data corresponding to a first frame rate and to a first number of samples of the audio signal per frame, wherein N ≧ 2; wherein the N bitstream frames represent the decodable set of audio data and are associated with each frame (F)₁，F₂，...，F_N) Wherein the first number of samples per frame is N times the second number of samples per frame;

enabling decoding of the decodable set of audio data into a segment of the audio signal by employing at least signal synthesis based on the decodable set of audio data and using a basic stride corresponding to the first number of samples of the audio signal; and

the bitstream frames have a second frame rate that is N times the first frame rate, wherein the computer-readable medium further comprises bitstream frames carrying a second set of audio data that can be decoded into segments of the audio signal by employing at least signal synthesis based on the second set of audio data and using shortened steps corresponding to a second number of samples of the audio signal, wherein the first number of samples is N times the second number of samples, and the bitstream frames can be independently decoded into segments or sub-segments of the audio signal.

19. The computer-readable medium of claim 18, wherein at least one of the N bitstream frames carries metadata (μ £ v)₁，μ₂，...，μ_N) The metadata is indicative of a set of bitstream frames according to which audio data sets are combined into the decodable audio data set.

20. The method of any one of claims 1, 2, 7, and 8, wherein N-2 or N-4.

21. The audio processing system according to claim 6 or 16, wherein N-2 or N-4.

22. The computer-readable medium of claim 17, wherein N-2 or N-4.

23. The computer-readable medium of claim 18 or 19, wherein N-2 or N-4.