EP4133477A1

EP4133477A1 - Method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment

Info

Publication number: EP4133477A1
Application number: EP20717647.0A
Authority: EP
Inventors: Eftychios PAPOULIS
Original assignee: Ask Industries GmbH
Current assignee: Ask Industries GmbH
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2023-02-15
Also published as: WO2021204363A1

Abstract

A method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment represented by its pre-recorded Room-Impulse-Response.

Description

Method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment

The invention relates to a method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment represented by its pre-recorded Room-Impulse-Response (“RIR”).

Audio signal processing generally, comprises processing of input audio signals, i.e. audio signals which are input to a digital audio signal processing unit, having specific input audio signal properties so as to generate output audio signals, i.e. audio signals which are output of the audio signal processing unit, having specific output audio signal properties at least partly different from the input audio signal properties. Specifically, audio signal processing may comprise modifying one or more properties of an input audio signal so as to obtain an output audio signal having one or more properties which are modified relative to the respective properties of the input audio signal.

One specific aim in audio signal processing comprises processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment, e.g. a specific room or venue. A respective room or venue can form part of a specific building.

Specifically, audio signal processing comprising real-time convolution-based artificial reverberation using pre-recorded RIR data from a real acoustic environment, e.g. a real room, often requires more computing power, i.e. more computing operations, such as Floating-Point Operations Per Input Sample (“FLOPIS”), more memory than actually available in a common digital audio signal processing unit, and more memory throughput, i.e. memory access operations than those feasible in real-time. This particularly, applies to digital audio signal processing units of vehicle audio systems.

The required number of computing operations and memory size typically, depend on the physical size of the respective acoustic environment, the sampling rate used during the recording of the RIR data and the play-back of the audio signal to be reverberated. For the sampling rates typically used for audio, and for the reverberation times of large acoustic environments, such as large buildings or venues, e.g. cathedrals, the length of the RIR data typically, turns out to be very large.

As an example, for the sampling rate of 48 kHz and for the reverberation time of 4 seconds, i.e. a reverberation time that the interior of a large acoustic environment, such as a large building or venue, could typically exhibit, the monophonic RIR Finite Impulse Response (“FIR”) model may have 192 x 10³ samples. A direct real convolution would thus, require 384 x 10³ FLOPIS, and 384 x 10³ memory locations to store the 192 x 10³ RIR samples plus the 192 x 10³ most recent samples of the input signal. For a stereophonic configuration (stereo input signal and stereo RIR FIR model), that delivers sound that sounds more natural, twice as many computing operations and memory would be needed. These numbers are typically, prohibitive, even for modern digital audio signal processing units.

Hence, there exists a need for improved methods for processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment, particularly with respect to the computing power and memory requirements for the respective digital audio signal processing unit.

In fact, diverse approaches for real-time artificial reverberation are known from prior art. As a first example, the Uniform Partition Overlap-Save (“UPOLS”) method is a widely used uniform partition algorithm for real-time artificial reverberation. UPOLS may significantly reduce the computing operations compared with the direct convolution, however it doubles the required memory because UPOLS works with complex data. As a second example, the Non-Uniform Partition Overlap-Save (“N UPOLS”) method is known. N UPOLS is a non-uniform partition algorithm for real-time artificial reverberation. It has the same memory requirements as UPOLS but reduces the computing operations even further. However, NUPOLS is a multi-thread algorithm which can be very challenging in its implementation and is even not possible to use when the real-time processing needs to be done in a single thread.

It is thus, the object of the present invention to provide an improved method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment, particularly with respect to the computing power and memory requirements for the respective digital audio signal processing unit and with respect to the ease of implementation.

The object is achieved by a method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment according to Claim 1. The Claims depending on Claim 1 refer to possible embodiments of the method of Claim 1.

A first aspect of the invention refers to a method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment represented by its pre-recorded Room-Impulse-Response (“RIR”). The method thus, enables, by processing an input audio signal, generating an output audio signal having the reverberation characteristics of a specific acoustic environment, e.g. a specific room of a specific building, represented by its pre-recorded RIR.

The method can be implemented by a hardware- and/or software-embodied digital signal processing unit configured to perform the method. The digital signal processing unit may comprise at least one processing unit, such as a processor, and at least one memory unit. The digital signal processing unit may form part of an apparatus for processing an audio signal. A respective apparatus can form a vehicle audio system or a car audio system, i.e. an audio system that is to be installed or is installed in a vehicle or a car, respectively or form part of a respective vehicle audio system or car audio system, respectively.

The basic steps of the method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment will be specified in the following:

According to a first step of the method, a pre-recorded RIR of a specific acoustic environment, e.g. a specific room of a specific building, is provided. The pre-recorded RIR is or can be represented by its RIR samples. Notably, the pre-recorded RIR of a respective specific acoustic environment can be obtained through known methods for recording the RIR of acoustic environments. The actual recording of a respective acoustic environment is typically, not a step of the method. The first step of the method can be implemented by a hardware- and/or software- embodied RIR provision unit which is configured to provide a pre-recorded RIR of a specific acoustic environment. The pre-recorded RIR provided by the RIR provision unit is or can be represented by its RIR samples.

According to a second step of the method, a discrete input audio signal, i.e. a signal representative of a specific audio content, e.g. a musical piece, is provided. The input audio signal is or can be represented by its incoming audio signal samples. The second step of the method can be implemented by a hardware- and/or software-embodied input audio signal provision unit which is configured to provide a discrete input audio signal from a physical or non-physical input audio signal source, such as data carrier source, a network source, etc. The discrete input audio signal provided by the input audio signal provision unit is or can be represented by its incoming audio signal samples.

According to a third step of the method, the incoming audio signal samples of the discrete input audio signal are divided in a number of input audio signal blocks, whereby each input audio signal block has the same size in audio signal samples and/or the same number of audio signal samples. As such, every input audio signal block can comprise the same number of audio signal samples. The third step of the method can be implemented by a hardware- and/or software-embodied sample dividing unit which is configured to divide the incoming audio signal samples of the discrete input audio signal in a number of input audio signal blocks, whereby each input audio signal block has the same number of audio signal samples.

According to a fourth step of the method, the samples of the RIR are divided in a number of RIR blocks, whereby each RIR block has the same number of RIR samples. As such, every RIR block can comprise the same number of RIR samples. Typically, the number of RIR samples of the RIR blocks is equal to the size in audio signal samples and/or the number of audio signal samples of the input audio signal blocks. The fourth step of the method can be implemented by a hardware- and/or software-embodied sample dividing unit which is configured to divide the RIR samples of the RIR in a number of RIR blocks, whereby each RIR block has the same number of RIR samples.

According to a fifth step of the method, it is determined if/when an input audio signal block becomes available, and, if an input audio signal block has become available, an output audio signal block is produced by processing the respective input audio signal block, whereby the output audio signal block has the same size and/or the same number of audio signal samples as the input audio signal block. As such, it is determined if/when a sufficient number of audio signal samples have been input to build an input audio signal block and when the input audio signal block is built, an output audio signal block is produced by processing the respective input audio signal block. The fifth step of the method can be implemented by a hardware- and/or software- embodied sample determining unit which is configured to determine if/when an input audio signal block becomes available, and, by a hardware- and/or software-embodied input block processing unit which is configured to process the respective input audio signal block so as to produce an output audio signal block, whereby the output audio signal block has the same size and/or the same number of audio signal samples as the input audio signal block.

According to a sixth step of the method, a number of RIR operating coefficients, particularly transformation coefficients, more particularly Discrete-Fourier-Transform (“DFT”) coefficients, is determined for each RIR block, where this number is the same for all RIR blocks, on basis of a first processing rule. As such, a first processing rule is applied on basis of which a number of RIR operating coefficients, particularly transformation coefficients, more particularly DFT coefficients, is determined for each RIR block, where this number is the same for all RIR blocks. The sixth step of the method can be implemented by a hardware- and/or software-embodied operating coefficient determining unit which is configured to determine a number of RIR operating coefficients, particularly transformation coefficients, more particularly DFT coefficients, for each RIR block, where this number is the same for all RIR blocks, on basis of a first processing rule.

According to a seventh step of the method, a number of determined RIR operating coefficients is assigned to each RIR block. Typically, these coefficients are selected from those already determined for this RIR block. As such, each RIR block is assigned with at least one RIR operating coefficient which has been previously determined for this block. The seventh step of the method can be implemented by a hardware- and/or software-embodied operating coefficient assigning unit which is configured to assign a number of RIR operating coefficients to each RIR block, selected from those already determined for this RIR block.

According to an eighth step of the method, the RIR operating coefficients which have been assigned to the respective RIR blocks are stored as static values in at least one memory unit. The eighth step of the method can be implemented by a hardware- and/or software-embodied memory unit which is configured to store the RIR operating coefficients which have been assigned to the respective RIR blocks as static values.

According to a ninth step of the method, the stored RIR operating coefficients of the RIR are utilized together with corresponding time-varying operating coefficients of the input audio signal for determining and/or generating an output audio signal having the reverberation characteristics of the specific acoustic environment on basis of a second processing rule. As such, each RIR operating coefficient has its corresponding input audio signal operating coefficient and, based on this relation, the RIR operating coefficients and the corresponding time-varying operating coefficients of the input audio signal are utilized for determining and/or generating an output audio signal having the reverberation characteristics of the specific acoustic environment on basis of a second processing rule. The total number of RIR operating coefficients, i.e. operating coefficients for the RIR, is equal to the total number of (input) audio signal operating coefficients, i.e. operating coefficients for the input audio signal, such that every RIR operating coefficient has its corresponding (input) audio signal operating coefficient and vice versa. The ninth step of the method can be implemented by a hardware- and/or software-embodied processing unit which is configured to use the stored static RIR operating coefficients of the RIR together with corresponding time-varying operating coefficients of the input audio signal for determining and/or generating an output audio signal having the reverberation characteristics of the specific acoustic environment on basis of a second processing rule.

The method thus, allows for implementing an Approximate Uniform Partition Overlap Save (“AUPOLS”) method which is different from and operates in between the abovementioned UPOLS- and NUPOLS-methods. The AUPOLS-method typically, has the same latency as the UPOLS- and NUPOLS-methods, it is a single-thread approach which is simple in its implementation, and uses/requires less memory than UPOLS and NUPOLS. THE AUPOLS- method allows for forming an approximate model of the pre-recorded RIR, in contrast to the error- free UPOLS- and NUPOLS methods, which form an exact model for the pre-recorded RIR.

The AUPOLS-method provides an approximate model of the RIR, instead of providing an exact model of it. The AUPOLS-method comprises dividing the pre-recorded L samples of the RIR into K blocks, each of size B samples (L = KB). The AUPOLS-method then operates with the DFT transformed data of the K blocks. For a real-life RIR, the transformed data can have the following time-frequency properties:

(i) at a given time (block number), the power (squared magnitude) of the DFT coefficients decreases with increasing frequency (DFT coefficient index);

(ii) at a given frequency (DFT coefficient index), the power (squared magnitude) of the DFT coefficients decreases with increasing time (block number);

(iii) the rate at which the power (squared magnitude) of a DFT coefficient decreases with time, increases with frequency, whereby high-frequency DFT coefficients typically, fade faster than low-frequency coefficients.

The above properties (i) - (iii) must not necessarily be interpreted in a strict sense because they can only refer to the trends that the powers have. Property (i), for example, does not preclude the possibility the power of a DFT coefficient to be smaller than the power of the next one (the power for a higher index), but only specifies that, for any given block, the power of the DFT coefficients generally decreases when moving from lower to higher frequencies (indices).

The above properties (i) - (iii) generally, hold for the samples of all pre-recorded Rl Rs and suggest that if an exact model of the RIRs is not required, i.e. an approximation is acceptable, it is possible to reduce the computing operations and memory required by considering only a subset of the transformed RIR data.

Hence, an improved method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment, particularly with respect to the computing power and memory required for the respective digital audio signal processing unit as well as ease of implementation, is given.

The first processing rule can be applied based on an energy-based time-frequency tiling-process (“EBTFT-process”). Thus, an EBTFT-process can be used for determining the parameters for implementing the AUPOLS method and the corresponding AUPOLS structure for a given pre recorded RIR. The AUPOLS structure is particularly, beneficial in view of the resources required, namely the number of computing operations and the memory size.

The EBTFT-process can comprise applying a time-domain window function to each RIR block to modify the first and last samples of each block so as to generate blocks, particularly gradually, increasing from a zero absolute value at a first sample and, particularly gradually, decreasing to a zero absolute value at a last sample.

The EBTFT-process can further comprise appending a number of zero samples after the last sample of each block so as to generate double-sized blocks.

The EBTFT-process can further comprise arranging the double-sized blocks as columns of a real matrix having a number of rows and a number of columns, whereby the number of rows corresponds to the number of samples of each double-sized block and the number of columns corresponds to the number of RIR blocks.

The EBTFT-process can further comprise applying a DFT transformation to each column of the real matrix, and applying a replacement rule to each of the columns so as to replace each column by the squared magnitude of its DFT transformation, resulting in a matrix of the same size having only real positive elements. The EBTFT-process can further comprise removing all last rows comprising redundant information and doubling the elements of all rows except of those of the first and the last row, so as to generate a matrix of real positive elements, whereby the elements of the matrix represent the energy distribution function of the particular RIR in the time-frequency domain.

The EBTFT-process can further comprise applying a smoothing function to the energy distribution function.

The EBTFT-process can further comprise applying an energy threshold rule to the elements of each column, such that only the first elements of each column that sum up to a threshold energy, e.g. 90%, of the total energy of the respective column, are kept, whereas the remaining elements of the respective column are set to zero, resulting in a modified matrix having zeros at the last locations of each column.

The EBTFT-process can further comprise generating a strictly-monotonically decreasing sequence, indicating for each column of the matrix the remaining energy of the matrix starting from the particular column and normalizing this sequence with the sum of all energies of all columns of the matrix or the sum of all elements of the matrix, respectively thereby, generating a strictly-monotonically decreasing sequence in the interval between 0 and 1.

The EBTFT-process can further comprise modifying the decay rate of the strictly-monotonically- decreasing sequence by applying a transformation on the sequence which converts an arbitrary strictly-monotonically-decreasing sequence to another sequence with the same property, that takes values in the same interval.

The EBTFT-process can further comprise determining a second sequence based on the modified matrix, that for each particular column of the modified matrix expresses the sum of all elements of the respective column of the modified matrix.

The EBTFT-process can further comprise determining a third sequence based on the modified matrix, the strictly-monotonically decreasing sequence, and the second sequence, whereby the third sequence is a monotonically decreasing sequence. It is possible that two or more consecutive values of this third sequence are equal to each other, which means that the third sequence is not a strictly-monotonically-decreasing sequence.

The EBTFT-process can further comprise applying a grouping rule to the samples of the third sequence, so as to group together P_g consecutive samples having the same value N_g. This value N_g represents the number of RIR operating coefficients for each of the P_g RIR blocks in the respective group of RIR blocks. The number of samples P_g grouped together represents the number of RIR blocks using the same number N_g of RIR operating coefficients. Typically, each sample of the third sequence that has a unique value N_g forms its own group with population P_g=1. Typically, this unique value N_g represents the number of RIR operating coefficients for the respective RIR block.

A respective EBTFT-process typically, analyzes the provided pre-recorded RIR samples on the time-frequency plane and determines the numbers P_g and N_g that best match how the energy of the RIR samples is distributed in time and frequency. It therefore, allows for yielding an AUPOLS structure that best matches the time-frequency properties of the RIR samples.

A concrete exemplary embodiment of an EBTFT-process comprises the following steps:

According to a first step of the exemplary EBTFT-process, the pre-recorded RIR having a length of L samples is segmented into K non-overlapping RIR blocks of size B samples each (where K = L/B). The pre-recorded RIR can be extended with zeros if its length is not a multiple of the block size B. Also, a time-domain window function can optionally be applied to each RIR block. The time-domain window function allows for modifying the RIR samples at the vicinity of the block boundaries. Specifically, the time-domain window function allows for modifying the first and the last RIR samples of each RIR block so as to make the power of the RIR samples to, particularly gradually, decrease to zero when approaching the start or the end of the RIR block.

According to a second step of the exemplary EBTFT-process, zero samples are appended at the end of each of the RIR blocks, so as to generate extended RIR blocks having double the length B of each RIR block, such that the length of each extended RIR block is 2B samples. The extended RIR blocks can be processed so as to be arranged as columns of a real matrix, having 2B rows and K columns. Each of the K columns can then be replaced by the squared magnitude of its transform, typically its DFT transform. The result is a real matrix of the same size, having non-negative elements. The last (B-1) rows of this matrix can be removed so as to yield a real matrix E[l, b] of non-negative elements, having (B+1) rows and K columns. The elements of all (B-1) rows in between the first and the last row of the matrix can be doubled so as to compensate for the last (B-1) removed rows.

The non-negative elements of the real matrix E[l, b] describe how the energy of the RIR is distributed in the time-frequency plane. More specifically, they describe how the energy contained in a certain frequency region fades in time, and how the energy contained in a certain time region fades in frequency. The elements of the real matrix E[l, b] being considered as the samples of a two-dimensional function, do not necessarily correspond to a smooth and well-behaved surface; but to a surface which may exhibit sudden minima and maxima along the time and the frequency direction. It is possible to smooth the two-dimensional surface with a suitable filter, such as a two- dimensional low-pass filter. The smoothing should be done in a way so as not to introduce artefacts (boundary effects) at the boundaries of the E[l, b] matrix, namely at the vicinity of the first and last column and at the vicinity of the first and last row. According to a third step of the exemplary EBTFT-process, a thresholding rule is a applied to the elements of each of the K columns of the real matrix E[l, b], whereby only the first elements (for the lower values of I) of the column containing a configuration or threshold parameter T_p0w% of the total column energy are maintained. The last elements (for the higher values of I) of the column can be set to zero. The configuration or threshold parameter T_po % is between 0% and 100%. The utilization of the configuration or threshold parameter T_po % is optional and has no effect when T_Pow% = 100%. A typical value of this parameter could be T_po % = 99%, meaning that the first elements of each column that when added yield a value that is 99% of the sum of all elements of the respective column are maintained, with the rest of the column elements set to zero.

According to a fourth step of the exemplary EBTFT-process, based on the respective final state of the real matrix E[l, b], a first sequence D[b] is constructed. This first sequence D[b] expresses the remaining total power of the real matrix E[l, b] from column b to the last column (K-1). Thereby, column 0 is the first column. The sequence D[b] is divided by the total power of the real matrix E[l, b] (the sum of all the elements of the real matrix E[l, b]). The sequence D[b] has length K (because the real matrix E[l, b] has K columns). The first sequence D[b] has the following properties: D[0] = 1 , D[b] > D[b+1], and 1»D[K-1]>0. The last property states that the last sample of the sequence D[b] is typically, positive and (very) close to zero.

Typically, the sequence D[b] is a strictly-monotonically-decreasing sequence and takes values in the interval (0 1] The decay rate of the sequence D[b] affects the complexity and the memory requirements of the AUPOLS structure that results from implementing the EBTFT-process. It is possible to modify the decay rate of the sequence D[b] by applying a transformation on the sequence that maps the interval (0 1] to the same interval and converts an arbitrary strictly- monotonically-decreasing sequence in this interval to another sequence with the same property.

According to a fifth step of the exemplary EBTFT-process, based on the final respective state of the real matrix E[l, b], a second sequence C[b] is constructed. The second sequence C[b] expresses the sum of all elements of column b of the real matrix E[l, b]. The second sequence C[b] has length K (because the real matrix E[l, b] has K columns).

According to a sixth step of the exemplary EBTFT-process, K integers N[b] are defined from the real matrix E[l, b], the first sequence D[b], and the second sequence C[b] Thereby, N[0] is the number of first elements of the first column of the real matrix E[l, b], that when added together yield a value that is at least equal to the product D[0]C[0]. This is similar for all other columns of the real matrix E[l, b]. For example, N[K-1] is the number of first elements of the last column of the real matrix E[l, b], that when added together yield a value that is at least equal to the product D[K-1]C[K-1] The selection of the elements starts from the first row of the real matrix E[l, b], whereby only the necessary number of elements is selected. A lower bound N_min can be imposed on the sequence N[b] The lower bound N_min can also be set to the value of 1 such that, in fact, no lower bound is imposed on the K integers which are the samples of the sequence N[b] The symbols N_band N[b] have the same meaning and denote the samples of the same sequence.

By the way of its construction and due to the time-frequency properties of the pre-recorded RIRs, N_b is a monotonically decreasing sequence with only a few exceptions. In other words, the condition N_b+i £ N_b is violated for only a few pairs of consecutive samples. The sequence N_b can be processed so as to yield a monotonically decreasing sequence everywhere (i.e. with no exceptions this time). The way that the processing is done is not crucial, as long as the resulting sequence is monotonically decreasing everywhere, and its samples are close to the samples of the original sequence. In any case, the processing must yield values in the range N_min £ N_b £(B+1) for all values of the index b.

The samples of the sequence N_b yield the sequences of numbers (or in other words the sequences) P_g and N_g as follows: A number of P_g consecutive samples of N_b which are all equal to N_g are grouped together to form the group g with population P_g. This is repeated for all consecutive samples that can be grouped together because they are equal. Samples of N_b with unique values, form each one its own group, with population one, since they cannot be grouped with any other sample due to having unique values. The total number of the groups formed in this way is the number G. For the group index g it is then 0£g<G. The symbols P_g and P[g] have the same meaning and denote the population of the group with index g. The symbols N_g and N[g] have the same meaning and denote the value of the elements, or the value of the element, of the sequence N[b] that formed the group with index g.

The previous step may yield a group g’ with a population that is deemed to be too small. If respective groups g’ with a population smaller than P_min > 1 are not allowed, then this group g’ must be merged with another neighbouring group. Thus, group g’ can be merged either with its previous group (g’ - 1) or with its next group (g’ + 1). This increases the population of the previous group (g’ - 1) or of the next group (g’ + 1) by the population of group g’ and also eliminates the samples of the sequences P_g and N_g associated with group g’.

The second processing rule (see the ninth step of the method) can be applied for every incoming input audio signal block.

Specifically, the second processing rule can comprise the following steps:

- pairing two directly consecutive in time blocks of the input audio signal to generate a paired block having double the size of each of the original input blocks;

- applying a forward DFT transformation to each paired block, whereby the length of the forward DFT transformation corresponds to the size of the paired block;

- using a subset of the DFT coefficients resulting from the transformation of each paired block as the time-varying operating coefficients of the input audio signal;

- performing complex multiplications of the RIR static operating DFT coefficients with their corresponding time-varying operating DFT coefficients of the input audio signal;

- accumulating the complex multiplication results for all sets of RIR static operating DFT coefficients;

- applying an inverse DFT transformation, whereby the length of the inverse DFT transformation corresponds to the size of the paired block, resulting in an output time-domain block that has a size equal to the size of the respective input paired block; and

- discarding the first halve of the output time-domain block and using the remaining second halve of the output time-domain-block as the block of the output audio signal, whereby the latter output audio signal block has the reverberation characteristics of the specific acoustic environment and is the response of the method to the input audio signal block.

A second aspect of the invention refers to a non-transitory computer readable medium comprising or storing computer-executable instructions, which when executed by a processing unit of a digital signal processing unit cause the digital signal processing unit to perform the method according to the first aspect of the invention.

A non-transitory computer-readable medium can refer to any tangible computer-based device implemented in any method of technology for short-term and long-term storage of information, such as, computer-readable instructions, data structures, program modules and sub-modules, or other data in any device. Therefore, the method described herein may be encoded as executable instructions embodied in a tangible, non-transitory, computer-readable medium, including, without limitation, a storage device and/or a memory device. The term "non-transitory computer-readable medium" generally, includes all tangible, computer-readable media, including, without limitation, non-transitory computer storage devices, including without limitation, volatile and non-volatile media, and removable and non-removable media such as firmware, physical and virtual storage, CD-ROMS, DVDs, and any other digital source such as a local or global network, as well as yet to be developed digital means, with the sole exception being transitory, propagating signal.

A third aspect of the invention refers to a processing unit comprising at least one processor having computer-executable instructions, which when executed by the processor cause the digital signal processing unit to perform the method according to the first aspect of the invention.

A fourth aspect of the invention refers to an apparatus for processing an audio signal, comprising a processing unit according to the third aspect of the invention.

A fifth aspect of the invention refers to a vehicle, particularly a car, comprising an apparatus for processing an audio signal according to the fourth aspect of the invention.

Exemplary embodiments of the invention are described in context with the following Figures, whereby: Fig. 1, 2 each show a principle drawing a structure allowing for implementing a method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment according to an exemplary embodiment.

Fig. 1, 2 each show a principle drawing a, particularly software-embodied, structure allowing for implementing a method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment according to an exemplary embodiment.

The method is a method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment represented by its pre recorded Room-Impulse-Response (“RIR”). The method thus, enables, by processing an input audio signal, generating an output audio signal having the reverberation characteristics of a specific acoustic environment, e.g. a specific room of a specific building, such as an interior of a specific cathedral, represented by its pre-recorded RIR.

The method can be implemented by a hardware- and/or software-embodied digital signal processing unit configured to perform the method. The digital signal processing unit may comprise at least one processing unit (not shown), such as a processor, and at least one memory unit (not shown), such as a memory. The digital signal processing unit may form part of an apparatus for processing an audio signal (not shown). A respective apparatus can form a vehicle audio system or a car audio system, i.e. an audio system that is to be installed or is installed in a vehicle or a car, respectively or form part of a respective vehicle audio system or car audio system, respectively.

The basic steps of an exemplary embodiment of a respective method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment will be specified in the following:

According to a first step of the method, a pre-recorded RIR of a specific acoustic environment, e.g. a specific room, such as a specific room of a specific building, is provided. The pre-recorded RIR is or can be represented by its RIR samples. The first step of the method can be implemented by a hardware- and/or software-embodied RIR provision unit (not shown) which is configured to provide a pre-recorded RIR of a specific acoustic environment. The pre-recorded RIR provided by the RIR provision unit is or can be represented by its RIR samples.

According to a third step of the method, the incoming audio signal samples of the discrete input audio signal are divided in a number of input audio signal blocks, whereby each input audio signal block has the same size in audio signal samples and/or the same number of audio signal samples. As such, every input audio signal block can comprise the same number of audio signal samples. The third step of the method can be implemented by a hardware- and/or software-embodied sample dividing unit (not shown) which is configured to divide the incoming audio signal samples of the discrete input audio signal in a number of input audio signal blocks, whereby each input audio signal block has the same number of audio signal samples.

According to a fourth step of the method, the samples of the RIR are divided in a number of RIR blocks, whereby each RIR block has the same size in RIR samples and/or the same number of RIR samples. As such, every RIR block can comprise the same size of RIR samples and/or the same number of RIR samples. Typically, the size in RIR samples and/or the number of RIR samples of the Rl R blocks is equal to the size in audio signal samples and/or the number of audio signal samples of the input audio signal blocks. The fourth step of the method can be implemented by a hardware- and/or software-embodied sample dividing unit (not shown) which is configured to divide the RIR samples of the RIR in a number of RIR blocks, whereby each RIR block has the same size in RIR samples and/or the same number of RIR samples.

According to a fifth step of the method, it is determined if/when an input audio signal block becomes available, and, if an input audio signal block has become available, an output audio signal block is produced by processing the respective input audio signal block, whereby the output audio signal block has the same size and/or the same number of audio signal samples as the input audio signal block. As such, it is determined if/when a sufficient number of audio signal samples have been input to build an input audio signal block and when the input audio signal block is built, an output audio signal block is produced by processing the respective input audio signal block. The fifth step of the method can be implemented by a hardware- and/or software- embodied sample determining unit (not shown) which is configured to determine if/when an input audio signal block becomes available, and, by a hardware- and/or software-embodied input block processing unit (not shown) which is configured to process the respective input audio signal block so as to produce an output audio signal block, whereby the output audio signal block has the same size and/or the same number of audio signal samples as the input audio signal block.

According to a sixth step of the method, a number of RIR coefficients, particularly transformation coefficients, more particularly Discrete-Fourier-Transform (“DFT”) coefficients, is determined for each RIR block, where this number is the same for all RIR blocks, on basis of a first processing rule. As such, a first processing rule is applied on basis of which a number of RIR coefficients, particularly transformation coefficients, more particularly DFT coefficients, is determined for each RIR block, where this number is the same for all RIR blocks. The sixth step of the method can be implemented by a hardware- and/or software-embodied coefficient determining unit (not shown) which is configured to determine a number of RIR coefficients, particularly transformation coefficients, more particularly DFT coefficients, for each RIR block, where this number is the same for all RIR blocks, on basis of a first processing rule.

According to a seventh step of the method, a or the number of RIR operating coefficients is assigned to each RIR block, where these coefficients are selected from those already determined for this RIR block. As such, each RIR block is assigned with at least one RIR operating coefficient which has been previously determined for this RIR block. The seventh step of the method can be implemented by a hardware- and/or software-embodied operating coefficient assigning unit (not shown) which is configured to assign a or the number of RIR operating coefficients to each RIR block selected from those already determined for this block.

According to a ninth step of the method, the stored RIR operating coefficients of the RIR are utilized together with corresponding time-varying operating coefficients of the input audio signal for determining and/or generating an output audio signal having the reverberation characteristics of the specific acoustic environment on basis of a second processing rule. As such, each RIR operating coefficient has its corresponding input audio signal operating coefficient and, based on this relation, the RIR operating coefficients of the RIR and the corresponding time-varying operating coefficients of the input audio signal are utilized for determining and/or generating an output audio signal having the reverberation characteristics of the specific acoustic environment on basis of a second processing rule. The total number of RIR operating coefficients, i.e. operating coefficients for the RIR, is equal to the total number of input audio signal operating coefficients, i.e. operating coefficients for the input audio signal, such that every RIR operating coefficient has its corresponding input audio signal operating coefficient and vice versa. The ninth step of the method can be implemented by a hardware- and/or software-embodied processing unit (not shown) which is configured to use the stored static RIR operating coefficients of the RIR together with corresponding time-varying operating coefficients of the input audio signal for determining and/or generating an output audio signal having the reverberation characteristics of the specific acoustic environment on basis of a second processing rule.

The method thus, allows for implementing an Approximate Uniform Partition Overlap Save (“AUPOLS”) method which is different from and operates in between the abovementioned UPOLS- and NUPOLS-methods. The AUPOLS-method has the same latency as the UPOLS- and NUPOLS-methods. It is a single-thread approach which is simple in its implementation and uses/requires less memory than UPOLS and N UPOLS. The AUPOLS-method allows for forming an approximate model of the pre-recorded RIR, in contrast to the error-free UPOLS- and NUPOLS-methods. The AUPOLS-method provides an approximate model of the pre-recorded RIR, instead of providing an exact model of the pre-recorded RIR.

The first processing rule can advantageously be applied based on an energy-based time- frequency tiling-process (“EBTFT-process”). Thus, an EBTFT-process can be used for determining the parameters for implementing the AUPOLS method and the corresponding AUPOLS structure for a given pre-recorded RIR.

The EBTFT-process can further comprise applying a DFT transformation to each column of the real matrix, and applying a replacement rule to each of the columns so as to replace each column by the squared magnitude of its DFT transformation, resulting in a matrix of the same size having only real positive elements.

The EBTFT-process can further comprise removing all last rows comprising redundant information and doubling the elements of all rows except of those of the first and the last row, so as to generate a matrix of real positive elements, whereby the elements of the matrix represent the energy distribution (function) of the particular RIR in the time-frequency domain.

The EBTFT-process can further comprise applying a filter function or operation, particularly a smoothing function or operation, to the energy distribution function.

The EBTFT-process can further comprise generating a strictly-monotonically decreasing sequence, indicating for each column of the matrix the remaining energy of the matrix starting from the particular column and normalizing this sequence with the sum of all energies of all columns (the sum of all elements of the matrix) thereby, generating a strictly-monotonically decreasing sequence in the interval between 0 and 1.

The EBTFT-process can further comprise determining a third sequence based on the modified matrix, the strictly-monotonically decreasing sequence, and the second sequence, whereby the third sequence is a monotonically decreasing sequence. It is possible two or more consecutive values of this third sequence to be equal to each other, meaning that the third sequence is not a strictly-monotonically-decreasing sequence.

The EBTFT-process can further comprise applying a grouping rule to the samples of the third sequence, so as to group together consecutive samples having the same value, whereby this value represents the number of RIR transformation operating coefficients in the respective group of RIR blocks, and whereby the number of samples grouped together represents the number of RIR blocks in the respective group of RIR blocks. Each value of the third sequence that has a unique value and therefore cannot be grouped with any other value, represents the number of RIR transformation operating coefficients for the respective RIR block.

Reference is now made to Fig. 1 , which shows a structure allowing for implementing an AUPOLS- method according to an exemplary embodiment of the method.

The method processes incoming samples x_n frame by frame. Thereby, x_n represents the value of the input audio signal at time n, where n³0. The input audio signal is assumed to be zero at time n < 0. The frame size is B ³1 samples. The k^th frame to be processed, whereby k ³0 is the frame index and frame 0 is the first frame, is the vector of samples Xk = [XKB+O, XkB_+i , ... , x kB₊(B-i)]ixB. Generally bold letters typically, indicate vectors in this document. Buffer 1 contains the samples of the vector X_k. The first sample XKB of the vector X_k is located at the first (leftmost) location of buffer 1. This is the first buffer location. This convention is followed for all buffers and vectors shown in the illustration of Fig.1.

For the current new incoming frame X_k of buffer 1 , the samples of the previous frame xn that were present in buffer 2, are shifted to the left by B samples and replace frame X_k-2 that was present in buffer 3. In this way, frame X_k-2 is discarded, buffer 3 is filled with the samples of frame xn and buffer 2 is filled with the samples of the current new incoming frame X_k. These actions are performed during the k^th iteration of the method that determines the output of the method for the input frame X_k. For the first iteration (this is the 0^th iteration), we set x.i = [0, ... , 0]I_XB. This means that for the first iteration B zeros are placed in buffer 3 and the B samples of the vector xo are placed in buffer 2.

The transformation indicated at 4 represents a size 2B Discrete-Fourier-Transform (size 2B R-C DFT transform) of the real time-domain vector [Xk-i | Xk] = [XkB-B, ... , XkB_+(B-i)]ix2B, which is the row vector formed by the samples of the vector xn located in buffer 3 followed by the samples of the vector X_k located in buffer 2. This transform maps a real time-domain vector to a complex frequency-domain vector of the same size.

The output of the transformation indicated at 4 is denoted as X_k = [X_k2_B, ... , X_k2_B+(2_B-i₎]i_x2_B. The first transformation coefficient (DC term) is X_k2_B (the first element of the vector) and the last transformation coefficient is X_k2_B+(2_B-i> (the last element of the vector)

The input of the transformation indicated at 6 is the frequency-domain vector Y_k= [Y_k2_B, ... , Y_k2_B+(2_B- i)]ix2B and its output is the time-domain vector [d_k | yj = [d_kB, ... , d_kB+(B-i) | ykB, ... , ykB+(B-i)]ix2B.

The transformation indicated at 6 represents a size 2B Inverse-Discrete-Fourier Transform (size 2B C-R I DFT transform). This transform maps a complex frequency-domain vector to a real time- domain vector of the same size. The elements of vector d_k are collected in buffer 8 and are all discarded. The elements of vector yk = [ykB₊o, ykB_+i, .... ykB_+(B-i)]ixB are collected in buffer 7 and form the output of the AU POLS-method to the input frame Xk = [XkB₊o, XkB_+i , ... , x kB_+(B-i)]ixB. Hence, at time n ³ 0, y_n=_kB+m is the output of the AU POLS-method to the input x_n=_kB+m (where 0 £ m < B and k ³0). This output has an inherent delay of (B-1) samples, since a total of B input samples need to be collected to build up a respective block for the processing to start. Only when an input block is complete and available, the output to this block can be calculated. The latency of the AUPOLS-method is thus, (B-1) samples.

From this point onwards buffer 18 contains the samples ho of block 1 (first block) of group 0 (first group), buffer 19 contains the samples hi of block 2 of group 0 , and buffer 20 contains the samples hp_[o_]-i of block Po (last block) of group 0. It is similar for buffers 21 , 22, 23, but this time for the blocks within group g. Buffer 24 contains the samples of the vector OB = [0, 0]I_XB (a vector of B zeros). The transformation indicated at 25 represents a Discrete-Fourier-Transform (size 2B R-C DFT transform) of the vector [ho | 0B]I_X2B. This is the vector formed by the samples of ho followed by the samples of OB. The transformation indicated at 25 maps a real time-domain vector of size 2B to a complex frequency-domain vector of the same size. There are K transformations similar to the transformation indicated at 25, for converting the K time-domain vectors [h_k | 0B]I_X2B, 0 £k < K, into the frequency-domain vectors Hk = [H B+O, H B+I , ... , H B+(2B-I)]I_C2B, 0 £k < K. K represents the number of the RIR blocks.

The first DFT coefficient (DC term) that results from the transformation is H_k2_B and the last DFT coefficient that results from the transformation is H_k2_B+(2_B-i₎. Buffer 26 contains the first No elements of the vector Ho and buffer 27 contains the first N_g elements of the vector H_k, where k is equal to the number of buffers located above buffer 27, all of them marked with the sequence symbol N. For group 0_, there are Po buffers having the same size No as buffer 26 and for group g there are P_g buffers having the same size N_g as buffer 27.

Buffers and transformations not explicitly shown in Fig. 1 are indicated at 28 and 29. Groups not explicitly shown in Fig. 1 are indicated at 30 and 31.

Reference is now made to buffer 26 as an exemplary buffer and all the buffers underneath. A total of K buffers containing complex data comprise the RIR transformation operating coefficients. The values of the RIR transformation operating coefficients can be calculated off-line and stay constant throughout the streaming and the processing of the real-time data. The static RIR transformation operating coefficients are stored in a memory unit.

Reference is now made to all K buffers (starting from buffer 10) under the transformation indicated at 4. Further buffers not explicitly shown in Fig. 1 are indicated at 12 and 15. Further groups not explicitly shown in Fig. 1 are indicated at 16 and 17.

For group 0 there are Po buffers having the same size No as buffer 10 and for group g there are P_g buffers having the same size N_g as buffer 13. All K buffers are initialised with zeros.

For the incoming frame X_k of buffer 1 , the output X_k of the transformation indicated at 4 is calculated. The last (B-1) elements of X_kare implied by the complex-conjugate symmetry property of the transformation indicated at 4. These are all discarded immediately after the output of the transformation indicated at 4 has been calculated. A total of (B+1) elements (complex numbers) remain after discarding these last (B-1) elements (complex numbers) of X_k.

From the remaining (B+1) elements, the first No are shifted into buffer 10 and the last (B+1-No) are discarded. The previous elements of buffer 10 are shifted to the next buffer, namely one buffer below. Every time that elements are shifted by moving downwards into any of the buffers, the elements of the buffer where the elements get shifted to, are also shifted downwards one buffer further below.

Given that all K buffers under the transformation indicated at 4 are initialized with zeros and that the elements from one buffer to the next buffer (the buffer just below) are shifted in the way described above, the calculation of the output frame yo for the input frame xo is done using the initial zero values in all (K-1) buffers below buffer 10 and the non-zero values in buffer 10 resulting from the transformation indicated at 4.

When elements are shifted from one buffer to the next (the one below), at any group boundary between groups g and (g+1), the last (N_g - N_g+i) elements of the last buffer of group g are discarded, since N_g > N_g+i, meaning that the buffers of group (g+1) can each only accommodate N_g+i elements. The first N_g+i elements of the last buffer of group g are shifted into the first buffer of group (g+1) and the last (N_g- N_g+i) elements of the last buffer of group g are simply discarded.

The complex multiplier indicated at 11 forms (outputs) the complex vector [HoX_k2B, HIXK2B₊I , , H _N[o_]-i X_k2_B+N[o_]-i ] 1 _xN[o_] This is a complex vector with N[0] elements. Similarly, each of the multipliers under the multiplier indicated at 11 forms in a similar way the element-by-element complex product between the complex contents of its corresponding pair of buffers. These are the buffers marked with the symbol N. There are K such vector products and they are all added by the accumulator indicated at 5 to form a single complex vector of size No. Since the vectors to be added do not have the same length (due to the condition N_g+i < N_g), a number of zeros are appended to the complex vectors to bring them all to the same length No before adding them.

Finally, accumulator 5 appends (B+1-No) complex zero samples to the complex vector sum. This corresponds to the removal of the last (B+1-No) complex elements when feeding buffer 10 from the output of the transformation indicated at 4.

Fig. 2 shows a principle drawing of a structure allowing for implementing a method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment according to another exemplary embodiment.

The structure of Fig. 2 is a structure for a specific numerical scenario, whereby the numbers in Fig. 2 indicate exemplary buffer sizes and exemplary numbers of Discrete-Time Fourier Transform operating coefficients used in the respective buffers. For the example of Fig. 2, there are 7 groups. Group 0 has population Po=1 and uses No=257 operating coefficients. For the other groups it is: Pi=1 and Ni=211 , P₂=2 and N₂=161 , P =2 and N₃=121 , P₄=1 and N₄=67, Ps=1 and N₅=28, R_d=4 and N_d=21. For this example, the block size B needs to be no less than 256.

Claims

CLAI M S

1. A method of processing an input audio signal for generating an output audio signal having the reverberation characteristics of a specific acoustic environment represented by its pre recorded Room-Impulse-Response (“RIR”), the method comprising the following steps: a) providing a pre-recorded RIR of a specific acoustic environment, the RIR being represented by its RIR samples; b) providing a discrete input audio signal, the discrete input audio signal being represented by its audio signal samples; c) dividing the audio signal samples of the discrete input audio signal in a number of input audio signal blocks, each input audio signal block having the same size in audio signal samples and/or the same number of audio signal samples; d) dividing the samples of the RIR in a number of RIR blocks, each RIR block having the same size in RIR samples, whereby the size in RIR samples of the RIR blocks is equal to the size in audio signal samples of the input audio signal blocks; e) determining if one input audio signal block becomes available, and, if one input audio signal block is available, producing an output audio signal block by processing the input block, with the output audio signal block having the same number of audio signal samples as the input audio signal block; f) determining a number of RIR operating coefficients, particularly transformation coefficients, more particularly DFT coefficients, for each block of the RIR on basis of a first processing rule; g) assigning the number of RIR operating coefficients to each RIR block of the RIR; h) storing the RIR operating coefficients assigned to the respective RIR blocks of the RIR as static values in at least one memory unit; i) using the RIR operating coefficients of the RIR together with corresponding time-varying operating coefficients of the input audio signal for determining an output audio signal having the reverberation characteristics of the acoustic environment on basis of a second processing rule.

2. The method according to Claim 1 , wherein the second processing rule is applied for every incoming input audio signal block and comprises:

- pairing two directly consecutive in time input audio signal blocks of the input audio signal to generate a paired block having double the size of each of the original input audio signal blocks;

- applying a DFT transformation to each paired block;

- using the DFT coefficients as the time-varying operating coefficients of the input audio signal;

- performing the complex multiplications of the static RIR operating coefficients with their corresponding time-varying operating coefficients of the input audio signal, whereby the set of the operating coefficients is a subset of the DFT coefficients required for the exact modeling of the RIR;

- accumulating the complex multiplication results for all sets of RIR operating coefficients;

- applying an inverse DFT transformation, whereby the length of the inverse DFT transformation corresponds to the size of the input paired block, resulting in an output time-domain block that has size equal to the size of the input paired block; and

- discarding the first halve of the output time-domain block and using the remaining second halve of the output time-domain block as the block of the output audio signal.

3. The method according to Claim 1 or 2, further comprising applying the first processing rule based on an energy-based time-frequency tiling-process (“EBTFT-process”).

4. The method according to Claim 3, further comprising applying a time-domain window function to each RIR block to modify the first and last samples of each block so as to generate blocks, particularly gradually, increasing from a zero absolute value at a first sample and, particularly gradually, decreasing to a zero absolute value at a last sample.

5. The method according to Claim 3 or 4, further comprising appending a number of zero samples after the last sample of each block so as to generate double-sized blocks.

6. The method according to Claim 5, further comprising arranging the double-sized blocks as columns of a real matrix having a number of rows and a number of columns, whereby the number of rows corresponds to the number of samples of each double-sized block and the number of columns corresponds to the number of RIR blocks.

7. The method according to Claim 6, further comprising applying a DFT transformation to each column of the real matrix, and applying a replacement rule to each of the columns so as to replace each column by the squared magnitude of its DFT transformation, resulting in a matrix of the same size having only real positive elements.

8. The method according to Claim 7, further comprising removing all last rows comprising redundant information and doubling the elements of all rows except of those of the first and the last row, so as to generate a matrix of real positive elements, whereby the elements of the matrix represent the energy distribution of the particular RIR in the time- frequency domain.

9. The method according to Claim 8, further comprising applying a filter function or operation, particularly a smoothing filter function or operation, to the energy distribution function.

10. The method according to Claim 9, further comprising applying an energy threshold rule to the elements of each column, such that only the first elements of each column that sum up to a threshold energy, e.g. 90% of the total energy of the respective column, are kept, whereas the remaining elements of the respective column are set to zero, resulting in a modified matrix.

11. The method according to Claim 10, further comprising generating a strictly- monotonically decreasing sequence indicating for each column of the matrix the remaining energy of the matrix starting from the particular column and normalizing this sequence with the sum of all energies of all columns thereby, generating a strictly-monotonically decreasing sequence in the interval between 0 and 1.

12. The method according to Claim 11, further comprising modifying the decay rate of the sequence by applying a transformation on the sequence that converts an arbitrary strictly- monotonically-decreasing sequence to another sequence of the same property, that takes values in the same interval.

13. The method according to Claim 10, further comprising determining a second sequence based on the modified matrix that for each particular column of the modified matrix expresses the sum of all elements of the respective column of the modified matrix.

14. The method according to Claim 13, further comprising determining a third sequence on basis of the modified matrix, the strictly-monotonically decreasing sequence, and the second sequence, whereby the third sequence is a monotonically decreasing sequence.

15. The method according to Claim 14, further comprising applying a grouping rule to the samples of the third sequence so as to group together consecutive samples having the same value, whereby this value represents the number of RIR operating coefficients for each RIR block in the respective group of RIR blocks; wherein

16. The method according to Claim 15, wherein samples of the third sequence with unique values each form its own group, whereby these unique values represent the number of

RIR operating coefficients for the respective RIR block.

17. A non-transitory computer readable medium storing comprising computer-executable instructions, which when executed by a processor of a digital signal processing unit cause the digital signal processing unit to perform the method of any of the preceding Claims.

18. A digital signal processing unit comprising at least one processor having computer- executable instructions, which when executed by the at least one processor cause the digital signal processing unit to perform the method of any of Claims 1 - 16.

19. An apparatus for processing an input audio signal, comprising a digital signal processing unit according to Claim 18.

20. A vehicle comprising an apparatus for processing an input audio signal according to Claim 19.