EP3275213B1

EP3275213B1 - Method and apparatus for driving an array of loudspeakers with drive signals

Info

Publication number: EP3275213B1
Application number: EP15725269.3A
Authority: EP
Inventors: Michael BÜRGER; Thomas Richter; Mengqiu ZHANG; Heinrich LÖLLMANN; Walter Kellermann; André KAUP; Yue Lang; Peter GROSCHE; Karim Helwani; Giovanni Cordara
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-05-13
Filing date: 2015-05-13
Publication date: 2019-12-04
Anticipated expiration: 2035-05-13
Also published as: EP3275213A1; WO2016180493A1

Description

TECHNICAL FIELD

The present invention relates to wave field synthesis apparatus and a method for driving an array of loudspeakers with drive signals. The present invention also relates to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out a method for driving an array of loudspeakers with drive signals.

BACKGROUND

There are many different approaches to virtual acoustics for multiple listeners. They can be divided into two main groups. A first group comprises local sound field synthesis (SFS) approaches, such as (higher order) ambisonics, wave field synthesis and techniques related to it, and a multitude of least squares approaches (pressure matching, acoustic contrast maximization, ...) These techniques aim at reproducing a desired sound field in multiple spatially extended areas. A second group comprises binaural rendering (BR) or point-to-point rendering approaches, e.g., binaural beamforming or crosstalk cancellation. Their aim is to generate the desired hearing impression by evoking proper interaural time differences (ITDs) and interaural level differences (ILDs) at the ear positions of the listeners. Thereby, virtual sources are perceived at desired positions.
Both approaches require the listeners of such a reproduction system to be located at a certain position, the so-called sweet spot, at which the desired virtual acoustic scene is synthesized. If the listeners move, the hearing impression will deteriorate and may completely collapse. In case of BR, head rotations may in addition lead to the virtual acoustic scene to rotate, since certain ILDs and ITDs correspond to a fix point with respect to the listener rather than a fix point in the 3D space. Furthermore, cross-talk, i.e., sound energy intended for a first listener location leaking to the location of a second listener, is a major problem in sound reproduction for real environments.
US 2011/0103620 relates to apparatus and method for generating filter characteristics.

SUMMARY OF THE INVENTION

The objective of the present invention is to provide an apparatus and a method for driving an array of loudspeakers with drive signals, wherein the apparatus and the method provide high-quality personalized spatial sound to possibly moving listeners.
In particular, various aspects of the present disclosure have been defined in the independent claims. Further technical features of each of these aspects have been defined in the respective dependent claims.
A first aspect of the invention provides a wave field synthesis apparatus for driving an array of loudspeakers with drive signals, the apparatus comprising:

a listener pose identifying unit for identifying a pose of a listener, wherein the pose comprises a location and an orientation of a listener,
a sound reproduction unit for generating the drive signals, the sound reproduction unit comprising a sound field synthesizer for generating sound field drive signals for causing the array of loudspeakers to generate a sound field at at least one audio zone and/or a binaural renderer for generating binaural drive signals for causing the array of loudspeakers to generate specified sound pressures at at least two locations, and
an adaptation unit for adapting one or more parameters of the sound reproduction unit based on the identified pose of the listener.

The apparatus of the first aspect combines sound reproduction with listener pose identifying for identifying the pose (position and orientation) of one or more listeners, wherein the identified pose is used to adapt parameters of the sound reproduction unit. This can involve a modification of the loudspeaker prefilters such that the hearing impression can be maintained in case of a moving listener or a listener located off the sweet spot(s). Furthermore, depending on the listener's position, only a certain subset of loudspeakers is utilized for reproduction in order to reduce cross-talk. In case of binaural rendering, a modification of the binaural input signals can be performed in order to avoid the virtual acoustic scene to rotate.
The wave field synthesis apparatus of the first aspect can be part of a system for untethered personalized multi-zone sound reproduction, which adapts to the poses (positions and orientation) of multiple possibly moving listeners. With this system, the desired auditory impression can be preserved if listeners are not located precisely at a sweet spot or even move, and in embodiments cross-talk can be reduced by selecting a suitable set of active loudspeakers for reproduction depending on the listeners' positions. Information about a number and poses of the listeners can be obtained, for example, with the help of a video-based pose detection and tracking system.
Further, in the first aspect, the listener pose identifying unit comprises an uncertainty determining unit for determining an uncertainty level, wherein the uncertainty level comprises a location uncertainty level, which reflects an estimated uncertainty in an identified location, and/or an orientation uncertainty level, which reflects an estimated uncertainty in an identified orientation, and wherein the adaptation unit is configured to adapt a parameter of the sound reproduction unit based on the determined uncertainty level.
As outlined above, the adaptation unit is configured to adapt parameters based on the determined pose. However, if an uncertainty in the pose determination is high, it may be preferable to avoid any significant parameter adaptations based on the determined pose. For example, if the uncertainty in the determined location of a listener is high, it is preferable to avoid setting parameters of the sound reproduction unit such that a sharply delimited audio zone is generated with high sound volume only at the determined location. The determined location might be inaccurate and at the true location, the sound output might be insufficient.
Further, in the first aspect, the adaptation unit is configured to adapt a parameter indicating a size of the audio zone based on the determined uncertainty level, wherein a higher uncertainty corresponds to a larger size.
If the location of a listener can only be determined with high uncertainty, it is preferable to set a size parameter of a corresponding audio zone high in order to ensure that the listener is located within the audio zone. In embodiments of the invention, the size of the audio zone can be set as a linear function of the uncertainty of the location of the listener.
Further, in the first aspect, the adaptation unit is configured to adjust a weighting parameter, which indicates a weighting between the sound field drive signals and the binaural drive signals, based on the determined uncertainty level, wherein in particular the apparatus is configured such that the drive signals are generated using only sound field synthesis if the determined uncertainty level is higher than a predetermined threshold. Therefore, in cases of uncertain location determination, a higher emphasis can be placed on sound field synthesis, which compared to binaural rendering can be less reliant on precise knowledge of the location of a listener.
In a first implementation of the apparatus according to the first aspect, the listener pose identifying unit is configured to identify a number of listeners in the audio zone, and wherein the adaptation unit is configured to adapt a size parameter, which indicates a size of the audio zone, based on the identified number of listeners in the audio zone.
According to the first implementation, the one or more parameters of the sound reproduction unit can comprise one or more size parameters of one or more audio zones generated by the sound reproduction unit. Therefore, the size of an audio zone can adapted if multiple persons or a group are present in a zone. This has the advantage that the wave field synthesis apparatus can adapt to different numbers of listeners who want to listen to the same audio content.
Updating of the size of the audio zone can be performed periodically, e.g. in fixed predetermined time intervals. In other embodiments, the size of the audio zone can be updated in irregular intervals, whenever new information about the number of listeners is available at the listener pose identifying unit.
In a second implementation of the apparatus according to the first aspect, the at least one audio zone comprises a dark audio zone and a bright audio zone and the adaptation unit is configured to set an output parameter, which corresponds to a strength of a specific drive signal for a specific loudspeaker, to zero if the adaptation unit determines that there is at least one connection line between a location of the specific loudspeaker and a point in the bright audio zone that intersects with the dark audio zone and/or with a surrounding of the dark audio zone, wherein in particular the surrounding is defined by a fixed perimeter around the dark audio zone.
Setting an output parameter, which corresponds to a strength of a specific drive signal for a specific loudspeaker, to zero, has the effect that this specific loudspeaker is discarded, i.e., excluded from the sound field generation.
Typically, all available loudspeakers are used for the task of generating one or more audio zones. Discarding loudspeakers for reducing cross-talk may seem counterintuitive, since more loudspeakers should theoretically provide a higher suppression of cross-talk as well as a smaller error in the bright zone. In real-world scenarios, however, loudspeaker imperfections, positioning errors, and reflections deteriorate the reproduction performance and especially introduce a significant amount of cross-talk. Therefore, discarding loudspeakers in the proximity of the dark-zone can reduce the amount of cross-talk and, thus, reduce the residual sound pressure level.
In a third implementation of the apparatus according to the first aspect, the adaptation unit is configured to control the sound reproduction unit to start and/or resume generating the drive signals if at least one listener is identified in the audio zone and/or the adaptation unit is configured to control the sound reproduction unit to stop and/or pause generating the drive signals for the audio zone if the listener pose identifying unit determines that there are no listeners in the audio zone.
Pausing to generate drive signals for an audio zone where no listeners are located has the advantage that cross talk to other audio zones can be avoided. For example, if there are three audio zones, and the wave field synthesis apparatus pauses generation of drive signals for a first of the three audio zones, cross talk to the second and third audio zone is avoided.
In a fourth implementation of the apparatus according to the first aspect, the apparatus comprises a camera input unit for obtaining image frames from one or more cameras and the listener pose identifying unit comprises:

a listener detection unit for detecting a location of a listener in one or more first image frames acquired by the one or more cameras,
a listener histogram determining unit for determining a first histogram of the listener in the one or more first image frames based on the detected location, and
a listener tracking unit for tracking the listener in one or more subsequent image frames that are acquired by the one or more cameras after the one or more first image frames, wherein the listener tracking unit is configured to track the listener based on the first histogram of the listener.

The one or more cameras can be two cameras, which are located at different positions, such that a 3D image can be derived from images acquired from the two cameras, and a 3D location of a listener be determined.
The listener tracking unit can be configured to use a tracking algorithm which requires a lower computational effort to track a location compared to the computational effort of the listener detection unit to detect a location.
Based on the location that has been detected in the first image frames, a first histogram of the listener can be determined with high accuracy. However, the location detection and the histogram determining involves significant computational effort and therefore, according to the seventh implementation, a tracking unit can be used for subsequent image frames. The tracking unit can assume that the first histogram of the listener does not change between the first image frame and the subsequent image frames. Furthermore, the tracking unit can be configured to assume that the location of the listener changes only within certain limits between an image frame and the next image frame. Based on one or more of these assumptions, the tracking unit can use simpler algorithms than the detection unit to determine a location of the listener in the subsequent image frames.
In an fifth implementation of the apparatus according to the first aspect, the uncertainty determining unit is configured to determine the uncertainty level based on a difference between a first histogram that is determined based on a detected location of the listener and a subsequent histogram that is determined based on a tracked location of the listener.
If the subsequent histogram that is determined based on the tracked location of the listener differs significantly from the first histogram that was determined based on a detected location of the listener, this can be because the detected location is inaccurate and the histogram that is determined based on the tracked location is not the histogram of the listener. Therefore, the difference between the first histogram and the subsequent histogram can be an indication for an error of the determined or tracked location of the listener.
In embodiments of the invention, the difference between the first histogram and the subsequent histogram can be adjusted to account for changes of a first global histogram and a subsequent global histogram, wherein the first global histogram is computed based on an entire first image frame and wherein the subsequent global histogram is computed based on an entire subsequent image frame. Global histograms of image frames can change e.g. because of changes in the lighting of the room. For example, if an artificial light is switched on in the room, all pixels in the image frames can be affected. Therefore, it can be preferable to adjust the difference computation based on a change of a global histogram.
In a sixth implementation of the apparatus according to the first aspect, the apparatus further comprises a distance detection unit which is configured to determine a distance of the listener from a reference point based on a size of a face region of the listener in the one or more image frames. For example, the reference point can be located at the location of the one or more cameras. When a listener is closer to the cameras, his face appears larger in the acquired image frames. Therefore, a distance of the listener can be determined based on the size of the listener's face in the one or more image frames.
A second aspect of the invention refers to a method for driving an array of loudspeakers with drive signals according to claim 8.
The methods according to the second aspect of the invention can be performed by the system according to the first aspect of the invention. Further features or implementations of the method according to the second aspect of the invention can perform the functionality of the apparatus according to the first aspect of the invention and its different implementation forms.
The sound reproduction parameters can be parameters of a sound reproduction unit.
In a first implementation of the method of the second aspect, identifying the pose of the listener comprises the steps:

acquiring one or more first image frames,
detecting a pose of the listener in the one or more first image frames,
computing a first histogram of the listener in the one or more first images based on the detected pose,

acquiring one or more subsequent image frames, and
tracking a pose of the listener in the one or more subsequent image frames based on the first histogram.

In a second implementation of the method of the second aspect, the method of the first implementation further comprises the steps:

computing a subsequent histogram of the listener in the one or more subsequent image frames based on the tracked pose,
determining an uncertainty level based on a difference between the first histogram and the subsequent histogram, and
adapting a sound reproduction parameter based on the determined uncertainty level.

In a third implementation of the method of the second aspect, the method further comprises a step of detecting the location of the listener in the one or more subsequent image frames if the determined uncertainty level is higher than a predetermined threshold.
Detecting the location of a listener without prior knowledge about a histogram of the listener or an estimate of the location typically involves a higher computational effort compared to tracking a location of a listener where a previous location of the listener is known. Therefore, the method of the third implementation has the advantage that a detection of the listener location is performed only if the uncertainty level is so high that it is no longer sensible to rely on the result of the tracking unit.
A third aspect of the invention refers to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of the second aspect or one of the implementations of the second aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

To illustrate the technical features of embodiments of the present invention more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present invention, but modifications on these embodiments are possible without departing from the scope of the present invention as defined in the claims.

FIG. 1: shows a schematic illustration of a wave field synthesis apparatus according to an embodiment of the invention,
FIG. 2: shows a schematic illustration of a system comprising a wave field synthesis apparatus according to another embodiment of the invention,
FIG. 3: is a flow chart of a method in accordance with the present invention,
FIG. 4: is a flow chart which illustrates in more detail the step of identifying a pose of a listener,
FIG. 5: shows a schematic illustration a loudspeaker selection scheme for an exemplary scenario with a first listener in a first audio zone and a second listener in a second audio zone,
FIG. 6: shows a schematic illustration of the definition of the quantities required to determine the minimum angle of the loudspeaker selection scheme of FIG. 5, and
FIG. 7: shows a schematic illustration of a listener pose identifying unit in accordance with the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

FIG. 1 shows a schematic illustration of a wave field synthesis apparatus 100. The wave field synthesis apparatus 100 comprises a listener pose identifying unit 110, a sound reproduction unit 120, and an adaptation unit 130. The sound reproduction unit 120 comprises a sound field synthesizer 122 and a binaural renderer 124.
FIG. 2 shows an overview block diagram of a system 202 in accordance with the present invention. In the scenario shown in FIG. 2, a first listener 250 and a second listener 252 are provided with personalized sound. The system 202 comprises a wave field synthesis apparatus 200, a camera system 214a, 214b, and an array of loudspeakers 240. The array of loudspeakers 240 is driven by drive signals that are generated by the personalized sound reproduction system, which is the sound reproduction unit 220 of the wave field synthesis apparatus 200. The wave field synthesis apparatus 200 further comprises a first and second camera input unit 212a, 212b for connecting external cameras 214a, 214b, a first and second video-based pose estimation system 210a, 210b, which are listener pose identifying units. The video-based pose estimation systems estimate poses of the listeners and pass estimated pose data (UDP) to the adaptation stage 230.
The drive signals generated by the sound reproduction unit 220 cause the array of loudspeakers to generate sound waves that generate a first audio zone 260 at a location of the first listener 250 and a second audio zone 262 at a location of the second listener 252. The location of the first audio zone 260 corresponds to an updated location of the first listener 250 that is different from a previous location 251 of the first listener. The change in location of the first listener corresponds also to a change in orientation, i.e., the pose of the first listener has changed.
The wave field synthesis apparatus 200 is configured to carry out a method, wherein the listeners 250, 252 are detected and their poses detected. In the illustrated example, this is done with the help of a camera system 214a, 214b, such as a stereo camera setup or dedicated devices providing the required depth information.
FIG. 3 is a flow chart of a method in accordance with the present invention. In a first step S10, a pose of a listener is identified, wherein the pose comprises a location and an orientation of the listener. For localizing the individual listeners, a head detection and tracking algorithm can be used, wherein the head detection is based on the so-called Viola-Jones approach, published by P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features.", Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1,I-511 -I-518, 2001.
Since the detection is trained on a specific kind of input data, e.g., frontal or profile faces, the algorithm may not be able to estimate the listeners' positions for arbitrary head orientations. However, using the detection results as initialization, the listeners' positions can be tracked over time using a listener tracking unit. For tracking, different features such as color, depth, and dominant characteristics within each facial region can be used, making the listener tracking unit more robust against potential illumination inconsistencies or complex background regions. The depth information can be used for a rough background/foreground segmentation of the facial regions, to detect outliers in the optical flow and/or to infer a 3D position for detected faces.
In a further step S20, sound field drive signals are generated for causing the array of loudspeakers to generate at least one sound field at at least one audio zone, and/or in step S30 binaural drive signals are generated for causing the array of loudspeakers to generate specified sound pressures at at least two locations.
In another step S40, one or more sound reproduction parameters are adapted based on the identified pose. In particular, the one or more sound reproduction parameters relate to the generation of sound field drive signals and/or the generation of binaural drive signals.
FIG. 4 is a flow chart that illustrates step S10 of identifying a listener's pose in more detail.
In a first step S11, one or more image frames are acquired. In a second step S12, a pose of the listener is detected in the one or more first image frames.
Subsequently, the facial regions can be transformed into the HSV color space and first histograms of the hue-values are determined in step S13. Dominant edges are defined as feature points and searched within each facial region. Using color information in the three color spaces RGB, HSV, and YCbCr, a segmentation mask can be created. Depth information can also be used in determining the segment mask.
In step S14, one or more subsequent image frames are acquired.
Then, in step S15, the pose of the listener is tracked in the one or more subsequent image frames. To this end, the feature sets from one or more previous image frames can be tracked into a current frame by using an optical flow approach, e.g. the approach presented in J. Bougout, "Pyramidal implementation of the Lucas Kanade feature tracker", Intel Corporation, Microprocessor Research Labs, 2000. The hue histograms, initially determined within the detected facial regions, are projected back onto the current frame, each resulting in a probability map. After combining the segmentation mask and the probability map, e.g. by using an AND logical operation, the respective user position is shifted to the most likely image region in the current frame. By using the corresponding depth information, the determined 2D positions are converted into 3D coordinates. Depth information can be acquired separately, e.g. using a 3D detection system that emits and detects structured light, and/or by using a time-of-flight detection.
The probability map can be used to detect the region where, most likely the face of a listener will move in following frames.
In embodiments of the invention, there is no measurement of confidence for the tracking of the pose. The tracking can be stopped and face detection re-initialized every N frames (wherein N can range from 5 to 50). When pose is evaluated, RANSAC based pose tracking can be used as a criteria for identifying when the tracking is lost.
In step S16, a subsequent histogram is computed based on the tracked pose of the listener in the one or more subsequent image frames.
In step S17, an uncertainty level of the tracked pose is determined. For example, the uncertainty level can be determined based on how far the subsequent histogram differs from the first histogram.
In step S18, a sound reproduction parameter is adapted based on the determined and/or the tracked pose of the listener.
After estimating the location and orientation of at least one listener, the pose information serves as input for an adaptation stage 230, which is an adaptation unit which controls the sound reproduction unit 220 accordingly, i.e., the individual acoustic scenes are adapted to the poses of the listeners. This adaptation may comprise different steps, depending on the scenario and algorithm (binaural rendering vs. sound field synthesis) used for reproduction.
In case of sound field synthesis (SFS) approaches and moving listeners, the local sound fields (bright and dark zones) can be shifted according to the listeners' positions such that all listeners are provided with the desired, personalized virtual acoustic scenes at all times. Local SFS techniques aim to reproduce a desired sound field in multiple spatially extended areas (e.g., audio zones). Such audio zones may be referred to as bright zones or dark zones. In bright zones, the sound field can be perceived by a listener, in dark zones (quiet zones), the sound field is attenuated (e.g., corresponds to silence or is otherwise not perceivable).
In case of sound field synthesis and a varying number of listeners within a single local listening area, the size of this area can be adapted such that each listener within the area can be provided with the same desired hearing impression.
In case of binaural rendering, the sound reproduction unit can be adapted such that the positions at which the sound field can be controlled always coincide with the ear positions such that the desired ILDs and ITDs (provided by the binaural input signals) can be evoked at all times even for moving listeners.
Moreover, the binaural input signals provided to the binaural rendering system can be adapted in order to avoid rotations of the virtual acoustic scenes.
The adaptation of the sound reproduction parameter can also be based on the determined uncertainty level. For example, if the uncertainty level increases compared to a previously determined uncertainty level, the size of an audio zone can be increased.
It should be noted that, for both binaural rendering and sound field synthesis, the loudspeaker prefilters which provide a certain listener i with personalized sound are also adapted if another listener j is moving. This is necessary since, in addition to the generation of the desired sound field in zone i, one or more quiet zones (spatial zeros) need to be generated and adapted to the positions of all other listeners j.
For both binaural rendering and sound field synthesis, the sound reproduction unit can be triggered by the listener pose identifying unit, i.e., sound reproduction for a particular zone will only start if at least one listener is present in that zone, and sound reproduction for a particular zone will stop, if no listener is present anymore in that zone.
In step S19a, the determined uncertainty level is compared with a predetermined threshold. If it is determined that the uncertainty level is higher than the predetermined threshold, in step S19b the pose of the listener is detected in the one or more subsequent image frames. If the uncertainty level is not too high, the method can proceed with acquiring further subsequent image frames.
While knowing the listener's positions is already sufficient for sound field synthesis approaches, binaural rendering additionally requires information about the user's head orientation. The present orientation estimation algorithm combines absolute orientation detection and relative orientation tracking. Based on the detection of well-known facial features, such as eyes, nose, or mouth, and their corresponding depth values, the three rotation angles roll, pitch, and yaw can be calculated for each listener. Again, the Viola-Jones approach can be used as detector. For the case that not all required features can be detected successfully, e.g. if nose, mouth and/or eyes cannot be detected properly, the approach switches from absolute orientation estimation to relative orientation tracking, trying to follow the change in orientation over time. The approach of relative orientation tracking comprises three steps. First, features are detected within the facial region in the previous image frame. Second, the features are tracked into the current image frame. he algorithms presented in J. Shi and C. Tomasi, "Good features to track.", Proc. IEEE Conference on Computer Vision and Pattern recognition, 593-600, 1994 and J. Bougout, "Pyramidal implementation of the Lucas Kanade feature tracker", Intel Corporation, Microprocessor Research Labs, 2000 can be used for feature detection and tracking. An iterative RANSAC algorithm can be used as method to detect geometric transformation among features' position. It can be assumed that, when RANSAC does not converge (because too few features are matched, for example), the tracking is lost.
Finally, a projection is estimated for mapping 3D feature points from the previous frame onto 2D feature points within the current frame. The relative change in orientation between the two frames is then given by an estimated rotation matrix.
In an advanced tracking mode, the Viola-Jones face detector is used to detect a face at an initial image frame and initializes the listener tracking unit by computing the hue histogram in the detected region. The histogram defines the target color distribution to be tracked. In a second step of the advanced tracking mode, features are searched within the initial detected region. Since the final listener tracking unit should also be able to track other objects then faces, no facial features, like eyes, nose or mouth are chosen. Instead a feature is defined to be a dominant edge within the detected region. At first, the minimal eigenvalue is computed at each pixel position of the input image frame, resulting in a map of eigenvalues M_eig(m, n, t). The minimal eigenvalue is used as "corner quality measurement". After performing a non-maximum suppression in a 3x3 neighborhood, a point at a position (m, n) is rejected, if $M_{eig} (m, n, t) \leq q \cdot \max (M_{eig} (x, y, t)) \forall (x, y)$
holds, where q is a pre-defined quality level. Finally, the remaining corners are thinned out by rejecting all points for which there is a stronger corner within a distance less than d_corner.
Afterwards, the image to be tracked at a subsequent image frame is segmented first, using color information. This step is similar to the color segmentation in the simple tracking mode. However, the segmentation is done in three color spaces, namely RGB, HSV and YCbCr. In the YCbCr color space, a color is represented with the luminance component Y, the blue-difference chroma component Cb and the red-difference chroma component Cr. The segmentation in HSV is done, as in the simple mode, using fixed values for the upper and lower bounds of hue, saturation and value, respectively.
However, the segmentation in RGB and YCbCr is done adaptively according to the color distribution within the initially detected region of interest. Thus, a histogram for each color channel in RGB and YCbCr is computed within the face rectangle defining each, the upper and the lower bound of the corresponding color channel. Since the detection region is usually larger than the target object, the rectangular region can be shrinked by a factor p in height and width.
This is done in order to avoid background color getting too much influence on the segmentation process. A pixel in the segmentation mask M(m, n, t) is then only marked, if its color is within the computed ranges in RGB, HSV and YCbCr.
The additional incorporation of RGB and YCbCr can lead to a more accurate segmentation result and thus a better result for the initial mask M(m, n, t).
In a next step of the advanced tracking mode, the initially detected sparse feature set, which comprises the image coordinates of the calculated feature points, is tracked using an optical flow approach using a pyramidal implementation of the Lucas Kanade feature tracker. The basic idea of this approach is to subdivide images into different resolution levels. Then, the motion of the input feature set is estimated, beginning on the lowest resolution level up to the original image resolution. Thereby, the result at a specific resolution level is used as initial guess for the next resolution level.
The motion is estimated by comparing the local neighborhood within a specific window of size w m x w n around the feature point to be tracked. Let (m₀, n₀) be the spatial coordinates of a feature point at time index t = 0 for which the corresponding position at t = 1 has to be estimated. The unknown position (m₁, n₁ ) can therefore be described as $(m_{1}, n_{1}) = (m_{0}, n_{0}) + (v_{m}, v_{n})$
with v being the unknown image velocity, also known as the optical flow. The image velocity is then defined as the vector that minimizes the error e(v) which is defined as follows $e (v) = e (v_{m}, v_{n}) = \sum_{m = m_{0} - w_{m}}^{m_{0} + w_{m}} \sum_{n = n_{0} - w_{n}}^{n_{0} + w_{n}} {(I (m, n, 0) - I (m - v_{m}, n - v_{n}, 1))}^{2}$
The advanced tracking mode as described above can lead to improved tracking results.
For both binaural rendering and sound field synthesis, a suitable subset of loudspeakers is chosen such that the amount of cross-talk can be reduced for a particular listener at the potential cost of a slightly lower reproduction performance for the other listeners. This constitutes a tradeoff between the quality of the reproduced scene itself and the amount of cross-talk leaking into the other zones (listening areas). Discarding certain loudspeakers for reproduction is only done if at least one listener is present in a particular dark zone. Otherwise, all loudspeakers are used.
FIG. 5 illustrates the loudspeaker selection scheme for an exemplary scenario with a first listener 350 in a first audio zone 360 and a second listener 352 in a second audio zone 362. The first audio zone 350 is a bright zone and the second audio zone 360 is a dark zone, i.e., a desired acoustic scene should be synthesized for the first listener 350, while the acoustic energy leaking to the position of the second listener 352 (cross-talk) should be minimized. The angular direction of a particular loudspeaker l, indicated with reference number 342, is denoted as α_l and defined with respect to the point x _tan,l, which denotes the point where the connection line 370 between loudspeaker l and the circular contour around the first audio zone 360 form a tangent.
Those loudspeakers 344 of the array of loudspeakers 340 for which the angular direction α_l is smaller than a minimum angle α _min are deactivated. This minimum angle α _min is chosen such that the connection lines between any point in the bright zone 360 and the loudspeaker l do not intersect with the dark zone 362. Since the connection line 370 does not intersect with the dark zone 362 and since there is also no other connection line between a point in the bright zone and the loudspeaker 342, the loudspeakers is not deactivated. For the further loudspeaker 344, on the other hand, there would be a connection line between a point in the bright zone 360 and the further loudspeaker 344 that intersects with the dark zone 362. Therefore, the further loudspeaker 344 is deactivated.
FIG. 6 illustrates the definition of the quantities required to determine the minimum angle α _min·x _tan = [x _tan, y _tan]^T denotes the point in space at which the tangent passing through also forms a tangent with respect to the second zone 362, i.e., it determines that connection line which touches, but does not intersect the second zone 362. x _i = [x_i,y_i ]^T and x _j = [x_j ,y_j ]^T denote the centers of the first zone 360 and the second zone 362, respectively. In order to compute α _min, the point x _tan needs to be determined according to x _tan = x _i + R_i [cos(ε), sin (ε)]^T, where R_i is the zone radius of the first zone 360 and ε = β + ϕ - 90°, with $β = \arctan (\frac{y_{j} - y_{i}}{x_{j} - x_{i}})$
and $φ = \arcsin (\frac{R_{i}}{d}) .$
The distance d is given by $d = \frac{R_{i}}{2 R_{j}} {({(x_{j} - x_{i})}^{2} + {(y_{j} - y_{i})}^{2})}^{1 / 2} .$
Finally, $α_{\min} = \arctan (\frac{Δ y}{Δ x}) .$
In order to allow for a quick adaptation of the system with a low delay, sound reproduction can be done in a frame-wise manner, e.g., using a Short-Time Fourier Transform (STFT), where preferably frame sizes should not exceed few milliseconds. If redesigning the loudspeaker prefilters online is too time-consuming, different sets of loudspeaker prefilters can be computed offline in advance for different scenarios. During reproduction, the respective set of loudspeaker filters corresponding to the current reproduction scenario can be selected.
FIG. 7 shows a schematic illustration of a listener pose identifying unit 410 in accordance with the present invention. The listener pose identifying unit 410 comprises an uncertainty determining unit 412, a listener detection unit 414, a listener histogram determining unit 416 and a listener tracking unit 418. In embodiments of the invention, the listener pose identifying unit 410 is a distance detection unit, i.e., it can be configured to detect a distance of a listener to a reference point.
The methods, the wave field synthesis apparatuses and the systems described above can be used for example in the following applications:
Untethered personalized information systems, e.g., in a museum: The video-based pose detection and tracking system triggers the sound reproduction system, which acoustically provides information about the respective exhibit if a visitor is detected in a predefined area in front of it while keeping the acoustic energy in the other areas low.
Personalized TV sound for multiple viewers: Combining state-of-the-art 3D imaging systems and multi-zone sound reproduction allows for two (or multiple) users watching their individual 2D audio content and 3D audio content. For example, two listeners can watch different movies with a single system. Again, sound reproduction is adapted to the actual position of the possibly moving users.
Dialogue multiplex in teleconferencing: The system described above allows for providing individual participants, e.g., with speech from a remote site in different languages, or with different speech signals originating from different conversation partners at a remote site.
To summarize, an adaptive system for personalized, multi-zone sound reproduction has been presented, where an unknown number of possibly moving users can be provided with individual audio content and cross-talk between the individual users can be reduced by choosing a suited subset of loudspeakers for reproduction. The poses (positions and orientations) of the users' heads can be tracked, e.g., with the help of a video-based system, and the obtained information are exploited in order to adapt the sound reproduction algorithm accordingly such that the desired hearing impression is maintained even if listeners move or rotate their heads. Furthermore, if a user is located in a certain local reproduction zone j, the number of loudspeakers utilized for reproducing sound in another zone i are adapted in order to reduce the cross-talk leaking into zone j. Finally, specific applications are proposed for which the proposed kind of adaptive sound reproduction system is utilized.
The foregoing descriptions are only implementation manners of the present invention, the protection of the scope of the present invention is not limited to this. Any variations or replacements can be easily made through person skilled in the art. Therefore, the protection scope of the present invention should be subject to the protection scope of the attached claims.
The invention has been described in conjunction with various embodiments herein. However, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in usually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Embodiments of the invention may be implemented in a computer program for running on a computer system, at least including code portions for performing steps of a method according to the invention when run on a programmable apparatus, such as a computer system or enabling a programmable apparatus to perform functions of a device or system according to the invention.
A computer program is a list of instructions such as a particular application program and/or an operating system. The computer program may for instance include one or more of: a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer system.
The computer program may be stored internally on computer readable storage medium or transmitted to the computer system via a computer readable transmission medium. All or some of the computer program may be provided on transitory or non-transitory computer readable media permanently, removably or remotely coupled to an information processing system. The computer readable media may include, for example and without limitation, any number of the following: magnetic storage media including disk and tape storage media; optical storage media such as compact disk media (e.g., CD-ROM, CD-R, etc.) and digital video disk storage media; non-volatile memory storage media including semiconductor-based memory units such as FLASH memory, EEPROM, EPROM, ROM; ferromagnetic digital memories; MRAM; volatile storage media including registers, buffers or caches, main memory, RAM, etc.; and data transmission media including computer networks, point-to-point telecommunication equipment, and carrier wave transmission media, just to name a few.
A computer process typically includes an executing (running) program or portion of a program, current program values and state information, and the resources used by the operating system to manage the execution of the process. An operating system (OS) is the software that manages the sharing of the resources of a computer and provides programmers with an interface used to access those resources. An operating system processes system data and user input, and responds by allocating and managing tasks and internal system resources as a service to users and programs of the system.
The computer system may for instance include at least one processing unit, associated memory and a number of input/output (I/O) devices. When executing the computer program, the computer system processes information according to the computer program and produces resultant output information via I/O devices.
The connections as discussed herein may be any type of connection suitable to transfer signals from or to the respective nodes, units or devices, for example via intermediate devices. Accordingly, unless implied or stated otherwise, the connections may for example be direct connections or indirect connections. The connections may be illustrated or described in reference to being a single connection, a plurality of connections, unidirectional connections, or bidirectional connections. However, different embodiments may vary the implementation of the connections. For example, separate unidirectional connections may be used rather than bidirectional connections and vice versa. Also, plurality of connections may be replaced with a single connection that transfers multiple signals serially or in a time multiplexed manner. Likewise, single connections carrying multiple signals may be separated out into various different connections carrying subsets of these signals. Therefore, many options exist for transferring signals.
Those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures can be implemented which achieve the same functionality.
Furthermore, those skilled in the art will recognize that boundaries between the above described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
Also for example, the examples, or portions thereof, may implemented as soft or code representations of physical circuitry or of logical representations convertible into physical circuitry, such as in a hardware description language of any appropriate type.
Also, the invention is not limited to physical devices or units implemented in nonprogrammable hardware but can also be applied in programmable devices or units able to perform the desired device functions by operating in accordance with suitable program code, such as mainframes, minicomputers, servers, workstations, personal computers, notepads, personal digital assistants, electronic games, automotive and other embedded systems, cell phones and various other wireless devices, commonly denoted in this application as "computer systems".

Claims

A wave field synthesis apparatus (100, 200) for driving an array of loudspeakers (240) with drive signals, the apparatus comprising:
- a listener pose identifying unit (110, 210a, 210b, 410) for identifying a pose of a listener (250, 252, 350, 352), wherein the pose comprises a location and an orientation of a listener,

- a sound reproduction unit (120, 220) for generating the drive signals, the sound reproduction unit comprising a sound field synthesizer (122) for generating sound field drive signals for causing the array of loudspeakers to generate a sound field at at least one audio zone (260, 262, 360, 362) and a binaural renderer (124) for generating binaural drive signals for causing the array of loudspeakers to generate specified sound pressures at at least two locations, and

- an adaptation unit (130, 230) for adapting one or more parameters of the sound reproduction unit based on the identified pose of the listener;
wherein the listener pose identifying unit (110, 210a, 210b, 410) comprises an uncertainty determining unit (412) for determining an uncertainty level, wherein the uncertainty level comprises a location uncertainty level, which reflects an estimated uncertainty in an identified location, and/or an orientation uncertainty level, which reflects an estimated uncertainty in an identified orientation, and wherein the adaptation unit (130, 230) is configured to adapt a parameter of the sound reproduction unit (120, 220) based on the determined uncertainty level;
wherein the adaptation unit (130, 230) is configured to adapt a parameter indicating a size of the audio zone based on the determined uncertainty level, wherein a higher uncertainty corresponds to a larger size; and
wherein the adaptation unit (130, 230) is configured to adjust a weighting parameter, which indicates a weighting between the sound field drive signals and the binaural drive signals, based on the determined uncertainty level.
The apparatus (100, 200) of one of the previous claims, wherein the listener pose identifying unit (110, 210a, 210b, 410) is configured to identify a number of listeners (205, 252, 350, 352) in the audio zone (260, 262, 360, 362), and wherein the adaptation unit is configured to adapt a size parameter, which indicates a size of the audio zone, based on the identified number of listeners in the audio zone.
The apparatus (100, 200) of one of the previous claims, wherein the at least one audio zone (260, 262, 360, 362) comprises a dark audio zone (362) and a bright audio zone (360) and wherein the adaptation unit (130, 230) is configured to set an output parameter, which corresponds to a strength of a specific drive signal for a specific loudspeaker (344), to zero if the adaptation unit determines that there is at least one connection line (370) between a location of the specific loudspeaker and a point in the bright audio zone that intersects with the dark audio zone and/or with a surrounding of the dark audio zone, wherein in particular the surrounding is defined by a fixed perimeter around the dark audio zone.
The apparatus (100, 200) of one of the previous claims, wherein the adaptation unit (130, 230) is configured to control the sound reproduction unit (120, 220) to start and/or resume generating the drive signals if at least one listener (250, 252, 350, 352) is identified in the audio zone and/or wherein the adaptation unit is configured to control the sound reproduction unit to stop and/or pause generating the drive signals for the audio zone if the listener pose identifying unit determines that there are no listeners in the audio zone.
The apparatus (100, 200) of one of the previous claims, wherein the apparatus comprises a camera input unit (212a, 212b) for obtaining image frames from one or more cameras (214a, 214b) and wherein the listener pose identifying unit (110, 210a , 210b, 410) comprises
- a listener detection unit (414) for detecting a pose of a listener in one or more first image frames acquired by the one or more cameras,

- a listener histogram determining unit (416) for determining a first histogram of the listener in the one or more first image frames based on the detected pose, and

- a listener tracking unit (418) for tracking the listener in one or more subsequent image frames that are acquired by the one or more cameras after the one or more first image frames, wherein the listener tracking unit is configured to track the listener based on the first histogram of the listener.
The apparatus (100, 200) of claim 5, wherein the uncertainty determining unit (412) is configured to determine the uncertainty level based on a difference between a first histogram that is determined based on a detected pose of the listener and a subsequent histogram that is determined based on a tracked pose of the listener.
The apparatus (100, 200) of one of claims 5 or 6, wherein the apparatus further comprises a distance detection unit which is configured to determine a distance of the listener from a reference point based on a size of a face region of the listener in the one or more image frames.
A method for driving an array of loudspeakers (240) with drive signals, comprising the steps:
- identifying (S10) a pose of a listener (250, 252, 350, 352), wherein the pose comprises a location and an orientation of the listener, and

- generating (S20) sound field drive signals for causing the array of loudspeakers to generate at least one sound field at at least one audio zone (260, 262, 360, 362), and

- generating (S30) binaural drive signals for causing the array of loudspeakers to generate specified sound pressures at at least two locations, and

- adapting (S40) one or more sound reproduction parameters based on the identified pose;
wherein the method further comprises:
determining an uncertainty level, wherein the uncertainty level comprises a location uncertainty level, which reflects an estimated uncertainty in an identified location, and/or an orientation uncertainty level, which reflects an estimated uncertainty in an identified orientation, and adapting a parameter based on the determined uncertainty level;

wherein the adapting a parameter based on the determined uncertainty level comprises/: adapting a parameter indicating a size of the audio zone based on the determined uncertainty level, wherein a higher uncertainty corresponds to a larger size; and

adjusting a weighting parameter, which indicates a weighting between the sound field drive signals and the binaural drive signals, based on the determined uncertainty level.
The method of claim 8, wherein identifying the pose of the listener comprises the steps:
- acquiring (S11) one or more first image frames,

- detecting (S12) a pose of the listener in the one or more first image frames,

- computing (S13) a first histogram of the listener in the one or more first images based on the detected pose,

- acquiring (S14) one or more subsequent image frames, and

- tracking (S15) a pose of the listener in the one or more subsequent image frames based on the first histogram.
The method of claim 9, further comprising the steps:
- computing (S16) a subsequent histogram of the listener in the one or more subsequent image frames based on the tracked pose,

- determining (S17) an uncertainty level based on a difference between the first histogram and the subsequent histogram, and

- adapting (S18) a sound reproduction parameter based on the determined uncertainty level.
The method of one of claim 10, further comprising a step of detecting (S19b) the pose of the listener in one or more subsequent image frames if the determined uncertainty level is higher than a predetermined threshold (S19b).
A computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of one of claims 8 to 11.