EP3945729A1 - Système et procédé d'égalisation de casque d'écoute et d'adaptation spatiale pour la représentation binaurale en réalité augmentée - Google Patents

Système et procédé d'égalisation de casque d'écoute et d'adaptation spatiale pour la représentation binaurale en réalité augmentée Download PDF

Info

Publication number
EP3945729A1
EP3945729A1 EP20188945.8A EP20188945A EP3945729A1 EP 3945729 A1 EP3945729 A1 EP 3945729A1 EP 20188945 A EP20188945 A EP 20188945A EP 3945729 A1 EP3945729 A1 EP 3945729A1
Authority
EP
European Patent Office
Prior art keywords
audio
room
binaural
sound
impulse responses
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP20188945.8A
Other languages
German (de)
English (en)
Inventor
Thomas Sporer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV
Original Assignee
Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV filed Critical Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV
Priority to EP20188945.8A priority Critical patent/EP3945729A1/fr
Priority to PCT/EP2021/071151 priority patent/WO2022023417A2/fr
Priority to JP2023506248A priority patent/JP2023536270A/ja
Priority to EP21751796.0A priority patent/EP4189974A2/fr
Publication of EP3945729A1 publication Critical patent/EP3945729A1/fr
Priority to US18/158,724 priority patent/US20230164509A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • H04S7/306For headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2225/00Details of deaf aids covered by H04R25/00, not provided for in any of its subgroups
    • H04R2225/41Detection or adaptation of hearing aid parameters or programs to listening situation, e.g. pub, forest
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2225/00Details of deaf aids covered by H04R25/00, not provided for in any of its subgroups
    • H04R2225/43Signal processing in hearing aids to enhance the speech intelligibility
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2460/00Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
    • H04R2460/01Hearing devices using active noise cancellation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/50Customised settings for obtaining desired overall acoustical characteristics
    • H04R25/505Customised settings for obtaining desired overall acoustical characteristics using digital signal processing
    • H04R25/507Customised settings for obtaining desired overall acoustical characteristics using digital signal processing implemented by neural network or fuzzy logic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/027Spatial or constructional arrangements of microphones, e.g. in dummy heads
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/033Headphones for stereophonic communication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/04Circuit arrangements, e.g. for selective connection of amplifier inputs/outputs to loudspeakers, for loudspeaker detection, or for adaptation of settings to personal preferences or hearing impairments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/301Automatic calibration of stereophonic sound system, e.g. with test microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • H04S7/304For headphones

Definitions

  • the present invention relates to headphone equalization and spatial adaptation of binaural playback in augmented reality (AR).
  • AR augmented reality
  • SH Selective Hearing
  • Time and level differences alone are not sufficient to determine the exact position of a sound source:
  • the locations with the same time and level difference are on a hyperboloid.
  • the resulting ambiguity in localization is called cone-of-confusion.
  • every sound source is reflected by the boundary surfaces.
  • Each of these so-called mirror sources lies on another hyperboloid.
  • the human sense of hearing combines the information about the direct sound and the associated reflections into one auditory event and thus resolves the ambiguity of the cone-of-confusion.
  • the reflections associated with a sound source increase the perceived loudness of the sound source.
  • Assisted hearing is an umbrella term that includes virtual, augmented, and SH applications.
  • Modern, so-called binaural hearing aids couple the correction factors of the two hearing aids. They often have several microphones, but usually only the microphone with the "most speech-like" signal is selected, but no explicit beamforming is calculated. In complex listening situations, desired and undesired sound signals are amplified in the same way, and concentration on desired sound components is therefore not encouraged.
  • Audio analysis has a number of specific challenges that need to be addressed. Due to their complexity, deep learning models are very data-hungry. Compared to the research areas of image processing and speech processing, only relatively small data sets are currently available for audio processing. The largest data set is the AudioSet data set from Google [83] with around 2 million sound samples and 632 different sound event classes, although most of the data sets used in research are much smaller. This small amount of training data can be addressed, for example, with transfer learning (transfer learning), in which a model pre-trained on a large data set is then fine-tuned to a smaller data set with new classes intended for the application (fine tuning) [77 ]. Furthermore, methods from partially supervised learning (semi-supervised learning) are used in order to include the generally available large amount of non-annotated audio data in the training.
  • transfer learning transfer learning
  • fine tuning new classes intended for the application
  • Real-time capability of the sound source detection algorithms is of elementary importance in the scenario of use planned in this project within a headphone.
  • a trade-off between the complexity of the neural network and the maximum possible number of arithmetic operations on the underlying computing platform must be carried out. Even if a sound event has a longer duration, it still has to be recognized as quickly as possible in order to start an appropriate source separation.
  • Source separation algorithms usually leave behind artifacts such as distortion and crosstalk between the sources [5], which are generally perceived as annoying by the listener. However, such artefacts can be partially masked and thus reduced by mixing the tracks again (re-mixing) [10].
  • Headphones have a significant influence on the acoustic perception of the environment. Depending on the design of the headphones, the sound incidence on the way to the ears is attenuated to different degrees. In-ear headphones completely block the ear canals [85]. The closed headphones enclosing the auricle also acoustically cut the listener off from the outside environment. Open and semi-open headphones, on the other hand, still let sound through completely or partially [84]. In many everyday applications, it is desirable for headphones to seal off unwanted ambient noise more than their design allows.
  • ANC Active Noise Control
  • the first products allow the microphone signals to also be passed through to the headphones in order to reduce passive isolation.
  • Sennheiser offers the function with the AMBEO headset [88] and Bragi in the product "The Dash Pro".
  • this option is just the beginning.
  • this function is to be greatly expanded so that not only can the full ambient noise be switched on or off, but individual signal components (such as only speech or alarm signals) can be made exclusively audible if required.
  • the French company Orosound allows the wearer of the "Tilde Earphones" [89] headset to adjust the strength of the ANC with a slider.
  • the voice of a conversation partner can also be passed through during activated ANCs. However, this only works if the interlocutor is in a 60° cone in front of you. A direction-independent adjustment is not possible.
  • a method which is designed to generate a listening environment for a user.
  • the method includes receiving a signal representing an ambient listening environment of the user, further processing the signal using a microprocessor to identify at least one of a plurality of sound types in the ambient listening environment.
  • the method further includes receiving user preferences for each of the plurality of sound types, modifying the signal for each sound type in the ambient listening environment, and outputting the modified signal to at least one speaker to create a listening environment for the user.
  • a major problem is headphone equalization and room adaptation of binaural playback in augmented reality (AR):
  • AR augmented reality
  • a human listener wears acoustically (partially) transparent headphones and hears his surroundings through them.
  • additional sound sources are played back via the headphones, which are embedded in the real environment in such a way that it is not possible for the listener to distinguish between the real sound scene and the additional sound.
  • tracking is used to determine in which direction the head is turned and where the listener is in the room (six degrees of freedom (6DoF)). It is known from research that good results (i.e. externalization and correct localization) are achieved when the room acoustics of the recording and playback rooms match or when the recording is adapted to the playback room.
  • 6DoF six degrees of freedom
  • An exemplary solution can be implemented as follows: In a first step, the BRIR is measured without headphones, either individually or with an artificial head using a probe microphone.
  • the spatial properties of the recording room are analyzed based on the measured BRIR.
  • the headphone transfer function is then measured individually or with an artificial head using a probe microphone at the same location. This determines an equalization function.
  • the room properties of the playback room can be measured, the acoustic properties of the playback room can be analyzed and the BRIR can be adapted to the playback room.
  • a source to be augmented is convolved with the position-correct, optionally adjusted, BRIR in order to obtain two raw channels. Convolve the raw channels with the equalization function to get the headphone signals.
  • the headphone signals are reproduced via headphones.
  • a system comprises an analyzer for determining a plurality of binaural room impulse responses and a loudspeaker signal generator for generating at least two loudspeaker signals dependent on the plurality of binaural room impulse responses and dependent on the audio source signal from at least one audio source.
  • the analyzer is designed to determine the plurality of binaural spatial impulse responses in such a way that each of the plurality of binaural spatial impulse responses takes into account an effect resulting from the wearing of headphones by a user.
  • the plurality of binaural spatial impulse responses are determined such that each of the plurality of binaural spatial impulse responses takes into account an effect resulting from a user wearing a headphone.
  • a computer program according to an embodiment of the invention is provided with a program code for carrying out the method described above.
  • Figure 1 shows a system according to one embodiment.
  • the system includes an analyzer 152 for determining a plurality of binaural spatial impulse responses.
  • the system comprises a loudspeaker signal generator 154 for generating at least two loudspeaker signals depending on the plurality of binaural room impulse responses and depending on the audio source signal from at least one audio source.
  • the analyzer 152 is configured to determine the plurality of binaural spatial impulse responses such that each of the plurality of binaural spatial impulse responses takes into account an effect resulting from a user wearing headphones.
  • the system can include the headphones, for example, it being possible for the headphones to be designed, for example, to output the at least two loudspeaker signals.
  • the headphone can comprise e.g. two headphone capsules and e.g. at least one microphone for measuring sound in each of the two headphone capsules, wherein in each of the two headphone capsules e.g.
  • the analyzer 152 can be designed, for example, to carry out the determination of the plurality of binaural room impulse responses using the measurement of the at least one microphone in each of the two headphone capsules.
  • Headphones intended for binaural playback always have at least two headphone capsules (one each for the left and right ear), although more than two capsules (e.g. for different frequency ranges) can also be provided.
  • the at least one microphone in each of the two headphone capsules can be configured, for example, to generate one or more recordings of a sound situation in a playback room before the start of playback of the at least two loudspeaker signals through the headphones, from the one or more recordings an estimate of a To determine the raw audio signal of at least one audio source and to determine a binaural room impulse response of the plurality of binaural room impulse responses for the audio source in the playback room.
  • the at least one microphone in each of the two headphone capsules can be designed, for example, to generate one or more additional recordings of the sound situation in the reproduction room during the playback of the at least two loudspeaker signals through the headphones, of which one or more additional recordings an augmented one subtract signal and determine the estimate of the raw audio signal from one or more audio sources and determine the binaural room impulse response of the plurality of binaural room impulse responses for the audio source in the playback room.
  • the analyzer 152 may be configured to determine acoustic space properties of the playback room and to adjust the plurality of binaural room impulse responses depending on the acoustic space properties.
  • the at least one microphone can be arranged, for example, in each of the two headphone capsules for measuring the sound close to the entrance of the ear canal.
  • the system can include, for example, one or more further microphones outside the two headphone capsules for measuring the sound situation in the reproduction room.
  • the headphone can, for example, comprise a bracket, with at least one of the one or more further microphones being arranged, for example, on the bracket.
  • the speaker signal generator 154 may be configured to generate the at least two speaker signals by convolving each of the plurality of binaural room impulse responses with an audio source signal of a plurality of one or more audio source signals.
  • the analyzer 152 can be configured, for example, to determine at least one of the plurality of binaural spatial impulse responses (or several or all binaural spatial impulse responses) as a function of a movement of the headphones.
  • the system can include a sensor in order to determine a movement of the headphones.
  • the sensor may be a sensor, such as an accelerometer, having at least 3 DoF (three degrees of freedom) to detect head rotations.
  • a 6 DoF sensor (English: six degrees of freedom sensor) can be used.
  • Certain embodiments of the invention address the technical challenge that it is often too loud in a listening environment, certain noises are annoying in the listening environment, and selective listening is desired.
  • the human brain itself is capable of selective hearing, but intelligent technical aids can significantly improve selective hearing. Just as glasses help many people in today's life to perceive their surroundings better, there are hearing aids for hearing, but in many situations people with normal hearing can also benefit from the support of intelligent systems.
  • hearing aids the (acoustic) environment must be analyzed by the technical system, individual sound sources must be identified in order to be able to treat them separately.
  • the BRIR is measured with headphones either individually or with headphones using a probe microphone.
  • the spatial properties of the recording room are analyzed based on the measured BRIR.
  • At least one built-in microphone in each shell records the real sound situation in the playback room before playback begins. From these recordings, an estimate of the raw audio signal from one or more sources is determined and the respective BRIR of the sound source/audio source in the playback room is determined. The acoustic room properties of the playback room are determined from this estimate and the BRIR of the recording room is thus adjusted.
  • At least one built-in microphone in each shell records the real sound situation in the playback room during playback.
  • the augmented signal is first subtracted from these recordings, then an estimate of the raw audio signal from one or more sources is determined and the respective BRIR of the sound source/audio source in the playback room is determined.
  • the acoustic room properties of the playback room are determined from this estimate and the BRIR of the recording room is thus adjusted.
  • a source to be augmented is convolved with the position-correct, optionally adjusted, BRIR in order to obtain the headphone signals.
  • the headphone signals are reproduced via headphones.
  • At least one microphone is placed in each headphone capsule to measure sound near the entrance of the ear canal.
  • additional microphones are optionally arranged on the outside of the headphones, possibly also on top of the bracket, for measuring and analyzing the sound situation in the playback room.
  • sound from natural and augmented sources is realized to be the same.
  • Embodiments realize that no measurement of headphone characteristics is required.
  • Embodiments thus provide concepts for measuring the spatial properties of the rendering space.
  • Some embodiments provide a seed and (post) optimization of the spatial adaptation.
  • the concepts provided also work if the room acoustics of the playback room change, e.g. if the listener changes to another room.
  • embodiments are based on installing different techniques for hearing assistance in technical systems and combining them in such a way that an improvement in the quality of sound and quality of life (e.g. desired sound is louder, undesired sound is quieter, better speech intelligibility) both for people with normal hearing and for people with damage to the hearing is achieved.
  • quality of sound and quality of life e.g. desired sound is louder, undesired sound is quieter, better speech intelligibility
  • FIG. 12 shows a system for supporting selective hearing according to an embodiment.
  • the system includes a detector 110 for detecting an audio source signal portion of one or more audio sources using at least two received microphone signals of a listening environment.
  • the system also includes a position determiner 120 for assigning position information to each of the one or more audio sources.
  • the system also includes an audio type classifier 130 for assigning an audio signal type to the audio source signal portion of each of the one or more audio sources.
  • the system also includes a signal component modifier 140 for changing the audio source signal component of at least one audio source of the one or more audio sources depending on the audio signal type of the audio source signal component of the at least one audio source in order to obtain a modified audio signal component of the at least one audio source.
  • a signal component modifier 140 for changing the audio source signal component of at least one audio source of the one or more audio sources depending on the audio signal type of the audio source signal component of the at least one audio source in order to obtain a modified audio signal component of the at least one audio source.
  • the analyzer 152 and the speaker signal generator 154 of 1 together form a signal generator 150.
  • the analyzer 152 of the signal generator 150 is designed to generate the plurality of binaural spatial impulse responses, the plurality of binaural spatial impulse responses being a plurality of binaural spatial impulse responses for each audio source of the one or more audio sources which, depending on the position information of this audio source and one Orientation of a user's head.
  • the loudspeaker signal generator 154 of the signal generator 150 is designed to generate the at least two loudspeaker signals as a function of the plurality of binaural room impulse responses and as a function of the modified audio signal component of the at least one audio source.
  • the detector 110 may be configured to detect the audio source signal portion of the one or more audio sources using deep learning models.
  • the position determiner 120 can be designed, for example, to determine the position information for each of the one or more audio sources depending on a recorded image or on a recorded video.
  • the position determiner 120 can be designed, for example, to determine the position information for each of the one or more audio sources as a function of the video by detecting a lip movement of a person in the video and depending on the lip movement, the audio source signal component is assigned to one of the one or more audio sources.
  • the detector 110 may be configured to determine one or more acoustic properties of the listening environment as a function of the at least two received microphone signals.
  • the signal generator 150 can be configured, for example, to determine the plurality of binaural room impulse responses depending on the one or more acoustic properties of the listening environment.
  • the signal component modifier 140 can be configured, for example, to select the at least one audio source whose audio source signal component is modified depending on a previously learned user scenario and to modify it depending on the previously learned user scenario.
  • the system may include a user interface 160 for selecting the previously learned user scenario from a set of two or more previously learned user scenarios.
  • 3 16 shows such a system according to an embodiment, which additionally comprises such a user interface 160.
  • the detector 110 and/or the position determiner 120 and/or the audio type classifier 130 and/or the signal component modifier 140 and/or the signal generator 150 can be implemented, for example, using a Hough transform or using parallel signal processing a plurality of VLSI chips or using a plurality of memristors.
  • the system can include a hearing aid 170, for example, which serves as a hearing aid for users with limited hearing ability and/or hearing impairment, the hearing aid including at least two loudspeakers 171, 172 for outputting the at least two loudspeaker signals.
  • 4 12 shows such a system according to an embodiment, comprising such a hearing aid 170 with two corresponding loudspeakers 171,172.
  • the system may include, for example, at least two speakers 181, 182 for outputting the at least two speaker signals and a housing structure 183 accommodating the at least two speakers, the at least one housing structure 183 being adapted to be attached to a head 185 of a user or another to be attached to any part of the user's body.
  • Figure 5a shows a corresponding system, which includes such a housing structure 183 and two loudspeakers 181, 182.
  • the system can include a headphone 180, for example, which includes at least two loudspeakers 181, 182 for outputting the at least two loudspeaker signals.
  • Figure 5b 18 shows a corresponding headphone 180 with two loudspeakers 181, 182 according to an embodiment.
  • the detector 110 and the position determiner 120 and the audio type classifier 130 and the signal component modifier 140 and the signal generator 150 can be integrated into the headset 180 .
  • the system may include a remote device 190 that includes detector 110 and position determiner 120 and audio type classifier 130 and signal component modifier 140 and signal generator 150 .
  • the remote device 190 can be spatially separated from the headphones 180, for example.
  • remote device 190 may be a smartphone.
  • Embodiments do not necessarily use a microprocessor, but use parallel signal processing steps, such as Hough transformation, VLSI chips or memristors for the power-saving implementation, including artificial neural networks.
  • the auditory environment is spatially recorded and reproduced, which on the one hand uses more than one signal to represent the input signal and on the other hand also uses a spatial reproduction.
  • the signal separation is performed using Deep Learning (DL) models (e.g. CNN, RCNN, LSTM, Siamese Network) and simultaneously processes the information from at least two microphone channels, with at least one microphone being in each hearable.
  • DL Deep Learning
  • a number of output signals (corresponding to the individual sound sources) together with their respective spatial position are determined by the joint analysis. If the recording device (microphones) is connected to the head, then the positions of the objects change when the head moves. This enables a natural focusing on important/unimportant sound, eg by turning the listener towards the sound object.
  • the signal analysis algorithms are based on a deep learning architecture, for example.
  • Alternative variants with an analysis unit or variants with separate networks are used for the aspects of localization, detection and source separation.
  • the alternative use of generalized cross-correlation takes account of the frequency-dependent shadowing by the head and improves localization, detection and source separation.
  • different source categories e.g. speech, vehicles, male/female/child's voice, warning tones, etc.
  • the source separation networks are also trained for high signal quality, as well as the localization networks with targeted stimuli for high localization accuracy.
  • the training steps mentioned above use, for example, multi-channel audio data, with a first training run usually being carried out in the laboratory using simulated or recorded audio data. This is followed by a training session in different natural environments (e.g. living room, classroom, train station, (industrial) production environment, etc.), i.e. transfer learning and domain adaptation takes place.
  • natural environments e.g. living room, classroom, train station, (industrial) production environment, etc.
  • the position detector could be coupled to one or more cameras to also determine the visual position of sound/audio sources.
  • the position detector could be coupled to one or more cameras to also determine the visual position of sound/audio sources.
  • lip movement and the audio signals coming from the source separator are correlated and thus a more precise localization is achieved.
  • the auralization is performed using binaural synthesis.
  • the binaural synthesis offers the further advantage that it is possible not to delete unwanted components completely, but only to reduce them to the extent that they are perceptible but not disturbing. This has the further advantage of receiving unexpected additional sources (warning signals, calls,...) which would be ignored if the system was switched off completely.
  • the analysis of the auditory environment is not only used to separate the objects, but also to analyze the acoustic properties (e.g. reverberation time, initial time gap). These properties are then used in the binaural synthesis to adapt the pre-stored (possibly also individualized) binaural room impulse responses (BRIR) to the actual room. Due to the reduction in room divergence, the listener has a significantly reduced listening effort when understanding the optimized signals. Minimizing room divergence affects the externalization of auditory events and thus the plausibility of spatial audio reproduction in the listening room. There are no known solutions in the prior art for speech understanding or for the general understanding of optimized signals.
  • acoustic properties e.g. reverberation time, initial time gap
  • a user interface is used to determine which sound sources are selected. According to the invention, this is done here by prior learning of different user scenarios, such as “amplify speech right from the front” (conversation with one person), “amplify speech in the range of +-60 degrees” (conversation in a group), “suppress music and amplify music “ (I don't want to hear concert goers), “make everything quiet” (I want my peace), “suppress all calls and warning tones”, etc.
  • Some embodiments are independent of the hardware used, i.e. both open and closed headphones can be used.
  • the signal processing can be integrated in the headphones, in an external device, or integrated in a smartphone.
  • signals from the smartphone e.g. music, telephony
  • an ecosystem for "selective listening with AI support” is provided.
  • Exemplary embodiments relate to "Personalized Auditory Reality” (PARty).
  • PARty Personalized Auditory Reality
  • the listener is able to amplify, attenuate, or modify defined acoustic objects.
  • the work of the envisaged implementation phase forms an essential building block for this.
  • Some embodiments implement the analysis of the real sound environment and the detection of the individual acoustic objects, the separation, tracking and editability of the existing objects and the reconstruction and playback of the modified acoustic scene.
  • a detection of sound events, a separation of the sound events, and a suppression of some of the sound events is implemented.
  • AI methods meaning in particular deep learning-based methods.
  • Embodiments of the invention contribute to the technological development for recording, signal processing and reproduction of spatial audio.
  • Embodiments create, for example, spatiality and three-dimensionality in multimedia systems when the user interacts
  • Exemplary embodiments are based on researched knowledge of perceptual and cognitive processes of spatial hearing.
  • Scene decomposition This includes a room-acoustic recording of the real environment and parameter estimation and/or a position-dependent sound field analysis.
  • Scene Representation This includes representation and identification of the objects and the environment and/or efficient representation and storage.
  • Scene composition and rendering This includes object and environment adjustment and manipulation and/or rendering and auralization.
  • Quality evaluation This includes technical and/or auditory quality measurement.
  • Signal processing This includes feature extraction and dataset generation for ML (machine learning).
  • Estimation of room and environment acoustics This includes in-situ measurement and estimation of room acoustic parameters and/or provision of room acoustic characteristics for source separation and ML.
  • Auralization This includes a spatial audio reproduction with an auditory fit to the environment and/or a validation and evaluation and/or a proof of function and a quality assessment.
  • Embodiments combine concepts for detecting, classifying, separating, locating, and enhancing sound sources, highlighting recent advances in each area and showing relationships between them.
  • Unified concepts are provided that can combine detect/classify/locate and separate/enhance sound sources to provide both the flexibility and robustness required for real-life SH.
  • embodiments for real-time performance provide appropriate low-latency concepts when dealing with the dynamics of real-life auditory scenes.
  • Some of the embodiments utilize concepts of deep learning, machine hearing, and smart hearables that allow listeners to selectively modify their auditory scene.
  • Embodiments provide the possibility for a listener to selectively improve, dampen, suppress or modify sound sources in the auditory scene using a hearing device such as headphones, earphones, etc.
  • the user represents the center of the auditory scene.
  • four external sound sources (S1-S4) are active around the user.
  • a user interface allows the listener to influence the auditory scene.
  • Sources S1-S4 can be attenuated, enhanced or suppressed with their respective sliders.
  • the listener can define sound sources or events to be retained or suppressed in the auditory scene.
  • In 2 is designed to suppress the background noise of the city while preserving alarms or the ringing of phones.
  • the user always has the option of playing an additional audio stream such as music or radio via the hearing device.
  • the user is usually the center of the system and controls the auditory scene via a control unit.
  • the user can control the auditory scene with a user interface like the one in 9 displayed or modified with any kind of interaction such as voice control, gestures, direction of gaze, etc.
  • the next step is a capture/classification/localization stage. In some cases only the acquisition is necessary, e.g. B. when the user wants to keep every speech utterance occurring in the auditory scene. In other cases classification might be necessary, e.g. B. if the user wants to keep fire alarms in the auditory scene, but not phone rings or office noise. In some cases only the location of the source is relevant to the system. This is the case, for example, with the four springs in 9 the case: the user can choose to remove or attenuate the sound source coming from a certain direction, regardless of the type or characteristics of the source.
  • Figure 12 illustrates a processing workflow of an SH application according to one embodiment.
  • the auditory scene is first in the stage of separation / improvement in 10 modified. This is done either by suppressing, dampening, or enhancing one specific sound source (or from specific sound sources). As in 10 As shown, an additional processing alternative in the SH is noise control, where the goal is to remove or minimize background noise from the auditory scene. Perhaps the most popular and widely used noise control technology today is Active Noise Control (ANC) [11].
  • ANC Active Noise Control
  • a source location usually refers to the direction of arrival (DOA) of a given source, which can be given either as a 2D coordinate (azimuth) or, if it includes an elevation, as a 3D coordinate .
  • DOA direction of arrival
  • Some systems also estimate the distance from the source to the microphone as location information [3].
  • location often refers to the panning of the source in the final mix and is usually specified as an angle in degrees [4].
  • embodiments utilize sound source detection, which refers to the ability to determine whether any instance of a given type of sound source is present in the auditory scene.
  • An example of a detection process is to determine if any speaker is present in the scene. In this context, determining the number of speakers in the scene or the identity of the speakers goes beyond the scope of sound source detection. Detection can be thought of as a binary classification process where the classes correspond to source present and source absent.
  • Sound Source Classification which assigns a class designation from a group of predefined classes to a given sound source or sound event.
  • An example of a classification process is to determine whether a given sound source corresponds to speech, music, or ambient noise.
  • Sound source classification and detection are closely related concepts.
  • classification systems include a level of coverage by considering "no class" as one of the possible designations. In these cases, the system implicitly learns to detect the presence or absence of a sound source and is not forced to assign a class designation if there is insufficient evidence that any of the sources is active.
  • embodiments utilize sound source separation, which refers to the extraction of a given sound source from an audio mix or an auditory scene.
  • sound source separation is the extraction of a singing voice from an audio mix in which other musical instruments are played simultaneously in addition to the singer [5].
  • Sound source separation becomes relevant in a selective listening scenario, as it allows for the suppression of sound sources that are of no interest to the listener.
  • Some sound separation systems implicitly perform a detection process before extracting the sound source from the mix. However, this is not necessarily the rule and so we emphasize the distinction between these operations.
  • the separation often serves as a pre-processing stage for other types of analysis such as source enhancement [6] or classification [7].
  • embodiments use Sound Source Identification, which goes one step further and aims to identify specific instances of a sound source in an audio signal. Speaker identification is perhaps the most common use of source identification today. The goal of this process is to identify whether a specific speaker is present in the scene. In the example in 1 the user has selected "speaker X" as one of the sources to keep in the auditory scene. This requires technologies that go beyond speech capture and classification, and requires speaker-specific models that enable this precise identification.
  • embodiments utilize sound source enhancement, which refers to the process of increasing the prominence of a given sound source in the auditory scene [8].
  • speech signals the goal is often to increase their perception of quality and intelligibility.
  • a common scenario for speech enhancement is denoising noise-tainted speech utterances [9].
  • source enhancement refers to the concept of making remixes and is often done to make a musical instrument (sound source) stand out more in the mix.
  • Remixing applications often use sound separation front-ends to gain access to the individual sound sources and to change the characteristics of the mix [10].
  • sound enhancement may be preceded by a sound source separation stage, this is not always the case and so we also emphasize the distinction between these two terms.
  • some of the embodiments use, for example, one of the following concepts, such as the detection and classification of acoustic scenes and events [18].
  • AED audio event detection
  • 10 sound event classes were considered, including cat, dog, speech, alarm, and running water.
  • Methods for detecting polyphonic sound events have also been proposed in the literature [21], [22].
  • a method for the detection of polyphonic sound events is proposed, in which a total of 61 sound events from real-life situations are detected using binary activity detectors based on a recurrent neural network (RNN). using bidirectional long short-term memory (BLSTM).
  • RNN recurrent neural network
  • BLSTM bidirectional long short-term memory
  • noise labels in classification is particularly relevant for applications to selective listening, where the class designations can be so different that high quality designations are very costly [24].
  • Noise labels in sound event classification processes have been addressed in [25], where noise-robust loss functions based on categorical cross-entropy are presented, as well as ways to evaluate data with noise labels as well as manually labeled data.
  • [26] presents a system for audio event classification based on a convolutional neural network (CNN) that includes a verification step for sound labels based on a CNN prediction consensus on several segments of the test example.
  • CNN convolutional neural network
  • some embodiments implement simultaneous detection and localization of sound events.
  • some embodiments, as in [27], perform the detection as a multi-label classification process and the location is given as the 3D coordinates of the direction of arrival (DOA) for each sound event.
  • DOA direction of arrival
  • Some embodiments use concepts of voice activity detection and speaker recognition/identification for SH.
  • Voice activity detection has been addressed in noisy environments using denoising autoencoders [28], recurrent neural networks [29] or as an end-to-end system using raw waveforms [30].
  • denoising autoencoders [28]
  • recurrent neural networks [29]
  • end-to-end system using raw waveforms [30].
  • many schemes have been proposed in the literature [31], with the vast majority focusing on increasing robustness to different conditions, for example with data augmentation or with improved embeddings that facilitate recognition [32]-[34]. So some of the embodiments make use of these concepts.
  • sound source localization is closely related to the problem of source counting, since the number of sound sources in the auditory scene is usually not known in real-life applications.
  • Some systems operate on the assumption that the number of sources in the scene is known. This is the case, for example, with the model presented in [39], which uses histograms of active intensity vectors to locate the sources.
  • [40] proposes, from a controlled perspective, a CNN-based algorithm to estimate the DOA of multiple speakers in the auditory scene using phase maps as input representations. In contrast, several works in the literature collectively estimate the number of sources in the scene and their location information.
  • Sound source localization algorithms can be computationally demanding as they often involve scanning a large space around the auditory scene [42].
  • some of the embodiments use concepts that expand the search space by using clustering algorithms [43] or by performing multi-resolution searches [42] relative to best practices such as those based on the steered-response phase transform (steered response power phase transform, SRP-PHAT).
  • Other methods place requirements on the sparsity of the matrix and assume that only one sound source is dominant in a given time-frequency range [44].
  • [45] proposed an end-to-end system for azimuth detection directly from the raw waveforms.
  • SSS Sound Source Separation
  • some embodiments employ concepts of speaker independent separation. There, a separation occurs without any prior information about the speakers in the scene [46]. Some embodiments also evaluate the speaker's spatial location to perform a separation [47].
  • Some embodiments employ music sound separation (MSS) concepts to extract a music source from an audio mix [5], such as main instrument and accompaniment separation concepts [52]. These algorithms take the most prominent sound source in the mix, regardless of its class designation, and attempt to separate it from the rest of the accompaniment.
  • Some embodiments use concepts for singing voice separation [53]. In most cases, either specific source models [54] or data-driven models [55] are used to capture the characteristics of the singing voice.
  • systems like the one proposed in [55] do not explicitly include a classification or a detection stage to achieve separation, the data-driven nature of these approaches allows these systems to implicitly learn to detect the singing voice with some accuracy before separation .
  • ANC anti-noise
  • ANC systems mainly aim to reduce background noise for headphone users by using an anti-noise signal to cancel it [11].
  • ANC can be viewed as a special case of SH and faces an equally stringent requirement [14].
  • Some work has focused on antinoise in specific environments such as automotive interiors [56] or operational scenarios [57].
  • the work in [56] analyzes the cancellation of different types of noise, such as road noise and engine noise, and requires unified systems capable of dealing with different types of noise.
  • Some work has focused on developing ANC systems for canceling noise over specific spatial regions.
  • ANC is discussed over a spatial region using spherical harmonics as basis functions to represent the noise field.
  • Some of the embodiments use sound source enhancement concepts.
  • Source enhancement in the context of music mostly refers to applications for making music remixes.
  • speech enhancement where the assumption is often that speech is affected only by noise sources
  • music applications mostly assume that other sound sources (musical instruments) are playing simultaneously with the source to be enhanced. Therefore, music remix applications are always provided preceded by a source separation application.
  • music remix applications are always provided preceded by a source separation application.
  • early jazz recordings were remixed using techniques to separate lead and accompaniment, harmonic and percussion instruments to achieve better tonal balance in the mix.
  • [63] investigated the use of different vocal separation algorithms to change the relative loudness of the vocal and backing track, showing that an increase of 6 dB is possible by introducing slight but audible distortions into the final mix.
  • the authors explore ways to improve music perception for cochlear implant users by applying sound source separation techniques to achieve new mixes. The concepts described there are used by some of the embodiments.
  • Some embodiments employ concepts to improve the robustness of current machine hearing methods as described in [25], [26], [32], [34], new emerging directions range adaptation [67] and learning based on datasets recorded with multiple devices [68].
  • Some of the embodiments employ concepts for improving the computational efficiency of machine hearing as described in [48], or concepts described in [30], [45], [50], [61] that are able to deal with unprocessed waveforms.
  • Some embodiments implement a unified optimization scheme that combines detection/classification/location and separation/enhancement to selectively modify sound sources in the scene, with independent detection, separation, localization, classification, and enhancement methods being reliable and applicable to SH provide the required robustness and flexibility.
  • Some embodiments are suitable for real-time processing, with a good trade-off between algorithmic complexity and performance.
  • Some embodiments combine ANC and machine hearing. For example, the auditory scene is first classified and then ANC is selectively applied.
  • the transfer functions map the properties of the sound sources, as well as the direct sound between the objects and the user, as well as all reflections that occur in the room. In order to ensure correct spatial audio reproductions for the room acoustics of a real room in which the listener is currently located, the transfer functions must also represent the room acoustic properties of the listening room with sufficient accuracy.
  • the challenge lies in the appropriate recognition and separation of the individual audio objects when a large number of audio objects are present. Furthermore, the audio signals of the objects in the recording position or in the listening position of the room overlap. Both the room acoustics and the superimposition of the audio signals change when the objects and/or the listening positions in the room change.
  • Room acoustics parameters must be estimated quickly enough in the case of relative movement. A low latency of the estimation is more important than a high accuracy. On the other hand, if the position of the source and receiver do not change (static case), a high degree of accuracy is required.
  • room acoustics parameters, as well as room geometry and listener position are estimated or extracted from a stream of audio signals. The audio signals are recorded in a real environment in which the source(s) and receiver(s) can move in any direction, and in which the source(s) and/or receiver(s) change their orientation in any way be able.
  • the audio signal stream can be the result of any microphone setup that includes one or more microphones.
  • the streams are fed to a signal processing stage for pre-processing and/or further analysis. Thereafter, the output is fed to a feature extraction stage. This stage estimates the room acoustics parameters, eg T60 (Reverberation Time), DRR (Direct to Reverberation Ratio) and others.
  • a second data stream is generated by a 6DoF ("six degrees of freedom" - degrees of freedom: three dimensions each for position in space and line of sight) sensor that records the orientation and position of the microphone setup.
  • the position data stream is fed into a 6DoF signal processing stage for pre-processing or further analysis.
  • the output of the 6DoF signal processing, the audio feature extraction stage and the pre-processed microphone streams is fed into a machine learning block by estimating the listening room (size, geometry, reflective surfaces) and the position of the microphone field in the room.
  • a user behavior model is applied to enable a more robust estimation. This model takes into account limitations of human movements (e.g. continuous movement, speed, etc.), as well as the probability distribution of different types of movements.
  • Some of the embodiments realize a blind estimation of room acoustics parameters by using arbitrary microphone arrays and by adding position and pose information of the user, and by analyzing the data with machine learning methods.
  • Systems according to embodiments may be used for acoustic augmented reality (AAR), for example.
  • AAR acoustic augmented reality
  • Some embodiments involve removing reverberations from the recorded signals.
  • Examples of such embodiments are hearing aids for people with normal hearing and those who are hard of hearing.
  • the reverberation can be removed from the input signal of the microphone setup with the help of the estimated parameters.
  • Another application is the spatial synthesis of audio scenes created in a room other than the current listening room.
  • the room-acoustic parameters which are part of the audio scenes, are adapted to the room-acoustic parameters of the listening room.
  • the available BRIRs are adapted to the acoustic parameters of the listening room.
  • an apparatus for determining one or more room acoustics parameters is provided.
  • the device is designed to receive microphone data that includes one or more microphone signals.
  • the device is designed to receive tracking data relating to a position and/or an orientation of a user.
  • the device is designed to determine the one or more room acoustics parameters as a function of the microphone data and as a function of the tracking data.
  • the device may be configured to use machine learning to determine the one or more room acoustic parameters based on the microphone data and based on the tracking data.
  • the device may be configured to employ machine learning in that the device may be configured to employ a neural network.
  • the device may be configured to use cloud-based processing for machine learning.
  • the one or more room acoustic parameters may include reverberation time.
  • the one or more room acoustic parameters may include a direct-to-reverberation ratio.
  • the tracking data to indicate the user's location may include, for example, an x-coordinate, a y-coordinate, and a z-coordinate.
  • the tracking data to indicate the user's orientation may include, for example, a pitch coordinate, a yaw coordinate, and a roll coordinate.
  • the device can be designed, for example, to transform the one or more microphone signals from a time domain into a frequency domain, wherein the device can be designed, for example, to extract one or more features of the one or more microphone signals in the frequency domain, and wherein the Device can be designed, for example, to determine the one or more room acoustics parameters depending on the one or more features.
  • the device may be configured to use cloud-based processing to extract the one or more features.
  • the device may include a microphone array of multiple microphones to pick up the multiple microphone signals.
  • the microphone arrangement can be designed, for example, to be worn on the body by a user.
  • system described above may further comprise, for example, a device as described above for determining one or more room acoustic parameters.
  • the signal portion modifier 140 can be configured, for example, to change the audio source signal portion of the at least one audio source of the one or more audio sources as a function of at least one of the one or more room acoustics parameters; and/or the signal generator 150 can be designed, for example, to generate at least one of the plurality of binaural room impulse responses for each audio source of the one or more audio sources depending on the at least one of the one or more room acoustics parameters.
  • Figure 12 shows a system according to an embodiment comprising five sub-systems (sub-systems 1-5).
  • Sub-system 1 includes a microphone setup of one, two or more individual microphones that can be combined into a microphone array if more than one microphone is available.
  • the positioning and relative arrangement of the microphone(s) to one another can be arbitrary.
  • the microphone assembly can be part of a device worn by the user or may be a separate device positioned in the space of interest.
  • sub-system 1 comprises a tracking device to measure the user's translational positions and the user's head pose in space. Up to 6-DOF (x-coordinate, y-coordinate, z-coordinate, pitch angle, yaw angle, roll angle) can be measured.
  • the tracking device can be positioned on a user's head, or it can be split into different sub-devices to measure the required DOFs and placed on the user or not on the user.
  • Subsystem 1 thus represents an input interface that includes a microphone signal input interface 101 and a position information input interface 102 .
  • Sub-system 2 includes signal processing for the captured microphone signal(s). This includes frequency transformations and/or time domain based processing. Furthermore, this includes methods for combining different microphone signals in order to realize field processing. It is possible to feed back from subsystem 4 in order to adapt parameters of the signal processing in subsystem 2.
  • the signal processing block of the microphone signal(s) can be part of the device in which the microphone(s) are built or it can be part of a separate device. It can also be part of cloud-based processing.
  • sub-system 2 includes signal processing for the recorded tracking data. This includes frequency transforms and/or time domain based processing. It also includes methods to improve the technical quality of the signals using noise reduction, smoothing, interpolation and extrapolation. It also includes procedures to derive information from higher levels. This includes speeds, accelerations, travel directions, rest times, movement areas, movement paths. Further, this includes predicting a near-future trajectory and a near-future velocity.
  • the signal processing block of the tracking signals can be part of the tracking device or it can be part of a separate device. It can also be part of cloud-based processing.
  • Sub-system 3 involves the extraction of features of the processed microphone(s).
  • the feature extraction block can be part of the user's handheld device, or it can be part of a separate device. It can also be part of cloud-based processing.
  • sub-system 3 module 121 can be the result of an audio type classification on sub-system 2, pass module 111 (feedback).
  • Sub-system 2, module 112 implements a position determiner 120, for example.
  • sub-systems 2 and 3 can also implement the signal generator 150, for example by sub-system 2, module 111 generating the binaural room impulse responses and generating the loudspeaker signals.
  • Sub-system 4 includes methods and algorithms to estimate room acoustic parameters using the processed microphone signal(s), the extracted features of the microphone signal(s), and the processed tracking data.
  • the output of this block is the room acoustic parameters as rest data and a control and modification of the parameters of the microphone signal processing in subsystem 2.
  • the machine learning block 131 can be part of the user's device or it can be part of a separate device. It can also be part of cloud-based processing.
  • sub-system 4 includes post-processing of the room-acoustic resting data parameters (e.g. in block 132). This includes a detection of outliers, a combination of single parameters to a new parameter, smoothing, extrapolation, interpolation and plausibility check. This block also gets information from subsystem 2. This includes near-future positions of the user in the room to estimate near-future acoustic parameters. This block can be part of the user's device or it can be part of a separate device. It can also be part of cloud-based processing.
  • Sub-system 5 includes the storage and allocation of the room acoustic parameters for downstream systems (e.g. in memory 141).
  • the allocation of the parameters can be done just-intime can be realized and/or the time course can be stored.
  • Storage can be done on the device that is on or near the user, or can be done on a cloud-based system.
  • One use case of an embodiment is home entertainment and relates to users in a home environment.
  • a user would like to concentrate on certain playback devices such as TV, radio, PC, tablet and block out other sources of interference (from other users' devices or children, construction noise, street noise).
  • the user is in the vicinity of the preferred playback device and selects the device or its position. Regardless of the user's position, the selected device or sound source positions are acoustically highlighted until the user cancels their selection.
  • the user goes near the target sound source.
  • the user selects the target sound source via a suitable interface, and the hearable adjusts the audio playback based on the user's position, user's line of sight and the target sound source, so that the target sound source can be clearly understood even in the presence of background noise.
  • the user moves close to a particularly disruptive sound source.
  • the user selects this noise source via a suitable interface, and the hearable (hearing aid) adjusts the audio playback based on the user's position, user's line of sight and the source of the noise in order to explicitly suppress the source of the noise.
  • Another use case of another embodiment is a cocktail party where a user is between multiple speakers. For example, when many speakers are present, a user would like to concentrate on one (or more) speakers and block out or attenuate other sources of interference. In this application, the control of the hearable may only require little active interaction from the user. Optional would be to control the strength of the selectivity using Biosignals or recognizable indicators of conversational difficulties (frequent inquiries, foreign languages, strong dialects).
  • the speakers are randomly distributed and move relative to the listener.
  • there are regular pauses in speaking new speakers join, other speakers move away.
  • Noise such as music can be comparatively loud under certain circumstances.
  • the selected speaker is highlighted acoustically and recognized again even after pauses in speaking, changes in position or pose.
  • a hearable recognizes a speaker in the user's environment.
  • the user can use a suitable control option (e.g. line of sight, attention control) to select preferred speakers.
  • the hearable adapts the audio playback according to the user's line of sight and the selected target sound source in order to be able to understand the target sound source even with background noise.
  • the user is addressed directly by a (previously) non-preferred speaker, who must at least be audible to ensure natural communication.
  • Another use case of another embodiment is in the automobile, where a user is in his (or in) a car. While driving, the user would like to actively direct their acoustic attention to certain playback devices such as navigation devices, radio or conversation partners in order to be able to better understand them in addition to the background noise (wind, engine, passengers).
  • certain playback devices such as navigation devices, radio or conversation partners in order to be able to better understand them in addition to the background noise (wind, engine, passengers).
  • the user and the target sound sources are in fixed positions inside the vehicle.
  • the user is static in relation to the reference system, but the vehicle itself moves.
  • An adapted tracking solution is therefore necessary.
  • the selected sound source position is acoustically highlighted until the user cancels the selection or until warning signals stop the device from functioning.
  • a user gets into the car and the device recognizes the surroundings.
  • the user can switch between the target sound sources using a suitable control option (e.g. speech recognition), and the hearable adjusts the audio playback according to the user's viewing direction and the selected target sound source in order to be able to understand the target sound source well even with background noise.
  • a suitable control option e.g. speech recognition
  • traffic-related warning signals interrupt the normal process and cancel the user's selection. The normal process is then restarted.
  • Another application of a further exemplary embodiment is live music and relates to a visitor to a live music event.
  • the visitor to a concert or live music performance would like to use the hearable to increase the focus on the performance and block out distracting listeners.
  • the audio signal itself can be optimized, for example to compensate for an unfavorable listening position or room acoustics.
  • the visitor is between many sources of interference, but the performances are usually relatively loud.
  • the target sound sources are in fixed positions or at least in a defined area, but the user can be very mobile (e.g. dancing).
  • the selected sound source position is acoustically highlighted until the user cancels the selection or until warning signals stop the device from functioning.
  • the user selects the stage area or the musician(s) as the target sound source(s).
  • the user can use a suitable control option to define the position of the stage/musicians, and the hearable adapts the audio playback to the target sound source according to the user's viewing direction and the selected target sound source to be able to understand well even with background noise.
  • warning information e.g. evacuation, imminent thunderstorm at outdoor events
  • warning signals can interrupt the normal process and cancel the user's selection. The normal process then restarts.
  • a further application of another exemplary embodiment is for large events and concerns visitors at large events.
  • major events e.g. football stadium, ice hockey stadium, large concert hall, etc.
  • a hearable can be used to emphasize the voices of family members and friends who would otherwise be lost in the noise of the crowds.
  • a major event takes place in a stadium or a large concert hall where a large number of visitors go.
  • a group family, friends, school class visits the event and is in front of or in the event area, where a large crowd of visitors is walking around.
  • One or more children lose eye contact with the group and, despite the high noise level, call out to the group due to the surrounding noise.
  • Hearable no longer amplifies the voice(s). For example, one person from the group on the hearable selects the voice of the missing child. The hearable localizes the voice. Then the hearable amplifies the voice and the user can find the missing item again (quicker) using the amplified voice.
  • the missing child also wears a hearable, for example, and selects the voice of their parents.
  • the hearable amplifies the parents' voice(s). The reinforcement then allows the child to locate its parents. So the child can walk back to his parents.
  • the missing child also wears a hearable and selects the voice of their parents. The hearable locates the parent's voice(s) and the hearable announces the distance to the voices. The child can find its parents more easily. An optional playback of an artificial voice from the hearable for the distance announcement is provided.
  • the hearables are coupled for a targeted amplification of the voice(s) and voice profiles are stored.
  • a further application of a further exemplary embodiment is leisure sports and relates to leisure athletes. Listening to music while exercising is popular, but it also poses risks. Warning signals or other road users may not be heard. In addition to music playback, the hearable can react to warning signals or shouts and temporarily interrupt music playback.
  • Another use case in this context is sport in small groups. The sports group's hearables can be connected to ensure good communication with each other during sports while other noise is suppressed.
  • the user is mobile and any warning signals are overlaid by numerous sources of interference.
  • the problem is that not all warning signals may affect the user (far away sirens in the city, horns on the street).
  • the Hearable automatically suspends music playback and acoustically highlights the warning signal or the communication partner until the user cancels his selection. The music will then continue to play normally.
  • a user does sports and listens to music through Hearable. Warning signals or shouts affecting the user are automatically recognized and the hearable interrupts the music playback. The hearable adjusts the audio playback in order to be able to clearly understand the target sound source ⁇ the acoustic environment. The hearable then continues playing music automatically (e.g. after the end of the warning signal) or at the request of the user.
  • athletes in a group can connect their hearables, for example.
  • the speech intelligibility between the group members is optimized and at the same time other disturbing noises are suppressed.
  • Another application of another embodiment is snoring suppression and affects all sleep seekers disturbed by snoring. People whose partners snore, for example, are disturbed in their nightly rest and have problems sleeping. The Hearable provides relief by suppressing the snoring noises, thus ensuring night-time rest and domestic peace. At the same time, the hearable allows other noises (crying babies, alarm sirens, etc.) to pass through so that the user is not completely acoustically isolated from the outside world.
  • a snoring detection is provided, for example.
  • the user has trouble sleeping due to snoring noises.
  • the hearable the user can then sleep better again, which has a stress-reducing effect.
  • the user wears the hearable while sleeping. He switches the hearable to sleep mode, which suppresses all snoring noises. After sleeping, he turns the hearable off again.
  • noises such as construction noise, lawn mower noise, etc. can be suppressed while sleeping.
  • Another application of a further exemplary embodiment is a diagnostic device for users in everyday life.
  • the hearable records the preferences (e.g. which sound sources, which amplification/damping are selected) and creates them over the period of use a profile with tendencies. This data can be used to draw conclusions about changes in hearing ability.
  • the goal is the early detection of hearing loss.
  • the user wears the device in everyday life or in the use cases mentioned for several months or years.
  • the hearable creates analyzes based on the selected setting and gives warnings and recommendations to the user.
  • the user wears the hearable over a long period of time (months to years).
  • the device automatically creates analyzes based on hearing preferences, and the device provides recommendations and warnings when hearing loss begins.
  • a further application of another exemplary embodiment is a therapy device and affects users with hearing impairments in everyday life.
  • a therapy device In its role as a transitional device to the hearing aid, potential patients are treated at an early stage and dementia is thus treated preventively.
  • Other possibilities are use as a concentration trainer (e.g. for ADHD), treatment of tinnitus and stress reduction.
  • the user has hearing or attention problems and uses the hearable temporarily/transitionally as a hearing aid.
  • this is reduced by the hearable, for example by: amplification of all signals (hearing impairment), high selectivity for preferred sound sources (attention deficits), reproduction of therapy noises (tinnitus treatment).
  • the user selects a form of therapy independently or on the advice of a doctor and makes the preferred settings, and the hearable carries out the selected therapy.
  • the Hearable detects hearing problems from UC-PRO1, and the Hearable automatically adjusts playback based on the problems detected and notifies the user.
  • Another use case of another embodiment is public sector work and relates to public sector workers.
  • Employees in the public sector hospitals, paediatricians, airport counters, educators, gastronomy, service counters, etc.
  • who are exposed to a high level of noise during work wear a hearable to improve the speech of one or just a few people communication and for better occupational safety by e.g. stress reduction.
  • a person switches on the attached hearable.
  • the user sets the hearable to select nearby voices, and the hearable amplifies the closest voice or a few nearby voices while suppressing background noise.
  • the user understands the relevant voice(s) better.
  • a person puts the hearable on permanent noise suppression.
  • the user turns on the function of recognizing occurring voices and then amplifying them. This allows the user to continue working with less noise.
  • the hearable When addressed directly from a radius of x meters, the hearable then amplifies the voice/s. The user can thus converse with the other person(s) at low noise levels. After the conversation, the hearable switches back to noise-cancelling mode alone, and after work, the user turns the hearable back off.
  • Another application of another exemplary embodiment is passenger transport and relates to users in a motor vehicle for passenger transport.
  • a user and driver of a passenger transporter would like to be distracted as little as possible by the people being transported while driving.
  • the passengers are the main source of interference, communication with them is also necessary at times.
  • the Hearable suppresses background noise from the occupants by default.
  • the user can manually override the suppression using a suitable control option (e.g. voice recognition, button in the vehicle).
  • the Hearable adjusts the audio playback according to the selection.
  • the hearable detects that a passenger is actively addressing the driver and temporarily disables noise cancellation.
  • Another application of a further embodiment is school and training and relates to teachers and students in the classroom.
  • the hearable has two roles, with the functions of the devices being partially coupled.
  • the teacher's/presenter's device suppresses background noise and amplifies speech/questions from the ranks of the students.
  • the hearables of the listeners can be controlled via the teacher's device. In this way, particularly important content can be highlighted without having to speak louder. Students can adjust their Hearable to better understand the teacher and block out disruptive classmates.
  • a teacher or lecturer presents content and the device suppresses background noise.
  • the teacher wants to hear a student's question and changes the focus of the hearable to the questioner (automatically or through a suitable control option). After communication, all noises are suppressed again.
  • it can be provided that, for example, a student who feels disturbed by classmates hides them acoustically.
  • a student sitting far away from the teacher can amplify his voice.
  • teacher and student devices can be paired, for example.
  • the selectivity of the student devices can be temporarily controlled by the teacher device.
  • the teacher changes the selectivity of the student devices to amplify their voice.
  • Another use case of another embodiment is in the military and pertains to soldiers.
  • Verbal communication between soldiers on deployment takes place on the one hand via radios and on the other hand via shouts and direct addressing.
  • Radio is mostly used when greater distances have to be bridged and when communication between different units and subgroups is to be carried out.
  • a fixed radio etiquette is often applied.
  • Shouting and direct addressing is mostly used for communication within a squad or group. Difficult acoustic conditions can arise during the deployment of soldiers (e.g. screaming people, noise from weapons, storms), which can impair both communication channels.
  • a soldier's equipment often includes a radio set with earphones. In addition to the purpose of audio reproduction, these also protect against excessive sound pressure levels.
  • shouting out and direct addressing between soldiers in action can be made more difficult by background noise.
  • This problem is currently being addressed by radio solutions in the short range and for longer distances.
  • the new system enables calling out and direct addressing at close range through an intelligent and Spatial emphasis of the respective speaker with simultaneous attenuation of ambient noise.
  • the soldier is on duty. Shouts and speech are automatically recognized and the system amplifies them while simultaneously dampening background noise.
  • the system adjusts the spatial audio reproduction in order to be able to clearly understand the target sound source.
  • the soldiers in a group can be known to the system. Only audio from those group members will pass through.
  • the hearable can be used at confusing large events (celebrations, protests) for preventive crime detection.
  • the selectivity of the hearable is controlled by keywords, e.g. calls for help or calls for violence. This requires an analysis of the content of the audio signal (e.g. speech recognition).
  • the security officer is surrounded by many loud sound sources, and the officer and all of the sound sources may be in motion.
  • a caller for help is not audible or only faintly audible under normal hearing conditions (poor SNR).
  • the manually or automatically selected sound source is acoustically highlighted until the user cancels the selection.
  • a virtual sound object is placed at the position/direction of the interesting sound source in order to be able to easily find the location (e.g. in the event of a one-time call for help).
  • the hearable recognizes sound sources with potential sources of danger.
  • a security officer chooses which sound source or event he would like to investigate (e.g. by selecting it on a tablet).
  • the hearable then adjusts the audio playback in order to be able to understand and locate the target sound source even with background noise.
  • a locating signal can be placed in the direction/distance of the source.
  • stage communication Another use case of another embodiment is stage communication and relates to musicians.
  • stages at rehearsals or concerts e.g. band, orchestra, choir, musical
  • individual instruments groups that could still be heard in other surroundings cannot be heard.
  • the hearable can emphasize these voices and make them audible again and thus improve or ensure the interaction of the individual musicians.
  • the use of this could also reduce the noise exposure of individual musicians and thus prevent hearing loss, for example by muting the drums, and at the same time the musicians could still hear everything important.
  • a musician without Hearable can no longer hear at least one other voice on stage.
  • the hearable can then be used here.
  • the user puts the hearable back down after switching it off.
  • the user turns on the hearable. He selects one or more desired musical instruments to be amplified. When playing music together, the Hearable now amplifies the selected musical instrument and thus makes it audible again. After making music, the user switches the hearable off again. In an alternate example, the user turns on the hearable. He selects the desired musical instrument whose volume is to be reduced. 7. When making music together, the Hearable now reduces the volume of the selected musical instrument so that the user only hears it at a moderate volume.
  • Another application of a further exemplary embodiment is source separation as a software module for hearing aids in terms of the ecosystem and relates to hearing aid manufacturers and hearing aid users.
  • Hearing aid manufacturers can use source separation as an additional tool for their hearing aids and offer it to customers.
  • Hearing aids could also benefit from the development.
  • a license model for other markets/devices (headphones, mobile phones, etc.) is also conceivable.
  • hearing aid users find it difficult to separate different sources from each other in a complex auditory situation, for example to focus on to assign a specific speaker.
  • additional systems e.g. transmission of signals from mobile phone systems via Bluetooth, targeted signal transmission in classrooms via an FM system or inductive hearing systems
  • the user uses a hearing aid with the additional function for selective hearing.
  • the user turns off the additional function and continues to hear normally with the hearing aid.
  • a hearing device user buys a new hearing device with an integrated additional function for selective hearing.
  • the user sets the selective hearing function on the hearing aid.
  • the user selects a profile (e.g. amplify loudest/nearest source, amplify voice recognition of specific voices from the personal environment (as with the UC-CE5 at major events).
  • the hearing aid amplifies the respective source/s according to the set profile and at the same time suppresses background noise if necessary , and the hearing aid user hears individual sources from the complex auditory scene instead of just a "noise mush"/muddle of acoustic sources.
  • the hearing device user buys the additional function for selective listening as software or the like for his own hearing device.
  • the user installs the add-on feature for their hearing aid.
  • the user sets the selective listening function on the hearing aid.
  • the user selects a profile (amplify loudest/closest source, amplify voice recognition of specific voices from their personal environment (like the UC-CE5 at major events), and the hearing aid amplifies the source(s) according to the set profile, while suppressing background noise if necessary.
  • the hearing aid user hears individual sources from the complex auditory scene instead of just a "noise mush"/muddle from acoustic sources.
  • the hearable can provide voice profiles that can be stored.
  • Another use case of another embodiment is professional sports and relates to athletes in competition.
  • sports such as biathlon, triathlon, cycling, marathon, etc.
  • professional athletes rely on information from their coaches or communication with teammates.
  • you want to protect yourself from loud noises shooting at a biathlon, loud cheering, party horns, etc.
  • the hearable could be for the respective sport/athlete be adjusted to enable a fully automatic selection of relevant sound sources (recognition of specific voices, loudness limitation for typical background noise).
  • the user may be very mobile and the nature of the noise depends on the sport. Due to the intense sporting activity, the athlete is not able to control the device actively or only to a limited extent. However, in most sports there is a fixed procedure (biathlon: running, shooting) and the important discussion partners (coaches, team members) can be defined in advance. Noise is suppressed in general or in certain phases of the sport. Communication between athletes and team members and coaches is always emphasized.
  • the athlete uses a hearable specially adapted to the sport.
  • the Hearable suppresses background noise fully automatically (preset), especially in situations where a high degree of attention is required for the sport in question.
  • the Hearable automatically highlights coaches and team members when they are within hearing range.
  • a further application of a further exemplary embodiment is ear training and relates to music students, professional musicians, amateur musicians.
  • a hearable is used in a targeted manner in order to be able to follow individual voices filtered out.
  • the voices in the background can't be clearly heard because you only hear the voices in the foreground. With the hearable you could then emphasize a voice of your choice using the instrument or similar in order to be able to practice it more specifically.
  • karaoke for example if there is no singing star or similar in the vicinity. Then you can suppress the vocal part(s) from a piece of music at will in order to only hear the instrumental version for karaoke singing.
  • a musician begins to relearn a voice from a piece of music. He listens to the recording of the piece of music on a CD system or another playback medium. When the user is done practicing, they turn the hearable back off. In one example, the user turns on the hearable. He selects the desired musical instrument to be amplified. When listening to the piece of music, the hearable amplifies the voice(s) of the musical instrument and turns down the volume of the other musical instruments, allowing the user to hear their own voice better
  • the user turns on the hearable. He selects the desired musical instrument to be suppressed. When listening to the song, the voice(s) of the selected song will be muted so that only the remaining voices can be heard. The user can then practice the voice on their own instrument with the other voices without being distracted by the voice from the recording.
  • the hearable may provide stored musical instrument profiles.
  • Another use case of another embodiment is occupational safety and concerns workers in noisy environments. Workers in noisy environments, for example in machine halls or on construction sites, must protect themselves from noise, but also be able to perceive warning signals and communicate with employees.
  • the user is in a very noisy environment and the target sound sources (warning signals, employees) may be significantly quieter than the interfering signals.
  • the user may be mobile, but the noise interference is mostly stationary.
  • noise is permanently reduced and the hearable automatically highlights a warning signal.
  • Communication with employees is ensured by amplification of speaker sources
  • the user goes about his work and uses Hearable as hearing protection.
  • Warning signals eg fire alarm
  • the user goes about his work, for example, and uses Hearable as hearing protection.
  • the communication partner is selected with the help of suitable interfaces (here, for example: gaze control) and highlighted acoustically
  • suitable interfaces here, for example: gaze control
  • Another use case of another embodiment is source separation as a software module for live translators and concerns users of a live translator. Live translators translate spoken foreign languages in real time and can benefit from an upstream source separation software module. Especially when multiple speakers are present, the software module can extract the target speaker and potentially improve the translation.
  • the software module is part of a live translator (dedicated device or smartphone app).
  • the user can select the target speaker via the device display. It is advantageous that the translator and the target sound source usually do not move or move very little during the translation. The selected sound source position is acoustically emphasized and thus potentially improves the translation.
  • a user wants to have a conversation in a foreign language or listen to a foreign speaker.
  • the user selects the target speaker through a suitable interface (e.g. GUI on the display) and the software module optimizes the audio recording for further use in the translator.
  • a suitable interface e.g. GUI on the display
  • a further application of another exemplary embodiment is occupational safety for emergency services and relates to the fire brigade, THW, possibly the police, rescue services.
  • emergency services good communication is essential for successful operation management. It is often not possible for the emergency services to wear hearing protection despite loud ambient noise, since then no communication with each other is possible. For example, firefighters must be able to communicate and understand commands precisely despite the loud engine noise, some of which is happening over radios. For this reason, emergency services are exposed to a high level of noise pollution, where the Hearing Protection Ordinance cannot be implemented. On the one hand, a hearable would offer hearing protection for the emergency services and, on the other hand, would continue to enable communication between the emergency services.
  • the user is exposed to high ambient noise and therefore cannot wear hearing protection and still needs to be able to communicate with others. He uses the hearable. After the operation or the dangerous situation is over, the user can put the hearable back down.
  • the user wears the hearable during an operation. He turns on the hearable.
  • the hearable suppresses ambient noise and amplifies the speech of colleagues and other nearby speakers (e.g. fire victims).
  • the user wears the hearable during an operation. He turns on the Hearable, and the Hearable blocks out ambient noise and amplifies co-workers' speech over the radio.
  • the hearable is specially designed to meet a structural suitability for use in accordance with a use regulation.
  • the hearable may have an interface to a radio device.
  • aspects have been described in the context of a device or a system, it is understood that these aspects also represent a description of the corresponding method, so that a block or a component of a device or a system can also be used as a corresponding method step or as a Feature of a process step is to be understood.
  • aspects described in connection with or as a method step also constitute a description of a corresponding block or detail or feature of a corresponding apparatus or system.
  • Some or all of the method steps may be performed by hardware apparatus (or using a hardware apparatus) such as a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some or more of the essential process steps can be performed by such an apparatus.
  • embodiments of the invention may be in hardware or in software, or at least partially in hardware or be at least partially implemented in software.
  • Implementation can be performed using a digital storage medium such as a floppy disk, a DVD, a BluRay disk, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, a hard disk or other magnetic or optical Memory are carried out on which electronically readable control signals are stored, which can interact with a programmable computer system in such a way or interact that the respective method is carried out. Therefore, the digital storage medium can be computer-readable.
  • some embodiments according to the invention comprise a data carrier having electronically readable control signals capable of interacting with a programmable computer system in such a way that one of the methods described herein is carried out.
  • embodiments of the present invention can be implemented as a computer program product with a program code, wherein the program code is effective to perform one of the methods when the computer program product runs on a computer.
  • the program code can also be stored on a machine-readable carrier, for example.
  • exemplary embodiments include the computer program for performing one of the methods described herein, the computer program being stored on a machine-readable carrier.
  • an exemplary embodiment of the method according to the invention is therefore a computer program that has a program code for performing one of the methods described herein when the computer program runs on a computer.
  • a further exemplary embodiment of the method according to the invention is therefore a data carrier (or a digital storage medium or a computer-readable medium) on which the computer program for carrying out one of the methods described herein is recorded.
  • the data carrier or digital storage medium or computer-readable medium is typically tangible and/or non-transitory.
  • a further exemplary embodiment of the method according to the invention is therefore a data stream or a sequence of signals which represents the computer program for carrying out one of the methods described herein.
  • the data stream or the sequence of signals can, for example, be configured to be transferred over a data communication link, for example over the Internet.
  • Another embodiment includes a processing device, such as a computer or programmable logic device, configured or adapted to perform any of the methods described herein.
  • a processing device such as a computer or programmable logic device, configured or adapted to perform any of the methods described herein.
  • Another embodiment includes a computer on which the computer program for performing one of the methods described herein is installed.
  • a further exemplary embodiment according to the invention comprises a device or a system which is designed to transmit a computer program for carrying out at least one of the methods described herein to a recipient.
  • the transmission can take place electronically or optically, for example.
  • the recipient may be a computer, mobile device, storage device, or similar device.
  • the device or the system can, for example, comprise a file server for transmission of the computer program to the recipient.
  • a programmable logic device e.g., a field programmable gate array, an FPGA
  • a field programmable gate array may cooperate with a microprocessor to perform any of the methods described herein.
  • the methods are performed on the part of any hardware device. This can be hardware that can be used universally, such as a computer processor (CPU), or hardware that is specific to the method, such as an ASIC.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Headphones And Earphones (AREA)
  • Stereophonic Arrangements (AREA)
EP20188945.8A 2020-07-31 2020-07-31 Système et procédé d'égalisation de casque d'écoute et d'adaptation spatiale pour la représentation binaurale en réalité augmentée Withdrawn EP3945729A1 (fr)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP20188945.8A EP3945729A1 (fr) 2020-07-31 2020-07-31 Système et procédé d'égalisation de casque d'écoute et d'adaptation spatiale pour la représentation binaurale en réalité augmentée
PCT/EP2021/071151 WO2022023417A2 (fr) 2020-07-31 2021-07-28 Système et procédé d'égalisation de casque d'écoute et d'adaptation à la salle pour une restitution binaurale en réalité augmentée
JP2023506248A JP2023536270A (ja) 2020-07-31 2021-07-28 拡張現実におけるバイノーラル再生のためのヘッドホン等化および室内適応のためのシステムおよび方法
EP21751796.0A EP4189974A2 (fr) 2020-07-31 2021-07-28 Système et procédé d'égalisation de casque d'écoute et d'adaptation à la salle pour une restitution binaurale en réalité augmentée
US18/158,724 US20230164509A1 (en) 2020-07-31 2023-01-24 System and method for headphone equalization and room adjustment for binaural playback in augmented reality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP20188945.8A EP3945729A1 (fr) 2020-07-31 2020-07-31 Système et procédé d'égalisation de casque d'écoute et d'adaptation spatiale pour la représentation binaurale en réalité augmentée

Publications (1)

Publication Number Publication Date
EP3945729A1 true EP3945729A1 (fr) 2022-02-02

Family

ID=71899608

Family Applications (2)

Application Number Title Priority Date Filing Date
EP20188945.8A Withdrawn EP3945729A1 (fr) 2020-07-31 2020-07-31 Système et procédé d'égalisation de casque d'écoute et d'adaptation spatiale pour la représentation binaurale en réalité augmentée
EP21751796.0A Pending EP4189974A2 (fr) 2020-07-31 2021-07-28 Système et procédé d'égalisation de casque d'écoute et d'adaptation à la salle pour une restitution binaurale en réalité augmentée

Family Applications After (1)

Application Number Title Priority Date Filing Date
EP21751796.0A Pending EP4189974A2 (fr) 2020-07-31 2021-07-28 Système et procédé d'égalisation de casque d'écoute et d'adaptation à la salle pour une restitution binaurale en réalité augmentée

Country Status (4)

Country Link
US (1) US20230164509A1 (fr)
EP (2) EP3945729A1 (fr)
JP (1) JP2023536270A (fr)
WO (1) WO2022023417A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023208333A1 (fr) * 2022-04-27 2023-11-02 Huawei Technologies Co., Ltd. Dispositifs et procédés de rendu audio binauriculaire

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230199420A1 (en) * 2021-12-20 2023-06-22 Sony Interactive Entertainment Inc. Real-world room acoustics, and rendering virtual objects into a room that produce virtual acoustics based on real world objects in the room

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150195641A1 (en) 2014-01-06 2015-07-09 Harman International Industries, Inc. System and method for user controllable auditory environment customization
DE102014210215A1 (de) * 2014-05-28 2015-12-03 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Ermittlung und Nutzung hörraumoptimierter Übertragungsfunktionen
US20190354343A1 (en) * 2016-09-27 2019-11-21 Grabango Co. System and method for differentially locating and modifying audio sources

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150195641A1 (en) 2014-01-06 2015-07-09 Harman International Industries, Inc. System and method for user controllable auditory environment customization
DE102014210215A1 (de) * 2014-05-28 2015-12-03 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Ermittlung und Nutzung hörraumoptimierter Übertragungsfunktionen
US20190354343A1 (en) * 2016-09-27 2019-11-21 Grabango Co. System and method for differentially locating and modifying audio sources

Non-Patent Citations (88)

* Cited by examiner, † Cited by third party
Title
A. AVNIJ. AHRENSM. GEIERCS. SPORSH. WIERSTORFB. RAFAELY: "Spatial perception of sound fields recorded by spherical microphone arrays with varying spatial resolution", JOURNAL OF THE ACOUSTIC SOCIETY OF AMERICA, vol. 133, no. 5, 2013, pages 2711 - 2721, XP012173358, DOI: 10.1121/1.4795780
A. MCPHERSONR. JACKG. MORO: "Action-sound latency: Are our tools fast enough?", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEW INTERFACES FOR MUSICAL EXPRESSION, July 2016 (2016-07-01)
A. MESAROST. HEITTOLAT. VIRTANEN: "A multi-device dataset for urban acoustic scene classification", PROCEEDINGS OF THE DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS WORKSHOP, 2018
B. FRENAYM. VERLEYSEN: "Classification in the presence of label noise: A survey", IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 25, no. 5, May 2014 (2014-05-01), pages 845 - 869, XP011545535, DOI: 10.1109/TNNLS.2013.2292894
BRANDENBURG, K.CANO CERON, E.KLEIN, F.KÖLLMER, T.LUKASHEVICH, H.NEIDHARDT, A.NOWAK, J.SLOMA, U.WERNER, S.: "Personalized auditory reality", JAHRESTAGUNG FÜR AKUSTIK (DAGA), GARCHING BEI MÜNCHEN, DEUTSCHE GESELLSCHAFT FÜR AKUSTIK (DEGA, vol. 44, 2018
C. H. TAALR. C. HENDRIKSR. HEUSDENSJ. JENSEN: "An algorithm for intelligibility prediction of time-frequency weighted noisy speech", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 19, no. 7, September 2011 (2011-09-01), pages 2125 - 2136, XP011335558, DOI: 10.1109/TASL.2011.2114881
C. ROTTONDIC. CHAFEC. ALLOCCHIOA. SARTI: "An overview on networked music performance technologies", IEEE ACCESS, vol. 4, 2016, pages 8823 - 8843
C.-R. NAGARJ. ABESSERS. GROLLMISCH: "Towards CNN-based acoustic modeling of seventh chords for recognition chord recognition", PROCEEDINGS OF THE 16TH SOUND & MUSIC COMPUTING CONFERENCE (SMC) (EINGEREICHT, 2019
CANO ESTEFANIA ET AL: "Selective Hearing: A Machine Listening Perspective", 2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), IEEE, 27 September 2019 (2019-09-27), pages 1 - 6, XP033660032, DOI: 10.1109/MMSP.2019.8901720 *
D. FITZGERALDA. LIUTKUSR. BADEAU: "Projection-based demixing of spatial audio", IEEE/ACM TRANS. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 24, no. 9, 2016, pages 1560 - 1572
D. MATZE. CANOJ. ABESSER: "Proc. of the 16th International Society for Music Information Retrieval Conference", October 2015, ISMIR, article "New sonorities for early jazz recordings using sound source separation and automatic mixing tools", pages: 749 - 755
D. PAVLIDIA. GRIFFINM. PUIGTA. MOUCHTARIS: "Real-time multiple sound source localization and counting using a circular microphone array", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 21, no. 10, October 2013 (2013-10-01), pages 2193 - 2206, XP011521588, DOI: 10.1109/TASL.2013.2272524
D. SNYDERD. GARCIA-ROMEROG. SEILD. POVEYS. KHUDANPUR: "X-vectors: Robust DNN embeddings for speaker recognition", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, April 2018 (2018-04-01), pages 5329 - 5333, XP033403941, DOI: 10.1109/ICASSP.2018.8461375
D. YOOKT. LEEY. CHO: "Fast sound source localization using two-level search space clustering", IEEE TRANSACTIONS ON CYBERNETICS, vol. 46, no. 1, January 2016 (2016-01-01), pages 20 - 26, XP011594358, DOI: 10.1109/TCYB.2015.2391252
E. C, CAKIRT. VIRTANEN: "End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input", PROC. OF INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN, July 2018 (2018-07-01), pages 1 - 7
E. CANOD. FITZGERALDA. LIUTKUSM. D. PLUMBLEYF. STÖTER: "Musical source separation: An introduction", IEEE SIGNAL PROCESSING MAGAZINE, vol. 36, no. 1, January 2019 (2019-01-01), pages 31 - 40, XP011694891, DOI: 10.1109/MSP.2018.2874719
E. CANOD. FITZGERALDK. BRANDENBURG: "Evaluation of quality of sound source separation algorithms: Human perception vs quantitative metrics", PROCEEDINGS OF THE 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO, 2016, pages 1758 - 1762, XP033011238, DOI: 10.1109/EUSIPCO.2016.7760550
E. CANOG. SCHULLERC. DITTMAR: "Pitch-informed solo and accompaniment separation towards its use in music education applications", EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, vol. 23, 2014, pages 1 - 19
E. CANOJ. LIEBETRAUD. FITZGERALDK. BRANDENBURG: "The dimensions of perceptual quality of sound source separation", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, April 2018 (2018-04-01), pages 601 - 605, XP033401636, DOI: 10.1109/ICASSP.2018.8462325
E. CANOJ. NOWAKS. GROLLMISCH: "Exploring sound source separation for acoustic condition monitoring in industrial scenarios", PROC. OF 25TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO, August 2017 (2017-08-01), pages 2264 - 2268, XP033236389, DOI: 10.23919/EUSIPCO.2017.8081613
E. FONSECAM. PLAKALD. P. W. ELLISF. FONTX. FAVORYX. SERRA: "Learning sound event classifiers from web audio with noisy labels", PROCEEDINGS OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, 2019
F. EYBENF. WENINGERS. SQUARTINIB. SCHULLER: "Real-life voice activity detection with LSTM recurrent neural networks and an application to hollywood movies", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, May 2013 (2013-05-01), pages 483 - 487, XP032509188, DOI: 10.1109/ICASSP.2013.6637694
F. GRONDINF. MICHAUD: "Lightweight and optimized sound source localization and tracking methods for open and closed microphone array configurations", ROBOTICS AND AUTONOMOUS SYSTEMS, vol. 113, 2019, pages 63 - 80
F. MÜLLERM. KARAU: "Transparant hearing", CHI ,02 EXTENDED ABSTRACTS ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI EA '02, April 2002 (2002-04-01), pages 730 - 731
F. WENINGERH. ERDOGANS. WATANABEE. VINCENTJ. LE ROUXJ. R. HERSHEYB. SCHULLER: "Latent Variable Analysis and Signal Separation", 2015, SPRINGER INTERNATIONAL PUBLISHING, article "Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR", pages: 293 - 305
G. NAITHANIT. BARKERG. PARASCANDOLOL. BRAMSLTWN. H. PONTOPPIDANT. VIRTANEN: "Low latency sound source separation using convolutional recurrent neural networks", PROC. OF IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA, October 2017 (2017-10-01), pages 71 - 75, XP033264904, DOI: 10.1109/WASPAA.2017.8169997
G. PARASCANDOLOH. HUTTUNENT. VIRTANEN: "Recurrent neural networks for polyphonic sound event detection in real life recordings", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, March 2016 (2016-03-01), pages 6440 - 6444
G. S. PAPINIR. L. PINTOE. B. MEDEIROSF. B. COELHO: "Hybrid approach to noise control of industrial exhaust systems", APPLIED ACOUSTICS, vol. 125, 2017, pages 102 - 112, XP085026079, DOI: 10.1016/j.apacoust.2017.03.017
H. WIERSTORFD. WARDR. MASONE. M. GRAISC. HUMMERSONEM. D. PLUMBLEY: "Perceptual evaluation of source separation for remixing music", PROC. OF AUDIO ENGINEERING SOCIETY CONVENTION, vol. 143, October 2017 (2017-10-01)
J. ABESSERM. GÖTZES. KÜHNLENZR. GRÄFEC. KÜHNT. CLAUSSH. LUKASHEVICH: "A Distributed Sensor Network for Monitoring Noise Level and Noise Sources in Urban Environments", PROCEEDINGS OF THE 6TH IEEE INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD), BARCELONA, SPAIN, 2018, pages 318 - 324, XP033399745, DOI: 10.1109/FiCloud.2018.00053
J. ABESSERM. MÜLLER: "Fundamental frequency contour classification: A comparison between hand-crafted and CNN-based features", PROCEEDINGS OF THE 44TH IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP, 2019
J. ABESSERS. BALKEM. MÜLLER: "Improving bass saliency estimation using label propagation and transfer learning", PROCEEDINGS OF THE 19TH INTERNATIONAL SOCIETY FOR MUSIC INFORMATION RETRIEVAL CONFERENCE (ISMIR, 2018, pages 306 - 312
J. ABESSERS. LOANNIS MIMILAKISR. GRÄFEH. LUKASHEVICH: "Acoustic scene classification by combining autoencoder-based dimensionality reduction and convolutional neural net-works", PROCEEDINGS OF THE 2ND DCASE WORKSHOP ON DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS, 2017
J. CHUAG. WANGW. B. KLEIJN: "Convolutive blind source separation with low latency", PROC. OF IEEE INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC, September 2016 (2016-09-01), pages 1 - 5, XP032983095, DOI: 10.1109/IWAENC.2016.7602895
J. F. GEMMEKED. P. W. ELLISD. FREEDMANA. JANSENW. LAWRENCER. C. MOOREM. PLAKALM. RITTER: "Audio Set: An ontology and human-Iabeled dataset for audio events", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, 2017
J. PONSJ. JANERT. RODEW. NOGUEIRA: "Remixing music using source separation algorithms to improve the musical experience of cochlear implant users", THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, vol. 140, no. 6, 2016, pages 4338 - 4349, XP012214619, DOI: 10.1121/1.4971424
J. R. HERSHEYZ. CHENJ. LE ROUXS. WATANABE: "Deep clustering: Discriminative embeddings for segmentation and separation", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, 2016, pages 31 - 35, XP032900557, DOI: 10.1109/ICASSP.2016.7471631
J. S. GÖMEZJ. ABESSERE. CANO: "Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning", PROCEEDINGS OF THE 19TH INTERNATIONAL SOCIETY FOR MUSIC INFORMATION RETRIEVAL CONFERENCE (ISMIR, 2018, pages 577 - 584
J. ZHANGT. D. ABHAYAPALAW. ZHANGP. N. SAMARASINGHES. JIANG: "Active noise control over space: A wave domain approach", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 26, no. 4, April 2018 (2018-04-01), pages 774 - 786
J.-L. DURRIEUB. DAVIDG. RICHARD: "A musically motivated midlevel representation for pitch estimation and musical audio source separation", SELECTED TOPICS IN SIGNAL PROCESSING, IEEE JOURNAL OF, vol. 5, no. 6, October 2011 (2011-10-01), pages 1180 - 1191, XP011386718, DOI: 10.1109/JSTSP.2011.2158801
K. BRANDENBURGE. CANOF. KLEINT. KÖLLMERH. LUKASHEVICHA. NEIDHARDTU. SLOMAS. WERNER: "Plausible augmentation of auditory scenes using dynamic binaural synthesis for personalized auditory realities", PROC. OF AES INTERNATIONAL CONFERENCE ON AUDIO FOR VIRTUAL AND AUGMENTED REALITY, August 2018 (2018-08-01)
KAROLINA PRAWDA: "Augmented Reality: Hear-through", 31 December 2019 (2019-12-31), pages 1 - 20, XP055764823, Retrieved from the Internet <URL:https://mycourses.aalto.fi/pluginfile.php/666520/course/section/128564/Karolina%20Prawda_1930001_assignsubmission_file_Prawda_ARA_hear_through_revised.pdf> [retrieved on 20210113] *
KLEINER, M.: "Acoustics and Audio Technology", 2012, J. ROSS PUBLISHING
L. JIAKAI: "Mean teacher convolution system for dcase 2018 task 4", DCASE2018 CHALLENGE, TECH. REP., September 2018 (2018-09-01)
L. VIEIRA: "Master Thesis", 2018, AALBORG UNIVERSITY, article "Super hearing: a study on virtual prototyping for hearables and hearing aids"
M. DICKREITERV. DITTELW. HOEGM. WÖHRM.: "Handbuch der Tonstudiotechnik", vol. 1, 2008, K.G. SAUR VERLAG
M. DORFERG. WIDMER: "Training general-purpose audio tagging networks with noisy labels and iterative self-verification", PROCEEDINGS OF THE DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS 2018 WORKSHOP (DCASE2018, 2018
M. MCLAREND. CASTÄNM. K. NANDWANAL. FERRERE. YILMAZ: "How to train your speaker embeddings extractor", ODYSSEY, 2018
M. MCLARENY. LEIL. FERRER: "Advances in deep neural network approaches to speaker recognition", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, April 2015 (2015-04-01), pages 4814 - 4818
M. SUNOHARAC. HARUTAN. ONO: "Low-Iatency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with truncation of non-causal components", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, March 2017 (2017-03-01), pages 216 - 220
OROSOUND, TILDE EARPHONES, 1 March 2019 (2019-03-01), Retrieved from the Internet <URL:https://www.orosound.com/tilde-earphones>
P. M. DELGADOJ. HERRE: "Objective assessment of spatial audio quality using directional loudness maps", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, May 2019 (2019-05-01), pages 621 - 625, XP033566358, DOI: 10.1109/ICASSP.2019.8683810
P. N. SAMARASINGHEW. ZHANGT. D. ABHAYAPALA: "Recent advances in active noise control inside automobile cabins: Toward quieter cars", IEEE SIGNAL PROCESSING MAGAZINE, vol. 33, no. 6, November 2016 (2016-11-01), pages 61 - 73, XP011633441, DOI: 10.1109/MSP.2016.2601942
P. VECCHIOTTIN. MAS. SQUARTINIG. J. BROWN: "End-to-end binaural sound localisation from the raw waveform", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, May 2019 (2019-05-01), pages 451 - 455
Q. KONGY. XUW. WANGM. D. PLUMBLEY: "A joint separation-classification model for sound event detection of weakly labelled data", PROCEEDINGS OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, March 2018 (2018-03-01)
R. SERIZELN. TURPAULTH. EGHBAL-ZADEHA. PARAG SHAH: "Large- Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments", DCASE2018 WORKSHOP, July 2018 (2018-07-01)
R. ZAZO-CANDILT. N. SAINATHG. SIMKOC. PARADA: "Feature learning with rawwaveform CLDNNs for voice activity detection", PROC. OF INTERSPEECH, 2016
RISHABH RANJAN ET AL: "Natural listening over headphones in augmented reality using adaptive filtering techniques", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 23, no. 11, 31 July 2015 (2015-07-31), pages 1988 - 2002, XP058072946, ISSN: 2329-9290, DOI: 10.1109/TASLP.2015.2460459 *
S. ADAVANNEA. POLITISJ. NIKUNENT. VIRTANEN: "Sound event localization and detection of overlapping sources using convolutional recurrent neural networks", IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2018, pages 1 - 1
S. ARGENTIERIP. DANSP. SOURES: "A survey on sound source localization in robotics: From binaural to array processing methods", COMPUTER SPEECH LANGUAGE, vol. 34, no. 1, 2015, pages 87 - 112, XP029225205, DOI: 10.1016/j.csl.2015.03.003
S. CHAKRABARTYE. A. P. HABETS: "Multi-speaker DOA estimation using deep convolutional networks trained with noise signals", IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, vol. 13, no. 1, March 2019 (2019-03-01), pages 8 - 21
S. DELIKARIS-MANIASD. PAVLIDIA. MOUCHTARISV. PULKKI: "DOA estimation with histogram analysis of spatially constrained active intensity vectors", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, March 2017 (2017-03-01), pages 526 - 530, XP033258473, DOI: 10.1109/ICASSP.2017.7952211
S. GANNOTE. VINCENTS. MARKOVICH-GOLANA. OZEROV: "A consolidated perspective on multimicrophone speech enhancement and source separation", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 25, no. 4, April 2017 (2017-04-01), pages 692 - 730, XP058372577, DOI: 10.1109/TASLP.2016.2647702
S. GHARIBK. DROSSOSE. CAKIRD. SERDYUKT. VIRTANEN: "Unsupervised adversarial domain adaptation for acoustic scene classification", PROCEEDINGS OF THE DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS WORKSHOP (DCASE, November 2018 (2018-11-01), pages 138 - 142
S. GROLLMISCHJ. ABESSERJ. LIEBETRAUH. LUKASHEVICH: "Sounding industry: Challenges and datasets for industrial sound analysis (ISA", PROCEEDINGS OF THE 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) (EINGEREICHT, 2019
S. GURURANIC. SUMMERSA. LERCH: "Proceedings of the 19th International Society for Music Information Retrieval Conference", September 2018, ISMIR, article "Instrument activity detection in polyphonic music using deep neural networks", pages: 321 - 326
S. I. MIMILAKISK. DROSSOSJ. F. SANTOSG. SCHULLERT. VIRTANENY. BENGIO: "Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP, 2018, pages 721 - 725
S. LIEBICHJ. FABRYP. JAXP. VARY: "Signal processing challenges for active noise cancellation headphones", SPEECH COMMUNICATION; 13TH ITG-SYMPOSIUM, October 2018 (2018-10-01), pages 1 - 5
S. M. KUOD. R. MORGAN: "Active noise control: a tutorial review", PROCEEDINGS OF THE IEEE, vol. 87, no. 6, June 1999 (1999-06-01), pages 943 - 973
S. MARCHAND: "Audio scene transformation using informed source separation", THE JOURNAL OFTHE ACOUSTICAL SOCIETY OF AMERICA, vol. 140, no. 4, 2016, pages 3091
S. O. SADJADIJ. W. PELECANOSS. GANAPATHY: "The IBM speaker recognition system: Recent advances and error analysis", PROC. OF INTERSPEECH, 2016, pages 3633 - 3637
S. PASCUALA. BONAFONTEJ. SERRÄ: "SEGAN: speech enhancement generative adversarial network", PROC. OF INTERSPEECH, August 2017 (2017-08-01), pages 3642 - 3646
S. UHLICHM. PORCUF. GIRONM. ENENKLT. KEMPN. TAKAHASHIY. MITSUFUJI: "Improving music source separation based on deep neural networks through data augmentation and network blending", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, 2017
SENNHEISER, AMBEO SMART HEADSET, 1 March 2019 (2019-03-01), Retrieved from the Internet <URL:https://de-de.sennheiser.com/finalstop>
T. GERKMANNM. KRAWCZYK-BECKER: "J. Le Roux, ''Phase processing for single-channel speech enhancement: History and recent advances", IEEE SIGNAL PROCESSING MAGAZINE, vol. 32, no. 2, March 2015 (2015-03-01), pages 55 - 66, XP011573073, DOI: 10.1109/MSP.2014.2369251
T. V. NEUMANNK. KINOSHITAM. DELCROIXS. ARAKIT. NAKATANIR. HAEB-UMBACH: "All-neural online source separation, counting, and diarization for meeting analysis", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, May 2019 (2019-05-01), pages 91 - 95
V. LONSTANLENC.-E. CELLA: "Proceedings of the 17th International Society for Music Information Retrieval Conference", 2016, ISMIR, article "Deep convolutional networks on the pitch spiral for musical instrument recognition", pages: 612 - 618
V. VALIMAKIA. FRANCKJ. RAMOH. GAMPERL. SAVIOJA: "Assisted listening using a headset: Enhancing audio perception in real, augmented, and virtual environments", IEEE SIGNAL PROCESSING MAGAZINE, vol. 32, no. 2, March 2015 (2015-03-01), pages 92 - 99, XP011573083, DOI: 10.1109/MSP.2014.2369191
X. LIL. GIRINR. HORAUDS. GANNOT: "Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 25, no. 10, October 2017 (2017-10-01), pages 1997 - 2012
X. LUY. TSAOS. MATSUDAC. HORI: "Speech enhancement based on deep denoising autoencoder", PROC. OF INTERSPEECH, 2013
Y. HANJ. KIMK. LEE: "Deep convolutional neural networks for predominant instrument recognition in polyphonic music", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 25, no. 1, January 2017 (2017-01-01), pages 208 - 221
Y. JUNGY. KIMY. CHOIH. KIM: "Joint learning using denoising variational autoencoders for voice activity detection", PROC. OF INTERSPEECH, September 2018 (2018-09-01), pages 1210 - 1214
Y. LUON. MESGARANI: "TaSNet: Time-domain audio separation network for real-time, single-channel speech separation", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, April 2018 (2018-04-01), pages 696 - 700
Y. LUOZ. CHENN. MESGARANI: "Speaker-independent speech separation with deep attractor network", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 26, no. 4, April 2018 (2018-04-01), pages 787 - 796
Y. XUJ. DUL. DAIC. LEE: "A regression approach to speech enhancement based on deep neural networks", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 23, no. 1, January 2015 (2015-01-01), pages 7 - 19
Y. XUQ. KONGW. WANGM. D. PLUMBLEY: "Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, 2018, pages 121 - 125
Z. RAFIIA. LIUTKUSF. STÖTERS. I. MIMILAKISD. FITZGERALDB. PARDO: "An overview of lead and accompaniment separation in music", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 26, no. 8, August 2018 (2018-08-01), pages 1307 - 1335
Z. WANGJ. LE ROUXJ. R. HERSHEY: "Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation", PROC. OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, April 2018 (2018-04-01), pages 1 - 5, XP033400917, DOI: 10.1109/ICASSP.2018.8461639

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023208333A1 (fr) * 2022-04-27 2023-11-02 Huawei Technologies Co., Ltd. Dispositifs et procédés de rendu audio binauriculaire

Also Published As

Publication number Publication date
JP2023536270A (ja) 2023-08-24
WO2022023417A3 (fr) 2022-03-24
US20230164509A1 (en) 2023-05-25
EP4189974A2 (fr) 2023-06-07
WO2022023417A2 (fr) 2022-02-03

Similar Documents

Publication Publication Date Title
EP4011099A1 (fr) Système et procédé d&#39;aide à l&#39;audition sélective
Gabbay et al. Visual speech enhancement
Wang Time-frequency masking for speech separation and its potential for hearing aid design
Arons A review of the cocktail party effect
Darwin Listening to speech in the presence of other sounds
CN110517705B (zh) 一种基于深度神经网络和卷积神经网络的双耳声源定位方法和系统
US10825353B2 (en) Device for enhancement of language processing in autism spectrum disorders through modifying the auditory stream including an acoustic stimulus to reduce an acoustic detail characteristic while preserving a lexicality of the acoustics stimulus
CN112352441B (zh) 增强型环境意识系统
Marxer et al. The impact of the Lombard effect on audio and visual speech recognition systems
Best et al. Spatial unmasking of birdsong in human listeners: Energetic and informational factors
US20230164509A1 (en) System and method for headphone equalization and room adjustment for binaural playback in augmented reality
EP2405673B1 (fr) Procédé de localisation d&#39;un source audio et système auditif à plusieurs canaux
CN103325383A (zh) 音频处理方法和音频处理设备
Hummersone A psychoacoustic engineering approach to machine sound source separation in reverberant environments
Kohlrausch et al. An introduction to binaural processing
JP2021511755A (ja) 音声認識オーディオシステムおよび方法
EP3216235B1 (fr) Appareil et procédé de traitement de signal audio
Gabbay et al. Seeing through noise: Speaker separation and enhancement using visually-derived speech
Keshavarzi et al. Use of a deep recurrent neural network to reduce wind noise: Effects on judged speech intelligibility and sound quality
Josupeit et al. Modeling speech localization, talker identification, and word recognition in a multi-talker setting
Schoenmaker et al. The multiple contributions of interaural differences to improved speech intelligibility in multitalker scenarios
Abel et al. Novel two-stage audiovisual speech filtering in noisy environments
Luo et al. Audio-visual speech separation using i-vectors
CN113347551B (zh) 一种单声道音频信号的处理方法、装置及可读存储介质
Gul et al. Preserving the beamforming effect for spatial cue-based pseudo-binaural dereverberation of a single source

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

RAP3 Party data changed (applicant data changed or rights of an application transferred)

Owner name: FRAUNHOFER-GESELLSCHAFT ZUR FOERDERUNG DER ANGEWANDTEN FORSCHUNG E.V.

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20220803