WO2021002862A1 - Acoustic echo cancellation - Google Patents

Acoustic echo cancellation Download PDF

Info

Publication number
WO2021002862A1
WO2021002862A1 PCT/US2019/040535 US2019040535W WO2021002862A1 WO 2021002862 A1 WO2021002862 A1 WO 2021002862A1 US 2019040535 W US2019040535 W US 2019040535W WO 2021002862 A1 WO2021002862 A1 WO 2021002862A1
Authority
WO
WIPO (PCT)
Prior art keywords
acoustic echo
room
person
echo cancellation
audio signal
Prior art date
Application number
PCT/US2019/040535
Other languages
French (fr)
Inventor
Srikanth KUTHURU
Sunil Bharitkar
Madhu Sudan ATHREYA
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/US2019/040535 priority Critical patent/WO2021002862A1/en
Priority to CN201980098110.7A priority patent/CN114008999A/en
Priority to US17/419,460 priority patent/US11937076B2/en
Priority to EP19935921.7A priority patent/EP3994874A4/en
Publication of WO2021002862A1 publication Critical patent/WO2021002862A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/307Frequency adjustment, e.g. tone control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/04Circuit arrangements, e.g. for selective connection of amplifier inputs/outputs to loudspeakers, for loudspeaker detection, or for adaptation of settings to personal preferences or hearing impairments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • Video conferencing systems can be used for communication between parties in different locations.
  • a video conferencing system at a near-end can capture audio-video information at the near-end and transmit the audio-video information to a far-end.
  • a video conferencing system at the far-end can capture audio-visual information at the far-end and transmit the audio-visual information to the near-end.
  • FIG. 1 illustrates an example of a video conference system in a near-end room that includes a plurality of persons in accordance with the present disclosure
  • FIG. 2 illustrates an example of a technique for performing acoustic echo cancellation for an audio signal in accordance with the present disclosure
  • FIG. 3 illustrates an example of a video conferencing system for performing acoustic echo cancellation in accordance with the present disclosure
  • FIG. 4 is a flowchart illustrating an example method of performing acoustic echo cancellation in a video conference system in accordance with the present disclosure
  • FIG. 5 is a flowchart illustrating another example method of performing acoustic echo cancellation in a video conference system in accordance with the present disclosure
  • FIG. 6 is a block diagram that provides an example illustration of a computing device that can be employed in the present disclosure.
  • the present disclosure describes a machine readable storage medium as well as a method and a system for acoustic echo cancellation, such as in a video conference system.
  • An example of the present disclosure can include a machine readable storage medium comprising instructions that, when executed by a processor, cause the processor to determine a location of a person in a room. The instructions, when executed by the processor, can cause the processor to capture an audio signal received from the location of the person using beamforming. The instructions, when executed by the processor, can cause the processor to determine an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person.
  • the instructions when executed by the processor, can cause the processor to perform acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter.
  • the instructions cause the processor to transmit the audio signal having the canceled acoustic echo to a far-end system.
  • the acoustic echo cancellation parameter includes a room impulse response.
  • an output of a beamformer that performs beamforming to capture the audio signal for example, can be an input to an echo canceller that performs the acoustic echo cancellation on the audio signal. Beamforming can be performed with a microphone array using a fixed delay-sum beamformer and a set of beamforming parameters.
  • the instructions can cause the processor to determine the location of the person in the room using camera information, pressure sensor information, signal power information, or a combination thereof.
  • the instructions can cause the processor to perform the acoustic echo cancellation on a number of channels that are outputted from a beamformer, wherein the number of channels corresponds to a number of persons detected in the room.
  • the instructions can cause the processor to determine to update the acoustic echo cancellation parameter when the location of the person in the room changes, as well as determine to not update the acoustic echo cancellation parameter when the location of the person in the room does not change.
  • Another example of the present disclosure can include a method for acoustic echo cancellation.
  • the method can include determining a location of a person in a room based in part on camera information.
  • the method can include capturing an audio signal received from the location of the person using a beamformer.
  • the method can include determining a room impulse response based in part on the audio signal captured from the location of the person.
  • the method can include providing an output of the beamformer as an input to an echo canceler that performs acoustic echo cancellation on the audio signal received from the location of the person based in part on the room impulse response.
  • the method can include transmitting the audio signal having the canceled acoustic echo.
  • the acoustic echo cancellation can be on a number of channels that are outputted from the beamformer, wherein the number of channels corresponds to a number of persons detected in the room based in part on the camera information.
  • performing beamforming can occur using a microphone array using the beamformer and a set of beamforming parameters.
  • the system can include a camera to capture camera information for a room.
  • the system can include a microphone array to capture an audio signal received from a location of a person in the room.
  • the system can include a processor.
  • the processor can determine the location of the person in the room based in part on the camera information.
  • the processor can perform beamforming to capture the audio signal received from the location of the person using the microphone array.
  • the processor can determine an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person.
  • the processor can perform acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter.
  • the processor can transmit the audio signal having the canceled acoustic echo.
  • the processor can perform the acoustic echo cancellation on a number of channels that are outputted from a beamformer that is used to perform the beamforming, wherein the number of channels corresponds to a number of persons detected in the room based in part on the camera information.
  • the camera can be a stereo camera, a structured light sensor camera, a time-of-flight camera, or a combination thereof.
  • the system can be a video
  • FIG. 1 illustrates an example of a video conference system 100 in a near-end room 120 that includes a plurality of persons 110.
  • the video conferencing system 100 can include a camera 102 to capture camera information for the near-end room 120.
  • the camera 102 can capture video of the persons 110 in the near-end room 120.
  • the video captured in the near-end room 120 can be converted to a video signal, and the video signal can be transmitted to a far-end room 150.
  • the video conference system 100 can include a speaker (or loudspeaker) 104.
  • the speaker 104 can receive an audio signal from the far-end room 150 and produce a sound based on the audio signal.
  • the video conference system 100 can include a microphone 106 to capture audio in the near-end room 120.
  • the microphone 106 can capture audio spoken by a person 110 in the near-end room 120.
  • the audio captured in the near-end room 120 can be converted to an audio signal, and the audio signal can be transmitted to the far-end room 150.
  • the video conference system 100 can include a display 108 to display a video signal received from the far-end room 150.
  • the far-end room 150 can include a video conferencing system 130.
  • the video conferencing system 130 can include a camera 132 to capture camera information for the far-end room 150.
  • the camera 132 can capture video of the persons 140 in the far-end room 160.
  • the video captured in the far-end room 150 can be converted to a video signal, and the video signal can be transmitted to the near-end room 120.
  • the video conferencing system 130 can include a speaker 134, which can receive the audio signal from the near-end room 120 and produce a sound based on the audio signal.
  • the video conferencing system 130 can include a microphone 136 to capture audio in the far-end room 150.
  • the microphone 136 can capture audio spoken by a person 140 in the far-end room 150.
  • the audio captured in the far-end room 150 can be converted to an audio signal, and the audio signal can be transmitted to the near-end room 120.
  • the video conferencing system 130 can include a display 138 to display the video signal received from the near-end room 120.
  • the video conference system 100 in the near-end room 120 and the video conference system 130 in the far-end room 150 can enable the persons 1 10 in the near-end room 120 to communicate with the persons 140 in the far-end room 150.
  • the persons 1 10 in the near-end room 120 may be able to see and hear the persons 140 in the far-end room 150, based on audio-video information that is communication between the video conference system 100 in the near-end room 120 and the video conference system 130 in the far-end room 150.
  • the near-end room 120 can include four persons and the far-end room 150 can include two persons, but other numbers of persons can be present in the near-end room 120 and the far-end room 150.
  • the microphone 106 that captures the audio spoken by the person 1 10 in the near-end room 120 can be a microphone array.
  • the microphone array can include a plurality of microphones placed at different spatial locations.
  • the microphone array can capture the audio spoken by the person 1 10 in the near-end room 120 using beamforming.
  • the different spatial locations of the microphones in the microphone array that capture the audio spoken by the person 1 10 can produce beamforming parameters.
  • a signal strength of signals emanating from particular directions in the near-end room 120, such as a location of the person 1 10 in the near-end room 120, can be increased based on the beamforming parameters.
  • a signal strength of signals (e.g., due to noise) emanating from other directions in the near-end room 120, such as a location that is different than the location of the person 1 10 in the near-end room 120, can be combined in a benign or destructive manner based on the beamforming parameters, resulting in degradation of the signals to/from the location that is different than the location of the person 1 10 in the near-end room 120.
  • the microphone array can provide an ability to augment signals emanating from a particular direction in the near-end room 120 based on knowledge of the particular direction.
  • beamforming techniques using a microphone array can adaptively track active persons and listen to sound in direction(s) of the active persons, and suppress sound (or noise) coming from other directions.
  • Beamforming using a microphone array can augment a sound quality of received speech by increasing a gain of an audio signal in the active person’s direction and reducing a number of far-end speaker echoes received at microphone(s) of the microphone array.
  • a gain and a phase delay for a given microphone output in the microphone array a sound signal from a specific direction can be amplified by constructive
  • the gain(s) and phase delay(s) for microphone(s) in the microphone array can be considered to be the beamforming parameters. Further, since the gain and the phase delay for the given microphone output can vary based on the location of the person 1 10, the beamforming parameters can also depend on the location of the person 1 10.
  • beamforming techniques using a microphone array can be classified as data-independent or fixed, or data-dependent or adaptive.
  • beamforming parameters can be fixed during operation.
  • beamforming parameters can be continuously updated based on received signals.
  • Examples of fixed beamforming techniques can include delay-sum beamforming, sub-array delay sum beamforming, super-directivity beamforming or near-field
  • adaptive beamforming techniques can include generalized side-lobe canceler beamforming, adaptive microphone-array system for noise reduction (AMNOR) beamforming or post-filtering beamforming.
  • AMNOR adaptive microphone-array system for noise reduction
  • the audio captured using the microphone 106 of the video conferencing system 100 in the near-end room 120 can be transmitted as the audio signal to the video conferencing system 130 in the far-end room 150.
  • the audio signal can be used to produce the sound at the speaker 134 of the video conferencing system 130 in the far-end room 150. That sound can bounce around the far-end room 150 for a fraction of a second and can be detected by the microphone 136 of the video conferencing system 100 in the far-end room 150, and then the sound can be sent back to the video conference system 100 in the near-end room 120.
  • the sound that bounces around the far-end room 150 can create a distracting and undesired echo that is heard in the near-end room 120.
  • the person 1 10 in the near-end room 120 can speak and when this sound bounces around the far-end room 150, the person 1 10 may hear an echo of their own voice.
  • acoustic echo cancellation can be used to cancel or reduce acoustic echo in the audio signal being transmitted from the video conferencing system 100 in the near-end room 120 to the video conferencing system 130 in the far-end room 150.
  • the audio signal transmitted from the video conferencing system 100 in the near-end room 120 can include a near-end speech signal and a far-end echoed speech signal.
  • the near-end speech signal can derive from the audio signal that is captured at the near-end room 120 with the microphone array using beamforming, and the far-end echoed speech signal can derive from the audio signal that is received from the far-end room 150.
  • the acoustic echo cancellation can be applied on both the near-end speech signal and the far-end echoed speech signal, such that the far-end echoed speech signal is removed from the audio signal.
  • An audio signal that comprises the near-end speech signal i.e. , an audio signal in which the acoustic echo has been cancelled or reduced
  • FIG. 2 illustrates an example of a technique for performing acoustic echo cancellation for an audio signal in accordance with the present disclosure.
  • the acoustic echo cancellation can be performed using a computing device 216 in a near-end room 220.
  • the computing device 216 can be part of a video conferencing system that captures audio-video at the near-end room and transmits the audio-video to a far-end room 230.
  • the computing device 216 may include, or be coupled to, a speaker 204 (or
  • the loudspeaker such as a stereo camera, a structured light sensor camera or a time-of-flight camera, and a microphone array 212.
  • the speaker 204, the camera 206 and the microphone array 212 can be integrated with the computing device 216, or can be separate units that are coupled to the computing device 216.
  • the camera 206 can capture camera information for the near-end room 200.
  • the camera information can be digital images and/or digital video of the near-end room 200.
  • the camera information can be provided to a person detector and tracker unit 208 that operates on the computing device 216.
  • the person detector and tracker unit 208 can analyze the camera information using object detection, which can include facial detection. Based on the camera information, the person detector and tracker unit 208 can determine a number of persons in the near-end room 220, as well as a location of a person in the near-end room 220.
  • the person(s) that are detected in the near-end room 220 based on the camera information can include a person that is currently speaking or a person that is not currently speaking (e.g., a person in the near-end room 220 that is listening to another person who is speaking).
  • the location of the person can be a relative location with respect to the number of persons in the near-end room 220.
  • the relative location of the person can imply a relative position of the person or persons with respect to the microphones in the microphone array 212.
  • the relative location can be determined based upon determining a camera position relative to the microphones in the microphone array 212.
  • the camera position relative to the microphones in the microphone array 212 can be determined manually or using object detection.
  • the camera position can be determined once or periodically, as the camera 206 and the microphones in the microphone array 212 can be stationary or semi-stationary.
  • the person detector and tracker unit 208 can detect that there are four persons in the near-end room 220. Further, based on the camera information, the person detector and tracker unit 208 can determine that a first person is at a first location in the near-end room 220, a second person is at a second location in the near-end room 220, a third person is at a third location in the near-end room 220, and a fourth person is at a fourth location in the near-end room 220.
  • the person detector and tracker unit 208 can track persons in the near-end room 220 over a period of time.
  • the person detector and tracker unit 208 can run when a level of variation in incoming video frames are above a defined threshold.
  • the person detector and tracker unit 208 can run during a beginning of a videoconference call when persons enter the near-end room 220 and settle down in the near-end room 220, and the person detector and tracker unit 208 can run at a reduced mode when persons are less likely to move in the near-end room 220 and therefore maintain a direction with respect to the microphone array 212.
  • the person detector and tracker unit 208 can provide person location information to a beamformer 210 that operates on the computing device 216.
  • the person location information can indicate the location of the person in the near-end room 220.
  • the beamformer 210 can be a fixed beamformer (e.g. , a beamformer that performs delay-sum beamforming) or an adaptive beamformer.
  • the beamformer 210 can be coupled to the microphone array 212.
  • the beamformer 210 and the microphone array 212 can work together to perform beamforming.
  • the beamformer 210 and the microphone array 212 can capture an audio signal received from the location of the person in the near-end room 220.
  • the beamformer 210 and the microphone array 212 can capture the audio signal received from the location of the person in the near-end room 220.
  • the audio signal can be captured using beamforming parameters, where the beamforming parameters can be set based on the location of the person in the near-end room.
  • the beamformer 210 can provide the audio signal received from the location of the person in the near-end room 220 using the beamforming parameters to a multi-direction acoustic echo canceler 214.
  • an output of the beamformer 210 can be an input to the acoustic echo canceler 214.
  • the acoustic echo canceler 214 can operate on the computing device 216.
  • the acoustic echo canceler 214 can also receive a far-end signal 202 from the far-end room 230.
  • the far-end signal 202 can be provided to the speaker 204 in the near-end room 220 and cause an acoustic echo in the near-end room 220, which can be detected by the microphone array 212.
  • the acoustic echo canceler 214 can determine an acoustic echo cancellation parameter based on the beamforming parameters associated with the audio signal received from the location of the person in the near-end room 220 using the beamformer 210.
  • One example of the acoustic echo cancellation parameter can be a room impulse response.
  • the room impulse response can correspond to the beamforming parameters associated with the audio signal received from the location of the person in the near-end room 220 using the beamformer 210, as well as the acoustic echo caused by the far-end signal 202.
  • the acoustic echo canceler 214 can model the room impulse response using a finite impulse response (FIR) filter. More specifically, the acoustic echo canceler 214 can model the room impulse response using the FIR filter based on a speaker signal from the speaker 104 and a microphone signal from the microphone 106. Depending on the speaker signal and the microphone signal, the room impulse response can be estimated using the FIR. Thus, FIR parameters can correspond with the acoustic echo cancellation parameters.
  • FIR finite impulse response
  • the acoustic echo cancellation parameter can be applied to the audio signal received from the location of the person in the near-end room 220, thereby producing an audio signal with a cancelled (or reduced) acoustic echo.
  • the acoustic echo cancellation parameter can be applied to cancel or reduce the acoustic echo caused by the far-end signal 202 that is detected at the microphone array 212, which can produce a resulting audio signal that is not affected by the acoustic echo caused by the far-end signal 202.
  • the resulting audio signal can be a near-end signal 218 that is transmitted to the far-end room 230. Since the acoustic echo cancellation has been applied to the near-end signal 218 to remove or reduce the acoustic echo, the near-end signal 218 can be of increased sound quality.
  • the beamformer 210 can operate with N beams or N channels, wherein N is a positive integer.
  • One channel or one beam can correspond with a person detected using the person detector and tracker unit 208.
  • the acoustic echo cancellation can be performed with respect to the N beams or the N channels.
  • the person detector and tracker unit 208 can detect three persons in the near-end room 220.
  • the beamformer 210 can receive an audio signal from a first person in the near-end room 220 using a first beam or channel, an audio signal from a second person in the near-end room 220 using a second beam or channel, and an audio signal from a third person in the near-end room 220 using a third beam or channel.
  • a first acoustic echo canceler can perform acoustic echo cancellation on the first beam or channel
  • a second acoustic echo canceler can perform acoustic echo cancellation on the second beam or channel
  • a third acoustic echo canceler can perform acoustic echo cancellation on the third beam or channel.
  • a number of acoustic echo cancellers could correspond to a number of channels of a microphone array, even when a number of persons in the room were less than the number of channels in the microphone array.
  • channel wise echo cancellation could be performed, where one microphone signal would correspond to one channel.
  • This solution would become more computationally intensive when the number of microphones in the microphone array would increase. For example, a 16-microphone array with four persons in the room would result in 16 acoustic echo cancellers being used to perform acoustic echo cancellation. As a result, an increased number of computations would be performed when a number of persons in the room were less than the number of microphones in the microphone array.
  • beamforming would be performed after the acoustic echo cancellation to capture audio from a defined location in the room.
  • 16 acoustic echo cancellers would be used to perform acoustic echo cancellation for a 16-microphone array with four persons in the room, and then beamforming would be performed for the four persons in the room.
  • the camera information can be used to determine a number of persons in a room, and a number of beams or channels used by a beamformer can correspond to the number of persons in the room. Further, the number of echo cancelers used to perform acoustic echo cancellation can correspond to the number of beams or channels used by the beamformer. Thus, in the present disclosure, the acoustic echo cancellation can be performed after the beamforming.
  • an increased number of microphones can be used in the microphone array while maintaining increased computational efficiency, even when a reduced number of persons are in the room.
  • microphones in the microphone array can provide increased directivity and increased gain or signal-to-noise ratio (SNR) in a direction of interest.
  • SNR signal-to-noise ratio
  • the present disclosure provides an acoustic echo cancellation setup with reduced complexity while maintaining an increased number of microphones in a microphone array.
  • a 16-microphone array with four persons can result in four beams or channels, and can result in four acoustic echo cancellers being used to perform acoustic echo cancellation.
  • a microphones in the microphone array can provide increased directivity and increased gain or signal-to-noise ratio (SNR) in a direction of interest.
  • SNR signal-to-noise ratio
  • the present disclosure provides an acoustic echo cancellation setup with reduced complexity while maintaining an increased number of microphones in a microphone array.
  • a 16-microphone array with four persons can result in four beams or channels, and can result in four acoustic echo cancellers being used to perform acoustic echo cancellation.
  • computational efficiency can be increased because the acoustic echo cancellation can be performed based on the number of persons in the room (and the corresponding number of beams or channels), and not based on a number of channels in the microphone array.
  • FIG. 3 illustrates an example of a video conferencing system 300 for performing acoustic echo cancellation.
  • the video conferencing system 300 can be a near-end video conferencing system or a far-end video conferencing system.
  • the video conferencing system 300 can include a camera 310 such as a stereo camera, a structured light sensor camera or a time-of-flight camera, a microphone array 320, pressure sensor(s) 330, a speaker 335 (or loudspeaker), and a processor 340 that performs the acoustic echo cancellation on an audio signal 322.
  • the processor 340 can be a digital signal processor (DSP).
  • DSP digital signal processor
  • the camera 310 can capture camera information 312 for a room.
  • the camera information 312 can include video information of the room, which can include a plurality of video frames.
  • the camera 310 can operate continuously or intermittently to capture the camera information 312 for the room.
  • the camera 310 can operate continuously during the videoconference session, or can operate intermittently during the videoconferencing session (e.g., at a beginning of the videoconferencing session and at defined periods during the videoconferencing session).
  • the microphone array 320 can capture the audio signal 322 received from a location of a person in the room.
  • the microphone array 320 can include a plurality of microphones at different spatial locations.
  • the microphones in the microphone array 320 can be omnidirectional microphones, directional microphones, or a
  • the speaker 335 can produce a sound, which can be detected by the microphone array 320.
  • the sound can correspond to an audio signal received at the video conferencing system 300 from a far-end.
  • the processor 340 can include a person location determination module 342.
  • the person location determination module 342 can determine the location of the person in the room based on the camera information 312.
  • the person location determination module 342 can analyze the camera information 312 using object detection, facial recognition, or like techniques to determine a number of persons in the room and a location of a person in the number of persons in the room.
  • the location of the person can be a relative location with respect to locations of other persons in the room.
  • the person location determination module 342 can determine the location of the person in the room using pressure sensor information from the pressure sensor(s) 330.
  • the pressure sensor(s) 330 can be installed on chairs or seats in the room, and can be used to detect the presence of persons in the room. For example, a pressure sensor 330 installed on a certain chair can detect whether a person is sitting on that chair based on pressure sensor information produced by the pressure sensor 330.
  • the pressure sensor(s) 330 can send the pressure sensor information, which can enable the person location determination module 342 to determine the number of persons in the room.
  • the person location determination module 342 can determine the location of the person in the room using signal power information, as determined at the microphone array 320.
  • the signal power information can indicate a signal power associated with the audio signal 322 detected using the microphone array 320.
  • the signal power associated with the audio signal 322 can be used to determine a distance and/or location of the person in the room in relation to the microphone array 320.
  • the signal power information can be provided to enable the person location determination module 342 to determine the location of the person in the room.
  • the processor can include a beamforming module 344.
  • the beamforming module 344 can perform beamforming to capture the audio signal 322 received from the location of the person using the microphone array 320.
  • the beamforming module 344 can use a fixed beamforming technique, such as delay-sum beamforming, sub-array delay sum beamforming, super-directivity beamforming or near-field super-directivity beamforming.
  • the beamforming module 344 can use an adaptive beamforming technique, such as generalized side-lobe canceler beamforming, AMNOR beamforming or post-filtering beamforming.
  • the beamforming module 344 can capture the audio signal 322 received from the location of the person using beamforming parameters 346, where the beamforming parameters 346 can be based on the location of the person in the room.
  • the location of the person in the room can be determined using the camera information 312, and that location can be used to set or adjust the beamforming parameters 346.
  • the audio signal can be captured from the location of the person.
  • the processor 340 can include an acoustic echo
  • the acoustic echo cancellation module 348 can determine an acoustic echo cancellation parameter 350 based on the audio signal 322 captured from the location of the person. More specifically, the acoustic echo cancellation module 348 can determine the acoustic echo cancellation parameter 350 based on the beamforming parameters 346, which can be set based on the detected location of the person in the room. Thus, the acoustic echo cancellation module 348 can receive the audio signal 322 from the beamforming module 344. In this case, an output of the beamforming module 344 can be an input to the acoustic echo cancellation module 348.
  • the acoustic echo cancellation parameter 350 can be a room impulse response.
  • the room impulse response can correspond to the beamforming parameters 346 associated with the audio signal 322 received from the location of the person in the room, as well as an acoustic echo detected by the microphone array 320.
  • the acoustic echo can result from sound produced by the speaker 335, as detected by the microphone array 320.
  • the sound can be associated with the audio signal received at the video conferencing system 300 from the far-end.
  • the room impulse response can be specific to one microphone in the microphone array 320. In other words, one microphone in the microphone array 320 can be associated with one room impulse response, while another microphone in the microphone array 320 can be associated with another room impulse response.
  • the room impulse response can be modelled using a FIR filter. More specifically, the room impulse response can be modelled using the FIR filter based on a speaker signal from the speaker 335 and the audio signal 322 detected at the microphone 320. Depending on the speaker signal and the audio signal 322, the room impulse response can be estimated using the FIR.
  • FIR parameters can correspond with the acoustic echo cancellation parameter 350.
  • the acoustic echo cancellation module 348 can perform acoustic echo cancellation on the audio signal 322 using the acoustic echo cancellation parameter 350, such as the room impulse response.
  • the acoustic echo cancellation module 348 can apply the acoustic echo cancellation parameter to cancel or reduce an acoustic echo in the audio signal 322.
  • the acoustic echo cancellation module 348 can converge to an acoustic echo cancellation solution in a reduced amount of time when the room impulse response is relatively sparse, as compared to when the room impulse response is relatively dense.
  • echoes can be formed when sound from the speaker 335 is produced, reflects through the room and then reaches the microphone array 320.
  • the microphone array 320 may be able to receive sound from multiple directions. By using the beamforming, sound from a particular direction in the room can be captured. A number of reflected sounds coming from this particular direction can be reduced, in which case the room impulse response can be relatively sparse.
  • the acoustic echo cancellation module 348 can learn reduced reflections due to the sparse room impulse response, so the acoustic echo cancellation module 348 can converge to the acoustic echo
  • the processor 340 can include an audio signal
  • the audio signal transmission module 352 can receive the audio signal 322 having the cancelled acoustic echo from the acoustic echo cancellation module 348.
  • the audio signal transmission module 352 can transmit the audio signal having the cancelled acoustic echo to, for example, a remote video conferencing system.
  • the beamforming module 344 can operate with N beams or N channels, wherein N is a positive integer.
  • One channel or one beam can correspond with a person detected in the room.
  • the acoustic echo cancellation module 348 can perform acoustic echo cancellation with the N beams or the N channels that are outputted from the beamforming module 344.
  • the N beams or the N channels can correspond to a number of persons detected in the room.
  • the acoustic echo cancellation module 348 can operate parallel acoustic echo canceler(s) that are equal to number of persons detected in the room, which can result in increased computational efficiency.
  • the acoustic echo cancellation module 348 can determine the acoustic echo cancellation parameter 350 based on the beamforming parameters 346, which can be set based on the detected location of the person in the room. In one example, the acoustic echo cancellation module 348 can update the acoustic echo cancellation parameter 350 when the location of the person in the room changes. In other words, the changed location of the person in the room can change the beamforming parameters 346, which in turn can cause the acoustic echo cancellation parameter 350 to be updated. On the other hand, the acoustic echo cancellation module 348 can determine to not update the acoustic echo cancellation parameter 350 when the location of the person in the room does not change. By updating the acoustic echo cancellation parameter 350 when the location of the person in the room changes and not updating the acoustic echo cancellation parameter 350 when the location of the person in the room does not change, compute resources can be saved at the processor 340.
  • spatial audio techniques can be used to create a directional sound at a far-end video conferencing system by collecting information from a near-end.
  • a far-end device can be a sound bar or a headset, for which directional sounds can be created.
  • beamforming can be used to create the directional sounds.
  • headsets head related transfer functions (FITRF) can be used to create the directional sounds.
  • FITRF head related transfer functions
  • a person direction at the near-end can be estimated by using the camera information 312, and an average position of the person can be selected to accommodate for minor movements of the person at the near-end.
  • Information about the person direction and the average position of the person can be sent from the video conferencing system 300 at the near-end to the far-end video conferencing system to enable the directional sound to be created.
  • a loudspeaker beamformer or FITRF spatial audio Tenderer at the far-end video conferencing system may not continuously change parameters, thereby saving computations at the far-end video conferencing system.
  • FIG. 4 is a flowchart illustrating one example method 400 of performing acoustic echo cancellation in a video conference system.
  • the method can be executed as instructions on a machine, where the instructions can be included on a non-transitory machine readable storage medium.
  • the method can include determining a location of a person in a room, as in block 410.
  • the method can include capturing an audio signal received from the location of the person using beamforming, as in block 420.
  • the method can include determining an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person, as in block 430.
  • the method can include performing acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter, as in block 440.
  • the method 400 can be performed using the video conferencing system 300, but the method 400 is not limited to being performed using the video conferencing system 300.
  • FIG. 5 is a flowchart illustrating one example method 500 of performing acoustic echo cancellation in a video conference system.
  • the method can be executed as instructions on a machine, where the instructions can be included on a non-transitory machine readable storage medium.
  • the method can include determining a location of a person in a room based in part on camera information, as in block 510.
  • the method can include capturing an audio signal received from the location of the person using a beamformer, as in block 520.
  • the method can include determining a room impulse response based in part on the audio signal captured from the location of the person, as in block 530.
  • the method can include providing an output of the beamformer as an input to an echo canceler that performs acoustic echo cancellation on the audio signal received from the location of the person based in part on the room impulse response, as in block 540.
  • the method can include transmitting the audio signal having the canceled acoustic echo, as in block 550.
  • the method 500 can be performed using the video conferencing system 300, but the method 500 is not limited to being performed using the video conferencing system 300.
  • FIG. 6 illustrates a computing device 610 on which modules of this disclosure can execute.
  • a computing device 610 is illustrated on which a high level example of the disclosure can be executed.
  • the computing device 610 can include processor(s) 612 that are in communication with memory devices 620.
  • the computing device can include a local communication interface 618 for the components in the computing device.
  • the local communication interface can be a local data bus and/or a related address or control busses as can be desired.
  • the memory device 620 can contain modules 624 that are executable by the processor(s) 612 and data for the modules 624.
  • the modules 624 can execute the functions described earlier, such as: determining a location of a person in a room based in part on camera information; capturing an audio signal received from the location of the person using a beamformer; determining a room impulse response based in part on the audio signal captured from the location of the person; providing an output of the beamformer as an input to an echo canceler that performs acoustic echo cancellation on the audio signal received from the location of the person based in part on the room impulse response; and transmitting the audio signal having the canceled acoustic echo.
  • a data store 622 can also be located in the memory device 620 for storing data related to the modules 624 and other applications along with an operating system that is executable by the processor(s) 612.
  • the computing device can also have access to I/O (input/output) devices 614 that are usable by the computing devices.
  • I/O devices 614 An example of an I/O device is a display screen that is available to display output from the computing devices.
  • Networking devices 616 and similar communication devices can be included in the computing device.
  • the networking devices 616 can be wired or wireless networking devices that connect to the internet, a local area network (LAN), wide area network (WAN), or other computing network.
  • the components or modules that are shown as being stored in the memory device 620 can be executed by the processor 612.
  • the term“executable” can mean a program file that is in a form that can be executed by a processor 612.
  • a program in a higher level language can be compiled into machine code in a format that can be loaded into a random access portion of the memory device 620 and executed by the processor 612, or source code can be loaded by another executable program and interpreted to generate instructions in a random access portion of the memory to be executed by a processor.
  • the executable program can be stored in a portion or component of the memory device 620.
  • the memory device 620 can be random access memory (RAM), read-only memory (ROM), flash memory, a solid state drive, memory card, a hard drive, optical disk, floppy disk, magnetic tape, or other memory components.
  • the processor 612 can represent multiple processors and the memory 620 can represent multiple memory units that operate in parallel to the processing circuits. This can provide parallel processing channels for the processes and data in the system.
  • the local interface 618 can be used as a network to facilitate communication between the multiple processors and multiple memories. The local interface 618 can use additional systems designed for coordinating communication such as load balancing, bulk data transfer, and similar systems.
  • block(s) shown in the flow chart can be omitted or skipped.
  • a number of counters, state variables, warning semaphores, or messages can be added to the logical flow for purposes of enhanced utility, accounting, performance, measurement, troubleshooting or for similar reasons.
  • modules Some of the functional units described in this specification have been labeled as modules, in order to more particularly emphasize their implementation independence.
  • a module can be implemented as a hardware circuit comprising custom very-large-scale integration (VLSI) circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components.
  • VLSI very-large-scale integration
  • a module can also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.
  • Modules can also be implemented in machine-readable software for execution by various types of processors.
  • An identified module of executable code can, for instance, comprise block(s) of computer instructions, which can be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but can comprise disparate instructions stored in different locations which comprise the module and achieve the stated purpose for the module when joined logically together.
  • a module of executable code can be a single instruction, or many instructions, and can even be distributed over several different code segments, among different programs, and across several memory devices.
  • operational data can be identified and illustrated herein within modules, and can be embodied in a suitable form and organized within a suitable type of data structure. The operational data can be collected as a single data set, or can be distributed over different locations including over different storage devices.
  • the modules can be passive or active, including agents operable to perform desired functions.
  • the disclosure described here can also be stored on a computer readable storage medium that includes volatile and non-volatile, removable and non-removable media implemented with a disclosure for the storage of information such as computer readable instructions, data structures, program modules, or other data.
  • Computer readable storage media can include, but is not limited to, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory disclosure, compact disc read-only memory (CD-ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or other computer storage medium which can be used to store the desired information and described disclosure.
  • the devices described herein can also contain communication connections or networking apparatus and networking connections that allow the devices to
  • Communication connections can be an example of communication media.
  • Communication media can embody computer readable
  • communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media.
  • wireless media such as acoustic, radio frequency, infrared, and other wireless media.
  • computer readable media as used herein can include communication media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Acoustic echo cancellation for a video conference system is described. A location of a person in a room can be determined. An audio signal received from the location of the person can be captured using beamforming. An acoustic echo cancellation parameter can be determined based in part on the audio signal captured from the location of the person. Acoustic echo cancellation can be performed on the audio signal using the acoustic echo cancellation parameter.

Description

ACOUSTIC ECHO CANCELLATION
BACKGROUND
[0001]Video conferencing systems can be used for communication between parties in different locations. A video conferencing system at a near-end can capture audio-video information at the near-end and transmit the audio-video information to a far-end. Similarly, a video conferencing system at the far-end can capture audio-visual information at the far-end and transmit the audio-visual information to the near-end.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] FIG. 1 illustrates an example of a video conference system in a near-end room that includes a plurality of persons in accordance with the present disclosure;
[0003] FIG. 2 illustrates an example of a technique for performing acoustic echo cancellation for an audio signal in accordance with the present disclosure;
[0004] FIG. 3 illustrates an example of a video conferencing system for performing acoustic echo cancellation in accordance with the present disclosure;
[0005] FIG. 4 is a flowchart illustrating an example method of performing acoustic echo cancellation in a video conference system in accordance with the present disclosure;
[0006] FIG. 5 is a flowchart illustrating another example method of performing acoustic echo cancellation in a video conference system in accordance with the present disclosure;
[0007] FIG. 6 is a block diagram that provides an example illustration of a computing device that can be employed in the present disclosure. DETAILED DESCRIPTION
[0008] The present disclosure describes a machine readable storage medium as well as a method and a system for acoustic echo cancellation, such as in a video conference system. An example of the present disclosure can include a machine readable storage medium comprising instructions that, when executed by a processor, cause the processor to determine a location of a person in a room. The instructions, when executed by the processor, can cause the processor to capture an audio signal received from the location of the person using beamforming. The instructions, when executed by the processor, can cause the processor to determine an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person. The instructions, when executed by the processor, can cause the processor to perform acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter. In one example, the instructions cause the processor to transmit the audio signal having the canceled acoustic echo to a far-end system. In another example, the acoustic echo cancellation parameter includes a room impulse response. In still another example, an output of a beamformer that performs beamforming to capture the audio signal, for example, can be an input to an echo canceller that performs the acoustic echo cancellation on the audio signal. Beamforming can be performed with a microphone array using a fixed delay-sum beamformer and a set of beamforming parameters. The instructions can cause the processor to determine the location of the person in the room using camera information, pressure sensor information, signal power information, or a combination thereof. In another example, the instructions can cause the processor to perform the acoustic echo cancellation on a number of channels that are outputted from a beamformer, wherein the number of channels corresponds to a number of persons detected in the room. In further detail, the instructions can cause the processor to determine to update the acoustic echo cancellation parameter when the location of the person in the room changes, as well as determine to not update the acoustic echo cancellation parameter when the location of the person in the room does not change.
[0009]Another example of the present disclosure can include a method for acoustic echo cancellation. The method can include determining a location of a person in a room based in part on camera information. The method can include capturing an audio signal received from the location of the person using a beamformer. The method can include determining a room impulse response based in part on the audio signal captured from the location of the person. The method can include providing an output of the beamformer as an input to an echo canceler that performs acoustic echo cancellation on the audio signal received from the location of the person based in part on the room impulse response. The method can include transmitting the audio signal having the canceled acoustic echo. In one example, the acoustic echo cancellation can be on a number of channels that are outputted from the beamformer, wherein the number of channels corresponds to a number of persons detected in the room based in part on the camera information. In another example, performing beamforming can occur using a microphone array using the beamformer and a set of beamforming parameters.
[0010] Another example of the present disclosure can include a system for acoustic echo cancellation. The system can include a camera to capture camera information for a room. The system can include a microphone array to capture an audio signal received from a location of a person in the room. The system can include a processor. The processor can determine the location of the person in the room based in part on the camera information. The processor can perform beamforming to capture the audio signal received from the location of the person using the microphone array. The processor can determine an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person. The processor can perform acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter. The processor can transmit the audio signal having the canceled acoustic echo. In one example, the processor can perform the acoustic echo cancellation on a number of channels that are outputted from a beamformer that is used to perform the beamforming, wherein the number of channels corresponds to a number of persons detected in the room based in part on the camera information. In another example, the camera can be a stereo camera, a structured light sensor camera, a time-of-flight camera, or a combination thereof. In one specific example, the system can be a video
conferencing system. [0011] In these examples, it is noted that when discussing the storage medium, the method, or the system, any of such discussions can be considered applicable to the other examples, whether or not they are explicitly discussed in the context of that example. Thus, for example, in discussing details about an audio signal in the context of the storage medium, such discussion also refers to the methods and systems described herein, and vice versa.
[0012]Turning now to the FIGS., FIG. 1 illustrates an example of a video conference system 100 in a near-end room 120 that includes a plurality of persons 110. The video conferencing system 100 can include a camera 102 to capture camera information for the near-end room 120. For example, the camera 102 can capture video of the persons 110 in the near-end room 120. The video captured in the near-end room 120 can be converted to a video signal, and the video signal can be transmitted to a far-end room 150. The video conference system 100 can include a speaker (or loudspeaker) 104. The speaker 104 can receive an audio signal from the far-end room 150 and produce a sound based on the audio signal. The video conference system 100 can include a microphone 106 to capture audio in the near-end room 120. For example, the microphone 106 can capture audio spoken by a person 110 in the near-end room 120. The audio captured in the near-end room 120 can be converted to an audio signal, and the audio signal can be transmitted to the far-end room 150. In addition, the video conference system 100 can include a display 108 to display a video signal received from the far-end room 150.
[0013] In one example, the far-end room 150 can include a video conferencing system 130. The video conferencing system 130 can include a camera 132 to capture camera information for the far-end room 150. For example, the camera 132 can capture video of the persons 140 in the far-end room 160. The video captured in the far-end room 150 can be converted to a video signal, and the video signal can be transmitted to the near-end room 120. The video conferencing system 130 can include a speaker 134, which can receive the audio signal from the near-end room 120 and produce a sound based on the audio signal. The video conferencing system 130 can include a microphone 136 to capture audio in the far-end room 150. For example, the microphone 136 can capture audio spoken by a person 140 in the far-end room 150. The audio captured in the far-end room 150 can be converted to an audio signal, and the audio signal can be transmitted to the near-end room 120. In addition, the video conferencing system 130 can include a display 138 to display the video signal received from the near-end room 120.
[0014] In the example shown in FIG. 1 , the video conference system 100 in the near-end room 120 and the video conference system 130 in the far-end room 150 can enable the persons 1 10 in the near-end room 120 to communicate with the persons 140 in the far-end room 150. For example, the persons 1 10 in the near-end room 120 may be able to see and hear the persons 140 in the far-end room 150, based on audio-video information that is communication between the video conference system 100 in the near-end room 120 and the video conference system 130 in the far-end room 150. In this non-limiting example, the near-end room 120 can include four persons and the far-end room 150 can include two persons, but other numbers of persons can be present in the near-end room 120 and the far-end room 150.
[0015] In one example, the microphone 106 that captures the audio spoken by the person 1 10 in the near-end room 120 can be a microphone array. The microphone array can include a plurality of microphones placed at different spatial locations. The microphone array can capture the audio spoken by the person 1 10 in the near-end room 120 using beamforming. The different spatial locations of the microphones in the microphone array that capture the audio spoken by the person 1 10 can produce beamforming parameters. A signal strength of signals emanating from particular directions in the near-end room 120, such as a location of the person 1 10 in the near-end room 120, can be increased based on the beamforming parameters. A signal strength of signals (e.g., due to noise) emanating from other directions in the near-end room 120, such as a location that is different than the location of the person 1 10 in the near-end room 120, can be combined in a benign or destructive manner based on the beamforming parameters, resulting in degradation of the signals to/from the location that is different than the location of the person 1 10 in the near-end room 120. As a result, by using sound propagation principles, the microphone array can provide an ability to augment signals emanating from a particular direction in the near-end room 120 based on knowledge of the particular direction. [0016] In one example, beamforming techniques using a microphone array can adaptively track active persons and listen to sound in direction(s) of the active persons, and suppress sound (or noise) coming from other directions. Beamforming using a microphone array can augment a sound quality of received speech by increasing a gain of an audio signal in the active person’s direction and reducing a number of far-end speaker echoes received at microphone(s) of the microphone array. In other words, by changing a gain and a phase delay for a given microphone output in the microphone array, a sound signal from a specific direction can be amplified by constructive
interference and sound signals in other directions can be attenuated by destructive interference. The gain(s) and phase delay(s) for microphone(s) in the microphone array can be considered to be the beamforming parameters. Further, since the gain and the phase delay for the given microphone output can vary based on the location of the person 1 10, the beamforming parameters can also depend on the location of the person 1 10.
[0017] Further, beamforming techniques using a microphone array can be classified as data-independent or fixed, or data-dependent or adaptive. For
data-independent or fixed beamforming techniques, beamforming parameters can be fixed during operation. For data-dependent or adaptive beamforming techniques, beamforming parameters can be continuously updated based on received signals.
Examples of fixed beamforming techniques can include delay-sum beamforming, sub-array delay sum beamforming, super-directivity beamforming or near-field
super-directivity beamforming. Examples of adaptive beamforming techniques can include generalized side-lobe canceler beamforming, adaptive microphone-array system for noise reduction (AMNOR) beamforming or post-filtering beamforming.
[0018] In one example, the audio captured using the microphone 106 of the video conferencing system 100 in the near-end room 120 can be transmitted as the audio signal to the video conferencing system 130 in the far-end room 150. The audio signal can be used to produce the sound at the speaker 134 of the video conferencing system 130 in the far-end room 150. That sound can bounce around the far-end room 150 for a fraction of a second and can be detected by the microphone 136 of the video conferencing system 100 in the far-end room 150, and then the sound can be sent back to the video conference system 100 in the near-end room 120. In some cases, the sound that bounces around the far-end room 150 can create a distracting and undesired echo that is heard in the near-end room 120. For example, the person 1 10 in the near-end room 120 can speak and when this sound bounces around the far-end room 150, the person 1 10 may hear an echo of their own voice.
[0019] In one example, acoustic echo cancellation can be used to cancel or reduce acoustic echo in the audio signal being transmitted from the video conferencing system 100 in the near-end room 120 to the video conferencing system 130 in the far-end room 150. The audio signal transmitted from the video conferencing system 100 in the near-end room 120 can include a near-end speech signal and a far-end echoed speech signal. The near-end speech signal can derive from the audio signal that is captured at the near-end room 120 with the microphone array using beamforming, and the far-end echoed speech signal can derive from the audio signal that is received from the far-end room 150. The acoustic echo cancellation can be applied on both the near-end speech signal and the far-end echoed speech signal, such that the far-end echoed speech signal is removed from the audio signal. An audio signal that comprises the near-end speech signal (i.e. , an audio signal in which the acoustic echo has been cancelled or reduced) can be transmitted to the video conferencing system 130 in the far-end room 150.
[0020] FIG. 2 illustrates an example of a technique for performing acoustic echo cancellation for an audio signal in accordance with the present disclosure. The acoustic echo cancellation can be performed using a computing device 216 in a near-end room 220. The computing device 216 can be part of a video conferencing system that captures audio-video at the near-end room and transmits the audio-video to a far-end room 230. The computing device 216 may include, or be coupled to, a speaker 204 (or
loudspeaker), a camera 206 such as a stereo camera, a structured light sensor camera or a time-of-flight camera, and a microphone array 212. In other words, the speaker 204, the camera 206 and the microphone array 212 can be integrated with the computing device 216, or can be separate units that are coupled to the computing device 216.
[0021] In one example, the camera 206 can capture camera information for the near-end room 200. The camera information can be digital images and/or digital video of the near-end room 200. The camera information can be provided to a person detector and tracker unit 208 that operates on the computing device 216. The person detector and tracker unit 208 can analyze the camera information using object detection, which can include facial detection. Based on the camera information, the person detector and tracker unit 208 can determine a number of persons in the near-end room 220, as well as a location of a person in the near-end room 220. The person(s) that are detected in the near-end room 220 based on the camera information can include a person that is currently speaking or a person that is not currently speaking (e.g., a person in the near-end room 220 that is listening to another person who is speaking).
[0022] In one example, the location of the person can be a relative location with respect to the number of persons in the near-end room 220. The relative location of the person can imply a relative position of the person or persons with respect to the microphones in the microphone array 212. The relative location can be determined based upon determining a camera position relative to the microphones in the microphone array 212. The camera position relative to the microphones in the microphone array 212 can be determined manually or using object detection. The camera position can be determined once or periodically, as the camera 206 and the microphones in the microphone array 212 can be stationary or semi-stationary.
[0023]As a non-limiting example, based on camera information captured using the camera 206, the person detector and tracker unit 208 can detect that there are four persons in the near-end room 220. Further, based on the camera information, the person detector and tracker unit 208 can determine that a first person is at a first location in the near-end room 220, a second person is at a second location in the near-end room 220, a third person is at a third location in the near-end room 220, and a fourth person is at a fourth location in the near-end room 220.
[0024] In one example, the person detector and tracker unit 208 can track persons in the near-end room 220 over a period of time. The person detector and tracker unit 208 can run when a level of variation in incoming video frames are above a defined threshold. For example, the person detector and tracker unit 208 can run during a beginning of a videoconference call when persons enter the near-end room 220 and settle down in the near-end room 220, and the person detector and tracker unit 208 can run at a reduced mode when persons are less likely to move in the near-end room 220 and therefore maintain a direction with respect to the microphone array 212. [0025] In one example, the person detector and tracker unit 208 can provide person location information to a beamformer 210 that operates on the computing device 216. The person location information can indicate the location of the person in the near-end room 220. The beamformer 210 can be a fixed beamformer (e.g. , a beamformer that performs delay-sum beamforming) or an adaptive beamformer. The beamformer 210 can be coupled to the microphone array 212. The beamformer 210 and the microphone array 212 can work together to perform beamforming. The beamformer 210 and the microphone array 212 can capture an audio signal received from the location of the person in the near-end room 220. For example, when the person in the near-end room 220 speaks, and the location of that person is established based on the person location information, the beamformer 210 and the microphone array 212 can capture the audio signal received from the location of the person in the near-end room 220. The audio signal can be captured using beamforming parameters, where the beamforming parameters can be set based on the location of the person in the near-end room.
[0026] In one example, the beamformer 210 can provide the audio signal received from the location of the person in the near-end room 220 using the beamforming parameters to a multi-direction acoustic echo canceler 214. In other words, an output of the beamformer 210 can be an input to the acoustic echo canceler 214. The acoustic echo canceler 214 can operate on the computing device 216. The acoustic echo canceler 214 can also receive a far-end signal 202 from the far-end room 230. The far-end signal 202 can be provided to the speaker 204 in the near-end room 220 and cause an acoustic echo in the near-end room 220, which can be detected by the microphone array 212. The acoustic echo canceler 214 can determine an acoustic echo cancellation parameter based on the beamforming parameters associated with the audio signal received from the location of the person in the near-end room 220 using the beamformer 210. One example of the acoustic echo cancellation parameter can be a room impulse response. The room impulse response can correspond to the beamforming parameters associated with the audio signal received from the location of the person in the near-end room 220 using the beamformer 210, as well as the acoustic echo caused by the far-end signal 202.
[0027] In one example, the acoustic echo canceler 214 can model the room impulse response using a finite impulse response (FIR) filter. More specifically, the acoustic echo canceler 214 can model the room impulse response using the FIR filter based on a speaker signal from the speaker 104 and a microphone signal from the microphone 106. Depending on the speaker signal and the microphone signal, the room impulse response can be estimated using the FIR. Thus, FIR parameters can correspond with the acoustic echo cancellation parameters.
[0028] In one example, the acoustic echo cancellation parameter can be applied to the audio signal received from the location of the person in the near-end room 220, thereby producing an audio signal with a cancelled (or reduced) acoustic echo. In other words, the acoustic echo cancellation parameter can be applied to cancel or reduce the acoustic echo caused by the far-end signal 202 that is detected at the microphone array 212, which can produce a resulting audio signal that is not affected by the acoustic echo caused by the far-end signal 202. The resulting audio signal can be a near-end signal 218 that is transmitted to the far-end room 230. Since the acoustic echo cancellation has been applied to the near-end signal 218 to remove or reduce the acoustic echo, the near-end signal 218 can be of increased sound quality.
[0029] In one example, the beamformer 210 can operate with N beams or N channels, wherein N is a positive integer. One channel or one beam can correspond with a person detected using the person detector and tracker unit 208. Similarly, the acoustic echo cancellation can be performed with respect to the N beams or the N channels.
[0030]As a non-limiting example, the person detector and tracker unit 208 can detect three persons in the near-end room 220. In this example, the beamformer 210 can receive an audio signal from a first person in the near-end room 220 using a first beam or channel, an audio signal from a second person in the near-end room 220 using a second beam or channel, and an audio signal from a third person in the near-end room 220 using a third beam or channel. Then, a first acoustic echo canceler can perform acoustic echo cancellation on the first beam or channel, a second acoustic echo canceler can perform acoustic echo cancellation on the second beam or channel, and a third acoustic echo canceler can perform acoustic echo cancellation on the third beam or channel. Thus, a person identified in the far-end room can correspond with a beam or channel, and acoustic echo cancellation can be applied to that beam or channel. This technique can have increased computationally efficiency since it depends on a number of persons in the near-end room 220, as opposed to a number of channels in the microphone array 212.
[0031]A number of acoustic echo cancellers could correspond to a number of channels of a microphone array, even when a number of persons in the room were less than the number of channels in the microphone array. In other words, channel wise echo cancellation could be performed, where one microphone signal would correspond to one channel. This solution would become more computationally intensive when the number of microphones in the microphone array would increase. For example, a 16-microphone array with four persons in the room would result in 16 acoustic echo cancellers being used to perform acoustic echo cancellation. As a result, an increased number of computations would be performed when a number of persons in the room were less than the number of microphones in the microphone array.
[0032] In addition, beamforming would be performed after the acoustic echo cancellation to capture audio from a defined location in the room. For example, 16 acoustic echo cancellers would be used to perform acoustic echo cancellation for a 16-microphone array with four persons in the room, and then beamforming would be performed for the four persons in the room.
[0033] In the present disclosure, the camera information can be used to determine a number of persons in a room, and a number of beams or channels used by a beamformer can correspond to the number of persons in the room. Further, the number of echo cancelers used to perform acoustic echo cancellation can correspond to the number of beams or channels used by the beamformer. Thus, in the present disclosure, the acoustic echo cancellation can be performed after the beamforming.
[0034] In the present disclosure, an increased number of microphones can be used in the microphone array while maintaining increased computational efficiency, even when a reduced number of persons are in the room. An increased number of
microphones in the microphone array can provide increased directivity and increased gain or signal-to-noise ratio (SNR) in a direction of interest. Thus, the present disclosure provides an acoustic echo cancellation setup with reduced complexity while maintaining an increased number of microphones in a microphone array. [0035] As a non-limiting example, a 16-microphone array with four persons can result in four beams or channels, and can result in four acoustic echo cancellers being used to perform acoustic echo cancellation. Thus, in the present disclosure, a
computational efficiency can be increased because the acoustic echo cancellation can be performed based on the number of persons in the room (and the corresponding number of beams or channels), and not based on a number of channels in the microphone array.
[0036] FIG. 3 illustrates an example of a video conferencing system 300 for performing acoustic echo cancellation. The video conferencing system 300 can be a near-end video conferencing system or a far-end video conferencing system. The video conferencing system 300 can include a camera 310 such as a stereo camera, a structured light sensor camera or a time-of-flight camera, a microphone array 320, pressure sensor(s) 330, a speaker 335 (or loudspeaker), and a processor 340 that performs the acoustic echo cancellation on an audio signal 322. One non-limiting example of the processor 340 can be a digital signal processor (DSP).
[0037] In one example, the camera 310 can capture camera information 312 for a room. The camera information 312 can include video information of the room, which can include a plurality of video frames. The camera 310 can operate continuously or intermittently to capture the camera information 312 for the room. For example, the camera 310 can operate continuously during the videoconference session, or can operate intermittently during the videoconferencing session (e.g., at a beginning of the videoconferencing session and at defined periods during the videoconferencing session).
[0038] In one example, the microphone array 320 can capture the audio signal 322 received from a location of a person in the room. The microphone array 320 can include a plurality of microphones at different spatial locations. The microphones in the microphone array 320 can be omnidirectional microphones, directional microphones, or a
combination of omnidirectional and directional microphones.
[0039] In one example, the speaker 335 can produce a sound, which can be detected by the microphone array 320. For example, the sound can correspond to an audio signal received at the video conferencing system 300 from a far-end.
[0040] In one example, the processor 340 can include a person location determination module 342. The person location determination module 342 can determine the location of the person in the room based on the camera information 312. For example, the person location determination module 342 can analyze the camera information 312 using object detection, facial recognition, or like techniques to determine a number of persons in the room and a location of a person in the number of persons in the room. The location of the person can be a relative location with respect to locations of other persons in the room.
[0041] Additionally, the person location determination module 342 can determine the location of the person in the room using pressure sensor information from the pressure sensor(s) 330. The pressure sensor(s) 330 can be installed on chairs or seats in the room, and can be used to detect the presence of persons in the room. For example, a pressure sensor 330 installed on a certain chair can detect whether a person is sitting on that chair based on pressure sensor information produced by the pressure sensor 330. The pressure sensor(s) 330 can send the pressure sensor information, which can enable the person location determination module 342 to determine the number of persons in the room.
[0042]Additionally, the person location determination module 342 can determine the location of the person in the room using signal power information, as determined at the microphone array 320. The signal power information can indicate a signal power associated with the audio signal 322 detected using the microphone array 320. The signal power associated with the audio signal 322 can be used to determine a distance and/or location of the person in the room in relation to the microphone array 320. The signal power information can be provided to enable the person location determination module 342 to determine the location of the person in the room.
[0043] In one example, the processor can include a beamforming module 344. The beamforming module 344 can perform beamforming to capture the audio signal 322 received from the location of the person using the microphone array 320. In one example, the beamforming module 344 can use a fixed beamforming technique, such as delay-sum beamforming, sub-array delay sum beamforming, super-directivity beamforming or near-field super-directivity beamforming. In another example, the beamforming module 344 can use an adaptive beamforming technique, such as generalized side-lobe canceler beamforming, AMNOR beamforming or post-filtering beamforming. [0044] In one example, the beamforming module 344 can capture the audio signal 322 received from the location of the person using beamforming parameters 346, where the beamforming parameters 346 can be based on the location of the person in the room. In other words, the location of the person in the room can be determined using the camera information 312, and that location can be used to set or adjust the beamforming parameters 346. Based on the beamforming parameters 346, the audio signal can be captured from the location of the person.
[0045] In one example, the processor 340 can include an acoustic echo
cancellation module 348. The acoustic echo cancellation module 348 can determine an acoustic echo cancellation parameter 350 based on the audio signal 322 captured from the location of the person. More specifically, the acoustic echo cancellation module 348 can determine the acoustic echo cancellation parameter 350 based on the beamforming parameters 346, which can be set based on the detected location of the person in the room. Thus, the acoustic echo cancellation module 348 can receive the audio signal 322 from the beamforming module 344. In this case, an output of the beamforming module 344 can be an input to the acoustic echo cancellation module 348.
[0046] In one example, the acoustic echo cancellation parameter 350 can be a room impulse response. The room impulse response can correspond to the beamforming parameters 346 associated with the audio signal 322 received from the location of the person in the room, as well as an acoustic echo detected by the microphone array 320. The acoustic echo can result from sound produced by the speaker 335, as detected by the microphone array 320. The sound can be associated with the audio signal received at the video conferencing system 300 from the far-end. The room impulse response can be specific to one microphone in the microphone array 320. In other words, one microphone in the microphone array 320 can be associated with one room impulse response, while another microphone in the microphone array 320 can be associated with another room impulse response.
[0047] In one example, the room impulse response can be modelled using a FIR filter. More specifically, the room impulse response can be modelled using the FIR filter based on a speaker signal from the speaker 335 and the audio signal 322 detected at the microphone 320. Depending on the speaker signal and the audio signal 322, the room impulse response can be estimated using the FIR. Thus, FIR parameters can correspond with the acoustic echo cancellation parameter 350.
[0048] In one example, the acoustic echo cancellation module 348 can perform acoustic echo cancellation on the audio signal 322 using the acoustic echo cancellation parameter 350, such as the room impulse response. The acoustic echo cancellation module 348 can apply the acoustic echo cancellation parameter to cancel or reduce an acoustic echo in the audio signal 322.
[0049] In one example, the acoustic echo cancellation module 348 can converge to an acoustic echo cancellation solution in a reduced amount of time when the room impulse response is relatively sparse, as compared to when the room impulse response is relatively dense. In one example, echoes can be formed when sound from the speaker 335 is produced, reflects through the room and then reaches the microphone array 320. The microphone array 320 may be able to receive sound from multiple directions. By using the beamforming, sound from a particular direction in the room can be captured. A number of reflected sounds coming from this particular direction can be reduced, in which case the room impulse response can be relatively sparse. The acoustic echo cancellation module 348 can learn reduced reflections due to the sparse room impulse response, so the acoustic echo cancellation module 348 can converge to the acoustic echo
cancellation solution in the reduced amount of time.
[0050] In one example, the processor 340 can include an audio signal
transmission module 352. The audio signal transmission module 352 can receive the audio signal 322 having the cancelled acoustic echo from the acoustic echo cancellation module 348. The audio signal transmission module 352 can transmit the audio signal having the cancelled acoustic echo to, for example, a remote video conferencing system.
[0051] In one configuration, the beamforming module 344 can operate with N beams or N channels, wherein N is a positive integer. One channel or one beam can correspond with a person detected in the room. Similarly, the acoustic echo cancellation module 348 can perform acoustic echo cancellation with the N beams or the N channels that are outputted from the beamforming module 344. In this example, the N beams or the N channels can correspond to a number of persons detected in the room. Thus, the acoustic echo cancellation module 348 can operate parallel acoustic echo canceler(s) that are equal to number of persons detected in the room, which can result in increased computational efficiency.
[0052] In one configuration, the acoustic echo cancellation module 348 can determine the acoustic echo cancellation parameter 350 based on the beamforming parameters 346, which can be set based on the detected location of the person in the room. In one example, the acoustic echo cancellation module 348 can update the acoustic echo cancellation parameter 350 when the location of the person in the room changes. In other words, the changed location of the person in the room can change the beamforming parameters 346, which in turn can cause the acoustic echo cancellation parameter 350 to be updated. On the other hand, the acoustic echo cancellation module 348 can determine to not update the acoustic echo cancellation parameter 350 when the location of the person in the room does not change. By updating the acoustic echo cancellation parameter 350 when the location of the person in the room changes and not updating the acoustic echo cancellation parameter 350 when the location of the person in the room does not change, compute resources can be saved at the processor 340.
[0053] In one configuration, spatial audio techniques can be used to create a directional sound at a far-end video conferencing system by collecting information from a near-end. A far-end device can be a sound bar or a headset, for which directional sounds can be created. For sound bars, beamforming can be used to create the directional sounds. For headsets, head related transfer functions (FITRF) can be used to create the directional sounds. A person direction at the near-end can be estimated by using the camera information 312, and an average position of the person can be selected to accommodate for minor movements of the person at the near-end. Information about the person direction and the average position of the person can be sent from the video conferencing system 300 at the near-end to the far-end video conferencing system to enable the directional sound to be created. By selecting the average position of the person, a loudspeaker beamformer or FITRF spatial audio Tenderer at the far-end video conferencing system may not continuously change parameters, thereby saving computations at the far-end video conferencing system.
[0054] FIG. 4 is a flowchart illustrating one example method 400 of performing acoustic echo cancellation in a video conference system. The method can be executed as instructions on a machine, where the instructions can be included on a non-transitory machine readable storage medium. The method can include determining a location of a person in a room, as in block 410. The method can include capturing an audio signal received from the location of the person using beamforming, as in block 420. The method can include determining an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person, as in block 430. The method can include performing acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter, as in block 440. In one example, the method 400 can be performed using the video conferencing system 300, but the method 400 is not limited to being performed using the video conferencing system 300.
[0055] FIG. 5 is a flowchart illustrating one example method 500 of performing acoustic echo cancellation in a video conference system. The method can be executed as instructions on a machine, where the instructions can be included on a non-transitory machine readable storage medium. The method can include determining a location of a person in a room based in part on camera information, as in block 510. The method can include capturing an audio signal received from the location of the person using a beamformer, as in block 520. The method can include determining a room impulse response based in part on the audio signal captured from the location of the person, as in block 530. The method can include providing an output of the beamformer as an input to an echo canceler that performs acoustic echo cancellation on the audio signal received from the location of the person based in part on the room impulse response, as in block 540. The method can include transmitting the audio signal having the canceled acoustic echo, as in block 550. In one example, the method 500 can be performed using the video conferencing system 300, but the method 500 is not limited to being performed using the video conferencing system 300.
[0056] FIG. 6 illustrates a computing device 610 on which modules of this disclosure can execute. A computing device 610 is illustrated on which a high level example of the disclosure can be executed. The computing device 610 can include processor(s) 612 that are in communication with memory devices 620. The computing device can include a local communication interface 618 for the components in the computing device. For example, the local communication interface can be a local data bus and/or a related address or control busses as can be desired.
[0057] The memory device 620 can contain modules 624 that are executable by the processor(s) 612 and data for the modules 624. The modules 624 can execute the functions described earlier, such as: determining a location of a person in a room based in part on camera information; capturing an audio signal received from the location of the person using a beamformer; determining a room impulse response based in part on the audio signal captured from the location of the person; providing an output of the beamformer as an input to an echo canceler that performs acoustic echo cancellation on the audio signal received from the location of the person based in part on the room impulse response; and transmitting the audio signal having the canceled acoustic echo.
[0058] A data store 622 can also be located in the memory device 620 for storing data related to the modules 624 and other applications along with an operating system that is executable by the processor(s) 612.
[0059] Other applications can also be stored in the memory device 620 and can be executable by the processor(s) 612. Components or modules discussed in this description that can be implemented in the form of machine-readable software using high programming level languages that are compiled, interpreted or executed using a hybrid of the methods.
[0060] The computing device can also have access to I/O (input/output) devices 614 that are usable by the computing devices. An example of an I/O device is a display screen that is available to display output from the computing devices. Networking devices 616 and similar communication devices can be included in the computing device. The networking devices 616 can be wired or wireless networking devices that connect to the internet, a local area network (LAN), wide area network (WAN), or other computing network.
[0061]The components or modules that are shown as being stored in the memory device 620 can be executed by the processor 612. The term“executable” can mean a program file that is in a form that can be executed by a processor 612. For example, a program in a higher level language can be compiled into machine code in a format that can be loaded into a random access portion of the memory device 620 and executed by the processor 612, or source code can be loaded by another executable program and interpreted to generate instructions in a random access portion of the memory to be executed by a processor. The executable program can be stored in a portion or component of the memory device 620. For example, the memory device 620 can be random access memory (RAM), read-only memory (ROM), flash memory, a solid state drive, memory card, a hard drive, optical disk, floppy disk, magnetic tape, or other memory components.
[0062] The processor 612 can represent multiple processors and the memory 620 can represent multiple memory units that operate in parallel to the processing circuits. This can provide parallel processing channels for the processes and data in the system. The local interface 618 can be used as a network to facilitate communication between the multiple processors and multiple memories. The local interface 618 can use additional systems designed for coordinating communication such as load balancing, bulk data transfer, and similar systems.
[0063] While the flowcharts presented for this disclosure can imply a specific order of execution, the order of execution can differ from what is illustrated. For example, the order of two more blocks can be rearranged relative to the order shown. Further, two or more blocks shown in succession can be executed in parallel or with partial
parallelization. In some configurations, block(s) shown in the flow chart can be omitted or skipped. A number of counters, state variables, warning semaphores, or messages can be added to the logical flow for purposes of enhanced utility, accounting, performance, measurement, troubleshooting or for similar reasons.
[0064] Some of the functional units described in this specification have been labeled as modules, in order to more particularly emphasize their implementation independence. For example, a module can be implemented as a hardware circuit comprising custom very-large-scale integration (VLSI) circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module can also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.
[0065] Modules can also be implemented in machine-readable software for execution by various types of processors. An identified module of executable code can, for instance, comprise block(s) of computer instructions, which can be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but can comprise disparate instructions stored in different locations which comprise the module and achieve the stated purpose for the module when joined logically together.
[0066] Indeed, a module of executable code can be a single instruction, or many instructions, and can even be distributed over several different code segments, among different programs, and across several memory devices. Similarly, operational data can be identified and illustrated herein within modules, and can be embodied in a suitable form and organized within a suitable type of data structure. The operational data can be collected as a single data set, or can be distributed over different locations including over different storage devices. The modules can be passive or active, including agents operable to perform desired functions.
[0067] The disclosure described here can also be stored on a computer readable storage medium that includes volatile and non-volatile, removable and non-removable media implemented with a disclosure for the storage of information such as computer readable instructions, data structures, program modules, or other data. Computer readable storage media can include, but is not limited to, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory disclosure, compact disc read-only memory (CD-ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or other computer storage medium which can be used to store the desired information and described disclosure.
[0068] The devices described herein can also contain communication connections or networking apparatus and networking connections that allow the devices to
communicate with other devices. Communication connections can be an example of communication media. Communication media can embody computer readable
instructions, data structures, program modules and other data in a modulated data signal such as a carrier wave or other transport mechanism and can include information delivery media. By way of example, and not limitation, communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media. The term computer readable media as used herein can include communication media.
[0069] Reference was made to the examples illustrated in the drawings, and specific language was used herein to describe the same. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended.
Alterations and further modifications of the features illustrated herein, and additional applications of the examples as illustrated herein, are to be considered within the scope of the description.
[0070] Furthermore, the described features, structures, or characteristics can be combined in a suitable manner. In the preceding description, numerous specific details were provided, such as examples of various configurations to provide a thorough understanding of examples of the described disclosure. The disclosure may be practiced without some of the specific details, or with other methods, components, devices, etc. In other instances, some structures or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
[0071]Although the subject matter has been described in language specific to structural features and/or operations, it is to be understood that the subject matter defined in the appended claims is not limited to the specific features and operations described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. Numerous modifications and alternative arrangements can be devised without departing from the scope of the described disclosure.

Claims

CLAIMS What is Claimed Is:
1. A machine readable storage medium comprising instructions that, when executed by a processor, cause the processor to:
determine a location of a person in a room;
capture an audio signal received from the location of the person using beamforming;
determine an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person; and
perform acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter.
2. The machine readable storage medium of claim 1 , wherein the instructions cause the processor to: transmit the audio signal having the canceled acoustic echo to a far-end system.
3. The machine readable storage medium of claim 1 , wherein the acoustic echo cancellation parameter includes a room impulse response.
4. The machine readable storage medium of claim 1 , wherein an output of a beamformer that performs beamforming to capture the audio signal is an input to an echo canceller that performs the acoustic echo cancellation on the audio signal.
5. The machine readable storage medium of claim 1 , wherein the
beamforming is performed with a microphone array using a fixed delay-sum beamformer and a set of beamforming parameters.
6. The machine readable storage medium of claim 1 , wherein the instructions cause the processor to determine the location of the person in the room using camera information, pressure sensor information, signal power information, or a combination thereof.
7. The machine readable storage medium of claim 1 , wherein the instructions cause the processor to: perform the acoustic echo cancellation on a number of channels that are outputted from a beamformer, wherein the number of channels corresponds to a number of persons detected in the room.
8. The machine readable storage medium of claim 1 , wherein the instructions cause the processor to:
determine to update the acoustic echo cancellation parameter when the location of the person in the room changes; and
determine to not update the acoustic echo cancellation parameter when the location of the person in the room does not change.
9. A method for acoustic echo cancellation, comprising:
determining a location of a person in a room based in part on camera information; capturing an audio signal received from the location of the person using a beamformer;
determining a room impulse response based in part on the audio signal captured from the location of the person;
providing an output of the beamformer as an input to an echo canceler that performs acoustic echo cancellation on the audio signal received from the location of the person based in part on the room impulse response; and
transmitting the audio signal having the canceled acoustic echo.
10. The method of claim 9, comprising performing the acoustic echo
cancellation on a number of channels that are outputted from the beamformer, wherein the number of channels corresponds to a number of persons detected in the room based in part on the camera information.
1 1 . The method of claim 9, comprising performing beamforming with a microphone array using the beamformer and a set of beamforming parameters.
12. A system for acoustic echo cancellation, comprising:
a camera to capture camera information for a room;
a microphone array to capture an audio signal received from a location of a person in the room; and
a processor to:
determine the location of the person in the room based in part on the
camera information;
perform beamforming to capture the audio signal received from the
location of the person using the microphone array;
determine an acoustic echo cancellation parameter based in part on the audio signal captured from the location of the person;
perform acoustic echo cancellation on the audio signal using the acoustic echo cancellation parameter; and
transmit the audio signal having the canceled acoustic echo.
13. The system of claim 12, wherein the processor is to: perform the acoustic echo cancellation on a number of channels that are outputted from a beamformer that is used to perform the beamforming, wherein the number of channels corresponds to a number of persons detected in the room based in part on the camera information.
14. The system of claim 12, wherein the camera is a stereo camera, a structured light sensor camera, a time-of-f light camera, or a combination thereof.
15. The system of claim 12, wherein the system is a video conferencing system.
PCT/US2019/040535 2019-07-03 2019-07-03 Acoustic echo cancellation WO2021002862A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/US2019/040535 WO2021002862A1 (en) 2019-07-03 2019-07-03 Acoustic echo cancellation
CN201980098110.7A CN114008999A (en) 2019-07-03 2019-07-03 Acoustic echo cancellation
US17/419,460 US11937076B2 (en) 2019-07-03 2019-07-03 Acoustic echo cancellation
EP19935921.7A EP3994874A4 (en) 2019-07-03 2019-07-03 Acoustic echo cancellation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2019/040535 WO2021002862A1 (en) 2019-07-03 2019-07-03 Acoustic echo cancellation

Publications (1)

Publication Number Publication Date
WO2021002862A1 true WO2021002862A1 (en) 2021-01-07

Family

ID=74100911

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/040535 WO2021002862A1 (en) 2019-07-03 2019-07-03 Acoustic echo cancellation

Country Status (4)

Country Link
US (1) US11937076B2 (en)
EP (1) EP3994874A4 (en)
CN (1) CN114008999A (en)
WO (1) WO2021002862A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4333459A1 (en) * 2022-08-31 2024-03-06 GN Audio A/S Speakerphone with beamformer-based conference characterization and related methods

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6266408B1 (en) 1996-10-28 2001-07-24 Samsung Electronics Co., Ltd. Echo controlling apparatus of video conferencing system and control method using the same
US20100272274A1 (en) 2009-04-28 2010-10-28 Majid Fozunbal Methods and systems for robust approximations of impulse reponses in multichannel audio-communication systems
US20110063405A1 (en) 2009-09-17 2011-03-17 Sony Corporation Method and apparatus for minimizing acoustic echo in video conferencing
US20130121498A1 (en) 2011-11-11 2013-05-16 Qsound Labs, Inc. Noise reduction using microphone array orientation information
US20170134849A1 (en) 2011-06-11 2017-05-11 Clearone, Inc. Conferencing Apparatus that combines a Beamforming Microphone Array with an Acoustic Echo Canceller
US9659576B1 (en) 2016-06-13 2017-05-23 Biamp Systems Corporation Beam forming and acoustic echo cancellation with mutual adaptation control

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5737485A (en) 1995-03-07 1998-04-07 Rutgers The State University Of New Jersey Method and apparatus including microphone arrays and neural networks for speech/speaker recognition systems
JP2007228070A (en) * 2006-02-21 2007-09-06 Yamaha Corp Video conference apparatus
US8036767B2 (en) 2006-09-20 2011-10-11 Harman International Industries, Incorporated System for extracting and changing the reverberant content of an audio input signal
US8229134B2 (en) 2007-05-24 2012-07-24 University Of Maryland Audio camera using microphone arrays for real time capture of audio images and method for jointly processing the audio images with video images
US8325909B2 (en) * 2008-06-25 2012-12-04 Microsoft Corporation Acoustic echo suppression
US8988970B2 (en) * 2010-03-12 2015-03-24 University Of Maryland Method and system for dereverberation of signals propagating in reverberative environments
TR201815799T4 (en) 2011-01-05 2018-11-21 Anheuser Busch Inbev Sa An audio system and its method of operation.
WO2013058728A1 (en) 2011-10-17 2013-04-25 Nuance Communications, Inc. Speech signal enhancement using visual information
EP2713593B1 (en) 2012-09-28 2015-08-19 Alcatel Lucent, S.A. Immersive videoconference method and system
US9449613B2 (en) 2012-12-06 2016-09-20 Audeme Llc Room identification using acoustic features in a recording
US9595997B1 (en) 2013-01-02 2017-03-14 Amazon Technologies, Inc. Adaption-based reduction of echo and noise
JP6169910B2 (en) 2013-07-08 2017-07-26 本田技研工業株式会社 Audio processing device
US9565497B2 (en) 2013-08-01 2017-02-07 Caavo Inc. Enhancing audio using a mobile device
US9426300B2 (en) 2013-09-27 2016-08-23 Dolby Laboratories Licensing Corporation Matching reverberation in teleconferencing environments
US9602923B2 (en) * 2013-12-05 2017-03-21 Microsoft Technology Licensing, Llc Estimating a room impulse response
US9972315B2 (en) 2015-01-14 2018-05-15 Honda Motor Co., Ltd. Speech processing device, speech processing method, and speech processing system
WO2017147325A1 (en) 2016-02-25 2017-08-31 Dolby Laboratories Licensing Corporation Multitalker optimised beamforming system and method
CN105976827B (en) 2016-05-26 2019-09-13 南京邮电大学 A kind of indoor sound localization method based on integrated study
GB2556058A (en) 2016-11-16 2018-05-23 Nokia Technologies Oy Distributed audio capture and mixing controlling
CN106898348B (en) 2016-12-29 2020-02-07 北京小鸟听听科技有限公司 Dereverberation control method and device for sound production equipment
US10170134B2 (en) 2017-02-21 2019-01-01 Intel IP Corporation Method and system of acoustic dereverberation factoring the actual non-ideal acoustic environment
US10560661B2 (en) 2017-03-16 2020-02-11 Dolby Laboratories Licensing Corporation Detecting and mitigating audio-visual incongruence
US10229698B1 (en) * 2017-06-21 2019-03-12 Amazon Technologies, Inc. Playback reference signal-assisted multi-microphone interference canceler
US9928847B1 (en) * 2017-08-04 2018-03-27 Revolabs, Inc. System and method for acoustic echo cancellation
US10440497B2 (en) 2017-11-17 2019-10-08 Intel Corporation Multi-modal dereverbaration in far-field audio systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6266408B1 (en) 1996-10-28 2001-07-24 Samsung Electronics Co., Ltd. Echo controlling apparatus of video conferencing system and control method using the same
US20100272274A1 (en) 2009-04-28 2010-10-28 Majid Fozunbal Methods and systems for robust approximations of impulse reponses in multichannel audio-communication systems
US20110063405A1 (en) 2009-09-17 2011-03-17 Sony Corporation Method and apparatus for minimizing acoustic echo in video conferencing
US20170134849A1 (en) 2011-06-11 2017-05-11 Clearone, Inc. Conferencing Apparatus that combines a Beamforming Microphone Array with an Acoustic Echo Canceller
US20130121498A1 (en) 2011-11-11 2013-05-16 Qsound Labs, Inc. Noise reduction using microphone array orientation information
US9659576B1 (en) 2016-06-13 2017-05-23 Biamp Systems Corporation Beam forming and acoustic echo cancellation with mutual adaptation control

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3994874A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4333459A1 (en) * 2022-08-31 2024-03-06 GN Audio A/S Speakerphone with beamformer-based conference characterization and related methods

Also Published As

Publication number Publication date
US11937076B2 (en) 2024-03-19
EP3994874A1 (en) 2022-05-11
US20220116733A1 (en) 2022-04-14
CN114008999A (en) 2022-02-01
EP3994874A4 (en) 2023-01-18

Similar Documents

Publication Publication Date Title
US11297178B2 (en) Method, apparatus, and computer-readable media utilizing residual echo estimate information to derive secondary echo reduction parameters
US9659576B1 (en) Beam forming and acoustic echo cancellation with mutual adaptation control
US9589556B2 (en) Energy adjustment of acoustic echo replica signal for speech enhancement
US8385557B2 (en) Multichannel acoustic echo reduction
US7970123B2 (en) Adaptive coupling equalization in beamforming-based communication systems
KR100853018B1 (en) A method for generating noise references for generalized sidelobe canceling
US10136217B2 (en) Soundfield decomposition, reverberation reduction, and audio mixing of sub-soundfields at a video conference endpoint
US10129409B2 (en) Joint acoustic echo control and adaptive array processing
EP3994691B1 (en) Audio signal dereverberation
US20130322655A1 (en) Method and device for microphone selection
US9508359B2 (en) Acoustic echo preprocessing for speech enhancement
US9412354B1 (en) Method and apparatus to use beams at one end-point to support multi-channel linear echo control at another end-point
US11937076B2 (en) Acoustic echo cancellation
US9729967B2 (en) Feedback canceling system and method
US11523215B2 (en) Method and system for using single adaptive filter for echo and point noise cancellation
Ruiz et al. Distributed combined acoustic echo cancellation and noise reduction using GEVD-based distributed adaptive node specific signal estimation with prior knowledge
EP3884683B1 (en) Automatic microphone equalization
WO2023149254A1 (en) Voice signal processing device, voice signal processing method, and voice signal processing program
Comminiello et al. Advanced intelligent acoustic interfaces for multichannel audio reproduction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19935921

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019935921

Country of ref document: EP

Effective date: 20220203