WO2022178852A1 - Procédé et appareil d'aide à l'écoute - Google Patents

Procédé et appareil d'aide à l'écoute Download PDF

Info

Publication number
WO2022178852A1
WO2022178852A1 PCT/CN2021/078222 CN2021078222W WO2022178852A1 WO 2022178852 A1 WO2022178852 A1 WO 2022178852A1 CN 2021078222 W CN2021078222 W CN 2021078222W WO 2022178852 A1 WO2022178852 A1 WO 2022178852A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
user
audio signal
coordinates
microphone
Prior art date
Application number
PCT/CN2021/078222
Other languages
English (en)
Chinese (zh)
Inventor
张立斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180004382.3A priority Critical patent/CN115250646A/zh
Priority to PCT/CN2021/078222 priority patent/WO2022178852A1/fr
Publication of WO2022178852A1 publication Critical patent/WO2022178852A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the present application relates to the technical field of intelligent terminals, and in particular, to a method and device for assisting listening.
  • the sound sources include the sound source that the user wants to listen to, such as the sound source S1 shown in FIG. 1, and the sound source that the user does not want to listen to, such as the sound source S2 shown in FIG. 1,
  • This sound source S2 is noise to the user. How to make the user listen more attentively to the audio content of the sound source S1 is a technical problem to be solved in the embodiment of the present application.
  • the present application provides an auxiliary listening method and device, so that the user's attention can be more focused on the audio content of the sound source concerned by the ringtone, and the user is assisted in listening selection.
  • a method for assisting listening comprising: determining coordinates of a first sound source according to an external audio signal collected by a microphone and a first direction, where the first direction is a direction determined by detecting a user, so The first sound source is a sound source in other directions except the first direction; according to the coordinates of the first sound source and the preset head-related transfer function HRTF, determine the first HRTF corresponding to the first sound source ; Obtain a noise signal corresponding to the first sound source according to the first HRTF and a preset virtual noise, and play the noise signal.
  • the sound source in the first direction that is, the sound source in the user's attention direction is S1
  • the first sound source is the sound source S2 in the user's non-attention direction as an example.
  • noise N can be superimposed on the audio content of the sound source S2 that the user is not concerned about/not interested in, thereby reducing the clarity of the audio content of the sound source S2, so that the user cannot clearly hear the sound source S2.
  • audio content so that the user's attention is more focused on the ringtone listening of the audio content of the sound source S1, and the selection of listening is assisted.
  • the determining the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction includes: determining at least one location near the user according to the external audio signal collected by the microphone The coordinates of the sound source; the user is detected to determine the first direction; according to the first direction, the coordinates of the first sound source are determined among the coordinates of at least one sound source near the user.
  • the determining the coordinates of at least one sound source near the user according to the external audio signal collected by the microphone includes: the number of microphones is multiple, and each microphone separately collects the external audio signal, There is a time delay between the external audio signals collected by different microphones; the coordinates of at least one sound source near the user are determined according to the time delays of the external audio signals collected by different microphones.
  • the detecting the user to determine the first direction includes: detecting the user's gaze direction; or, detecting the user's binaural current difference, according to the binaural current The correspondence between the difference and the gaze direction determines the user's gaze direction, and the user's gaze direction is the first direction.
  • the determining the coordinates of the first sound source from the coordinates of at least one sound source near the user according to the first direction includes: determining the coordinates of the at least one sound source near the user.
  • the coordinates of the source are analyzed to determine the directional relationship between each sound source and the user; according to the first direction and the directional relationship between the sound source and the user, the deviation between each sound source and the first direction is determined; Among at least one sound source near the user, a sound source whose deviation from the first direction is greater than a threshold is selected, and the coordinates of the sound source are the coordinates of the first sound source.
  • the external audio signal collected by the microphone is a mixed audio signal, and the mixed audio signal includes audio signals output by multiple sound sources; the external audio signal collected by the microphone is Perform separation to obtain the first audio signal output by the first sound source.
  • the method further includes: analyzing the separated first audio signal to determine the content of the first audio signal; and determining virtual noise to be added according to the content of the first audio signal type.
  • the content of the first audio signal is human conversation sound
  • the type of virtual noise to be added is multi-person conversation babble noise
  • the method further includes: determining the energy of the separated first audio signal; and determining the energy of the virtual noise to be added according to the energy of the first audio signal.
  • virtual noise corresponding to energy can be added according to the energy of the first audio signal output by the first sound source, which can avoid adding virtual noise with more energy and reduce the power consumption of the electronic device.
  • an auxiliary listening device including corresponding functional modules or units for implementing the functions of the first aspect or any one of the designs of the first aspect.
  • This function can be implemented by hardware, or by executing corresponding software by hardware, and the hardware or software includes one or more modules or units corresponding to the above functions.
  • an auxiliary listening device including a processor and a memory.
  • the memory is used to store computing programs or instructions, and the processor is coupled to the memory; when the processor executes the computer program or instructions, the apparatus is made to execute the method in the first aspect or any design of the first aspect.
  • an electronic device is provided, and the electronic device is used to execute the method in the first aspect or any one of the designs of the first aspect.
  • the electronic device may be an earphone (including a wired earphone or a wireless earphone, etc.), a smart phone, a vehicle-mounted device, a wearable device, or the like.
  • the wireless earphones include but are not limited to Bluetooth earphones, etc.
  • the wearable devices can be smart glasses, smart watches, or smart bracelets.
  • a fifth aspect provides a computer-readable storage medium, in which a computer program or instruction is stored, and when the computer program or instruction is executed by a device, the device is made to perform the above-mentioned first aspect or the first aspect. any of the methods in the design.
  • a computer program product in a sixth aspect, includes a computer program or an instruction, and when the computer program or instruction is executed by a device, the device is made to execute the method in the first aspect or any one of the designs of the first aspect. .
  • FIG. 1 is a schematic diagram of a user selecting listening to a sound provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the principle of assisted listening provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a coordinate system provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a model of a microphone and a sound source provided by an embodiment of the present application
  • FIG. 7 is a functional schematic diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of HRTF rendering provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an apparatus provided by an embodiment of the present application.
  • Cocktail party effect refers to a human hearing selection ability. In this case, one's attention is focused on the conversation of one individual and ignores other conversations or noises in the background. The effect reveals the amazing ability of the human auditory system to talk in noise.
  • the cocktail party effect is an auditory version of the figure-background phenomenon.
  • the "graphics” here are the sounds that we notice or draw our attention to, and the "backgrounds” are other sounds.
  • the cocktail party effect in acoustics refers to the masking effect of the human ear. In the noisy crowd at the cocktail party, the two can talk smoothly. Although there is a lot of noise around, they hear each other's voice in their ears, and people seem to be unable to hear all kinds of noise other than the content of the conversation. Because people have put their own focus (this is the selectivity of attention) on the topic of conversation.
  • DNNs deep neural networks
  • the nerves in his brain can record a spectrogram that is used to reconstruct that speech source.
  • the solution is used for hearing aids, and can enhance playback of audio objects that the user pays attention to or is interested in, so that the audio objects that the user pays attention to or are interested in can be heard more clearly.
  • this method is based on brainwave technology, it identifies the external sound source signal that the current user is interested in or pays attention to (restores and recovers from the brainwave), and enhances and plays the corresponding sound source signal. From a technical point of view, this technology requires speech separation first, and then matches with the brainwave decoded signal. The effect depends on the accuracy of the front-end speech signal separation, which is a huge challenge and difficult to achieve.
  • the second solution includes: step 1, detecting the user's attention or direction of interest, such as determining the user's interest or direction of interest based on the user's gaze or the orientation of the user's head.
  • Step 2 focus on sound pickup, and pick up the audio signal s(n) in the direction of the user's attention; optionally, sound pickup refers to a specific action of collecting audio signals.
  • the method for picking up the audio signals focused on the user and picking up the audio signals that the user pays attention to may include: picking up all the audio signals around the user, and then separating the audio signals that the user pays attention to. Alternatively, only the audio signals that the user pays attention to are picked up, for example, using a beamforming method, etc., to acquire only the audio signals that the user pays attention to.
  • Step 3 binaural playback, binaural rendering playback of the sound in the pickup direction, combined with the head related transfer function (HRTF), to improve the sense of presence of s(n).
  • HRTF head related transfer function
  • the essence of the above-mentioned second solution is to make the audio content of the sound source S1 clearer and to realize the user's more effective attention perception of the S1 by picking up the sound of the focused sound source S1. But in essence, the signal-to-noise ratio of S2 relative to N is not changed, so that S2 may still be perceived, thereby affecting the user's effective attention perception of the sound source S1. For example, the user is talking with object A, but there is also object B near the user talking, and object B is noise to the user.
  • the volume or clarity of the conversation content of the object A is improved by focusing on sound pickup. However, no processing was performed on the content of subject B's conversation.
  • the conversation content of object B involves words that the user pays attention to, it may still attract the user's attention, so that the user cannot better concentrate on listening to the conversation content of object A. For example, if the user pays more attention to "salary increase", during the conversation between the user and object A, if the content of the conversation of object B involves words such as "salary increase", it will attract the user's attention and make the user Can't focus on listening to subject A's conversation.
  • the present application provides an auxiliary listening method and device.
  • the principle of the solution is to superimpose noise (noise, N) on the audio content of the sound source S2 that the user is not concerned about/not interested in, thereby reducing the sound source S2
  • the clarity of the audio content prevents the user from clearly listening to the audio content of the sound source S2, thereby making the user pay more attention to the ringtone listening of the audio content of the sound source S1, and assist in selecting listening.
  • noise N can be superimposed on the conversation content of the object B, thereby reducing the signal-to-noise ratio of the conversation content of the object B.
  • the assisted listening method provided in the embodiments of the present application can be applied to electronic devices, including but not limited to earphones, smart phones, in-vehicle devices, or wearable devices (such as smart glasses, smart watches, or wristbands, etc.).
  • electronic devices including but not limited to earphones, smart phones, in-vehicle devices, or wearable devices (such as smart glasses, smart watches, or wristbands, etc.).
  • the application scenario may be that a button for assisting listening is set in the earphone. When the user activates the button, the earphone can detect the user to determine the sound source of the user's attention.
  • Noise N is added to the audio signal of the sound source that the user does not care about, thereby reducing the signal-to-noise ratio of the audio signal of the unconcerned sound source, so that the user can concentrate more on listening to the content of the sound source that he cares about and assist in selecting listening.
  • an embodiment of the present application provides an electronic device, which can be used to implement the auxiliary ringtone method provided by the embodiment of the present application, and the electronic device at least includes: a processor 301 , a memory 302 , and at least one speaker 304, at least one microphone 305, and a power supply 306, etc.
  • the memory 302 may store program codes, and the program codes may include program codes for implementing assisted listening.
  • the processor 301 may execute the above program codes to implement the function of assisting listening in the embodiments of the present application.
  • the processor 301 can execute the program code of the memory 302 to realize the following functions: according to the external audio signal collected by the microphone and the detected first direction of the user's attention, determine the coordinates of the first sound source in the direction that the user is not concerned about; The coordinates of the sound source and the predetermined HRTFs at different positions determine the first HRTF corresponding to the first sound source; according to the first HRTF and the preset virtual noise, the noise signal corresponding to the first sound source is obtained.
  • the speaker 304 can be used to convert audio electrical signals into sounds and play them.
  • the speaker 304 is used to play the noise signal corresponding to the first sound source and so on.
  • the microphone 305 which may also be referred to as a microphone, a microphone, etc., is used to convert a sound signal into an audio electrical signal.
  • the microphone 305 can collect sound signals in the vicinity of the user and convert them into audio electrical signals.
  • the audio electrical signal is the audio signal in this embodiment of the present application.
  • a power supply 306 can be used to supply power to various components included in the electronic device.
  • the power source 306 may be a battery, such as a rechargeable battery or the like.
  • the electronic device 300 may further include: a sensor 303, a wireless communication module 307, and the like.
  • Sensor 303 may be a proximity light sensor.
  • the processor 301 can determine whether the earphone is worn by the user through the sensor 303 .
  • the processor 301 can use the proximity light sensor to detect whether there is an object near the earphone, so as to determine whether the earphone is worn or the like.
  • the processor 301 may use the proximity light sensor to determine whether the charging box of the earphone is opened, so as to determine whether to control the earphone to be in a pairing state.
  • the earphone may further include a bone conduction sensor, and the bone conduction sensor may acquire the vibration signal of the vibrating bone mass of the voice part, and the processor 301 parses the voice signal to realize the control function corresponding to the voice signal.
  • the earphone may further include a touch sensor or a pressure sensor, etc., which are respectively used to detect the user's touch operation and pressing operation on the earphone.
  • the headset may further include a fingerprint sensor for detecting the user's fingerprint, identifying the user's identity, and the like.
  • the wireless communication module 307 is used to establish a wireless connection with other electronic devices, so that the headset can perform data interaction with other electronic devices.
  • the wireless communication module 307 may be a near field communication (near field communication, NFC) module, so that the headset can perform near field communication with other electronic devices having an NFC module.
  • NFC near field communication
  • the NFC module can store relevant information about the headset, such as the name, address information or unique identifier of the headset, so that other electronic devices with the NFC module can establish an NFC connection with the headset according to the relevant information, and based on the relevant information NFC connection for data transfer.
  • the wireless communication module 307 may also be a Bluetooth module, and the Bluetooth module stores the Bluetooth address of the headset, so that other electronic devices can establish a Bluetooth connection with the headset according to the Bluetooth address, and transmit audio through the Bluetooth connection. data etc.
  • the Bluetooth module can support multiple Bluetooth connection types at the same time, such as serial port profile (SPP) of traditional Bluetooth technology or Bluetooth low energy (bluetooth low energy, BLE) general attribute configuration protocol (generic attribute profile, GAP), etc., there are no restrictions here.
  • the wireless communication module 307 may also be an infrared module or wireless fidelity (wireless fidelity, WIFI), etc., and the specific implementation of the wireless communication module 307 is not limited herein.
  • only one wireless communication module 307 may be provided, or a plurality of wireless communication modules 307 may be provided as required.
  • two wireless communication modules may be provided in the headset, wherein one wireless communication module is a Bluetooth module, and the other wireless communication module is an NFC module.
  • the earphone can perform data communication through the two wireless communication modules respectively, and the number of the wireless communication modules 307 is not limited here.
  • the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device 300 . It may have more or fewer components than those shown in Figure 3, may combine two or more components, or may have a different configuration of components, and the like.
  • the electronic device 300 may also include components such as an indicator light (which can indicate states such as power), a dust filter (which can be used in conjunction with an earpiece), and the like.
  • the various components shown in Figure 3 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing or application specific integrated circuits.
  • a flow of an auxiliary listening method is provided, and the flow includes at least:
  • Step 401 The electronic device determines the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction, the first direction is determined by detecting the user, and the first sound source is the first sound source except the first direction. sound sources in other directions.
  • the first direction may be a direction in which the user pays attention
  • the first sound source may be a sound source in a direction that the user does not pay attention to.
  • S1 may be the sound source in the direction of the user's attention, and the sound sources S2 and S3 are the sound sources in the direction not the user pays attention to.
  • the implementation process of the above step 401 may include: the electronic device determines the coordinates of at least one sound source near the user according to the external audio signal collected by the microphone.
  • the electronic device is provided with a microphone array, the microphone array includes at least one microphone, each microphone separately collects external audio signals, and there is a time delay between the external audio signals collected by different microphones.
  • the electronic device may determine the coordinates of at least one sound source near the user according to the time delay between the external audio signals collected by different microphones.
  • the above-mentioned microphone may be a vector microphone or the like. For example, a sound source S1 and a sound source S2 exist near the user.
  • the microphone array of the electronic device can collect the external sound signal of the user and convert the sound signal into an audio signal, which includes the audio signal corresponding to the sound source S1 and the audio signal corresponding to the sound source S2.
  • the electronic device can determine the coordinates of the sound source S1 and the coordinates of the sound source S2 according to the time delays of the audio signals collected by different microphones in the microphone array.
  • the microphone array is composed of N+1 microphones, and the N+1 microphones are respectively (M 0 , M 1 , . . . , Mn ) and the like.
  • the distance between the sound source S and the ith microphone is:
  • d i represents the distance from the sound source S to the ith microphone
  • (x, y, z) represents the coordinates of the sound source S
  • ( xi, y i , z i ) represents the coordinates of the ith microphone.
  • d ij represents the difference d ij between the distance d i from the sound source to the ith microphone and the distance d j from the sound source to the j th microphone.
  • the distance between the various microphones is known, and the speed of sound is also known.
  • the specific algorithm may be the maximum likelihood estimation method, or the least square method, etc., which is not limited.
  • the coordinates of at least one sound source near the user can be obtained. Then, continue to describe the process of detecting the first direction the user pays attention to, and how to screen out the first sound source that the user does not pay attention to from at least one sound source according to the first direction the user pays attention to.
  • a user may wear an electronic device.
  • the electronic device is an earphone
  • the user can wear the electronic device at the ear position.
  • the electronic device can determine the first direction that the user pays attention to.
  • the electronic device detects the gaze direction of the user's eyes, and uses the gaze direction of the user's eyes as the first direction that the user pays attention to.
  • an inertial measurement unit IMU
  • the electronic device may use the IMU to determine the orientation of the user's head, and determine the orientation of the user's head according to the orientation of the user's head. Eye gaze direction, etc.
  • the gaze direction of the user's eyes that is, the first direction the user pays attention to, is straight ahead, and so on.
  • a camera may be deployed in the electronic device, and the camera may collect an image of the user's head, and determine the gaze direction of the user's eyes according to the image of the user's head collected by the camera.
  • an electroencephalogram detection sensor may be deployed in the electronic device, and the electroencephalogram detection sensor may detect the difference in the current between the two ears of the user, and determine the direction of the user's gaze according to the corresponding relationship between the difference in the current between the two ears and the gaze direction.
  • the electronic device may determine the coordinates of the first sound source among the coordinates of at least one sound source near the user according to the first direction. This process can be implemented in two specific ways:
  • the electronic device determines the sound source concerned by the user among at least one sound source near the user according to the above-mentioned first direction;
  • the sound source is the first sound source that the user does not pay attention to. For example, there are 5 sound sources near the user. According to the first direction the user pays attention to, it is determined that the sound source A among the above 5 sound sources is the sound source that the user pays attention to. The remaining 4 sound sources are the first sound sources that the user does not pay attention to.
  • the first sound source that the user does not pay attention to may include one sound source, or may include multiple sound sources, etc., which is not limited.
  • the electronic device directly determines a first sound source that the user does not pay attention to among at least one sound source near the user according to the above-mentioned first direction.
  • the electronic device may analyze the coordinates of at least one sound source near the user, and determine the positional relationship between each sound source and the user.
  • the coordinates of the sound source may be coordinates relative to the user.
  • a coordinate system is established.
  • the coordinate system may be a three-dimensional coordinate system. The origin of the coordinate system is the user's head position, the X axis represents the user's left/right direction, the Y axis represents the user's front/rear direction, and the Z axis represents the user's up/down direction.
  • the positive direction of the X axis is the positive right direction of the user
  • the negative direction of the X axis is the positive left direction of the user
  • the positive direction of the Y axis is the positive front direction of the user
  • the positive direction of the Y axis is the positive front direction of the user.
  • the negative direction is the positive back direction of the user
  • the positive direction of the Z axis is the positive upward direction of the user
  • the negative direction of the Z axis is the positive downward direction of the user; it can be understood that the positive directions of the above X, Y and Z axes are All represent the direction in which it is greater than 0, and the negative directions of the X-axis, Y-axis and Z-axis all represent the direction in which it is less than 0.
  • the electronic device can determine the positional relationship between each sound source and the user by analyzing the above-mentioned coordinates of each sound source relative to the user.
  • the coordinates of a sound source relative to the user are (1, 2, 0), which means that the position of the sound source relative to the user is: 1m directly to the right of the user, 2m directly in front of the user, and the position on the same plane as the user's position.
  • the position of the sound source can be accurately located, so as to determine the azimuth relationship between the position and the user.
  • the deviation between each sound source and the first direction is determined.
  • the detected first direction that the user pays attention to is 15 degrees away from the front of the user. If the direction relationship between a certain sound source and the user is 20 degrees away from the front of the user, the deviation between the sound source and the first direction is 5 degrees.
  • a sound source whose deviation from the first direction is greater than a threshold is selected, and the sound source is the first sound source that the user does not pay attention to.
  • a sound source whose deviation from the first direction is less than or equal to a threshold is selected, and the sound source can be the sound source that the user is concerned about; after that, all sound sources near the user are selected. , excluding the sound sources that the user pays attention to, that is, the sound sources that the user does not pay attention to.
  • the deviation between each sound source and the first direction can be determined.
  • the sound source is a sound source that the user does not pay attention to.
  • the deviation of the sound source from the first direction is less than or equal to the threshold, the sound source is considered to be the sound source that the user pays attention to.
  • the sound source S2 is a sound source that the user does not pay attention to, or the like.
  • the above threshold may be set when the electronic device leaves the factory, or may be synchronized or notified to the electronic device through the server of the electronic device later, etc., which is not limited. It should be noted that, in the description of the embodiments of the present application, “concern” and “interested” are not distinguished, and can be replaced with each other.
  • the above-mentioned sound source that the user pays attention to can also be replaced with a sound source that the user is interested in; the sound source that the user does not pay attention to can also be replaced with a sound source that the user is not interested in, and so on.
  • the sound source that the user pays attention to or the sound source that the user is interested in may be the sound source related to the activity that the user is currently engaged in. For example, if the user is currently engaged in watching TV, and the TV plays an audio signal as a sound source, the TV can be a sound source that the user is interested in or concerned about.
  • the user's attention or the sound source that the user is interested in may be a sound source whose deviation from the user's eye gaze direction is less than or equal to a threshold, or the like.
  • the sound source that the user does not pay attention to or is not interested in may be a sound source that is not related to the activity currently engaged by the user, and the sound source is noise to the user.
  • the user is watching TV, and watching TV is the current activity of the user, and listening to music is not the current activity of the user. If there is a music player playing music near the TV, the music player may be a sound source that the user does not pay attention to.
  • the sound source that the user does not pay attention to may be a sound source whose deviation from the user's eye gaze direction is greater than a threshold, or the like.
  • Step 402 The electronic device determines a first HRTF corresponding to the first sound source according to the coordinates of the first sound source and a preset HRTF.
  • the above-mentioned first HRTF may include two HRTFs, as shown in FIG. 7 , which may correspond to the HRTF of the left ear and the HRTF of the right ear respectively.
  • Step 403 The electronic device obtains a noise signal corresponding to the first sound source according to the first HRTF and the preset virtual noise, and plays the noise signal.
  • the electronic device may multiply the first HRTF by a preset virtual noise frequency domain to obtain a noise signal corresponding to the first sound source.
  • the first HRTF includes the HRTF of the left ear and the HRTF of the right ear. Then, the HRTF of the left ear is multiplied by the preset virtual noise frequency domain to obtain the noise signal of the left ear, and the HRTF of the right ear is multiplied by the preset virtual noise frequency domain to obtain the noise signal of the right ear.
  • the noise signal of the left ear and the noise signal of the right ear can be called the noise signal of both ears.
  • HRTF is an algorithm for sound localization, which corresponds to the head related inpulse response (HRIR) in the time domain.
  • HRIR head related inpulse response
  • the above-mentioned process of multiplying the HRTF and the preset virtual noise in the frequency domain is embodied in the time domain as a process of convolving the above-mentioned HRIR and the preset virtual noise.
  • HRTF In order to facilitate understanding, first introduce HRTF.
  • the concept and meaning of HRTF are as follows: Humans have two ears, but they can locate sounds from three-dimensional space, which is due to the analysis system of human ears for sound signals. HRTF can simulate the human ear's analysis system for sound signals. The HRTF essentially contains the spatial orientation information of the sound source, and the corresponding HRTFs of the sound sources in different spatial orientations are different. For any monaural audio signal, after multiplying the HRTF of the left ear and the HRTF of the right ear respectively in the frequency domain, the audio signal corresponding to the binaural can be obtained, and the 3-dimensional audio can be experienced by playing it with headphones.
  • the HRTFs of multiple locations may be pre-stored, and the above-mentioned first sound can be obtained by using the pre-stored HRTFs of multiple locations (for example, performing an interpolation operation on the HRTFs of the above-mentioned multiple locations, etc.).
  • the first HRTF corresponding to the source, etc. After that, the first HRTF is multiplied by a preset virtual noise frequency domain to obtain a noise signal.
  • the HRTF is related to the position, and the above-mentioned first HRTF is obtained according to the coordinates of the first sound source.
  • the above virtual noise signal can be superimposed on the audio signal of the first sound source, reducing the signal-to-noise ratio of the audio signal of the first sound source, which can make the user more Focus on listening to the audio signal of the sound source of interest, assisting the user to listen.
  • the first HRTF is obtained according to the coordinates of the sound source S2, and then the first HRTF is multiplied by the preset virtual noise frequency domain to obtain a noise signal.
  • noise signal By playing the noise signal, you can The above-mentioned noise signal is superimposed on the sound source S2, thereby reducing the signal-to-noise ratio of the audio signal of the sound source S2, so that the user can listen to the audio signal of the sound source S1 more attentively.
  • the process of determining the first HRTF corresponding to the first sound source according to the coordinates of the first sound source and the preset HRTF in the above step 402 includes but is not limited to the following two implementations:
  • Method 1 Accurately locate the coordinates of each sound source relative to the user, and store a large number of HRTFs in advance.
  • the electronic device precisely locates the position of each sound source relative to the user.
  • the coordinates of a sound source relative to the user are (1, 2, 0), which means the position of the sound source is: 1 meter to the right of the user, 2 meters to the front of the user, and on the same horizontal plane as the user s position. Since the HRTF is related to the position, in this case, the electronic device may require a larger number of HRTFs in advance, so as to calculate the HRTF corresponding to the position of the sound source.
  • the advantage of this method is that a more accurate HRTF can be determined, and the noise signal determined according to the HRTF can be superimposed on the unconcerned sound source more accurately.
  • Mode 2 Roughly locate the coordinates of each sound source relative to the user, and store a small number of HRTFs in advance.
  • the electronic device can roughly locate the direction of the sound source relative to the user, instead of accurately locating the specific position of each sound source.
  • the determined coordinates of a certain sound source relative to the user may be (1, 1, 0), which means that the sound source is located in the right front direction of the user, and the specific position of the right front direction is no longer calculated.
  • the electronic device can store HRTFs in four directions, such as front, rear, left, and right, and calculate the HRTF in the right direction according to the HRTFs in the four directions. The advantage of this is that the storage space of the electronic device can be reduced, the calculation process can be simplified, and the power consumption of the electronic device can be saved.
  • the virtual noise added in the foregoing step 403 may be white noise.
  • the virtual noise added in the above step 403 may also be noise that matches the content of the first audio signal of the first sound source.
  • the electronic device may analyze the first audio signal of the first sound source, determine the content of the first audio signal, and determine the type of virtual noise to be added according to the content of the first audio signal. For example, if the content of the first audio signal is human speech, the electronic device determines that the type of virtual noise to be added is multi-person conversation babble noise or the like. In a possible implementation manner, the electronic device may detect whether the content of the first audio signal contains a human voice.
  • the virtual noise added to the first audio signal may be multi-person conversation babble noise or the like.
  • a voice activity detection (VAD) technology can be used, such as short time energy (STE) and zero-crossing rate (zero-crossing rate). cross counter, ZCC) and other detection methods.
  • the STE and ZCC of the first audio signal of the first sound source may be detected. Since the STE of the speech segment is relatively large and the ZCC is relatively small, the STE of the non-speech segment is relatively small and the ZCC is relatively large. This is mainly due to the fact that most of the energy of the speech signal is contained in the low frequency band, while the noise signal usually has less energy and contains information in the higher frequency band. Therefore, a certain threshold can be set. When the STE of the first audio signal of the first sound source is greater than or equal to the first threshold, and the ZCC is less than or equal to the second threshold, it can be considered that the first audio signal of the first sound source includes human voice.
  • the STE of the first audio signal of the first sound source is less than the first threshold and the ZCC is greater than the second threshold, it may be considered that the first audio signal of the first sound source does not include human speech.
  • ZCC refers to the ratio of the symbol change of a signal, that is, the number of times that a frame of speech time-domain signal passes through the time axis. The calculation method is to translate all the signals in the frame by 1, and then do the product of the corresponding points. If the sign is negative, it means that the zero-crossing is here. Just calculate the value of the product of all the negative numbers in the frame to get the zero-crossing rate of the frame. .
  • STE refers to the energy of a frame of speech signal.
  • the electronic device may also determine the energy of the first audio signal; determine the energy of the virtual noise to be added according to the energy of the first audio signal, and the like.
  • the essence of the method is to control the signal-to-noise ratio of the first audio signal after adding virtual noise.
  • the signal-to-noise ratio of the first audio signal and the added virtual noise is preset to be 50%.
  • the energy of the first audio signal is W1
  • the energy of the virtual noise to be added may be half of the energy of the first audio signal, that is, 0.5W1.
  • the external audio signal collected by the microphone may be a mixed audio signal, and the mixed audio signal includes audio signals output by multiple sound sources.
  • the electronic device can separate the mixed audio signal collected by the microphone to obtain the first audio signal corresponding to the first sound source. Afterwards, the above method is used to determine the audio content in the first audio signal, and/or the energy of the first audio signal, and the like.
  • the domain speech separation algorithm can separate the mixed audio signal collected by the microphone into a plurality of independent audio signals, and the plurality of independent audio signals include the first audio signal corresponding to the first sound source.
  • the length of the mixing filter is P
  • the convolution mixing process can be expressed as:
  • the hybrid network H(n) is a matrix sequence of M*N, which is composed of the impulse response of the hybrid filter.
  • the length of the separation filter be L
  • the separation network W(n) is a matrix of N*M, which is composed of the impulse response of the separation filter, and * represents the matrix convolution operation.
  • the separation network W(n) can be obtained by a frequency-domain blind source separation algorithm.
  • L point short-time Fourier transform short-time fourier transform, STFT
  • STFT short-time fourier transform
  • m is obtained by down-sampling the time index value n by L point
  • X(m, f) and Y(m, f) are respectively x(n) and y(n) and obtained by STFT
  • H(f) and W(f) is the Fourier transform form of H(n) and W(n) respectively
  • f ⁇ [f 0 ,...f L/2 ] is the frequency.
  • the Y(m,f) obtained after blind source separation is inversely transformed back to the time domain, and the estimated sound source signals y 1 (n),...y N (n) are obtained.
  • one or more sound sources may be included in the first sound source that the user does not pay attention to.
  • each sound source is processed according to the above steps 402 and 403, that is, the HRTF corresponding to each sound source is obtained, and each HRTF is multiplied by the preset virtual noise frequency domain , to get the noise signal corresponding to each sound source.
  • the above noise signal can be superimposed on the corresponding sound source, thereby reducing the signal-to-noise ratio of the sound source that the user does not pay attention to, allowing the user to focus more on listening to the audio signal of the sound source of interest, assisting selection Listen.
  • an embodiment of the present application provides an electronic device, the electronic device is used to implement the method for assisting listening provided by the embodiment of the present application, and the electronic device implements at least the following two functions:
  • the modules implementing detection and identification functions in the electronic device may include an environmental information acquisition module, an environmental information processing module and an attention detection module.
  • the environmental information collection module is used to collect audio signals in the environment, and may be environmental information collection sensors such as microphones and cameras deployed in electronic devices.
  • the environment information processing module can determine the position of the sound source corresponding to the audio signal based on the audio signal in the environment collected above.
  • the attention detection module is used to determine the direction of the user's attention. For example, the orientation of the user's head, or the gaze direction of the user's eyes, etc., may be an IMU, a camera, or the like deployed on the electronic device.
  • the modules that implement the functional processing in the electronic device may include an audio processing module and an audio playing module.
  • the audio processing module for adding noise to the audio signal that the user does not pay attention to, may be a processor or the like deployed in the electronic device.
  • the audio playback module is used to play the above-mentioned added noise signal, which can be a speaker or the like deployed in the electronic device.
  • the embodiment of the present application provides a specific example of a method for assisting listening, including at least the following steps:
  • Step 1 Orientation Recognition
  • Detect user focus direction Regarding how to detect the user's attention direction, including but not limited to the following two methods:
  • the first way a camera is deployed on the electronic device, and the camera is used to detect the gaze direction of the user wearing the electronic device, and the detected gaze direction of the user is the user's attention direction.
  • an electroencephalogram detection sensor is deployed on the electronic device, and the electroencephalogram detection sensor can be used to detect the current difference between the user's ears, and determine the user's gaze direction based on the current difference between the two ears in different gaze directions. Likewise, the user's gaze direction, that is, the user's focus direction.
  • Detects and determines sound sources in directions other than the user's attention For example, a microphone array or the like is deployed on the electronic device, and the coordinates of all sound sources near the user can be detected by using the above-mentioned microphone array. According to the above-identified direction of the user's attention, from all the sound sources near the user, a sound source that the user does not pay attention to is selected. Optionally, there may be one or more sound sources that the user does not pay attention to, which is not limited. For example, the number of sound sources that the user does not pay attention to may be n, and the coordinates of the n unconcerned sound sources may be p1(x1, y1, z1), . . . , pn(xn, yn, zn), where n is greater than A positive integer of 1.
  • Step 3 Binaural rendering, as shown in Figure 8.
  • the HRTF of both ears Based on the coordinates of the sound source that the user does not pay attention to, obtain the HRTF of both ears.
  • HRTFs of multiple locations may be pre-stored in the electronic device.
  • the HRTF corresponding to the coordinates of the sound source that the user does not pay attention to can be obtained by performing an interpolation operation on the above-mentioned multiple HRTFs.
  • the coordinates of the n non-concerned sound sources are respectively p1(x1,y1,z1),...,pn(xn,yn,zn), then in the embodiment of the present application, the above n sound sources can be obtained respectively Binaural HRTF with no sound source focus, etc.
  • the playback mode may be an acoustic mode, a bone conduction mode, or the like, which is not limited.
  • the above-mentioned virtual noise may be noise audio files, which may be stored in the electronic device or on the cloud.
  • the electronic device downloads these noise audio files from the cloud to the electronic device and renders them when needed.
  • the above n non-attention sound sources correspond to n binaural HRTFs
  • n binaural HRTFs can correspond to n virtual noises
  • the n binaural HRTFs and the corresponding n virtual noises can be processed to obtain: n binaural noise signals, and play the n binaural noise signals.
  • the above n virtual noises may be respectively n1(n), . . . , nn(n), and the n virtual noises may be the same or different.
  • the methods provided by the embodiments of the present application are introduced from the perspective of an electronic device as an execution subject.
  • the electronic device may include a hardware structure and/or software modules, and implement the above functions in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether one of the above functions is performed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.
  • an embodiment of the present application provides an auxiliary listening device, which at least includes a processing unit 901 and a playback unit 902 .
  • the processing unit 901 is configured to determine the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction, where the first direction is the direction determined by detecting the user, and the first sound source is the sound source in other directions outside the first direction; the processing unit 901 is further configured to determine the first HRTF corresponding to the coordinates of the first sound source according to the coordinates of the first sound source and a preset HRTF; The processing unit 901 is further configured to obtain a noise signal corresponding to the first sound source according to the first HRTF and a preset virtual noise; the playing unit 902 is configured to play the noise signal.
  • the determining the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction includes: determining at least one near the user according to the external audio signal collected by the microphone The coordinates of a sound source; the user is detected to determine the first direction; according to the first direction, the coordinates of the first sound source are determined among the coordinates of at least one sound source near the user .
  • the determining the coordinates of at least one sound source near the user according to the external audio signal collected by the microphone includes: the number of microphones is multiple, and each microphone separately collects the external audio signal , there is a delay between the external audio signals collected by different microphones; the coordinates of at least one sound source near the user are determined according to the time delays of the external audio signals collected by different microphones.
  • the detecting the user and determining the first direction includes: detecting the user's gaze direction; or, detecting the binaural current difference of the user, according to binaural current differences The correspondence between the current difference and the gaze direction determines the user's gaze direction, and the user's gaze direction is the first direction.
  • the determining the coordinates of the first sound source among the coordinates of at least one sound source near the user according to the first direction includes: determining the coordinates of at least one sound source near the user.
  • the coordinates of the sound source are analyzed to determine the directional relationship between each sound source and the user; the deviation between each sound source and the first direction is determined according to the first direction and the directional relationship between the sound source and the user. ; among at least one sound source near the user, a sound source whose deviation from the first direction is greater than a threshold is selected, and the coordinates of the sound source are the coordinates of the first sound source.
  • the processing unit 901 is further configured to: the external audio signal collected by the microphone is a mixed audio signal, and the mixed audio signal includes audio signals output by multiple sound sources; The collected external audio signal is separated to obtain the first audio signal output by the first sound source.
  • the processing unit 901 is further configured to: analyze the separated first audio signal to determine the content of the first audio signal; determine the content of the first audio signal according to the content of the first audio signal The type of dummy noise that needs to be added.
  • the content of the first audio signal is human conversation sound
  • the type of virtual noise that needs to be added is multi-person conversation babble noise
  • the processing unit 901 is further configured to: determine the energy of the separated first audio signal; and determine the energy of the virtual noise to be added according to the energy of the first audio signal.
  • the embodiments of the present application further provide a computer-readable storage medium, including a program, and when the program is executed by a processor, the methods in the above method embodiments are executed.
  • a computer program product comprising computer program code, when the computer program code is run on a computer, causes the computer to implement the methods in the above method embodiments.
  • a chip comprising: a processor, the processor is coupled with a memory, the memory is used for storing a program or an instruction, when the program or instruction is executed by the processor, the device causes the apparatus to perform the above method embodiments Methods.
  • At least one (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that the words “first”, “second” and the like do not limit the quantity and execution order, and the words “first”, “second” and the like are not necessarily different.
  • the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which can be implemented or executed
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
  • the memory may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or may also be a volatile memory (volatile memory), for example Random-access memory (RAM).
  • Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory in this embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, for storing program instructions and/or data.
  • the methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software When implemented in software, it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server or data center by wire (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available media that can be accessed by a computer, or a data storage device such as a server, data center, etc. that includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, digital video discs (DVD)), or semiconductor media (eg, SSDs), and the like.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

La présente invention concerne un procédé et un appareil d'aide à l'écoute qui peuvent aider un utilisateur à sélectionner un son écouté. Le procédé comprend : la détermination de coordonnées d'une première source sonore selon un signal audio externe collecté par un microphone et selon une première direction, la première direction étant une direction déterminée par la réalisation d'une détection sur un utilisateur, et la première source sonore étant une source sonore dans une direction autre que la première direction ; selon les coordonnées de la première source sonore et selon une fonction de transfert relative à la tête (HRTF) prédéfinie, la détermination d'une première fonction HRTF correspondant à la première source sonore ; et selon la première fonction HRTF et selon un bruit virtuel prédéfini, l'obtention d'un signal de bruit correspondant à la première source sonore, et ensuite la lecture du signal de bruit.
PCT/CN2021/078222 2021-02-26 2021-02-26 Procédé et appareil d'aide à l'écoute WO2022178852A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180004382.3A CN115250646A (zh) 2021-02-26 2021-02-26 一种辅助聆听方法及装置
PCT/CN2021/078222 WO2022178852A1 (fr) 2021-02-26 2021-02-26 Procédé et appareil d'aide à l'écoute

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/078222 WO2022178852A1 (fr) 2021-02-26 2021-02-26 Procédé et appareil d'aide à l'écoute

Publications (1)

Publication Number Publication Date
WO2022178852A1 true WO2022178852A1 (fr) 2022-09-01

Family

ID=83047666

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078222 WO2022178852A1 (fr) 2021-02-26 2021-02-26 Procédé et appareil d'aide à l'écoute

Country Status (2)

Country Link
CN (1) CN115250646A (fr)
WO (1) WO2022178852A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162254A1 (en) * 2014-12-05 2016-06-09 Stages Pcs, Llc Communication system for establishing and providing preferred audio
CN108601519A (zh) * 2016-02-02 2018-09-28 电子湾有限公司 个性化的实时音频处理
CN108810719A (zh) * 2018-08-29 2018-11-13 歌尔科技有限公司 一种降噪方法、颈带式耳机及存储介质
WO2020159557A1 (fr) * 2019-01-29 2020-08-06 Facebook Technologies, Llc Génération d'une expérience audio modifiée pour un système audio

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162254A1 (en) * 2014-12-05 2016-06-09 Stages Pcs, Llc Communication system for establishing and providing preferred audio
CN108601519A (zh) * 2016-02-02 2018-09-28 电子湾有限公司 个性化的实时音频处理
CN108810719A (zh) * 2018-08-29 2018-11-13 歌尔科技有限公司 一种降噪方法、颈带式耳机及存储介质
WO2020159557A1 (fr) * 2019-01-29 2020-08-06 Facebook Technologies, Llc Génération d'une expérience audio modifiée pour un système audio

Also Published As

Publication number Publication date
CN115250646A (zh) 2022-10-28

Similar Documents

Publication Publication Date Title
US10187740B2 (en) Producing headphone driver signals in a digital audio signal processing binaural rendering environment
US10659908B2 (en) System and method to capture image of pinna and characterize human auditory anatomy using image of pinna
US10585486B2 (en) Gesture interactive wearable spatial audio system
JP6665379B2 (ja) 聴覚支援システムおよび聴覚支援装置
TW201820315A (zh) 改良型音訊耳機裝置及其聲音播放方法、電腦程式
JP2017521902A (ja) 取得した音響信号のための回路デバイスシステム及び関連するコンピュータで実行可能なコード
US11184723B2 (en) Methods and apparatus for auditory attention tracking through source modification
US11221820B2 (en) System and method for processing audio between multiple audio spaces
US10979236B1 (en) Systems and methods for smoothly transitioning conversations between communication channels
WO2019109420A1 (fr) Procédé de détermination de canaux gauche et droit et dispositif d'écouteur
CN104168534A (zh) 一种全息音频装置及控制方法
US20200413190A1 (en) Processing device, processing method, and program
CN114339582B (zh) 双通道音频处理、方向感滤波器生成方法、装置以及介质
WO2022178852A1 (fr) Procédé et appareil d'aide à l'écoute
US11217268B2 (en) Real-time augmented hearing platform
JP6587047B2 (ja) 臨場感伝達システムおよび臨場感再現装置
US20220021998A1 (en) Method for generating sound and devices for performing same
Amin et al. Impact of microphone orientation and distance on BSS quality within interaction devices
EP4052487A1 (fr) Systèmes et procédés permettant de classifier des signaux formés en faisceaux pour une lecture audio binaurale

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927288

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927288

Country of ref document: EP

Kind code of ref document: A1