CN115250646A - Auxiliary listening method and device - Google Patents

Auxiliary listening method and device Download PDF

Info

Publication number
CN115250646A
CN115250646A CN202180004382.3A CN202180004382A CN115250646A CN 115250646 A CN115250646 A CN 115250646A CN 202180004382 A CN202180004382 A CN 202180004382A CN 115250646 A CN115250646 A CN 115250646A
Authority
CN
China
Prior art keywords
sound source
user
coordinates
determining
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180004382.3A
Other languages
Chinese (zh)
Inventor
张立斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN115250646A publication Critical patent/CN115250646A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

An auxiliary listening method and apparatus for assisting a user in listening to a selection, the method comprising: determining coordinates of a first sound source according to an external audio signal collected by a microphone and a first direction, wherein the first direction is determined by detecting a user, and the first sound source is a sound source in other directions except the first direction; determining a first HRTF corresponding to a first sound source according to the coordinates of the first sound source and a preset Head Related Transfer Function (HRTF); and according to the first HRTF and the preset virtual noise, obtaining a noise signal corresponding to the first sound source, and playing the noise signal.

Description

Auxiliary listening method and device Technical Field
The application relates to the technical field of intelligent terminals, in particular to an auxiliary listening method and device.
Background
When there are a plurality of sound sources around the user, the sound sources include a sound source that the user wants to listen to, such as the sound source S1 shown in fig. 1, and a sound source that the user does not want to listen to, such as the sound source S2 shown in fig. 1, and the sound source S2 is noise for the user. How to make the user more concentrate on listening to the audio content of the sound source S1 is a technical problem to be solved by the embodiment of the present application.
Disclosure of Invention
The application provides an auxiliary listening method and an auxiliary listening device, so that the attention of a user is more concentrated on ringing the audio content of a sound source concerned, and the user is assisted in listening and selecting.
In a first aspect, there is provided an auxiliary listening method, comprising: determining coordinates of a first sound source according to an external audio signal acquired by a microphone and a first direction, wherein the first direction is determined by detecting a user, and the first sound source is a sound source in other directions except the first direction; determining a first HRTF corresponding to the first sound source according to the coordinates of the first sound source and a preset Head Related Transfer Function (HRTF); and obtaining a noise signal corresponding to the first sound source according to the first HRTF and preset virtual noise, and playing the noise signal.
Taking the first direction as the user attention direction, and the sound source in the first direction, i.e. the sound source in the user attention direction, as S1, the first sound source is the sound source in the user non-attention direction S2 as an example. Through the design of the first aspect, the noise N can be superimposed on the audio content of the sound source S2 that the user is not paying attention to/is not interested in, so that the definition of the audio content of the sound source S2 is reduced, the user cannot clearly listen to the audio content of the sound source S2, the user is enabled to concentrate more on the ring listening of the audio content of the sound source S1, and the selective listening is assisted.
In one possible design, the determining coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction includes: determining coordinates of at least one sound source near the user according to an external audio signal collected by a microphone; detecting the user and determining the first direction; determining coordinates of the first sound source among coordinates of at least one sound source near the user according to the first direction.
In one possible design, the determining coordinates of at least one sound source near the user from the external audio signals collected by the microphones includes: the number of the microphones is multiple, each microphone respectively collects external audio signals, and the external audio signals collected by different microphones are delayed; and determining the coordinates of at least one sound source nearby the user according to the time delay of the external audio signals collected by different microphones.
In one possible design, the detecting the user, determining the first direction, includes: detecting a gaze direction of the user; or detecting the binaural current difference of the user, and determining the gaze direction of the user according to the corresponding relation between the binaural current difference and the gaze direction, wherein the gaze direction of the user is the first direction.
In one possible design, the determining coordinates of the first sound source in coordinates of at least one sound source near the user according to the first direction includes: analyzing the coordinates of at least one sound source near the user, and determining the direction relation between each sound source and the user; determining the deviation between each sound source and the first direction according to the first direction and the direction relation between the sound source and a user; selecting a sound source whose deviation from the first direction is greater than a threshold value, of at least one sound source in the vicinity of the user, the coordinates of which are the coordinates of the first sound source.
In one possible design, further comprising: the external audio signals collected by the microphone are mixed audio signals, and the mixed audio signals comprise audio signals output by a plurality of sound sources; and separating the external audio signals collected by the microphone to obtain a first audio signal output by the first sound source.
In one possible design, further comprising: analyzing the separated first audio signal to determine the content of the first audio signal; and determining the type of virtual noise to be added according to the content of the first audio signal.
Through the design, different types of noise are added to different non-attention sound sources of users, namely the first sound source, according to different contents of the audio signals output by the first sound source, so that the content of the audio signals of the first sound source is favorably masked, and the auxiliary listening to the audio signals of the attention sound sources of the users is facilitated.
In a possible design, the content of the first audio signal is a human talk sound, and the type of virtual noise to be added is a multiple-person talk babble noise.
In one possible design, further comprising: determining an energy of the separated first audio signal; and determining the energy of the virtual noise to be added according to the energy of the first audio signal.
Through the design, the virtual noise with corresponding energy can be added according to different energy of the first audio signal output by the first sound source, so that the virtual noise with more energy can be avoided, and the power consumption of the electronic equipment is reduced.
In a second aspect, an auxiliary listening device is provided, which comprises corresponding functional modules or units for implementing the functions of the first aspect or any design of the first aspect. The functions can be realized by hardware, and corresponding software can be executed by hardware, and the hardware or the software comprises one or more modules or units corresponding to the functions.
In a third aspect, an auxiliary listening device is provided that includes a processor and a memory. Wherein the memory is used for storing a computing program or instructions, and the processor is coupled with the memory; the computer program or instructions, when executed by a processor, cause the apparatus to perform the method of any of the first aspect or the first aspect designs described above.
In a fourth aspect, an electronic device is provided, wherein the electronic device is configured to perform the method of the first aspect or any design of the first aspect. Optionally, the electronic device may be a headset (including a wired headset or a wireless headset), a smart phone, an in-vehicle device, or a wearable device. The wireless headset comprises but is not limited to a bluetooth headset and the like, and the wearable device can be smart glasses, a smart watch, a smart bracelet and the like.
In a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program or instructions which, when executed by an apparatus, cause the apparatus to perform the method of the first aspect or any design of the first aspect.
In a sixth aspect, there is provided a computer program product comprising a computer program or instructions which, when executed by an apparatus, causes the apparatus to perform the method of the first aspect or any of the designs of the first aspect.
Drawings
FIG. 1 is a schematic diagram of user-selected listening provided by an embodiment of the present application;
fig. 2 is a schematic diagram of an auxiliary listening provided in an embodiment of the present application;
fig. 3 is a schematic diagram of an electronic device provided in an embodiment of the present application;
fig. 4 is a flowchart of an auxiliary listening method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a coordinate system provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of a microphone and a sound source model provided in an embodiment of the present application;
fig. 7 is a functional schematic diagram of an electronic device provided in an embodiment of the present application;
fig. 8 is a schematic HRTF rendering provided in an embodiment of the present application;
fig. 9 is a schematic view of an apparatus provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Various sound sources exist in our environment, and as shown in fig. 1, a sound source S2, and the like exist near a user. For a person with normal hearing, we can still consciously select the desired audio content, e.g. the audio content of sound source S1, in a scene where there are multiple sound sources, this capability is called cocktails effect.
Cocktail party effect (cocktail party effect) refers to a hearing selection capability of a person. In this case, the person's attention is focused on a conversation of a certain person and ignores other conversations or noises in the background. This effect reveals the surprising ability in the human auditory system that we can talk in noise. The cocktail party effect is an auditory version of the graphics-background phenomenon. Here, "graphics" is a sound that we notice or draw our attention, and "background" is another sound.
In noisy indoor environments, such as in cocktail parties, many different sound sources are present simultaneously, for example: sounds of multiple persons speaking at the same time, collision sounds of tableware, musical sounds, and reflected sounds generated by reflecting these sounds by walls and objects in the room, and the like. During the transmission of sound sources, the sound waves emitted from different sound sources, and the direct sound and the reflected sound are superposed in a propagation medium (usually air) to form complex mixed sound waves. Therefore, the sound waves corresponding to the respective sound sources are not present in the mixed sound source reaching the external auditory meatus of the listener. In such an acoustic environment, the listener can understand the target sentence of interest to a considerable extent. How does the listener separate the speech signals of different speakers from the received mixed sound wave and understand the target sentence? This was the well-known "cocktail party" problem that was addressed in 1953. The industry generally utilizes the following principles of auditory attention and acoustics to illustrate and describe the "cocktail party effect".
Auditory attention principle: this is because when a person's auditory attention is focused on something, consciousness excludes some extraneous sound stimuli, while unconsciousness constantly monitors the external stimuli, and once there are some special stimuli associated with him, it will immediately draw attention, which effect is actually an adaptation of the auditory system. In short, our brains judge the sounds to some extent and then decide to listen or not to listen.
The acoustic principle is as follows: the cocktail party effect refers acoustically to the masking effect of the human ear. In the crowd who the cocktail will be noisy, two people can talk smoothly, although the surrounding noise is very large, the two people hear the speaking voice of the other party in ears, and people seem to not hear various noises except the talking content. Since people have placed their respective focus on conversational topics (which is the selectivity of attention).
By utilizing the cocktail effect, the selective listening of people can be influenced by controlling the definition degree of audio objects in the environment. In general, the clearer the content of an audio object, the more attention is perceived and focused. With regard to schemes for assisting a user in making listening selections, the following two main categories are currently available:
the first scheme, the basic process is as follows:
1) The spectrogram of the mixed audio is fed into a plurality of Deep Neural Networks (DNNs), each of which is trained to separate a particular speech source.
2) When a user is listening to one of the speech sources, the nerves of their brain may record a spectrogram that is used to reconstruct the speech source.
3) The reconstructed spectrogram is compared with the output of each DNN and if consistent, the source is amplified.
The scheme is used for the hearing aid, and can perform enhanced playback on the audio objects concerned or interested by the user, so that the audio objects concerned or interested by the user can be listened to more clearly. But because the method identifies the external sound source signals which are interested or concerned by the current user based on brain wave technology (restores and recovers from brain waves), and strengthens and plays the corresponding sound source signals. Technically, the technology needs to perform voice separation first and then match with the brain wave decoding signal, the effect depends on the accuracy of front-end voice signal separation, the challenge is huge, and the realization difficulty is high.
A second scheme, comprising: step one, detecting the attention or interest direction of the user, such as determining the interest or attention direction of the user based on the gaze of the user or based on the head orientation of the user. Focusing pickup to pick up an audio signal s (n) in the direction concerned by the user; optionally, the sound pickup refers to a specific action of acquiring an audio signal. Focusing pickup, a method of picking up an audio signal of interest to a user may include: the audio signals around the user are all picked up and then the audio signal of interest to the user is separated therefrom. Alternatively, only the audio signal of interest to the user is picked up, and for example, only the audio signal of interest to the user is acquired by a beam forming method or the like. And step three, performing binaural playback, namely performing binaural rendering playback on the sound in the pickup direction, and improving the presence of s (n) by combining a Head Related Transfer Function (HRTF).
The essence of the second scheme is that the audio content of the sound source S1 is clearer by mainly picking up sound of the sound source S1 of interest, so that the user can more effectively perceive the sound source S1. But essentially the signal-to-noise ratio of S2 with respect to N is not changed, so that S2 may still be perceived, thereby affecting the user' S effective attention perception of the sound source S1. For example, a user is talking to object a, but there is also object B near the user who is also talking, object B being noise to the user. In the second scheme, by focusing on sound pickup, the volume, the clarity, and the like of the content of the conversation of the subject a are improved. But does not perform any processing on the conversational content of object B. It is still possible to attract the attention of the user if the words of interest of the user are involved in the speech content of object B, so that the user cannot better concentrate on listening to the speech content of object a. For example, if the user is concerned with "salary", the user's attention will be attracted if the words "salary" and the like are involved in the conversation of object B during the user's conversation with object a, so that the user cannot concentrate on listening to the conversation of object a.
As shown in fig. 2, the principle of the scheme is to superimpose noise (N) on the audio content of a sound source S2 that a user is not interested in/interested in, so as to reduce the definition of the audio content of the sound source S2, so that the user cannot clearly listen to the audio content of the sound source S2, and further, the user' S attention is focused on ring-listening of the audio content of the sound source S1, and the scheme assists in selective listening. Following the above example, with the scheme in the embodiment of the present application, noise N may be superimposed on the content of the conversation of the above object B, thereby reducing the signal-to-noise ratio of the content of the conversation of the object B. At this time, even if the speech content of the object B further relates to the word eye such as "salary" which is more concerned by the user, the user cannot clearly listen to the speech content of the object B through the superposition of the noise N, and the speech content of the object B does not attract the attention of the user any more, so that the user is more focused on listening to the speech content of the object a.
The auxiliary listening method provided by the embodiment of the application can be applied to electronic equipment, including but not limited to earphones, smart phones, vehicle-mounted equipment or wearable equipment (such as smart glasses, smart watches or bracelets). Taking an earphone as an example, the application scenario may be that a key for assisting listening is arranged in the earphone, and when the user opens the key, the earphone may determine a sound source of interest of the user by detecting the user. Noise N is added to the audio signals of the non-attention sound sources of the user, so that the signal to noise ratio of the audio signals of the non-attention sound sources is reduced, the user can concentrate on listening of the content of the attention sound sources of the user, and selective listening is assisted.
As shown in fig. 3, an embodiment of the present application provides an electronic device, where the electronic device may be used to implement the auxiliary ring-listening method provided in the embodiment of the present application, and the electronic device at least includes: a processor 301, a memory 302, at least one speaker 304, at least one microphone 305, and a power supply 306, among other things.
Memory 302 may store, among other things, program code that may include program code to implement assisted listening. The processor 301 may execute the above program codes to realize the function of assisting listening in the embodiment of the present application. For example, the processor 301 may execute the program code of the memory 302, implementing the following functions: determining the coordinate of a first sound source in a non-attention direction of a user according to an external audio signal collected by a microphone and the detected first direction concerned by the user; determining a first HRTF corresponding to a first sound source according to the coordinates of the first sound source and the preset HRTFs at different positions; and obtaining a noise signal and the like corresponding to the first sound source according to the first HRTF and the preset virtual noise.
The speaker 304 may be used to convert the electrical audio signals into sound and play it. For example, the speaker 304 is used to play a noise signal corresponding to the first sound source.
The microphone 305, which may also be referred to as a microphone, or the like, is used to convert sound signals into electrical audio signals. For example, microphone 305 may capture a sound signal near the user and convert it to an audio electrical signal. It should be understood that the audio electrical signal is an audio signal in the embodiment of the present application.
A power supply 306 operable to supply power to various components included in the electronic device. In some embodiments, the power source 306 may be a battery, such as a rechargeable battery or the like.
Optionally, if the electronic device 300 is a wireless headset, the electronic device 300 may further include: a sensor 303, a wireless communication module 307, etc.
The sensor 303 may be an approaching light sensor. The processor 301 may determine whether the headset is worn by the user via the sensor 303. For example, the processor 301 may utilize a proximity light sensor to detect whether an object is near the headset, to determine whether the headset is worn, and so on. Alternatively, if the headset is charged in the charging box, the processor 301 may determine whether the charging box of the headset is opened using the proximity light sensor, thereby determining whether to control the headset to be in the pairing state. In addition, in some embodiments, the earphone may further include a bone conduction sensor, the bone conduction sensor may acquire a vibration signal of the bone mass vibrated by the sound part, and the processor 301 may analyze the voice signal to implement a control function corresponding to the voice signal. In other embodiments, the headset may further include a touch sensor, a pressure sensor, or the like for detecting a touch operation, a press operation, or the like of the user on the headset, respectively. In other embodiments, the headset may further include a fingerprint sensor for detecting a user's fingerprint, identifying the user's identity, and the like.
And a wireless communication module 307, configured to establish a wireless connection with the other electronic device, so that the headset can perform data interaction with the other electronic device. In some embodiments, the wireless communication module 307 may be a Near Field Communication (NFC) module, so that the headset may perform NFC communication with other electronic devices having the NFC module. The NFC module may store information related to the headset, for example, a name, address information, or a unique identifier of the headset, so that other electronic devices having the NFC module may establish an NFC connection with the headset according to the related information and perform data transmission based on the NFC connection. In other embodiments, the wireless communication module 307 may also be a bluetooth module, and the bluetooth module stores a bluetooth address of the headset, so that other electronic devices can establish a bluetooth connection with the headset according to the bluetooth address, transmit audio data through the bluetooth connection, and the like. In this embodiment of the present application, the bluetooth module may simultaneously support multiple bluetooth connection types, for example, a Serial Port Protocol (SPP) of a conventional bluetooth technology or a Bluetooth Low Energy (BLE) generic attribute configuration protocol (GAP), and the like, which is not limited herein.
In other embodiments, the wireless communication module 307 may also be an infrared module or a wireless fidelity (WIFI), and the specific implementation of the wireless communication module 307 is not limited herein.
In addition, in the embodiment of the present application, only one wireless communication module 307 may be provided, or a plurality of wireless communication modules may be provided as needed. For example, two wireless communication modules may be provided in the headset, wherein one wireless communication module is a bluetooth module, and the other wireless communication module is an NFC module. Thus, the headset can perform data communication through the two wireless communication modules, respectively, and the number of the wireless communication modules 307 is not limited herein.
It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic device 300. It may have more or fewer components than shown in fig. 3, may combine two or more components, or may have a different configuration of components, etc. For example, the electronic device 300 may further include an indicator light (which may indicate a status such as power), a dust screen (which may be used with an earpiece), and the like. The various components shown in fig. 3 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing or application specific integrated circuits.
As shown in fig. 4, a flow of an auxiliary listening method is provided, the flow at least includes:
step 401: the electronic device determines coordinates of a first sound source according to an external audio signal collected by a microphone and a first direction, wherein the first direction is determined by detecting a user, and the first sound source is a sound source in other directions except the first direction. For example, the first direction may be a direction in which the user is interested, and the first sound source may be a sound source in a direction in which the user is not interested. For example, sound sources S1, S2, and S3 are present near the user. If the user is interested in sound source S1, for example, listening to the audio output of sound source S1, S1 can be the sound source in the direction of interest of the user, and sound sources S2 and S3 can be the sound sources in the directions of non-interest of the user.
Optionally, the implementation process of step 401 may include: the electronic device determines coordinates of at least one sound source in the vicinity of the user from the external audio signal collected by the microphone. In an example, a microphone array is disposed in the electronic device, where the microphone array includes at least one microphone, each microphone respectively collects an external audio signal, and there is a time delay between the external audio signals collected by different microphones. The electronic device may determine coordinates of at least one sound source in the vicinity of the user based on a time delay between external audio signals collected by different microphones. Alternatively, the microphone may be a vector microphone or the like. For example, a sound source S1 and a sound source S2 are present near the user. When a user wears the electronic device, a microphone array of the electronic device can collect a sound signal outside the user and convert the sound signal into an audio signal, where the audio signal includes an audio signal corresponding to the sound source S1 and an audio signal corresponding to the sound source S2. The electronic device may determine the coordinates of the sound source S1 and the coordinates of the sound source S2 according to the time delays of the different microphones of the microphone array for acquiring the audio signals.
In a possible implementation manner, a model described in "research on sound source localization algorithm based on microphone array" (the great university of south kyo, liu Chao) in the article may be cited to implement the above-mentioned determination of the coordinates of each sound source by using the time delay of different microphones in the embodiment of the present application.
Specifically, as shown in fig. 6, the microphone array is composed of N +1 microphones, where the N +1 microphones are (M) respectively 0 ,M 1 ,….,M n ) And the like. Setting the spatial coordinate of the ith microphone to r i =(x i, y i ,z i ) I =0,1, …, N. Microphone M 0 Is the origin of a spatial coordinate, i.e. r 0 = (0,0,0), spatial coordinate of sound source S is r s = (x, y, z), then:
a. the distance between the sound source S and the ith microphone is:
Figure PCTCN2021078222-APPB-000001
wherein d is i Denotes the distance of the sound source S from the ith microphone, (x, y, z) denotes the coordinates of the sound source S, (x i, y i ,z i ) Indicating the coordinates of the ith microphone.
b. The distance difference between the sound source S to each microphone is:
d ij =d i -d j i、j=0,1,....,N
wherein d is ij Indicating the distance d from the sound source to the ith microphone i Distance d from the sound source to the jth microphone j Difference d between ij
c. The relative distance between the ith microphone and the jth microphone is:
Figure PCTCN2021078222-APPB-000002
wherein, the first and the second end of the pipe are connected with each other,
Figure PCTCN2021078222-APPB-000003
denotes a relative distance between the ith microphone and the jth microphone, c denotes a sound velocity, τ ij Representing the time delay between the ith and jth microphones.
In the above expressions, the distances between the respective microphones are known, and the sound velocities are also known. The above expressions are solved comprehensively, and the spatial position S = (x) of the sound source S can be obtained s ,y s ,z s ). The specific algorithm may be a maximum likelihood estimation method, a least square method, or the like, and is not limited.
Through the above description, the coordinates of at least one sound source near the user can be obtained. Then, the process of detecting the first direction in which the user focuses on and the process of how to screen out the first sound source which is not focused on by the user from the at least one sound source according to the first direction in which the user focuses on are continuously described.
For example, the user may wear the electronic device. For example, if the electronic device is a headset, the user may wear the electronic device at the ear position. The electronic device can determine a first direction in which the user focuses by detecting the user. For example, the electronic device detects a gaze direction of a user's eyes, treats the gaze direction of the user's eyes as a first direction of user attention, and so on. In one possible implementation, an Inertial Measurement Unit (IMU) may be deployed in the electronic device, and the electronic device may determine a head orientation of the user using the IMU, determine an eye gaze direction of the user according to the head orientation of the user, and the like. For example, if the head of the user is oriented straight ahead as detected by the IMU, then the user's eye gaze direction, i.e., the first direction of user attention, is straight ahead, etc. Or, a camera may be deployed in the electronic device, and the camera may acquire a head image of the user, and determine a gaze direction of both eyes of the user according to the head image of the user acquired by the camera. Or, a brain wave detection sensor may be disposed in the electronic device, the brain wave detection sensor may detect a binaural current difference of the user, and the gaze direction of the user may be determined according to a correspondence between the binaural current difference and the gaze direction.
Thereafter, the electronic device may determine coordinates of a first sound source among coordinates of at least one sound source near the user according to the first direction. The process can be implemented in two ways:
first, the electronic device determines a sound source of interest of a user among at least one sound source near the user according to the first direction; and then, in at least one sound source nearby the user, excluding the sound source concerned by the user, wherein the rest sound sources are the first sound sources which are not concerned by the user. For example, there are 5 sound sources near the user, and according to the first direction of user interest, it is determined that sound source a of the 5 sound sources is the sound source of user interest, and the remaining 4 sound sources of the 5 sound sources except sound source a are the first sound sources of user non-interest. It should be understood that, in the embodiment of the present application, the first sound source that is not focused by the user may include one sound source, may also include a plurality of sound sources, and the like, without limitation.
Second, the electronic device directly determines a first sound source that is not of interest to the user among at least one sound source in the vicinity of the user according to the first direction.
For example, the electronic device may analyze the coordinates of at least one sound source in the vicinity of the user to determine the positional relationship of each sound source to the user. Alternatively, the coordinates of the sound source may be coordinates relative to the user. As shown in fig. 5, a coordinate system is established. Alternatively, the coordinate system may be a three-dimensional coordinate system. Wherein the origin of the coordinate system is the head position of the user, the X-axis represents the left/right direction of the user, the Y-axis represents the front/rear direction of the user, and the Z-axis represents the up/down direction of the user. For example, in one possible implementation, the positive direction of the X-axis is the positive right direction of the user, the negative direction of the X-axis is the positive left direction of the user, the positive direction of the Y-axis is the positive front direction of the user, the negative direction of the Y-axis is the positive back direction of the user, the positive direction of the Z-axis is the positive top direction of the user, and the negative direction of the Z-axis is the positive down direction of the user; it is to be understood that the positive directions of the above-mentioned X, Y and Z axes each represent a direction thereof greater than 0, and the negative directions of the X, Y and Z axes each represent a direction thereof less than 0. The electronic equipment can determine the position relation between each sound source and the user by analyzing the coordinates of each sound source relative to the user. For example, the coordinates of a sound source relative to the user are (1,2,0), which represents the position of the sound source relative to the user as: the position of the user is 1m right and 2m front, and is the same plane with the user position. Through the coordinates of the sound source, the position of the sound source can be accurately positioned, and therefore the azimuth relation between the position and the user is determined.
And determining the deviation between each sound source and the first direction according to the detected first direction and the direction relation between the sound source and the user. For example, the first direction of the detected user's attention is 15 degrees away from the front of the user. And the direction relationship between a sound source and the user is deviated from the front of the user by 20 degrees, the deviation between the sound source and the first direction is 5 degrees. And selecting a sound source with deviation larger than a threshold value from the first direction from at least one sound source near the user, wherein the sound source is the first sound source which is not concerned by the user. Or, selecting a sound source with a deviation from the first direction less than or equal to a threshold value from at least one sound source near the user, wherein the sound source can be a sound source concerned by the user; and then, excluding the sound source concerned by the user from all the sound sources nearby the user, namely, the sound source not concerned by the user is obtained.
In this way, the deviation of each sound source from the first direction can be determined. And when the deviation of a sound source from the first direction is greater than a threshold value, the sound source is considered as a sound source which is not concerned by the user. And when the deviation of a sound source from the first direction is less than or equal to the threshold value, the sound source is considered as the sound source concerned by the user. Following the example shown in fig. 1 described above, a sound source S1 and a sound source S2 are present near the user. If the deviation of the sound source S1 from the first direction is less than or equal to the threshold, the sound source S1 is considered as the sound source of interest to the user. If the deviation of the sound source S2 from the first direction is greater than the threshold value, the sound source S2 is considered to be a sound source that is not of interest to the user. Of course, the threshold may be set at the time of shipment of the electronic device, or may be subsequently synchronized or notified to the electronic device by a server of the electronic device, and the like, without limitation. It should be noted that, in the description of the embodiments of the present application, the "attention" and the "interest" are not distinguished and may be replaced with each other. For example, the sound source concerned by the user may be replaced by a sound source interested by the user; the sound source which is not concerned by the user can be replaced by a sound source which is not interested by the user, and the like. In one understanding, the sound source of interest to the user, or the sound source of interest to the user, may be a sound source associated with an activity the user is currently engaged in. For example, if the user is currently engaged in a television-watching activity and the television is playing an audio signal as a sound source, the television may be the source of interest or interest to the user. Technically, the sound source of interest to the user or the user may be a sound source whose deviation from the eye gaze direction of the user is less than or equal to a threshold value, or the like. Similarly, a sound source that is not of interest or interest to the user may be a sound source that is not related to the activity the user is currently engaged in, which is noise to the user. Following the above example, the user is watching tv, which is the activity the user is currently engaged in, and listening to music is not the activity the user is currently engaged in. If there are music players in the vicinity of the television that are playing music, then the music players may be sources that are not of interest to the user. Technically, a sound source that is not of interest to the user may be a sound source that has a deviation from the user's eye gaze direction that is greater than a threshold, or the like.
Step 402: the electronic equipment determines a first HRTF corresponding to the first sound source according to the coordinates of the first sound source and a preset HRTF. Alternatively, the first HRTF may include two HRTFs, as shown in fig. 7, which may correspond to an HRTF of a left ear and an HRTF of a right ear, respectively.
Step 403: the electronic equipment obtains a noise signal corresponding to the first sound source according to the first HRTF and the preset virtual noise, and plays the noise signal.
Optionally, in the frequency domain, the electronic device may multiply the first HRTF with a preset virtual noise frequency domain to obtain a noise signal corresponding to the first sound source. For example, following the example above, the first HRTF includes an HRTF for the left ear and an HRTF for the right ear. The HRTF of the left ear is multiplied by the preset virtual noise frequency domain to obtain the noise signal of the left ear, and the HRTF of the right ear is multiplied by the preset virtual noise frequency domain to obtain the noise signal of the right ear. The noise signal of the left ear and the noise signal of the right ear may be referred to as binaural noise signals. HRTF is an algorithm for sound effect localization, corresponding to head related input impulse response (HRIR) in the time domain. The process of multiplying the HRTF and the preset virtual noise in the frequency domain by the frequency domain is embodied as a process of convolving the HRIR and the preset virtual noise in the time domain.
For ease of understanding, HRTFs are first introduced, the concept and meaning of which are as follows: a person has two ears but is able to localize sound from three dimensions, which is a benefit of the system for analyzing the sound signal by the human ear. HRTFs can simulate the analysis system of the human ear on a sound signal. The HRTFs essentially contain spatial orientation information of the sound sources, and the corresponding HRTFs are different for sound sources with different spatial orientations. For any single-sound-channel audio signal, after multiplying the single-sound-channel audio signal by the HRTF of the left ear and the HRTF of the right ear respectively, the audio signal corresponding to the two ears can be obtained, and the 3-dimensional audio can be experienced by playing through the earphone.
In a possible implementation manner, HRTFs of a plurality of positions may be stored in advance, and a first HRTF corresponding to the first sound source may be obtained by using the HRTFs of the plurality of positions stored in advance (for example, performing interpolation operation on the HRTFs of the plurality of positions). And then multiplying the first HRTF by a preset virtual noise frequency domain to obtain a noise signal. It should be noted that the HRTF is position-dependent, and the first HRTF is obtained from the coordinates of the first sound source. Through the multiplication of the first HRTF and the preset virtual noise signal frequency domain, the virtual noise signal can be superposed to the audio signal of the first sound source, the signal-to-noise ratio of the audio signal of the first sound source is reduced, the user can be enabled to more concentrate on listening to the audio signal of the concerned sound source, and the user is assisted to listen. For example, following the example of fig. 1, the first HRTF is obtained according to the coordinates of the sound source S2, and then the first HRTF is multiplied by a preset virtual noise frequency domain to obtain a noise signal, and the noise signal is played to superimpose the noise signal on the sound source S2, so as to reduce the signal-to-noise ratio of the audio signal of the sound source S2, and thus the user can listen to the audio signal of the sound source S1 more intensely.
In the embodiment of the present application, regarding the process of determining the first HRTF corresponding to the first sound source according to the coordinates of the first sound source and the preset HRTF in the step 402, the process includes, but is not limited to, the following two implementation manners:
mode 1: the coordinates of each sound source relative to the user are accurately located, and a large number of HRTFs are stored in advance.
In this manner, the electronic device pinpoints the location of each sound source relative to the user. Following the above example, the coordinates of a sound source relative to the user are (1,2,0), which represents the position of the sound source: the distance from the right side of the user is 1 meter, the distance from the front side of the user is 2 meters, and the distance and the position of the user are located on the same horizontal plane. Since the HRTFs are position-dependent, in this case, the electronic device may need a larger number of HRTFs in advance, so as to calculate the HRTF corresponding to the position of the sound source. The method has the advantages that the accurate HRTF can be determined, and the noise signals determined according to the HRTF can be accurately superposed on the non-attention sound source.
Mode 2: the coordinates of each sound source with respect to the user are roughly localized, and a small number of HRTFs are stored in advance.
In this manner, the electronic device may approximate the direction of the sound sources relative to the user, and no longer pinpoint the specific location of each sound source. For example, in the above manner, the determined coordinates of a sound source relative to the user may be (1,1,0), which represents that the sound source is located in the right front direction of the user, and the specific position in the right front direction is not estimated. The electronic device may store HRTFs of four directions, i.e., front, rear, left, and right, and calculate HRTFs of the right direction according to the HRTFs of the four directions. The method has the advantages of reducing the storage space of the electronic equipment, simplifying the calculation process and saving the power consumption of the electronic equipment.
Optionally, the virtual noise added in step 403 may be white noise. Alternatively, the virtual noise added in step 403 may be noise that matches the content of the first audio signal of the first sound source. The electronic device may analyze the first audio signal of the first sound source, determine the content of the first audio signal, and determine the type of virtual noise to be added according to the content of the first audio signal. For example, if the content of the first audio signal is a human talk sound, the electronic device determines that the type of virtual noise that needs to be added is a multi-human talk babble noise or the like. In one possible implementation, the electronic device may detect whether the content of the first audio signal contains human speech. If the content of the first audio signal contains human voice, the virtual noise added in the first audio signal may be multi-person talk babble noise or the like. As for the content of the first audio signal for detecting the first sound source, whether the first audio signal contains human voice or not may utilize Voice Activity Detection (VAD) techniques, such as Short Time Energy (STE) and zero crossing rate (ZCC) detection methods.
In one possible implementation, STE and ZCC etc. of the first audio signal of the first sound source may be detected. Because the STEs of speech segments are relatively large, ZCC is relatively small, while the STEs of non-speech segments are relatively small and ZCC is relatively large. This is mainly due to the fact that most of the speech signal energy is contained in the low frequency band, while the noise signal is usually less energetic and contains information in the higher frequency band. Therefore, a certain threshold may be set, and when the STE of the first audio signal of the first sound source is greater than or equal to the first threshold and the ZCC is less than or equal to the second threshold, the first audio signal of the first sound source may be considered to include human voice. And when the STE of the first audio signal of the first sound source is smaller than the first threshold value and the ZCC is larger than the second threshold value, the first audio signal of the first sound source may be considered as not including human voice. Wherein ZCC refers to the rate of sign change of a signal, i.e., the number of times a frame of speech time domain signal crosses the time axis. The calculation method is that all signals in the frame are translated by 1, then the corresponding points are multiplied, if the sign is negative, the zero crossing is indicated, and the zero crossing rate of the frame can be obtained only by solving the product value of all negative numbers in the frame. STE refers to the energy of a frame of speech signal.
Optionally, the electronic device may also determine an energy of the first audio signal; and determining the energy of virtual noise to be added and the like according to the energy of the first audio signal. The method is essentially used for controlling the signal-to-noise ratio of the first audio signal after adding the virtual noise. For example, the signal-to-noise ratio of the first audio signal and the added virtual noise is set to 50% in advance. For example, the energy of the first audio signal is W1, then the energy to be added with virtual noise may be half the energy of the first audio signal, i.e. 0.5W1.
Optionally, in this embodiment, the external audio signal collected by the microphone may be a mixed audio signal, where the mixed audio signal includes audio signals output by multiple sound sources. The electronic device can separate the mixed audio signal collected by the microphone to obtain a first audio signal corresponding to the first sound source. Then, the audio content in the first audio signal and/or the energy of the first audio signal are determined by adopting the above mode.
In a possible implementation, a simple frequency-domain speech separation algorithm introduced in "sound field reconstruction and speech separation" of multichannel speech signal processing (doctor paper of university of general technology, wang Lin) in the industry may be used to separate the mixed audio signal collected by the microphone into a plurality of independent audio signals, where the plurality of independent audio signals include the first audio signal corresponding to the first sound source.
N independent sound sources and M microphones are provided, and the sound source vector is (N) = [ s = 1 (n),...s N (n)] T The observation vector is x (n) = [ x = [) 1 (n),...x M (n)] T And the hybrid filter length is P, the convolutional hybrid process can be expressed as:
Figure PCTCN2021078222-APPB-000004
the hybrid network H (N) is a matrix sequence of M x N and is formed by the impulse response of the hybrid filter. Assuming that the length of the separation filter is L, the estimated sound source vector is y (n) = [ y = 1 (n),...y N (n)] T The expression is as follows:
Figure PCTCN2021078222-APPB-000005
where the separation network W (N) is a matrix of N x M, which is formed by the impulse responses of the separation filters, and x represents a matrix convolution operation.
The separation network W (n) may be obtained by a frequency domain blind source separation algorithm. The time-domain convolution is transformed into a frequency-domain product, i.e., a product, by a short-time Fourier transform (STFT)
X(m,f)=H(f)S(m,f)
Y(m,f)=W(f)X(m,f)
Wherein m is obtained by down-sampling a time index value n at L points, X (m, f) and Y (m, f) are respectively obtained by STFT for X (n) and Y (n), H (f) and W (f) are respectively Fourier transform forms of H (n) and W (n), f belongs to [ f ∈ [ f [ ] 0 ,...f L/2 ]Is the frequency.
The Y (m, f) obtained after the blind source separation is inversely transformed back to the time domain, and then the estimated sound source signal Y is obtained 1 (n),...y N (n)。
It should be noted that in the present embodiment, one or more sound sources may be included in the first sound source that is not of interest to the user. When a plurality of sound sources are included, each sound source is processed once according to the steps of the step 402 and the step 403, that is, an HRTF corresponding to each sound source is obtained, and each HRTF is multiplied by a preset virtual noise frequency domain to obtain a noise signal corresponding to each sound source. The noise signals of each sound source are played, and the noise signals can be superposed on the corresponding sound sources, so that the signal to noise ratio of the non-concerned sound sources of the user is reduced, the user can concentrate on listening to the audio signals of the concerned sound sources, and the auxiliary selection listening is realized.
For example, an embodiment of the present application provides an electronic device, where the electronic device is configured to implement the method for assisting listening provided in the embodiment of the present application, and the electronic device at least implements the following two functions:
a first part: detection and identification
The module for realizing the detection and identification functions in the electronic equipment can comprise an environmental information acquisition module, an environmental information processing module and an attention detection module. The environment information acquisition module is used for acquiring audio signals in an environment, and can be an environment information acquisition sensor such as a microphone and a camera deployed in the electronic equipment. The environment information processing module may determine a position of a sound source corresponding to the audio signal based on the audio signal in the collected environment. The attention detection module is used for determining the direction in which the user focuses. For example, the orientation of the user's head, or the gaze direction of the user's eyes, etc., may be an IMU, camera, etc., disposed on the electronic device.
A second part: function processing
The module for realizing the functional processing in the electronic equipment can comprise an audio processing module and an audio playing module. The audio processing module is used for adding noise to an audio signal that is not focused by a user, and may be a processor disposed in the electronic device. The audio playing module is configured to play the added noise signal, and may be a speaker disposed in the electronic device.
The embodiment of the present application provides a specific example of an auxiliary listening method, which at least includes the following steps:
the method comprises the following steps: direction recognition
A user direction of attention is detected. As to how to detect the user attention direction, two ways are included, but not limited to:
the first mode is as follows: the electronic equipment is provided with a camera, the camera is used for detecting the gaze direction of a user wearing the electronic equipment, and the detected gaze direction of the user is the user attention direction.
The second mode is as follows: the electronic equipment is provided with a brain wave detection sensor, the current difference of ears of a user can be detected by using the brain wave detection sensor, and the gaze direction of the user is determined based on the current difference of ears in different gaze directions. Also, the user's gaze direction, i.e. the direction of attention of the user.
Step two: location identification
And detecting and determining the sound source in the direction which is not concerned by the user. For example, a microphone array or the like is disposed on the electronic device, and the coordinates of all sound sources near the user can be detected by using the microphone array. And selecting a sound source which is not focused by the user from all sound sources near the user according to the recognized user focusing direction. Optionally, the sound source that is not focused by the user may be one or more, but is not limited. For example, the number of the sound sources that are not focused by the user may be n, and the coordinates of the n sound sources that are not focused by the user may be p1 (x 1, y1, z 1), …, pn (xn, yn, zn), where n is a positive integer greater than 1.
Step three: binaural rendering, as shown in fig. 8.
1. And acquiring the HRTFs of the ears based on the coordinates of the non-attention sound source of the user. Alternatively, HRTFs for a plurality of positions may be stored in advance in the electronic device. And performing interpolation operation on the HRTFs to obtain the HRTFs corresponding to the coordinates of the non-attention sound source of the user. Following the above example, the coordinates of the n non-attention sound sources are p1 (x 1, y1, z 1), …, and pn (xn, yn, zn), respectively, so in the embodiment of the present application, binaural HRTFs and the like of the n non-attention sound sources can be obtained respectively.
2. And processing the virtual noise and the binaural HRTF, such as time domain convolution, frequency domain product and the like, to obtain a binaural audio signal, and playing the binaural audio signal in real time. The playing mode may be an acoustic mode, a bone conduction mode, or the like, and is not limited.
The virtual noise can be a noise audio file, can be stored in the electronic device, and can also be stored in the cloud, and the electronic device downloads the noise audio file from the cloud to the electronic device and renders the noise audio file for playing when needed. Following the above example, the n non-attention sound sources correspond to n binaural HRTFs, the n binaural HRTFs may correspond to n virtual noises, the n binaural HRTFs and the corresponding n virtual noises may be processed to obtain n binaural noise signals, and the n binaural noise signals are played. Optionally, the n virtual noises may be n1 (n), …, nn (n), respectively, and the n virtual noises may be the same or different.
By the method, the definition of the audio signal of the non-concerned sound source is reduced by superposing the virtual noise in the audio signal of the non-concerned sound source, so that the perception definition of the audio signal of the user concerned sound source is improved.
In the embodiments provided in the present application, the method provided in the embodiments of the present application is described from the perspective of an electronic device as an execution subject. In order to implement the functions in the method provided by the embodiments of the present application, the electronic device may include a hardware structure and/or a software module, and the functions are implemented in the form of a hardware structure, a software module, or a hardware structure and a software module. Whether any of the above-described functions is implemented as a hardware structure, a software module, or a hardware structure plus a software module depends upon the particular application and design constraints imposed on the technical solution.
As shown in fig. 9, an auxiliary listening device according to an embodiment of the present application includes at least a processing unit 901 and a playing unit 902.
The processing unit 901 is configured to determine, according to an external audio signal collected by a microphone and a first direction, coordinates of a first sound source, where the first direction is determined by detecting a user, and the first sound source is a sound source in a direction other than the first direction; the processing unit 901 is further configured to determine, according to the coordinates of the first sound source and a preset HRTF, a first HRTF corresponding to the coordinates of the first sound source; the processing unit 901 is further configured to obtain a noise signal corresponding to the first sound source according to the first HRTF and preset virtual noise; a playing unit 902, configured to play the noise signal.
In one possible implementation manner, the determining coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction includes: determining coordinates of at least one sound source near the user according to an external audio signal collected by a microphone; detecting the user and determining the first direction; determining coordinates of the first sound source among coordinates of at least one sound source near the user according to the first direction.
In one possible implementation, the determining coordinates of at least one sound source near the user according to an external audio signal collected by a microphone includes: the number of the microphones is multiple, each microphone respectively collects external audio signals, and the external audio signals collected by different microphones are delayed; and determining the coordinates of at least one sound source nearby the user according to the time delay of the external audio signals collected by different microphones.
In a possible implementation manner, the detecting the user and determining the first direction include: detecting a gaze direction of the user; or detecting the binaural current difference of the user, and determining the gaze direction of the user according to the corresponding relation between the binaural current difference and the gaze direction, wherein the gaze direction of the user is the first direction.
In a possible implementation manner, the determining, according to the first direction, coordinates of the first sound source in coordinates of at least one sound source near the user includes: analyzing the coordinates of at least one sound source near the user, and determining the direction relation between each sound source and the user; determining the deviation between each sound source and the first direction according to the first direction and the direction relation between the sound source and a user; selecting a sound source having a deviation from the first direction greater than a threshold value among at least one sound source near the user, the coordinates of the sound source being the coordinates of the first sound source.
In one possible implementation, the processing unit 901 is further configured to: the external audio signals collected by the microphone are mixed audio signals, and the mixed audio signals comprise audio signals output by a plurality of sound sources; and separating the external audio signals collected by the microphone to obtain a first audio signal output by a first sound source.
In one possible implementation, the processing unit 901 is further configured to: analyzing the separated first audio signal to determine the content of the first audio signal; and determining the type of virtual noise to be added according to the content of the first audio signal.
In a possible implementation manner, the content of the first audio signal is a human talk sound, and the type of virtual noise to be added is a multiple-person talk babble noise.
In one possible implementation, the processing unit 901 is further configured to: determining an energy of the separated first audio signal; and determining the energy of the virtual noise to be added according to the energy of the first audio signal.
Embodiments of the present application also provide a computer-readable storage medium, which includes a program, and when the program is executed by a processor, the method in the above method embodiments is performed.
A computer program product comprising computer program code which, when run on a computer, causes the computer to implement the method in the above method embodiments.
A chip, comprising: a processor coupled with a memory for storing a program or instructions which, when executed by the processor, cause an apparatus to perform the method in the above method embodiments.
In the description of the present application, a "/" indicates a relationship in which the objects associated before and after are an "or", for example, a/B may indicate a or B; in the present application, "and/or" is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," and the like do not denote any order or importance, but rather the terms "first," "second," and the like do not denote any order or importance.
In the embodiments of the present application, the processor may be a general processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
In the embodiment of the present application, the memory may be a nonvolatile memory, such as a Hard Disk Drive (HDD) or a solid-state drive (SSD), and may also be a volatile memory (RAM), for example. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
The methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network appliance, a user device, or other programmable apparatus. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., an SSD), among others.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
It is noted that a portion of this patent application contains material which is subject to copyright protection. The copyright owner reserves the copyright rights whatsoever, except for making copies of the patent files or recorded patent document contents of the patent office.

Claims (20)

  1. A method for assisted listening, comprising:
    determining coordinates of a first sound source according to an external audio signal collected by a microphone and a first direction, wherein the first direction is determined by detecting a user, and the first sound source is a sound source in other directions except the first direction;
    determining a first HRTF corresponding to the first sound source according to the coordinates of the first sound source and a preset Head Related Transfer Function (HRTF);
    and obtaining a noise signal corresponding to the first sound source according to the first HRTF and preset virtual noise, and playing the noise signal.
  2. The method of claim 1, wherein determining coordinates of a first sound source from the external audio signal captured by the microphone and the first direction comprises:
    determining coordinates of at least one sound source near the user according to an external audio signal collected by a microphone;
    detecting the user and determining the first direction;
    determining coordinates of the first sound source among coordinates of at least one sound source near the user according to the first direction.
  3. The method of claim 2, wherein determining coordinates of at least one sound source in the vicinity of the user from the external audio signals collected by the microphones comprises:
    the number of the microphones is multiple, each microphone respectively collects external audio signals, and the external audio signals collected by different microphones are delayed;
    and determining the coordinates of at least one sound source nearby the user according to the time delay of the external audio signals collected by different microphones.
  4. The method of claim 2 or 3, wherein said detecting the user and determining the first direction comprises:
    detecting a gaze direction of the user; alternatively, the first and second electrodes may be,
    detecting the binaural current difference of the user, and determining the gaze direction of the user according to the corresponding relation between the binaural current difference and the gaze direction, wherein the gaze direction of the user is the first direction.
  5. The method according to any of claims 2 to 4, wherein said determining coordinates of said first sound source in coordinates of at least one sound source in the vicinity of the user according to said first direction comprises:
    analyzing the coordinates of at least one sound source near the user, and determining the direction relation between each sound source and the user;
    determining the deviation between each sound source and the first direction according to the first direction and the direction relation between the sound source and a user;
    selecting a sound source whose deviation from the first direction is greater than a threshold value, of at least one sound source in the vicinity of the user, the coordinates of which are the coordinates of the first sound source.
  6. The method of any of claims 1 to 5, further comprising:
    the external audio signals collected by the microphone are mixed audio signals, and the mixed audio signals comprise audio signals output by a plurality of sound sources;
    and separating the external audio signals collected by the microphone to obtain a first audio signal output by the first sound source.
  7. The method of claim 6, further comprising:
    analyzing the separated first audio signal to determine the content of the first audio signal;
    and determining the type of virtual noise to be added according to the content of the first audio signal.
  8. The method of claim 7, wherein the content of the first audio signal is a human talk sound, and the type of virtual noise to be added is a multiple human talk babble noise.
  9. The method of any of claims 6 to 8, further comprising:
    determining an energy of the separated first audio signal;
    and determining the energy of the virtual noise to be added according to the energy of the first audio signal.
  10. An assistive listening device, comprising:
    a processing unit, configured to determine coordinates of a first sound source according to an external audio signal acquired by a microphone and a first direction, where the first direction is determined by detecting a user, and the first sound source is a sound source in a direction other than the first direction;
    the processing unit is further configured to determine a first HRTF corresponding to the coordinates of the first sound source according to the coordinates of the first sound source and a preset HRTF;
    the processing unit is further configured to obtain a noise signal corresponding to the first sound source according to the first HRTF and predetermined virtual noise;
    and the playing unit is used for playing the noise signal.
  11. The apparatus of claim 10, wherein determining coordinates of the first sound source based on the external audio signal collected by the microphone and the first direction comprises:
    determining coordinates of at least one sound source near the user according to an external audio signal collected by a microphone;
    detecting the user and determining the first direction;
    determining coordinates of the first sound source among coordinates of at least one sound source near the user according to the first direction.
  12. The apparatus of claim 11, wherein said determining coordinates of at least one sound source in the vicinity of the user from the external audio signals collected by the microphones comprises:
    the number of the microphones is multiple, each microphone respectively collects external audio signals, and the external audio signals collected by different microphones are delayed;
    and determining the coordinates of at least one sound source nearby the user according to the time delay of the external audio signals collected by different microphones.
  13. The apparatus of claim 11 or 12, wherein said detecting the user to determine the first direction comprises:
    detecting a gaze direction of the user; alternatively, the first and second electrodes may be,
    detecting the binaural current difference of the user, and determining the gaze direction of the user according to the corresponding relation between the binaural current difference and the gaze direction, wherein the gaze direction of the user is the first direction.
  14. The apparatus according to any of claims 11 to 13, wherein said determining coordinates of said first sound source among coordinates of at least one sound source in the vicinity of the user according to said first direction comprises:
    analyzing the coordinates of at least one sound source near the user, and determining the direction relation between each sound source and the user;
    determining the deviation between each sound source and the first direction according to the first direction and the direction relation between the sound source and a user;
    selecting a sound source having a deviation from the first direction greater than a threshold value among at least one sound source near the user, the coordinates of which are the coordinates of the first sound source.
  15. The apparatus of any of claims 10 to 14, wherein the processing unit is further to:
    the external audio signals collected by the microphone are mixed audio signals, and the mixed audio signals comprise audio signals output by a plurality of sound sources;
    and separating the external audio signals collected by the microphone to obtain a first audio signal output by the first sound source.
  16. The apparatus as recited in claim 15, said processing unit to further:
    analyzing the separated first audio signal to determine the content of the first audio signal;
    and determining the type of virtual noise to be added according to the content of the first audio signal.
  17. The apparatus of claim 16, wherein the content of the first audio signal is a human talk sound, and the type of virtual noise to be added is a multiple human talk babble noise.
  18. The apparatus of any of claims 15 to 17, wherein the processing unit is further to:
    determining an energy of the separated first audio signal;
    and determining the energy of the virtual noise to be added according to the energy of the first audio signal.
  19. An electronic device, comprising memory and one or more processors, wherein the memory is configured to store computer program code comprising computer instructions; the computer instructions, when executed by the processor, cause the electronic device to perform the method of any of claims 1-9.
  20. A computer-readable storage medium comprising a program or instructions which, when run on a computer, causes the method of any of claims 1-9 to be performed.
CN202180004382.3A 2021-02-26 2021-02-26 Auxiliary listening method and device Pending CN115250646A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/078222 WO2022178852A1 (en) 2021-02-26 2021-02-26 Listening assisting method and apparatus

Publications (1)

Publication Number Publication Date
CN115250646A true CN115250646A (en) 2022-10-28

Family

ID=83047666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180004382.3A Pending CN115250646A (en) 2021-02-26 2021-02-26 Auxiliary listening method and device

Country Status (2)

Country Link
CN (1) CN115250646A (en)
WO (1) WO2022178852A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9747367B2 (en) * 2014-12-05 2017-08-29 Stages Llc Communication system for establishing and providing preferred audio
US9905244B2 (en) * 2016-02-02 2018-02-27 Ebay Inc. Personalized, real-time audio processing
CN108810719A (en) * 2018-08-29 2018-11-13 歌尔科技有限公司 A kind of noise-reduction method, neckstrap formula earphone and storage medium
US10638248B1 (en) * 2019-01-29 2020-04-28 Facebook Technologies, Llc Generating a modified audio experience for an audio system

Also Published As

Publication number Publication date
WO2022178852A1 (en) 2022-09-01

Similar Documents

Publication Publication Date Title
US10187740B2 (en) Producing headphone driver signals in a digital audio signal processing binaural rendering environment
EP3424229B1 (en) Systems and methods for spatial audio adjustment
US10659908B2 (en) System and method to capture image of pinna and characterize human auditory anatomy using image of pinna
US20220159403A1 (en) System and method for assisting selective hearing
US10585486B2 (en) Gesture interactive wearable spatial audio system
CN102164336B (en) Head-wearing type receiver system and acoustics processing method
US20200186912A1 (en) Audio headset device
JP2017521902A (en) Circuit device system for acquired acoustic signals and associated computer-executable code
US11184723B2 (en) Methods and apparatus for auditory attention tracking through source modification
EP3695618B1 (en) Augmented environmental awareness system
US20200301653A1 (en) System and method for processing audio between multiple audio spaces
Gupta et al. Augmented/mixed reality audio for hearables: Sensing, control, and rendering
Chatterjee et al. ClearBuds: wireless binaural earbuds for learning-based speech enhancement
US10979236B1 (en) Systems and methods for smoothly transitioning conversations between communication channels
WO2019109420A1 (en) Left and right channel determining method and earphone device
CN114339582B (en) Dual-channel audio processing method, device and medium for generating direction sensing filter
Veluri et al. Semantic hearing: Programming acoustic scenes with binaural hearables
WO2022178852A1 (en) Listening assisting method and apparatus
US20220122630A1 (en) Real-time augmented hearing platform
JP6587047B2 (en) Realistic transmission system and realistic reproduction device
US20170195779A9 (en) Psycho-acoustic noise suppression
Amin et al. Blind Source Separation Performance Based on Microphone Sensitivity and Orientation Within Interaction Devices
Amin et al. Impact of microphone orientation and distance on BSS quality within interaction devices
TW202247142A (en) Method, device, composite microphone of the device, computer program and computer readable medium for automatically or freely selecting an independent voice target
CN113132845A (en) Signal processing method and device, computer readable storage medium and earphone

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination