WO2020060206A1 - Methods for audio processing, apparatus, electronic device and computer readable storage medium - Google Patents

Methods for audio processing, apparatus, electronic device and computer readable storage medium Download PDF

Info

Publication number
WO2020060206A1
WO2020060206A1 PCT/KR2019/012099 KR2019012099W WO2020060206A1 WO 2020060206 A1 WO2020060206 A1 WO 2020060206A1 KR 2019012099 W KR2019012099 W KR 2019012099W WO 2020060206 A1 WO2020060206 A1 WO 2020060206A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
voice
audio
signal
collecting device
Prior art date
Application number
PCT/KR2019/012099
Other languages
French (fr)
Inventor
Lei Yang
Weiqin Wang
Bingxiao FANG
Yunchuan LI
Lizhong Wang
Heng Zhu
Zhenchang MA
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2020060206A1 publication Critical patent/WO2020060206A1/en

Links

Images

Classifications

    • G10L21/0202
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1016Earpieces of the intra-aural type
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2460/00Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
    • H04R2460/13Hearing devices using bone conduction transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/027Spatial or constructional arrangements of microphones, e.g. in dummy heads

Definitions

  • the present application relates to the field of voice enhancement technologies, and in particular, to a method for audio processing, an apparatus, an electronic device, and a computer readable storage medium.
  • Earphones with two audio collecting devices i.e., an air conduction audio collecting device and a body conduction audio collecting device
  • sound collected by an air conduction audio collecting device is easily interfered by the surrounding environment, such that the collected sound may contain a lot of noise
  • sound collected by a body conduction audio collecting device is obtained through body tissue conduction (such as bone conduction). Therefore, the body conduction audio collecting device collects less noise, or even does not collect noise.
  • the sound collected by the air conduction audio collecting device is susceptible to ambient noise, the sound collected through air conduction is full-band.
  • the sound collected by the body conduction audio collecting device is collected through body tissues, such that a high-frequency part of the sound collected by the body conduction audio collecting device is lost. Therefore, it remains a crucial issue about how to set up an earphone with two audio collecting devices to obtain better sound signals by utilizing different characteristics of the two audio collecting devices, and to perform applications, such as voice transmission, voice recognition, etc.
  • the present application provides a method for audio processing, an apparatus, an electronic device and a computer readable storage medium, by utilizing different characteristics of two audio collecting device of an earphone to obtain an audio signal with better effect for performing applications, such as voice transmission, voice recognition, etc.
  • the specific technical solutions are shown as follows:
  • a method for audio processing including: acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; and performing voice enhancement processing on the first audio signal and the second audio signal to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
  • an apparatus for audio processing including: a first acquiring module, configured to acquire a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; and a voice enhancement processing module, configured to perform voice enhancement processing on the first audio signal and the second audio acquired by the first acquiring module, to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
  • an electronic device including: an air conduction audio collecting device, a body conduction audio collecting device, an audio signal playing device, a processor, and a memory; wherein, the air conduction audio collecting device, configured to collect a first audio signal conducted via air; the body conduction audio collecting device, configured to collect a second audio signal conducted via body tissues; the audio signal playing device, configured to play an audio signal; and the memory, configured to store machine readable instructions that, when executed by the processor, cause the processor to perform the method for audio processing shown in the first aspect.
  • a computer readable storage medium wherein the computer readable storage medium stores a computer program that, when executed by a processor, implements the method for audio processing shown in the first aspect.
  • a fifth aspect there is provided another method for audio processing, including: acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; performing ambient sound cancellation processing on the second audio signal; and determining an audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • another apparatus for audio processing including: a second acquiring module, configured to acquire a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; an ambient sound cancellation processing module, configured to perform ambient sound cancellation processing on the second audio signal acquired by the second acquiring module; and a determining module, configured to determine an audio signal to be output based on the first audio signal acquired by the second acquiring module and the second audio signal which subjecting to ambient sound cancellation processing by the ambient sound cancellation processing module.
  • an electronic device including: an air conduction audio collecting device, a body conduction audio collecting device, an audio signal playing device, a processor, and a memory; wherein, the air conduction audio collecting device, configured to collect a first audio signal conducted via air; the body conduction audio collecting device, configured to collect a second audio signal conducted via body tissues; the audio signal playing device, configured to play an audio signal; and the memory, configured to store machine readable instructions that, when executed by the processor, cause the processor to perform the method for audio processing shown in the fifth aspect.
  • a computer readable storage medium wherein the computer readable storage medium stores a computer program that, when executed by a processor, implements the method for audio processing shown in the fifth aspect.
  • the embodiments of the present application provide a method for audio processing, an apparatus, an electronic device and a computer readable storage medium.
  • the present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by an body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
  • the embodiments of the present application provide a method for audio processing, an apparatus, an electronic device and a computer readable storage medium.
  • the present application acquires a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain the ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, to obtain audio signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
  • Fig. 1 is a schematic diagram that a call peer is unable to hear voice or accurately recognize a voice instruction when using a conventional earphone;
  • Fig. 2 is a schematic diagram that a call peer is capable of hearing voice or accurately recognizing a voice instruction when using an earphone with a body conduction audio collecting device;
  • Fig. 3 is a schematic flowchart of performing voice enhancement processing in the prior art
  • Fig. 4 is a schematic structural diagram of an earphone provided with an air conduction audio collecting device and a body conduction audio collecting device;
  • Fig. 5 is a schematic method flowchart for audio processing according to an embodiment of the present application.
  • Fig. 6 is another schematic method flowchart for audio processing according to an embodiment of the present application.
  • Fig. 7a is a schematic flowchart of a method for audio processing in a first specific example of Embodiment I;
  • Fig. 7b is a schematic diagram of a general flow for audio processing according to an embodiment of the present application.
  • Fig. 7c is a schematic flowchart of a specific implementation for audio processing in Embodiment I of the present application.
  • Fig. 7d is a schematic diagram of calculating the final voice spectrum amplitude by joint voice estimation
  • Fig. 7e is a schematic flowchart of a method of a second specific example in Embodiment I;
  • Fig. 7f is a schematic flowchart of a method of a third specific example in Embodiment I;
  • Fig. 8a is a schematic flowchart of implementing audio enhancement by ambient sound cancellation processing and voice enhancement processing
  • Fig. 8b is a schematic method flowchart for audio processing according to Embodiment II of the present application.
  • Fig. 8c is a schematic diagram of filtering and updating filter parameters based on a set filter according to an embodiment of the present application
  • Fig. 9a is a schematic diagram for voice activation detection in Embodiment II of the present application.
  • Fig. 9b is a schematic method flowchart for voice activation detection according to Embodiment II of the present application.
  • Fig. 9c is a schematic diagram of determining whether it is currently in an activation state based on a sequence of correlation coefficients
  • Fig. 9d is a schematic diagram of a sequence of correlation coefficients
  • Fig. 10a is a schematic method flowchart for audio processing in Embodiment III of the present application.
  • Fig. 10b is a schematic diagram of a first specific example in Embodiment IV of the present application.
  • Fig. 10c is a schematic diagram of a second specific example in Embodiment IV of the present application.
  • Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • Fig. 12 is a block diagram of a computing system of an electronic device according to an embodiment of the present application.
  • Fig. 13 is a schematic structural diagram of an apparatus for audio processing according to an embodiment of the present application.
  • Fig. 14 is a schematic structural diagram of another apparatus for audio processing according to an embodiment of the present application.
  • the audio signal includes: a voice signal and/or a noise signal, etc.
  • the audio signal sending user (user wearing the earphone) sends audio to the call peer via a communication link, and the call peer receives the audio signal, wherein there may be cases of unclear call voice; for another example, in an environment where the ambient noise is high or there is a human voice, due to the interference of noise and interference voice, a voice recognition application receives a voice instruction, and often cannot accurately recognize the user's voice instruction.
  • the body conduction audio collecting device (which may be located in ear or outside ear, and if located in ear, it may be referred as an in-ear audio collecting device) may be physically isolated.
  • the collected audio signal is collected by the body conduction audio collecting device via body tissue conduction (such as bone conduction) when a person is making voices. Therefore, the collected noise signal is rare, and even the noise signal may not be collected, so that the audio signal is sent to the call peer during the call, and the voice sent to the call peer is a clean voice, which is easily understood by the peer.
  • voice is used for voice recognition
  • the voice is sent to a voice recognition application, and since the voice received by the voice recognition application has no noise and voice interference, the recognition rate is higher.
  • the air conduction audio collecting device (which may be an out-ear audio collecting device) is susceptible to interference from ambient noise, and the collected audio signals contain many noise signals. But the signal collected by the air conduction audio collecting device is full-band compared to the voice collected by the body conduction audio collecting device (which may be an in-ear audio collecting device). This is because the voice signal picked up by the body conduction audio collecting device is conducted via body tissues, and the conducted signal undergoes a process similar to low-pass filtering.
  • the out-ear noise is physically isolated by the body conduction audio collecting device (for example, the earplug is closely attached to the external auditory canal) during using the earphone by the user, so the collected audio signal is a clean voice signal and does not contain noise.
  • the high frequency part is lost, and the spectrum of the audio signal collected by the air conduction audio collecting device are different from the spectrum of the audio signal collected by the body conduction audio collecting device.
  • an earphone having two audio collecting devices, an air conduction audio collecting device and a body conduction audio collecting device obtains voice signals with better effect by utilizing different characteristics of the two audio collecting devices, thereby performing applications, such as voice transmission, voice recognition, etc.
  • the general process is: the audio signal picked up by the body conduction audio collecting device and the audio signal picked up by the air conduction audio collecting device are separately processed as separate two-way signals, for example, being processed through a filter and the like for performing noise cancellation processing, then the processed result is superimposed into a final audio signal, and the obtained audio signal is transmitted to a terminal device (the terminal device may be a mobile phone connected to the earphone through Bluetooth or wired connection, etc.) connected to the earphone.
  • a terminal device the terminal device may be a mobile phone connected to the earphone through Bluetooth or wired connection, etc.
  • the terminal device connected to the earphone may transmit the final superimposed audio signal to the call peer; if in the voice recognition scenario, the terminal device connected to the earphone may recognize the user instruction according to the finally superimposed audio signal.
  • the first problem of the prior art for signal processing of an earphone having two audio collecting devices, before transmitting the audio signal to the connected terminal device, the audio signals collected by the two audio collecting devices are performed with noise cancellation and voice enhancement processing respectively, and the audio signals collected by the two audio collecting devices are superimposed, in the conventional method.
  • the audio signals collected by the body conduction audio collecting device and the audio signals collected by the air conduction audio collecting device are respectively processed by Fast Fourier Transformation (FFT), signal noise estimation processing, signal voice estimation processing and Inverse Fast Fourier Transformation (IFFT).
  • FFT Fast Fourier Transformation
  • IFFT Inverse Fast Fourier Transformation
  • the IFFT-processed audio signal corresponding to the body conduction audio collecting device is processed by low-pass filtering, the IFFT-processed audio signal corresponding to the air conduction audio collecting device is processed by high-pass-filter processing, and the filter-processed two signals are superimposed to obtain an output signal.
  • the output signal is output to a terminal device, such as a mobile phone connected to the earphone, and then transmitted to the call peer by the terminal device or performing a corresponding application, such as voice recognition, recording, etc.
  • This approach does not consider the correlation of the signals collected by the air conduction audio collecting devices and the body conduction audio collecting device.
  • This correlation mainly comes from: whether it is the audio signal collected by the body conduction audio collecting device or the air conduction audio collecting device, their sound source is the speaker, but the two kinds of signals have passed different propagation paths.
  • the audio signals collected by the air conduction audio collecting device is directly transmitted via the air and collected by the air conduction audio collecting device, as the environment contains ambient noise, the air conduction audio collecting device also picks up the ambient noise while collecting the speaker's voice.
  • the body conduction audio collecting device collects the speaker's voice directly transmitted to the body conduction audio collecting device through body tissue conduction. Therefore, the voice audio collected by the air conduction audio collecting device and the body conduction audio collecting device has a high correlation, actually. This correlation may help us to perform voice detection and voice noise cancellation. If being capable of using the correlation between the two items, better voice enhancement effect may be achieved. However, in the prior art, voice enhancement is performed without using correlation of voice, so the effect of performing voice enhancement in the prior art is poor.
  • the second problem of the prior art when the existing earphones with two audio collecting devices play the local audio to the user, the biggest purpose is to eliminate the local environment noise and obtain the clean voice, which such noise cancellation manner often uses the ambient noise collected by the air conduction audio collecting device, by playing opposite phase noise in an audio signal playing device (for example, an earphone speaker) to achieve the purpose of ambient noise cancellation.
  • This conventional method for ambient noise cancellation effectively eliminates local ambient noise and improves the user's listening experience. But it also leads to another problem, if there is a car next to the user or someone is talking, the noise cancellation algorithm will suppress the surrounding sound as noise, resulting in security issues or communication issues.
  • an earphone with two audio collecting devices may be designed with an ambient sound (AS) mode, that is, an environmental sound mode, and in the case of activating this mode, the air conduction audio collecting device is capable of collecting the ambient sound outside the ear and then playing it out through the earphone speaker, so that the user may hear the sound of the surrounding environment such as saying hello or having a car approached.
  • AS ambient sound
  • the body conduction audio collecting device is located in the ear, the audio signal collected by the body conduction audio collecting device includes: the voice conducted via body tissues and the audio signal played by the earphone speaker.
  • This AS mode may avoid security issues or communication issues. As shown by the schematic structural diagram of the earphone in Fig.
  • the body conduction audio collecting device and the audio signal playing device are all located within ear, and the air conduction audio collecting device is located outside ear, which the earphone speaker plays the audio signal collected by the air conduction audio collecting device, the body conduction audio collecting device collects the voice conducted via body tissues and the audio signal played by the earphone speaker, and the air conduction audio collecting device collects the external audio signal.
  • the earphone designed with the AS mode is also problematic.
  • the audio signal collected by the body conduction audio collecting device is composed of two parts, which one part corresponds the sound (including human voice and ambient noise) recorded by an air conduction audio collecting device and played by an audio signal playing device (e.g., an in-ear speaker), and the other part corresponds to the sound which is made by the user and collected by the body conduction audio collecting device through body tissue conduction.
  • the audio collected by the body conduction audio collecting device contains the ambient sound, the user voice collected by the air conduction audio collecting device, and the user voice conducted through body tissues (which may be referred as body conduction voice), the audio collected by the body conduction audio collecting device is no longer a clean call voice (body conduction voice), which may also cause the voice call peer to fail to hear the user's voice, or that the terminal device may not accurately recognize the user's voice instruction, so that the earphone with an air conduction audio collecting device and a body conduction audio collecting device, has unsuitable or unsatisfactory conventional noise cancellation algorithms.
  • body conduction voice body conduction voice
  • the noise outside the ear is isolated.
  • the sound collected by the body conduction audio collecting device and made by the earphone user does not contain noise, and the high-frequency part of the audio signal conducted through bone is lost;
  • the audio collected by the air conduction audio collecting device is full-band, since that the audio collected by the air conduction audio collecting device, which is made by the earphone user and propagated via air contains noise;
  • the audio signal collected by the air conduction audio collecting device is played by the audio signal playing device (e.g., an earphone speaker), and therefore in the case that the AS mode is activated, the noise signal contained in the audio signal collected by the body conduction audio collecting device is required to be eliminated.
  • the embodiments of the present application suppress noise and enhance voice quality, by using the correlation between the signal collected by the body conduction audio collecting device and the signal collected by the air conduction audio collecting device, as well as achieve a clearer voice call and improve the performance of the uplink voice signal during the call.
  • the terminal device may accurately recognize the user instruction after enhancing voice quality, and improve the accuracy of the voice recognition; for the problems that the un-activated AS mode easily causes a security accident, or the voice call quality is poor, or the voice instruction may not be accurately recognized when activating the AS mode, the present application recovers the signal collected by the body conduction audio collecting device by adding an adaptive filter in the AS mode, and eliminates the ambient noise in the audio sent to the peer when the speaker may hear the ambient sound clearly, such that the receiver may not hear the ambient noise of the transmitting end clearly, thereby achieving a clearer voice call and improving the performance of the uplink voice signal during the call.
  • the terminal device since the ambient noise is eliminated, the terminal device may accurately recognize the user's instruction and improve the accuracy of the voice recognition.
  • the embodiments of the present application provide a method for audio processing, which may be applied to an earphone having an air conduction audio collecting device and a body conduction audio collecting device, as shown in Fig. 5, wherein,
  • Step S801 acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device.
  • Step S802 performing voice enhancement processing on the first audio signal and the second audio signal to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
  • the terminal device connected to the earphone may acquire the voice-enhancement-processed audio signal to be output, and output the audio signal to the call peer, or output the audio signal to the voice recognition application for voice recognition; or output the audio signal to the instant messaging application, and send it as voice information to the communication peer; or record the audio signal.
  • the specific process of the terminal device receiving the audio signal is not limited in the embodiment of the present application.
  • Step S802 the step of performing voice enhancement processing on the first audio signal and the second audio signal based on a signal correlation between the first audio signal and the second audio signal, includes: Step S8021 (not shown in the figure), Step S8022 (not shown in the figure) and Step S8023 (not shown in the figure), wherein,
  • Step S8021 performing noise estimation on the first audio signal and the second audio signal, respectively.
  • Step S8022 performing voice spectrum estimation on the first audio signal and the second audio signal respectively, according to the noise estimation result corresponding to the first audio signal and the second audio signal.
  • Step S8023 performing voice enhancement processing on the first audio signal and the second audio signal according to the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
  • Step S8021 the step of performing noise estimation on the first audio signal includes: Step S8021a (not shown in the figure) to Step S8021b (not shown in the figure), wherein,
  • Step S8021a determining a voice presence prior probability corresponding to the first audio signal.
  • Step S8021b performing noise estimation on the first audio signal based on the voice presence prior probability.
  • Step S8021a includes Step S8021a1 (not shown in the figure) and Step S8021a2 (not shown in the figure), wherein,
  • Step S8021a1 determining a signal outer inner ratio (OIR) between the first audio signal and the second audio signal.
  • OIR signal outer inner ratio
  • Step S8021a2 determining a voice presence prior probability corresponding to the first audio signal according to the signal OIR.
  • Step S8021b includes Step S8021b1 (not shown in the figure) and Step S8021b2 (not shown in the figure), wherein,
  • Step S8021b1 determining a corresponding voice presence posterior probability based on the voice presence prior probability.
  • Step S8021b2 performing noise estimation on the first audio signal based on the voice presence posterior probability.
  • Step S8023 the step of performing voice enhancement processing on the first audio signal and the second audio signal according to the voice spectrum estimation results corresponding to the first audio signal and the second audio signal, includes: Step S8023a (not shown in the figure), wherein,
  • Step S8023a performing voice enhancement processing on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
  • Step S8023a includes Step S8023a1 (not shown in the figure) and Step S8023a2 (not shown in the figure), wherein,
  • Step S8023a1 performing joint voice enhancement processing on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
  • Step S8023a2 obtaining a voice-enhancement-processed audio signal to be output according to the obtained joint voice spectrum estimation result.
  • Step S8023a1 includes Step S8023a11 (not shown in the figure) to Step S8023a12 (not shown in the figure), wherein,
  • Step S8023a11 determining a mean value of a third Gaussian distribution model, according to a first Gaussian distribution model whose mean value is the voice spectrum estimation result of the first audio signal and variance is the noise estimation result of the first audio signal, and a second Gaussian distribution model whose mean value is the voice spectrum estimation result of the second audio signal and variance is the noise estimation result of the second audio signal.
  • Step S8023a12 determining joint voice spectrum estimation results for joint voice spectrum estimation on the first audio signal and the second audio signal according to the mean value of the third Gaussian distribution model.
  • Step Sa (not shown in the figure) is included, wherein,
  • Step Sa performing ambient sound cancellation processing on the second audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
  • the step of performing voice enhancement processing on the first audio signal and the second audio signal includes Step Sb (not shown in the figure), wherein,
  • Step Sb performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step Sa1 (not shown in the figure)
  • Step Sa2 (not shown in the figure)
  • Step Sa1 acquiring a third audio signal to be played by an audio signal playing device.
  • Step Sa2 performing ambient sound cancellation processing on the second audio signal through the third audio signal, and obtaining the ambient-sound-cancellation-processed second audio signal.
  • the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal includes: detecting whether it is currently in a voice activation state, wherein the voice activation state indicates that the user is making voices; if detected that it is currently in the voice activation state, performing the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal.
  • the step of detecting whether it is currently in the voice activation state includes: determining whether an audio signal playing device channel and/or a body conduction audio collecting device channel is in a voice activation state according to the second audio signal and/or the third audio signal; if at least one channel is in the voice activation state, then determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal.
  • the embodiments of the present application provide a method for audio processing.
  • the present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the audio signal after voice enhancement processing, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
  • the present application provides another method for audio processing, which may be applied to an electronic device having two audio collecting devices, as shown in Fig. 6, wherein,
  • Step S901 acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device.
  • Step S902 performing ambient sound cancellation processing on the second audio signal.
  • Step S903 determining an audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step S902 includes Step S9021 (not shown in the figure) and Step S9022 (not shown in the figure), wherein,
  • Step S9021 acquiring a third audio signal to be played by an audio signal playing device.
  • Step S9022 performing ambient sound cancellation processing on the second audio signal through the third audio signal, and obtaining the ambient-sound-cancellation-processed second audio signal.
  • Step S9022 the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal, includes:
  • the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal includes: detecting whether it is currently in a voice activation state, wherein the voice activation state indicates that the user is making voices; and, if detected that it is currently in the voice activation state, then performing the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal.
  • the method further includes: if detected that it is currently in the voice inactivation state, then updating parameter information of the ambient sound cancellation filter processing.
  • the step of updating parameter information of the ambient sound cancellation filter processing includes: determining a prediction signal for the second audio signal based on the third audio signal; and, updating the parameter information of the ambient sound cancellation filter processing according to the second audio signal and a prediction signal for the second audio signal.
  • the step of detecting whether it is currently in the voice activation state includes: determining whether an audio signal playing device channel and/or a body conduction audio collecting device channel is in a voice activation state according to the second audio signal and/or the third audio signal; if at least one channel is in the voice activation state, then determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal.
  • the step of determining whether it is currently in the voice activation state, according to a signal correlation between the second audio signal and the third audio signal includes: determining a sequence of correlation coefficients between of the second audio signal and the third audio signal; detecting whether it is currently in the voice activation state, based on the sequence of correlation coefficients.
  • the step of detecting whether it is currently in the voice activation state, based on the sequence of correlation coefficients includes: determining a main peak in the sequence of correlation coefficients; if there is another peak in a predefined delay range before the main peak in the sequence of correlation coefficients, determining that it is currently in the voice activation state.
  • the embodiments of the present application provide a method for audio processing.
  • the present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, to obtain audio signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
  • Embodiment I is used to solve the first problem of the prior art: the approach does not consider the correlation of the signals collected the body conduction audio collecting device and the audio signals collected by the air conduction audio collecting devices to perform voice enhancement processing, resulting in poor voice enhancement effect;
  • Embodiment II is used to solve the second problem of the prior art that the un-activated AS mode easily causes a security accident, or the voice call quality is poor, or the voice recognition may not be accurately when the AS mode is activated;
  • Embodiment III is used to solve the above-mentioned first and second problems of the prior art.
  • Embodiment IV describes the manners of audio signal processing in two different application scenarios on the basis of Embodiment III, for detailed description, please refer to the following embodiments, wherein, in the present application, the air conduction audio collecting device may be located outside ear, and the body conduction audio collecting device is a device that collects audio via body tissues, such as the bone tissue, may also be worn outside ear or worn inside ear, which is not limited in the present application.
  • the air conduction audio collecting device may be located outside ear
  • the body conduction audio collecting device is a device that collects audio via body tissues, such as the bone tissue, may also be worn outside ear or worn inside ear, which is not limited in the present application.
  • the embodiment of the present application provides a method for audio processing, including: acquiring a first audio signal and a second audio signal, wherein the first audio signal is an audio signal collected by an air conduction audio collecting device of the earphone, and the second audio signal is an audio signal conducted via body tissues (e.g., bone tissue) and collected by a body conduction audio collecting device of the earphone; performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal, based on a signal correlation between the first audio signal and the second audio signal.
  • body tissues e.g., bone tissue
  • the signal correlation between the first audio signal and the second audio signal may be embodied in the joint voice estimation processing (the voice estimation processing may also be referred to as a voice spectrum estimation processing), as described in detail in the first specific example, and may also be embodied in the calculation processing for the voice presence prior probability, as described in detail in the second specific example; the correlation between the first audio signal and the second audio signal may be embodied both in the joint voice estimation processing and in the calculation processing for the voice presence prior probability, as described in detail in the third specific example, wherein,
  • This specific example provides a method for audio processing, as shown in Fig. 7a, including:
  • Step S1001 acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device.
  • the first audio signal may also contain ambient noise signals. It is not limited in the embodiment of the present application.
  • the second audio signal is an audio signal that is conducted via body tissues and collected by the body conduction audio collecting device, and the second audio signal contains a voice signal of the user.
  • the body conduction audio collecting device is possible to collect the music or call voice played by the audio signal playing device.
  • the audio played by the audio signal playing device may be eliminated by echo cancellation processing to obtain the second audio signal.
  • Step S1002 performing joint voice estimation processing based on the following information:
  • the estimation value of the noise variance corresponding to the first audio signal is an estimation value of the noise variance corresponding to each frequency point in the frequency-domain signal of the first audio signal;
  • the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal is an estimation value of the pure voice spectrum amplitude corresponding to each frequency point in the frequency-domain signal of the first audio signal;
  • the estimation value of the noise variance corresponding to the second audio signal is an estimation value of the noise variance corresponding to each frequency point in the frequency-domain signal of the second audio signal;
  • the estimation value of the pure voice frequency-domain amplitude corresponding to the second audio signal is an estimation value of the pure voice spectrum amplitude corresponding to each frequency point in the frequency domain signal of the second audio signal.
  • Step S1002 the method further includes: calculating the following information:
  • the estimation value of the noise variance corresponding to the first audio signal and the estimation value of the noise variance corresponding to the second audio signal are the noise estimation results obtained by noise estimation on the first audio signal and the second audio signal, respectively;
  • the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal are the voice spectrum estimation results for the voice spectrum estimation on the first audio signal and the second audio signal, respectively.
  • a signal noise estimation algorithm and a voice spectrum estimation algorithm in the prior art may be applied to calculate the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal;
  • the first audio signal and the second audio signal may be separately subjected to noise estimation and voice spectrum estimation by using the processing manners in the present application, specifically including: firstly calculating the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal (i.e., the voice presence prior probability corresponding to the first audio signal), then calculating the estimation value of the noise variance corresponding to the first audio signal as well as the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal by a signal noise estimation algorithm and a voice spectrum estimation algorithm based on the voice presence prior probability corresponding to each frequency point in the frequency- domain signal of the first
  • Step S1003 obtaining a voice-enhancement-processed audio signal to be output according to the obtained joint voice estimation result.
  • the obtained joint voice estimation result is the final voice spectrum amplitude value corresponding to each frequency point.
  • the final voice spectrum amplitude value corresponding to each frequency point is a voice spectrum amplitude value corresponding to each frequency point in the frequency-domain signal corresponding to the voice-enhancement-processed time-domain signal.
  • Step S1003 includes: performing IFFT transformation on the final voice spectrum amplitude value corresponding to each frequency point, and superimposing the sine window and the interframe to obtain the voice-enhancement-processed time-domain audio signal to be output.
  • Fig. 7b illustrates that the audio signal is subjected to the voice presence prior probability processing, the signal noise estimation, the voice spectrum estimation and the joint voice estimation processing by using the processing manner of the present application, specifically including: performing FFT on the first audio signal and the second audio signal to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second audio signal respectively; performing the voice presence prior probability processing to obtain the voice presence prior probability corresponding to the first audio signal based on the frequency-domain signal corresponding to the first audio signal and the frequency-domain signal corresponding to the second audio signal; then performing noise estimation on the first audio signal based on the voice presence prior probability corresponding to the first audio signal, to obtain an estimation value of the noise variance corresponding to the first audio signal and the first voice presence posterior probability; performing noise estimation processing on the second audio signal based on a predefined voice presence prior probability, to obtain an estimation value of the noise variance corresponding to the second audio signal and the second voice presence posterior probability; performing the voice spectrum
  • Step S701 performing FFT on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second audio signal.
  • Before performing noise estimation processing on the first audio signal and the second audio signal further includes: performing Fourier transformation on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second signal.
  • the first audio signal and the second audio signal are respectively calculated by windowed short-time Fourier transformation to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second audio signal, which may also be referred to as a first frequency-domain signal and a second frequency-domain signal.
  • the formula of windowed short-time Fourier transformation may be:
  • x is the first audio signal x 0 or the second audio signal x i
  • w represents a window function, the window function w in the embodiment of the present application being selected as the sine window
  • N is the frame length.
  • the output frequency-domain signal f(k) is the frequency-domain signal corresponding to the first audio signal x 0 or the frequency-domain signal corresponding to the second audio signal x i
  • the output frequency-domain signal f(k) is represented as a vector Y
  • k is 0 to N-1 in the following.
  • the frame length N may be 10 ms.
  • Step S702 determining the voice presence prior probability corresponding to the first audio signal. That is, the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal is calculated by voice presence prior probability processing.
  • an Outer Inner Ratio (OIR) of the first frequency-domain signal and the second frequency-domain signal may be calculated, which may also be referred to as a signal OIR between the first audio signal and the second audio signal; and the voice presence prior probability corresponding to the first audio signal is determined by the Outer Inner Ratio.
  • OIR Outer Inner Ratio
  • the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal (also referred as the first voice presence prior probability) is calculated by the Cauchy distribution model based on the OIR of the first frequency-domain signal and the second frequency-domain signal obtained by calculation.
  • the amplitude value of the pure voice audio point roughly conforms to the Gaussian distribution with a mean value of 0, and the ratio of two of Gaussian distribution with a mean value of 0 conforms to the Cauchy distribution. Therefore, the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal is calculated based on the OIR and by the Cauchy distribution model.
  • the OIR of the first frequency-domain signal to the second frequency-domain signal is calculated by the following formula: ;
  • OIR is the Outer Inner Ratio of the first frequency-domain signal and the second frequency domain signal
  • Y 0 is the first frequency-domain signal output after the first audio signal is subjected to time-frequency conversion
  • Y i is the second frequency-domain signal output after the second audio signal is subjected to time-frequency conversion.
  • the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first frequency-domain signal is calculated by the following formula:
  • P is the initial value vector of the voice presence prior probability
  • the vector element is the frequency point.
  • the general rule is that the voice presence probability in the second audio signal decreases rapidly as the frequency increases, and the first audio signal decreases relatively slowly.
  • g is an empirical coefficient (which may be a fixed value)
  • priOIR is the outer inner ratio of the second audio signal and the first audio signal when the signal is pure voice
  • priOIR may be obtained by pre-statistics.
  • Step S703 performing noise estimation on the first audio signal based on the voice presence prior probability corresponding to the first audio signal.
  • the estimation value of the noise variance corresponding to the first audio signal is calculated by a signal noise estimation algorithm based on the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal obtained by calculation.
  • the voice presence posterior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal (which may also be referred to as: the first voice presence posterior probability) is calculated by a signal noise estimation algorithm, based on the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal obtained by calculation; the estimation value of the noise variance corresponding to the first audio signal is calculated based on the first voice presence posterior probability.
  • Step S704 performing noise estimation on the second audio signal.
  • the posterior probability corresponding to each frequency point in the frequency-domain signal of the second audio signal (which may also be referred to as: the second voice presence posterior probability) is calculated by a signal noise estimation algorithm, based on the predetermined voice presence prior probability; the estimation value of the noise variance corresponding to the second audio signal is calculated based on the second voice presence posterior probability.
  • the first voice presence posterior probability or the second voice presence posterior probability is calculated by the following formula (1), and the estimation value of the noise variance corresponding to the first audio signal or the estimation value of the noise variance corresponding to the second audio signal is calculated by the formula (2).
  • y) is the voice presence posterior probability, which may be characterized as the first voice presence posterior probability or the second voice presence posterior probability
  • P(H0) is the voice absence prior probability
  • is the prior signal to noise ratio or may be a fixed value, which may be 12 dB in the embodiment of the present application
  • Y is a frequency-domain signal, which may be characterized as a first frequency-domain signal or a second frequency-domain signal
  • represents an estimation value of the noise variance of the current frame which may also be referred to as an updated estimation value of the noise variance
  • is an updating coefficient, and may be a fixed value between 0 and 1, for example, it may be 0.8;
  • y) represents the voice absence posterior probability, which may be the first voice absence posterior probability corresponding to the first audio signal, and may be the second voice absence posterior probability corresponding to the second audio signal;
  • Step S705 performing voice spectrum estimation on the first audio signal.
  • the estimation value of the pure voice audio frequency-domain amplitude corresponding to the first audio signal is calculated based on the estimation value of the noise variance corresponding to the first audio signal and the first voice presence posterior probability.
  • Step S706 performing voice spectrum estimation on the second audio signal.
  • the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal is calculated based on the estimation value of the noise variance corresponding to the second audio signal and the second voice presence posterior probability.
  • the OM-LSA algorithm is used to calculate the ratio G1 to the collected original signal (the first audio signal), and then calculate the estimation value of the pure voice frequency-domain amplitude corresponding to the first audio signal based on the ratio G1; after calculating and obtaining the second voice presence posterior probability and the estimation value of the noise variance corresponding to the second audio signal, the OM-LSA algorithm is used to calculate the ratio G2 to the collected original signal (the second audio signal), and then calculate the estimation value of the pure voice frequency-domain amplitude corresponding to the second audio signal based on the ratio G2.
  • the estimation value S of the pure voice frequency-domain amplitude is calculated by the formulas (3) and (4), and S may be the estimation value S1 of the pure voice frequency-domain amplitude corresponding to the first audio signal, or may be the estimation value S2 of the pure voice frequency-domain amplitude corresponding to the second audio signal, wherein,
  • G is G1
  • Y is the first frequency-domain signal
  • y) is the first voice presence posterior probability
  • y) is the first voice absence posterior probability
  • is the amplitude value of the frequency-domain signal corresponding to the first audio signal, and is the estimation value of the noise variance corresponding to the first audio signal.
  • G is G2
  • Y is the second frequency-domain signal
  • y) is the second voice presence posterior probability
  • y) is the first voice absence posterior probability
  • is the amplitude value of the frequency-domain signal corresponding to the second audio signal, and is the estimation value of the noise variance corresponding to the second audio signal.
  • G min is an empirical coefficient with a fixed value, which is the lower limit of G , and may selected as one value between -18 dB and -30 dB.
  • Step S707 performing joint voice spectrum estimation on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
  • a mean value of a third Gaussian distribution model is determined according to a first Gaussian distribution model whose mean value is the voice spectrum estimation result of the first audio signal and variance is the noise estimation result of the first audio signal, and a second Gaussian distribution model whose mean value is the voice spectrum estimation result of the second audio signal and variance is the noise estimation result of the second audio signal; the joint voice spectrum estimation result for joint voice spectrum estimation on the first audio signal and the second audio signal is determined according to the mean value of the third Gaussian distribution model.
  • the voice spectrum amplitude of the first frequency point may be regarded as a Gaussian distribution, of which the mean value is the voice spectrum amplitude corresponding to the first frequency point and the variance is the estimation value of the noise variance corresponding to the first frequency point, by above-calculated these of the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal, and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal;
  • the voice spectrum amplitude of the second frequency point may be regarded as a Gaussian distribution, of which the mean value is the voice spectrum amplitude corresponding to the second frequency point and the variance is the estimation value of the noise variance corresponding to the second frequency point.
  • the final voice spectrum amplitude value corresponding to any frequency point that is, the mean value of the new Gaussian distribution
  • Fig. 7d wherein the common probability distribution in the figure refers to the final voice spectrum amplitude probability distribution.
  • the first frequency point is any frequency point of the frequency-domain signal of the first audio signal, and the voice spectrum amplitude of the first frequency point is the voice spectrum amplitude corresponding to the first frequency point;
  • the second frequency point is any frequency point in the frequency-domain signal of the second audio signal, and the voice spectrum amplitude of the second frequency point is the voice spectrum amplitude corresponding to the second frequency point.
  • the final voice spectrum amplitude value corresponding to any frequency point is calculated by the formula (5).
  • S io is the final voice spectrum amplitude value corresponding to any frequency point
  • S o is the examination value of the pure voice frequency-domain amplitude corresponding to the first audio signal
  • S i is the examination value of the pure voice frequency-domain amplitude corresponding to the second audio signal.
  • Step S708 performing IFFT transformation on the joint voice spectrum estimation result to obtain a voice-enhanced time-domain audio signal to be output, that is, an output signal x.
  • the IFFT transformation is performed on the final voice spectrum amplitude value corresponding to each frequency point, then by windowing of sine window and overlap-add process, the voice-enhanced time-domain audio signal to be output is obtained.
  • the voice-enhanced time-domain audio signal may be calculated according to formula (6).
  • x(n) is the voice-enhanced time-domain audio signal
  • w represents the window function
  • S io (k) is the frequency-domain signal corresponding to the voice-enhanced time-domain audio signal.
  • This specific example provides another method for audio processing, as shown in Fig. 7e, including:
  • Step S1004 acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
  • Step S1005 obtaining a voice-enhancement-processed audio signa by voice presence prior probability calculation processing, based on the first audio signal and the second audio signal.
  • Step S1005 further includes: performing Fourier transformation on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal (which may also be referred as a first frequency-domain signal) and a frequency domain signal corresponding to the second audio signal (which may also be referred as a second frequency-domain signal).
  • Step S1005 may specifically include: Step S10051 (not shown in the figure), Step S10052 (not shown in the figure), Step S10053 (not shown in the figure), Step S10054 (not shown in the figure) and Step S10055 (not shown in the figure), wherein,
  • Step S10051 determining the voice presence prior probability corresponding to the first audio signal.
  • Step S10052 performing noise estimation on the first audio signal based on the determined voice presence prior probability.
  • Step S10053 performing noise estimation on the second audio signal.
  • Step S10054 performing voice spectrum estimation on the first audio signal and the second audio signal respectively according to the noise estimation results corresponding to the first audio signal and the second audio signal.
  • Step S10055 performing voice enhancement processing on the first audio signal and the second audio signal respectively to obtain the voice-enhancement-processed audio signal according to the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
  • the first audio signal and the second audio signal may be subjected to voice spectrum estimation and voice enhancement processing by the manner of the voice spectrum estimation and the manner of the voice enhancement processing in the prior art;
  • the voice-enhancement-processed time-domain signal may be determined by the signal noise estimation, the voice spectrum estimation, the joint voice estimation and IFFT, based on to the first voice presence prior probability in the present application.
  • the detail calculating manner of determining the voice-enhancement-processed time-domain signal by the signal noise estimation, the voice spectrum estimation, the joint voice estimation and IFFT, based on the first voice presence prior probability is described in the first specific example, which will not be described herein.
  • This specific example provides another method for audio processing, as shown in Fig. 7f, including:
  • Step S1006 acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
  • Step S1007 obtaining the voice presence prior probability corresponding to the first audio signal by the voice presence prior probability calculation processing based on the first audio signal and the second audio signal.
  • Step S1007 further includes: performing Fourier transformation on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal (which may also be referred as a first frequency-domain signal) and a frequency-domain signal corresponding to the second audio signal (which may also be referred as a second frequency-domain signal).
  • Step S1007 specifically includes: obtaining the voice presence prior probability corresponding to the first audio signal by the voice presence prior probability calculation processing based on the first audio signal and the second audio signal.
  • Step S1008 obtaining a voice-enhancement-processed audio signal by joint voice estimation processing and based on the following information:
  • Step S1008 further includes: determining the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal, and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal according to the voice presence prior probability corresponding to the first audio signal obtained by calculation by Step S1007.
  • Step S1008 for obtaining the voice-enhancement-processed audio signal by the joint voice estimation processing, based on the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal, and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal is described in the first specific example, which will not be described herein.
  • the embodiment of the present application provides another method for audio processing.
  • the detected audio signal collected by the body conduction audio collecting device of the earphone and the signal to be played by the audio signal playing device i.e., the earphone speaker
  • the voice activation detection to detect whether it is currently in the voice activation state, to determine whether the user is making voices; if detecting that at least one channel in the body conduction audio collecting device channel and the earphone speaker channel is in the voice activation state, then the ambient sound cancellation processing is performed by a set filter, and the voice enhancement processing is performed according to the ambient-sound-cancellation-processed audio signal and the audio signal collected by the air conduction audio collecting device of the earphone, to obtain the voice-enhancement-processed signal, which is used as an output signal; if detecting that both the body conduction audio collecting device channel and the earphone speaker channel are in the voice inactivation state, the parameter information (i.e., the parameter information of the ambient sound cancellation
  • Step S1101 acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
  • the first audio signal when the user is speaking, in addition to voice signal of the user, the first audio signal may also contain ambient noise signal; the second audio signal includes the voice signal that is conducted via body tissues and collected by the body conduction audio collecting device of the earphone, and the audio signal played by the earphone speaker and collected by the body conduction audio collecting device.
  • Step S1102 acquiring a third audio signal to be played by the earphone speaker.
  • Step S1101 and Step S1102 may be performed simultaneously.
  • Step S1103a performing ambient sound cancellation processing on the second audio signal through the third audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
  • Step S1103a Before Step S1103a, it is detected whether it is currently in the voice activation state, and if detecting that it is in the voice activation state, it is determined to perform Step S1103a.
  • the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal in Step S1103a includes: performing ambient sound cancellation filter processing on the third audio signal, and obtaining a filter-processed signal; and, removing the filter-processed signal from the second audio signal, and obtaining an ambient-sound-cancellation-processed second audio signal.
  • being currently in the voice activation state indicates that the user is currently making voices.
  • the ambient noise is not contained in the ambient-sound-cancellation-processed second audio signal, and only the voice signal conducted via body tissues and collected by the body conduction audio collecting device is contained.
  • the ambient-sound-cancellation-processed second audio signal is calculated by the formula (7):
  • is the ambient-sound-cancellation-processed second audio signal
  • d is the expected signal (that is, the second audio signal) collected by the body conduction audio collecting device of the earphone when it is currently in the voice activation state; if it is currently in the voice activation state, then y is the above filter-processed signal
  • X is the third audio signal
  • k is the k th point in the time-domain sampling point, which may be referred as k time, and the value is an index value
  • M is the order of the set filter
  • w i is the i th order coefficient of the filter.
  • Step S1103b if detected that it is currently in the voice inactivation state, updating the parameter information of the ambient sound cancellation filter processing.
  • Step S1103a may be performed before the Step S1103b, or may be performed after Step S1103b, which is not limited in the embodiment of the present application.
  • the step of updating the parameter information of the ambient sound cancellation filter processing in Step S1103b includes: determining a prediction signal for the second audio signal based on the third audio signal; updating parameter information of the ambient sound cancellation filter processing according to the second audio signal and the prediction signal for the second audio signal.
  • the step of updating the parameter information of the ambient sound cancellation filter processing means that updating the parameter information of the set filter.
  • the signal to be played by the earphone speaker i.e., the third audio signal
  • X(k) is used to predict a signal collected by the body conduction audio collecting device, to obtain a prediction signal (the prediction signal for the ambient-sound-cancellation-processed second audio signal) y(k);
  • the parameter information of the set filter is updated by the expected signal collected by the body conduction audio collecting device in inactivation state, to obtain the updated parameter information W of the set filter, wherein, the calculation formula of the updated parameter information of the set filter is shown in formula (8), in which,
  • W(k) is the filter coefficient at time k
  • W(k+1) represents the coefficient at the next time k+1 of time k, that is, the updated coefficient
  • is a fixed empirical value
  • ⁇ (k) is the difference between the expected signal d(k) collected by the body conduction audio collecting device of the earphone and the prediction signal y(k) when in an inactivation state
  • W ⁇ w 1 , w 2 , w 3 , w 4 ... w M ⁇ .
  • Step S1104 determining an audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step S1104 is performed after Step S1103a.
  • the ambient-sound-cancellation-processed second audio signal in the embodiment of the present application may be equivalent to the second audio signal collected by the body conduction audio collecting device in Embodiment I.
  • the manner for performing the voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal is described in detail in Embodiment I, and details will not be described in this embodiment.
  • Fig. 9a the specific detecting manner for detecting whether it is currently in the voice activation state, is shown in Fig. 9a, including: performing voice activation detection on the third audio signal to be played by the earphone speaker and the second audio signal collected by the body conduction audio collecting device of the earphone respectively; if at least one is in the activation state, then determining the correlation between the third audio signal to be played by the earphone speaker and the second audio signal collected by the body conduction audio collecting device of the earphone, that is, performing correlation detection, to obtain a sequence of correlation coefficients; then detecting whether there is another peak value in the predefined range before the main peak of the detection of the sequence of the correlation coefficients, if existing, determining that it is currently in the voice activation state, otherwise it is in the inactivation state.
  • the voice activation detection is described in detail below with reference to Fig. 9b, wherein,
  • Step S1201 for the third audio signal and/or the second audio signal, determining whether the earphone speaker channel and/or the body conduction audio collecting device channel is/are in the voice activation state.
  • the step includes: for the third audio signal, calculating whether the earphone speaker channel is in a voice activation state by the short-time energy algorithm or the zero-crossing rate algorithm; and/or for the second audio signal, calculating whether the body conduction audio collecting device channel is in a voice activated state by using a short-time energy algorithm or a zero-crossing rate algorithm.
  • the short-time energy calculation formula is: , in which, s(n) is the amplitude value of the frequency point n of the frequency-domain signal corresponding to the third audio signal, or the amplitude value of the frequency point n of the frequency-domain signal corresponding to the second audio signal, where N is the frame length.
  • the zero-crossing rate algorithm formula is: , in which, where, S(n) is the amplitude value of the frequency point n of the frequency-domain signal corresponding to the third audio signal, or the amplitude value of the frequency point n of the frequency-domain signal corresponding to the second audio signal, where N is the frame length.
  • the short-time energy value is greater than the predefined threshold or the zero-crossing rate value is greater than the predefined threshold, it is determined that the channel is in a voice activation state.
  • Step S1202 if at least one channel is in the voice activation state, then determining it is currently in the voice activation state, according to a correlation between the third audio signal and the second audio signal.
  • the step of determining it is currently in the voice activation state, according to a correlation between the third audio signal and the second audio signal in Step S1202 includes: calculating a correlation between of the third audio signal and the second audio signal to obtain a sequence of correlation coefficients; and determining whether it is currently in the activation state, based on the sequence of correlation coefficients.
  • Var[X] and Var[Y] are signal variance values of the third audio signal and the second audio signal respectively.
  • the step of determining whether it is currently in the activation state, based on the sequence of correlation coefficients includes: determining a main peak in the sequence of correlation coefficients; if there is another peak in the predefined delay range before the main peak in the sequence of correlation coefficients, determining that the voice is currently in the activation state.
  • the audio signal collected by the in-ear voice device is composed of two parts: one is the signal collected by the body conduction audio collecting device via the body tissue conduction, and the other is the part collected by the air conduction audio collecting device and then played by the in-ear speaker and finally picked up by the body conduction audio collecting device, so the audio signal at this time will have two peaks in the correlation, and the second peak is the autocorrelation of the audio signal collected by the air conduction audio collecting device, and may be greater than the
  • Part (1) in Fig. 9d shows the audio signal collected by the air conduction audio collecting device; Part (2) in Fig. 9d shows the audio signal to be played by the earphone speaker; Part (3) in Fig. 9d shows that when the earphone is in the non-AS mode, the audio signal collected by the body conduction audio collecting device; Part (4) in Fig. 9d shows that when the earphone is in the AS mode, the audio signal collected by the body conduction audio collecting device.
  • the embodiment of the present application provides another method for audio processing, as shown in Fig. 10a, including:
  • Step S1301 acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
  • Step S1302 performing ambient sound cancellation processing on the second audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
  • the ambient sound cancellation processing may not be performed on the second audio signal; if the current mode of the earphone is the AS mode, the ambient audio cancellation processing may be performed on the second audio signal.
  • Step S1302 includes: Step S1302a (not shown in the figure) to Step S1302b (not shown in the figure), wherein,
  • Step S1302a acquiring a third audio signal to be played by the earphone speaker.
  • Step S1302b performing ambient sound cancellation processing on the second audio signal through the third audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
  • Step S1302b includes: Step S1302b1 (not shown in the figure) to Step S1302b2 (not shown in the figure), wherein,
  • Step S1302b1 performing ambient sound cancellation filter processing on the third audio signal, and obtaining a filter-processed signal.
  • Step S1302b2 removing the filter-processed signal from the second audio signal, and obtaining the ambient-sound-cancellation-processed second audio signal.
  • Step S1302b includes: Step S1302b3 (not shown in the figure) to Step S1302b4 (not shown in the figure), wherein,
  • Step S1302b3 detecting whether it is currently in a voice activation state, wherein the voice activation state indicates that the user is making voices.
  • Step S1302b3 may specifically include: Step S1302b31 (not shown in the figure) and Step S1302b32 (not shown in the figure), wherein,
  • Step S1302b31 determining whether the earphone speaker channel and/or the body conduction audio collecting device channel is/are in the voice activation state according to the second audio signal and/or the third audio signal.
  • Step S1302b32 if at least one channel is in the voice activation state, then determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal.
  • Step S1302b32 the step of determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal may include Step Sd (not shown in the figure) to Step Se (not shown in the figure), wherein,
  • Step Sd determining a sequence of correlation coefficients between the second audio signal and the third audio signal.
  • Step Se determining whether it is currently in the voice activation state based on the correlation of coefficient sequence.
  • Step Se may specifically include: Step Se1 (not shown in the figure) and Step Se2 (not shown in the figure), wherein,
  • Step Se1 in the sequence of correlation coefficients, determining the main peak.
  • Step Se2 if there is another peak in the predefined delay range before the main peak in the sequence of correlation coefficients, determining that it is currently in the voice activation state.
  • Step S1302b4 if detecting that it is in the voice activation state, performing the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal.
  • the method further includes: Step Sc (not shown in the figure), wherein,
  • Step Sc if detecting that it is in the voice inactivation state, updating the parameter information of the ambient sound cancellation filter processing.
  • Step Sc may be performed after Step S1302b3.
  • Step S1303 performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal based on the signal correlation between the first audio signal and the second audio signal, to obtain the voice-enhancement-processed audio signal to be output.
  • Step S1303 the step of performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal based on the signal correlation between the first audio signal and the second audio signal, may specifically include Step S13031 (not shown in the figure), Step S13032 (not shown in the figure) and Step S13033 (not shown in the figure), wherein,
  • Step S13031 performing noise estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal, respectively.
  • the step of performing noise estimation on the first audio signal in Step S13031 may include Step Sf (not shown in the figure) to Step Sg (not shown in the figure), wherein,
  • Step Sf determining the voice presence prior probabilities corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step Sf may include: Step Sf1 (not shown in the figure) to Step Sf2 (not shown in the figure), wherein,
  • Step Sf1 determining a signal OIR between the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step Sf2 determining the voice presence prior probabilities corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal based on the signal OIR.
  • Step Sg performing noise estimation on the first audio signal based on the voice presence prior probability.
  • Step Sg may include Step Sg1 (not shown in the figure) and Step Sg2 (not shown in the figure), wherein,
  • Step Sg1 determining the corresponding voice presence posterior probability based on the voice presence prior probability.
  • Step Sg2 performing noise estimation on the first audio signal based on the voice presence posterior probability.
  • Step S13032 performing a voice spectrum estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the noise estimation result corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step S13033 performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the voice spectrum estimation result corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step S13033 may include Step S13033a (not shown in the figure), wherein,
  • Step S13033a performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the noise estimation result corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • Step S13033a may include Step Sh (not shown in the figure) to Step Si (not shown in the figure), wherein,
  • Step Sh performing joint voice spectrum estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the noise estimation results corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • the Step Sh may include a Step Sh1 (not shown in the figure) to a Step Sh2 (not shown in the figure), wherein,
  • Step Sh1 determining a mean value of a third Gaussian distribution model according to a first Gaussian distribution model whose mean value is the voice spectrum estimation result of the first audio signal and variance is the noise estimation result of the first audio signal, and a second Gaussian distribution model whose mean value is the voice spectrum estimation result of the ambient-sound-cancellation-processed second audio signal and variance is the noise estimation result of the ambient-sound-cancellation-processed second audio signal.
  • Step Sh2 determining joint voice spectrum estimation results for joint voice spectrum estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to a mean value of the third Gaussian distribution model.
  • Step Si obtaining a voice-enhancement-processed audio signal to be output according to the obtained joint voice spectrum estimation results.
  • Embodiment III the technical solution of Embodiment I and Embodiment II is contained in Embodiment III, and the specific implementations of the steps in Embodiment III are described in detail in Embodiment I and Embodiment II, which will not be specifically described in the embodiment.
  • the method for audio processing allows the earphone user to activate the AS mode when using the earphone to make a call, so that the user who is talking with the earphone may clearly hear the surrounding ambient sound, and avoid reality dangers of ignoring the ambient sound during wearing the earphone when calling which also makes it possible to be sensitive to the ambient sound during wearing the earphone when calling, thereby making users easily and natural calling with the earphone.
  • joint enhancement is performed, for the characteristics of audio signals obtained by air conduction and body conduction (the audio collected by body conduction contains less noise, with insufficient bandwidth, while the audio collected by the air conduction has high bandwidth, but contain a lot of ambient noise) to quire each other's strong points to overcome one's own weakness, thereby making the voice heard by the call peer clean and natural during the call while preserving the high intelligibility of the voice, and improving the intelligibility of the voice, such that even if the user is in a noisy environment, the sound transmitted by the earphone user to the far end is also highly intelligible.
  • the embodiment of the present application contains two specific examples, and respectively introduces the method for performing voice enhancement on the collected audio signal in two different application scenarios, including the first specific example wherein the first specific example describes the application scenario in which the device user communicates with the remote call user and the collected audio signal is processed and sent to the remote call peer with which the communication connection is established, and a second specific example, wherein the second specific example introduces the process of sending the voice instruction and controlling the execution of the voice instruction after collecting and processing the audio signals of the device user in the voice-based instruction recognition application scenario ,wherein the device user in the embodiment is a user using an earphone provided with a body conduction audio collecting device and an air conduction audio collecting device.
  • Step I establishing a call connection between the device user with the remote call user.
  • Step II the device user making a call voice, for example, "Hello?";
  • Step III when the earphone is in the AS mode, performing voice activation detection on the collected audio signal, and performing ambient sound cancellation processing in an activation state; and updating the parameter information of the set filter in an inactivation state;
  • Step IV performing voice enhancement processing on the ambient-sound-cancellation-processed audio signal (including: time-frequency conversion, noise signal estimation, voice spectrum estimation, joint enhancement, and frequency-time conversion);
  • Step V sending the voice-enhancement-processed audio signal to the remote call user.
  • Step VI receiving the voice of the remote call user.
  • the specific example introduces a process of a voice instruction and controlling the execution of the voice instruction after processing the collected audio signal of the device user in the voice-based instruction recognition application scenario, as shown in Fig. 10c, wherein,
  • Step I the device user sending a voice instruction, for example "open a map";
  • Step II when the earphone is in the AS mode, performing voice activation detection on the collected audio signal, and performing ambient sound cancellation processing in an activation state; and updating the parameter information of the set filter in an inactivation state;
  • Step III performing voice enhancement processing on the ambient-sound-cancellation-processed audio signal (including: time-frequency conversion, noise signal estimation, voice spectrum estimation, joint enhancement, and frequency-time conversion);
  • Step IV recognizing the voice-enhancement-processed voice instruction, and executing the instruction, for example, "Open a map APP".
  • the embodiment of the present application provides an electronic device, which is applicable to the foregoing method embodiments.
  • the electronic device may be an earphone device.
  • the electronic device 1400 includes: an air conduction audio collecting device 1401 and a body conduction audio collecting device 1402, an audio signal playing device 1403, a processor 1404, and a memory 1405; wherein,
  • the air conduction audio collecting device 1401 configured to collect a first audio signal conducted via air
  • the body conduction audio collecting device 1402 configured to collect a second audio signal conducted via the body tissue
  • the audio signal playing device 1403, configured to play an audio signal
  • the memory 1405, configured to store machine readable instructions that, when executed by the processor 1404, cause the processor 1404 to perform the methods described above.
  • Fig. 12 schematically illustrates a block diagram of a computing system that may be used to implement an electronic device of the embodiment of the present disclosure.
  • the computing system 1500 includes a processor 1510, a computer readable storage medium 1520, an output interface 1530, and an input interface 1540.
  • the computing system 1500 may perform the methods described above with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10a to implement voice enhancement processing on the signal collected by the air conduction audio collecting device and the signal collected by the body conduction audio collecting device to obtain audio signals with better effect for voice transmission or voice recognition.
  • the processor 1510 may include, for example, a general-purpose microprocessor, an instruction set processor, and/or a related chipset and/or a special purpose microprocessor (e.g., an application specific integrated circuit (ASIC)), and the like.
  • the processor 1510 may also include onboard memory for caching purposes.
  • the processor 1810 may be a single processing unit or multiple processing units for performing different actions of the method flow described with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10a.
  • the computer readable storage medium 1520 may be any medium that may contain, store, communicate, propagate or transport the instructions.
  • the readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium.
  • Specific examples of the readable storage medium include: a magnetic storage device such as a magnetic tape or a hard disk (HDD); an optical storage device such as a compact disk (CD-ROM); a memory such as a random access memory (RAM) or a flash memory; and/or a wired /wireless communication link.
  • the computer readable storage medium 1520 may include a computer program 1521 that may include code/computer executable instructions that, when executed by the processor 1510, cause the processor 1510 to perform, for example, the method flow with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10a, or any variations thereof.
  • the computer program 1521 may be configured to have for example, computer program codes including a computer program module.
  • the codes in computer program 1521 may include one or more program modules, including, for example, 1521A, module 1521B, ....
  • the processor 1510 may use the output interface 1530 and the input interface 1540 to perform the above described method flow with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10 and any variation thereof.
  • the embodiments of the present application provide an electronic device.
  • the present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by an air conduction audio collecting device and the audio signal collected by a body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing voice transmission or voice recognition.
  • the embodiments of the present application provide another electronic device.
  • the present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal. That is, ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, thus to obtain better audio signals for performing voice transmission or voice recognition.
  • the embodiment of the present application provides an apparatus for audio processing, as shown in Fig. 13, wherein the apparatus 1600 for audio processing includes: a first acquiring module 1601 and a voice enhancement processing module 1602, wherein,
  • the first acquiring module 1601 is configured to acquire a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
  • the voice enhancement processing module 1602 is configured to perform voice enhancement processing on the first audio signal and the second audio signal acquired by the first acquiring module 1601 to obtain the voice-enhancement-processed audio signal to be output based on a signal correlation between the first audio signal and the second audio signal.
  • the embodiments of the present application provide an apparatus for audio processing.
  • the present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by an air conduction audio collecting device and the audio signal collected by a body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing voice transmission or voice recognition.
  • the apparatus 1700 for audio processing includes: a second acquiring module 1701, an ambient sound cancellation processing module 1702, and a determining module 1703, wherein,
  • the second acquiring module 1701 is configured to acquire a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
  • the ambient sound cancellation processing module 1702 is configured to perform ambient sound cancellation processing on the second audio signal acquired by the second acquiring module 1701.
  • the determining module 1703 is configured to determine the audio signal to be output based on the first audio signal acquired by the second acquiring module 1701 and the second audio signal after the ambient sound cancellation processing module 1702 performs the ambient sound cancellation processing.
  • the embodiment of the present application provides an apparatus for audio processing.
  • the present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  • ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, to obtain audio signals with better effect for performing voice transmission or voice recognition.
  • the present invention involves apparatuses for performing one or more of operations as described in the present invention.
  • Those apparatuses may be specially designed and manufactured as intended, or may include well known apparatuses in a general-purpose computer.
  • Those apparatuses have computer programs stored therein, which are selectively activated or reconstructed.
  • Such computer programs may be stored in device (such as computer) readable media or in any type of media suitable for storing electronic instructions and respectively coupled to a bus
  • the computer readable media include but are not limited to any type of disks (including floppy disks, hard disks, optical disks, CD-ROM and magneto optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memories, magnetic cards or optical line cards.
  • readable media include any media storing or transmitting information in a device (for example, computer) readable form.
  • computer program instructions may be used to realize each block in structure diagrams and/or block diagrams and/or flowcharts as well as a combination of blocks in the structure diagrams and/or block diagrams and/or flowcharts. It may be understood by those skilled in the art that these computer program instructions may be provided to general purpose computers, special purpose computers or other processors of programmable data processing means to be implemented, so that solutions designated in a block or blocks of the structure diagrams and/or block diagrams and/or flow diagrams are performed by computers or other processors of programmable data processing means.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Otolaryngology (AREA)
  • General Health & Medical Sciences (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

The embodiments of the present application provide a method for audio processing, an apparatus, an electronic device and a computer readable storage medium, relating the field of voice enhancement technologies, and the method includes: acquiring a first audio signal using an air conduction audio collecting device and a second audio signal using a body conduction audio collecting device, and then performing voice enhancement processing on at least one of the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal. The embodiments of the present application achieve audio enhancement on a signal collected by an audio collecting device of an earphone, and may obtain an audio signal with better effect for performing application such as voice transmission or voice recognition.

Description

METHODS FOR AUDIO PROCESSING, APPARATUS, ELECTRONIC DEVICE AND COMPUTER READABLE STORAGE MEDIUM
The present application relates to the field of voice enhancement technologies, and in particular, to a method for audio processing, an apparatus, an electronic device, and a computer readable storage medium.
Along with the development of information technologies, earphone technology has also been developed. Earphones with two audio collecting devices (i.e., an air conduction audio collecting device and a body conduction audio collecting device) have emerged. In which, sound collected by an air conduction audio collecting device is easily interfered by the surrounding environment, such that the collected sound may contain a lot of noise, while sound collected by a body conduction audio collecting device is obtained through body tissue conduction (such as bone conduction). Therefore, the body conduction audio collecting device collects less noise, or even does not collect noise.
Since the sound collected by the air conduction audio collecting device is susceptible to ambient noise, the sound collected through air conduction is full-band. The sound collected by the body conduction audio collecting device is collected through body tissues, such that a high-frequency part of the sound collected by the body conduction audio collecting device is lost. Therefore, it remains a crucial issue about how to set up an earphone with two audio collecting devices to obtain better sound signals by utilizing different characteristics of the two audio collecting devices, and to perform applications, such as voice transmission, voice recognition, etc.
The present application provides a method for audio processing, an apparatus, an electronic device and a computer readable storage medium, by utilizing different characteristics of two audio collecting device of an earphone to obtain an audio signal with better effect for performing applications, such as voice transmission, voice recognition, etc. The specific technical solutions are shown as follows:
In a first aspect, there is provided a method for audio processing, the method including: acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; and performing voice enhancement processing on the first audio signal and the second audio signal to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
In a second aspect, there is provided an apparatus for audio processing, the apparatus including: a first acquiring module, configured to acquire a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; and a voice enhancement processing module, configured to perform voice enhancement processing on the first audio signal and the second audio acquired by the first acquiring module, to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
In a third aspect, there is provided an electronic device, the electronic device including: an air conduction audio collecting device, a body conduction audio collecting device, an audio signal playing device, a processor, and a memory; wherein, the air conduction audio collecting device, configured to collect a first audio signal conducted via air; the body conduction audio collecting device, configured to collect a second audio signal conducted via body tissues; the audio signal playing device, configured to play an audio signal; and the memory, configured to store machine readable instructions that, when executed by the processor, cause the processor to perform the method for audio processing shown in the first aspect.
In a fourth aspect, there is provided a computer readable storage medium, wherein the computer readable storage medium stores a computer program that, when executed by a processor, implements the method for audio processing shown in the first aspect.
In a fifth aspect, there is provided another method for audio processing, including: acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; performing ambient sound cancellation processing on the second audio signal; and determining an audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
In a sixth aspect, there is provided another apparatus for audio processing, including: a second acquiring module, configured to acquire a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; an ambient sound cancellation processing module, configured to perform ambient sound cancellation processing on the second audio signal acquired by the second acquiring module; and a determining module, configured to determine an audio signal to be output based on the first audio signal acquired by the second acquiring module and the second audio signal which subjecting to ambient sound cancellation processing by the ambient sound cancellation processing module.
In a seventh aspect, there is provided an electronic device, the electronic device including: an air conduction audio collecting device, a body conduction audio collecting device, an audio signal playing device, a processor, and a memory; wherein, the air conduction audio collecting device, configured to collect a first audio signal conducted via air; the body conduction audio collecting device, configured to collect a second audio signal conducted via body tissues; the audio signal playing device, configured to play an audio signal; and the memory, configured to store machine readable instructions that, when executed by the processor, cause the processor to perform the method for audio processing shown in the fifth aspect.
In an eighth aspect, there is provided a computer readable storage medium, wherein the computer readable storage medium stores a computer program that, when executed by a processor, implements the method for audio processing shown in the fifth aspect.
The technical solution provided by the embodiments of the present application is advantageous in the following aspects:
The embodiments of the present application provide a method for audio processing, an apparatus, an electronic device and a computer readable storage medium. The present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by an body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
The embodiments of the present application provide a method for audio processing, an apparatus, an electronic device and a computer readable storage medium. The present application acquires a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal. That is, ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain the ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, to obtain audio signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic diagram that a call peer is unable to hear voice or accurately recognize a voice instruction when using a conventional earphone;
Fig. 2 is a schematic diagram that a call peer is capable of hearing voice or accurately recognizing a voice instruction when using an earphone with a body conduction audio collecting device;
Fig. 3 is a schematic flowchart of performing voice enhancement processing in the prior art;
Fig. 4 is a schematic structural diagram of an earphone provided with an air conduction audio collecting device and a body conduction audio collecting device;
Fig. 5 is a schematic method flowchart for audio processing according to an embodiment of the present application;
Fig. 6 is another schematic method flowchart for audio processing according to an embodiment of the present application;
Fig. 7a is a schematic flowchart of a method for audio processing in a first specific example of Embodiment I;
Fig. 7b is a schematic diagram of a general flow for audio processing according to an embodiment of the present application;
Fig. 7c is a schematic flowchart of a specific implementation for audio processing in Embodiment I of the present application;
Fig. 7d is a schematic diagram of calculating the final voice spectrum amplitude by joint voice estimation;
Fig. 7e is a schematic flowchart of a method of a second specific example in Embodiment I;
Fig. 7f is a schematic flowchart of a method of a third specific example in Embodiment I;
Fig. 8a is a schematic flowchart of implementing audio enhancement by ambient sound cancellation processing and voice enhancement processing;
Fig. 8b is a schematic method flowchart for audio processing according to Embodiment II of the present application;
Fig. 8c is a schematic diagram of filtering and updating filter parameters based on a set filter according to an embodiment of the present application;
Fig. 9a is a schematic diagram for voice activation detection in Embodiment II of the present application;
Fig. 9b is a schematic method flowchart for voice activation detection according to Embodiment II of the present application;
Fig. 9c is a schematic diagram of determining whether it is currently in an activation state based on a sequence of correlation coefficients;
Fig. 9d is a schematic diagram of a sequence of correlation coefficients;
Fig. 10a is a schematic method flowchart for audio processing in Embodiment III of the present application;
Fig. 10b is a schematic diagram of a first specific example in Embodiment IV of the present application;
Fig. 10c is a schematic diagram of a second specific example in Embodiment IV of the present application;
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
Fig. 12 is a block diagram of a computing system of an electronic device according to an embodiment of the present application;
Fig. 13 is a schematic structural diagram of an apparatus for audio processing according to an embodiment of the present application; and
Fig. 14 is a schematic structural diagram of another apparatus for audio processing according to an embodiment of the present application.
Embodiments of the present invention will be described in detail hereafter. The examples of these embodiments have been illustrated in the drawings throughout which same or similar reference numerals refer to same or similar elements or elements having same or similar functions. The embodiments described hereafter with reference to the drawings are illustrative, merely used for explaining the present invention and should not be regarded as any limitations thereto.
It should be understood by those skill in the art that singular forms "a", "an", "the", and "said" may be intended to include plural forms as well, unless otherwise stated. It should be further understood that terms "include/including" used in this specification specify the presence of the stated features, integers, steps, operations, elements and/or components, but not exclusive of the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or combinations thereof. It should be understood that when a component is referred to as being "connected to" or "coupled to" another component, it may be directly connected or coupled to other elements or provided with intervening elements therebetween. In addition, "connected to" or "coupled to" as used herein may include wireless connection or coupling. As used herein, term "and/or" includes all or any of one or more associated listed items or combinations thereof.
In order to make the purposes, technical solutions and advantages of the present application more clearly, the figures shall be referred to further describe the implementation manners of the present application.
With the development of earphone technology, current earphones include two kinds of having one air conduction audio collecting device and having two air conduction audio collecting devices. In the present application, the audio signal includes: a voice signal and/or a noise signal, etc.
For a conventional earphone with one air conduction audio collecting device, when there is no noise or low noise in the outside world, there is no problem with the effect of the headphones during a voice call or voice recognition. However, when there is a large amount of voice interference or noise in the outside world, there is a problem that the voice pickup is unclear during voice recognition, especially in the case of low signal to noise ratio and voice interference. For example, in the scenario of Fig. 1, in the process of two people making a call, if the environment around the user wearing an earphone has high ambient noise, such as there is train noise around or the surrounding human voice is very noisy, the audio signal sending user (user wearing the earphone) sends audio to the call peer via a communication link, and the call peer receives the audio signal, wherein there may be cases of unclear call voice; for another example, in an environment where the ambient noise is high or there is a human voice, due to the interference of noise and interference voice, a voice recognition application receives a voice instruction, and often cannot accurately recognize the user's voice instruction.
For an earphone with two audio collecting devices (an air conduction audio collecting device and a body conduction audio collecting device), as shown in Figure 2, the body conduction audio collecting device (which may be located in ear or outside ear, and if located in ear, it may be referred as an in-ear audio collecting device) may be physically isolated. In addition, the collected audio signal is collected by the body conduction audio collecting device via body tissue conduction (such as bone conduction) when a person is making voices. Therefore, the collected noise signal is rare, and even the noise signal may not be collected, so that the audio signal is sent to the call peer during the call, and the voice sent to the call peer is a clean voice, which is easily understood by the peer. Similarly, when voice is used for voice recognition, the voice is sent to a voice recognition application, and since the voice received by the voice recognition application has no noise and voice interference, the recognition rate is higher.
For earphones provided with two audio collecting devices (i.e., an air conduction audio collecting device and a body conduction audio collecting device), the air conduction audio collecting device (which may be an out-ear audio collecting device) is susceptible to interference from ambient noise, and the collected audio signals contain many noise signals. But the signal collected by the air conduction audio collecting device is full-band compared to the voice collected by the body conduction audio collecting device (which may be an in-ear audio collecting device). This is because the voice signal picked up by the body conduction audio collecting device is conducted via body tissues, and the conducted signal undergoes a process similar to low-pass filtering. Most of the out-ear noise is physically isolated by the body conduction audio collecting device (for example, the earplug is closely attached to the external auditory canal) during using the earphone by the user, so the collected audio signal is a clean voice signal and does not contain noise. However, after subjecting to the "low-pass filtering" of body tissues, the high frequency part is lost, and the spectrum of the audio signal collected by the air conduction audio collecting device are different from the spectrum of the audio signal collected by the body conduction audio collecting device.
In the prior art, an earphone having two audio collecting devices, an air conduction audio collecting device and a body conduction audio collecting device, obtains voice signals with better effect by utilizing different characteristics of the two audio collecting devices, thereby performing applications, such as voice transmission, voice recognition, etc. The general process is: the audio signal picked up by the body conduction audio collecting device and the audio signal picked up by the air conduction audio collecting device are separately processed as separate two-way signals, for example, being processed through a filter and the like for performing noise cancellation processing, then the processed result is superimposed into a final audio signal, and the obtained audio signal is transmitted to a terminal device (the terminal device may be a mobile phone connected to the earphone through Bluetooth or wired connection, etc.) connected to the earphone. If in the call scenario, the terminal device connected to the earphone may transmit the final superimposed audio signal to the call peer; if in the voice recognition scenario, the terminal device connected to the earphone may recognize the user instruction according to the finally superimposed audio signal. However, there are still many problems with such processing. It is specified as follows:
The first problem of the prior art: for signal processing of an earphone having two audio collecting devices, before transmitting the audio signal to the connected terminal device, the audio signals collected by the two audio collecting devices are performed with noise cancellation and voice enhancement processing respectively, and the audio signals collected by the two audio collecting devices are superimposed, in the conventional method. As shown in Fig. 3, the audio signals collected by the body conduction audio collecting device and the audio signals collected by the air conduction audio collecting device are respectively processed by Fast Fourier Transformation (FFT), signal noise estimation processing, signal voice estimation processing and Inverse Fast Fourier Transformation (IFFT). The IFFT-processed audio signal corresponding to the body conduction audio collecting device is processed by low-pass filtering, the IFFT-processed audio signal corresponding to the air conduction audio collecting device is processed by high-pass-filter processing, and the filter-processed two signals are superimposed to obtain an output signal. The output signal is output to a terminal device, such as a mobile phone connected to the earphone, and then transmitted to the call peer by the terminal device or performing a corresponding application, such as voice recognition, recording, etc. This approach does not consider the correlation of the signals collected by the air conduction audio collecting devices and the body conduction audio collecting device. This correlation mainly comes from: whether it is the audio signal collected by the body conduction audio collecting device or the air conduction audio collecting device, their sound source is the speaker, but the two kinds of signals have passed different propagation paths. The audio signals collected by the air conduction audio collecting device is directly transmitted via the air and collected by the air conduction audio collecting device, as the environment contains ambient noise, the air conduction audio collecting device also picks up the ambient noise while collecting the speaker's voice. The body conduction audio collecting device collects the speaker's voice directly transmitted to the body conduction audio collecting device through body tissue conduction. Therefore, the voice audio collected by the air conduction audio collecting device and the body conduction audio collecting device has a high correlation, actually. This correlation may help us to perform voice detection and voice noise cancellation. If being capable of using the correlation between the two items, better voice enhancement effect may be achieved. However, in the prior art, voice enhancement is performed without using correlation of voice, so the effect of performing voice enhancement in the prior art is poor.
The second problem of the prior art: when the existing earphones with two audio collecting devices play the local audio to the user, the biggest purpose is to eliminate the local environment noise and obtain the clean voice, which such noise cancellation manner often uses the ambient noise collected by the air conduction audio collecting device, by playing opposite phase noise in an audio signal playing device (for example, an earphone speaker) to achieve the purpose of ambient noise cancellation. This conventional method for ambient noise cancellation effectively eliminates local ambient noise and improves the user's listening experience. But it also leads to another problem, if there is a car next to the user or someone is talking, the noise cancellation algorithm will suppress the surrounding sound as noise, resulting in security issues or communication issues. For example, in a certain scenario, when the user is using the earphone and the car is approaching, since the earphone eliminates the ambient noise, the sound of the car is also eliminated as the ambient sound, accordingly the user cannot hear the sound of the car, thereby probably causing an accident.
In order to solve the second problem of the prior art, an earphone with two audio collecting devices may be designed with an ambient sound (AS) mode, that is, an environmental sound mode, and in the case of activating this mode, the air conduction audio collecting device is capable of collecting the ambient sound outside the ear and then playing it out through the earphone speaker, so that the user may hear the sound of the surrounding environment such as saying hello or having a car approached. If the body conduction audio collecting device is located in the ear, the audio signal collected by the body conduction audio collecting device includes: the voice conducted via body tissues and the audio signal played by the earphone speaker. This AS mode may avoid security issues or communication issues. As shown by the schematic structural diagram of the earphone in Fig. 4, the body conduction audio collecting device and the audio signal playing device (an earphone speaker) are all located within ear, and the air conduction audio collecting device is located outside ear, which the earphone speaker plays the audio signal collected by the air conduction audio collecting device, the body conduction audio collecting device collects the voice conducted via body tissues and the audio signal played by the earphone speaker, and the air conduction audio collecting device collects the external audio signal.
However, if the body conduction audio collecting device is located in ear, the earphone designed with the AS mode is also problematic. Specifically, when the user activates the AS mode, the audio signal collected by the body conduction audio collecting device is composed of two parts, which one part corresponds the sound (including human voice and ambient noise) recorded by an air conduction audio collecting device and played by an audio signal playing device (e.g., an in-ear speaker), and the other part corresponds to the sound which is made by the user and collected by the body conduction audio collecting device through body tissue conduction. In this way, since the audio collected by the body conduction audio collecting device contains the ambient sound, the user voice collected by the air conduction audio collecting device, and the user voice conducted through body tissues (which may be referred as body conduction voice), the audio collected by the body conduction audio collecting device is no longer a clean call voice (body conduction voice), which may also cause the voice call peer to fail to hear the user's voice, or that the terminal device may not accurately recognize the user's voice instruction, so that the earphone with an air conduction audio collecting device and a body conduction audio collecting device, has unsuitable or unsatisfactory conventional noise cancellation algorithms.
Since the physical structure of the earphone is attached to the ear canal, the noise outside the ear is isolated. When the AS mode is not activated, the sound collected by the body conduction audio collecting device and made by the earphone user does not contain noise, and the high-frequency part of the audio signal conducted through bone is lost; the audio collected by the air conduction audio collecting device is full-band, since that the audio collected by the air conduction audio collecting device, which is made by the earphone user and propagated via air contains noise; when the AS mode is activated, the audio signal collected by the air conduction audio collecting device is played by the audio signal playing device (e.g., an earphone speaker), and therefore in the case that the AS mode is activated, the noise signal contained in the audio signal collected by the body conduction audio collecting device is required to be eliminated.
For the problems that the effect of voice enhancement existing in the prior art is not ideal, the embodiments of the present application suppress noise and enhance voice quality, by using the correlation between the signal collected by the body conduction audio collecting device and the signal collected by the air conduction audio collecting device, as well as achieve a clearer voice call and improve the performance of the uplink voice signal during the call. In addition, in the voice recognition application, the terminal device may accurately recognize the user instruction after enhancing voice quality, and improve the accuracy of the voice recognition; for the problems that the un-activated AS mode easily causes a security accident, or the voice call quality is poor, or the voice instruction may not be accurately recognized when activating the AS mode, the present application recovers the signal collected by the body conduction audio collecting device by adding an adaptive filter in the AS mode, and eliminates the ambient noise in the audio sent to the peer when the speaker may hear the ambient sound clearly, such that the receiver may not hear the ambient noise of the transmitting end clearly, thereby achieving a clearer voice call and improving the performance of the uplink voice signal during the call. In addition, in the voice recognition application, since the ambient noise is eliminated, the terminal device may accurately recognize the user's instruction and improve the accuracy of the voice recognition.
Specifically, in order to solve the problem that the voice enhancement effect existing in the prior art is not ideal enough, the embodiments of the present application provide a method for audio processing, which may be applied to an earphone having an air conduction audio collecting device and a body conduction audio collecting device, as shown in Fig. 5, wherein,
Step S801: acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device.
Step S802: performing voice enhancement processing on the first audio signal and the second audio signal to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
For the embodiment of the present application, the terminal device connected to the earphone may acquire the voice-enhancement-processed audio signal to be output, and output the audio signal to the call peer, or output the audio signal to the voice recognition application for voice recognition; or output the audio signal to the instant messaging application, and send it as voice information to the communication peer; or record the audio signal. The specific process of the terminal device receiving the audio signal is not limited in the embodiment of the present application.
Specifically, in Step S802, the step of performing voice enhancement processing on the first audio signal and the second audio signal based on a signal correlation between the first audio signal and the second audio signal, includes: Step S8021 (not shown in the figure), Step S8022 (not shown in the figure) and Step S8023 (not shown in the figure), wherein,
Step S8021: performing noise estimation on the first audio signal and the second audio signal, respectively.
Step S8022: performing voice spectrum estimation on the first audio signal and the second audio signal respectively, according to the noise estimation result corresponding to the first audio signal and the second audio signal.
Step S8023: performing voice enhancement processing on the first audio signal and the second audio signal according to the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
Specifically, in Step S8021, the step of performing noise estimation on the first audio signal includes: Step S8021a (not shown in the figure) to Step S8021b (not shown in the figure), wherein,
Step S8021a: determining a voice presence prior probability corresponding to the first audio signal.
Step S8021b: performing noise estimation on the first audio signal based on the voice presence prior probability.
Specifically, Step S8021a includes Step S8021a1 (not shown in the figure) and Step S8021a2 (not shown in the figure), wherein,
Step S8021a1: determining a signal outer inner ratio (OIR) between the first audio signal and the second audio signal.
Step S8021a2: determining a voice presence prior probability corresponding to the first audio signal according to the signal OIR.
Specifically, Step S8021b includes Step S8021b1 (not shown in the figure) and Step S8021b2 (not shown in the figure), wherein,
Step S8021b1: determining a corresponding voice presence posterior probability based on the voice presence prior probability.
Step S8021b2: performing noise estimation on the first audio signal based on the voice presence posterior probability.
Specifically, in Step S8023, the step of performing voice enhancement processing on the first audio signal and the second audio signal according to the voice spectrum estimation results corresponding to the first audio signal and the second audio signal, includes: Step S8023a (not shown in the figure), wherein,
Step S8023a: performing voice enhancement processing on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
Specifically, Step S8023a includes Step S8023a1 (not shown in the figure) and Step S8023a2 (not shown in the figure), wherein,
Step S8023a1: performing joint voice enhancement processing on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
Step S8023a2: obtaining a voice-enhancement-processed audio signal to be output according to the obtained joint voice spectrum estimation result.
Specifically, Step S8023a1 includes Step S8023a11 (not shown in the figure) to Step S8023a12 (not shown in the figure), wherein,
Step S8023a11: determining a mean value of a third Gaussian distribution model, according to a first Gaussian distribution model whose mean value is the voice spectrum estimation result of the first audio signal and variance is the noise estimation result of the first audio signal, and a second Gaussian distribution model whose mean value is the voice spectrum estimation result of the second audio signal and variance is the noise estimation result of the second audio signal.
Step S8023a12: determining joint voice spectrum estimation results for joint voice spectrum estimation on the first audio signal and the second audio signal according to the mean value of the third Gaussian distribution model.
In a possible implementation, before the voice enhancement processing is performed on the first audio signal and the second audio signal, Step Sa (not shown in the figure) is included, wherein,
Step Sa: performing ambient sound cancellation processing on the second audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
Specifically, the step of performing voice enhancement processing on the first audio signal and the second audio signal, includes Step Sb (not shown in the figure), wherein,
Step Sb: performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Specifically, the step of performing ambient sound cancellation processing on the second audio signal in Step Sa includes: Step Sa1 (not shown in the figure) and Step Sa2 (not shown in the figure), wherein,
Step Sa1: acquiring a third audio signal to be played by an audio signal playing device.
Step Sa2: performing ambient sound cancellation processing on the second audio signal through the third audio signal, and obtaining the ambient-sound-cancellation-processed second audio signal.
Specifically, in Step Sa2, the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal includes: detecting whether it is currently in a voice activation state, wherein the voice activation state indicates that the user is making voices; if detected that it is currently in the voice activation state, performing the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal.
Specifically, the step of detecting whether it is currently in the voice activation state, includes: determining whether an audio signal playing device channel and/or a body conduction audio collecting device channel is in a voice activation state according to the second audio signal and/or the third audio signal; if at least one channel is in the voice activation state, then determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal.
The embodiments of the present application provide a method for audio processing. The present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the audio signal after voice enhancement processing, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
For the problems that the inactivated ambient sound mode easily causes a security accident, or the voice call quality is poor, or the voice recognition may not be accurately when the AS mode is activated, the present application provides another method for audio processing, which may be applied to an electronic device having two audio collecting devices, as shown in Fig. 6, wherein,
Step S901: acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device.
Step S902: performing ambient sound cancellation processing on the second audio signal.
Step S903: determining an audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Specifically, Step S902 includes Step S9021 (not shown in the figure) and Step S9022 (not shown in the figure), wherein,
Step S9021: acquiring a third audio signal to be played by an audio signal playing device.
Step S9022: performing ambient sound cancellation processing on the second audio signal through the third audio signal, and obtaining the ambient-sound-cancellation-processed second audio signal.
Specifically, in Step S9022, the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal, includes:
performing ambient sound cancellation filter processing on the third audio signal, and obtaining a filter-processed signal; and
removing the filter-processed signal from the second audio signal, and obtaining the ambient-sound-cancellation-processed second audio signal.
Specifically, the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal, includes: detecting whether it is currently in a voice activation state, wherein the voice activation state indicates that the user is making voices; and, if detected that it is currently in the voice activation state, then performing the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal.
In a possible implementation, the method further includes: if detected that it is currently in the voice inactivation state, then updating parameter information of the ambient sound cancellation filter processing.
Specifically, the step of updating parameter information of the ambient sound cancellation filter processing, includes: determining a prediction signal for the second audio signal based on the third audio signal; and, updating the parameter information of the ambient sound cancellation filter processing according to the second audio signal and a prediction signal for the second audio signal.
Specifically, the step of detecting whether it is currently in the voice activation state, includes: determining whether an audio signal playing device channel and/or a body conduction audio collecting device channel is in a voice activation state according to the second audio signal and/or the third audio signal; if at least one channel is in the voice activation state, then determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal.
Specifically, the step of determining whether it is currently in the voice activation state, according to a signal correlation between the second audio signal and the third audio signal, includes: determining a sequence of correlation coefficients between of the second audio signal and the third audio signal; detecting whether it is currently in the voice activation state, based on the sequence of correlation coefficients.
Specifically, the step of detecting whether it is currently in the voice activation state, based on the sequence of correlation coefficients, includes: determining a main peak in the sequence of correlation coefficients; if there is another peak in a predefined delay range before the main peak in the sequence of correlation coefficients, determining that it is currently in the voice activation state.
The embodiments of the present application provide a method for audio processing. The present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal. That is, ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, to obtain audio signals with better effect for performing applications, such as voice transmission, voice recognition, etc.
The method for audio processing is introduced in the following with reference to the specific embodiments, including Embodiment I, Embodiment II, Embodiment III, and Embodiment IV. Embodiment I is used to solve the first problem of the prior art: the approach does not consider the correlation of the signals collected the body conduction audio collecting device and the audio signals collected by the air conduction audio collecting devices to perform voice enhancement processing, resulting in poor voice enhancement effect; Embodiment II is used to solve the second problem of the prior art that the un-activated AS mode easily causes a security accident, or the voice call quality is poor, or the voice recognition may not be accurately when the AS mode is activated; Embodiment III is used to solve the above-mentioned first and second problems of the prior art. Embodiment IV describes the manners of audio signal processing in two different application scenarios on the basis of Embodiment III, for detailed description, please refer to the following embodiments, wherein, in the present application, the air conduction audio collecting device may be located outside ear, and the body conduction audio collecting device is a device that collects audio via body tissues, such as the bone tissue, may also be worn outside ear or worn inside ear, which is not limited in the present application.
Embodiment Ⅰ
The embodiment of the present application provides a method for audio processing, including: acquiring a first audio signal and a second audio signal, wherein the first audio signal is an audio signal collected by an air conduction audio collecting device of the earphone, and the second audio signal is an audio signal conducted via body tissues (e.g., bone tissue) and collected by a body conduction audio collecting device of the earphone; performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal, based on a signal correlation between the first audio signal and the second audio signal. Since the signal correlation between the first audio signal and the second audio signal may be embodied in the joint voice estimation processing (the voice estimation processing may also be referred to as a voice spectrum estimation processing), as described in detail in the first specific example, and may also be embodied in the calculation processing for the voice presence prior probability, as described in detail in the second specific example; the correlation between the first audio signal and the second audio signal may be embodied both in the joint voice estimation processing and in the calculation processing for the voice presence prior probability, as described in detail in the third specific example, wherein,
First specific example
This specific example provides a method for audio processing, as shown in Fig. 7a, including:
Step S1001: acquiring a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device.
For the embodiment of the present application, when a user who uses the earphone (which may also be referred as a user) speaks, in addition to the voice signals of the user, the first audio signal may also contain ambient noise signals. It is not limited in the embodiment of the present application.
For the embodiment of the present application, the second audio signal is an audio signal that is conducted via body tissues and collected by the body conduction audio collecting device, and the second audio signal contains a voice signal of the user.
For the embodiment of the present application, if the audio signal playing device (such as an earphone speaker) of the earphone plays music, or plays call voice of the peer user during call, the body conduction audio collecting device is possible to collect the music or call voice played by the audio signal playing device. After the body conduction audio collecting device collects the audio signal, the audio played by the audio signal playing device may be eliminated by echo cancellation processing to obtain the second audio signal.
Step S1002: performing joint voice estimation processing based on the following information:
an estimation value of the noise variance corresponding to the first audio signal;
an estimation value of the pure voice spectrum amplitude corresponding to the first audio signal;
an estimation value of the noise variance corresponding to the second audio signal; and
an estimation value of the pure voice spectrum amplitude corresponding to the second audio signal;
For the embodiment of the present application, the estimation value of the noise variance corresponding to the first audio signal is an estimation value of the noise variance corresponding to each frequency point in the frequency-domain signal of the first audio signal; the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal is an estimation value of the pure voice spectrum amplitude corresponding to each frequency point in the frequency-domain signal of the first audio signal; the estimation value of the noise variance corresponding to the second audio signal is an estimation value of the noise variance corresponding to each frequency point in the frequency-domain signal of the second audio signal; the estimation value of the pure voice frequency-domain amplitude corresponding to the second audio signal is an estimation value of the pure voice spectrum amplitude corresponding to each frequency point in the frequency domain signal of the second audio signal.
Before Step S1002, the method further includes: calculating the following information:
an estimation value of the noise variance corresponding to the first audio signal;
an estimation value of the pure voice spectrum amplitude corresponding to the first audio signal;
an estimation value of the noise variance corresponding to the second audio signal; and
an estimation value of the pure voice spectrum amplitude corresponding to the second audio signal.
Wherein, the estimation value of the noise variance corresponding to the first audio signal and the estimation value of the noise variance corresponding to the second audio signal are the noise estimation results obtained by noise estimation on the first audio signal and the second audio signal, respectively; the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal are the voice spectrum estimation results for the voice spectrum estimation on the first audio signal and the second audio signal, respectively.
For the embodiments of the present application, a signal noise estimation algorithm and a voice spectrum estimation algorithm in the prior art may be applied to calculate the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal; the first audio signal and the second audio signal may be separately subjected to noise estimation and voice spectrum estimation by using the processing manners in the present application, specifically including: firstly calculating the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal (i.e., the voice presence prior probability corresponding to the first audio signal), then calculating the estimation value of the noise variance corresponding to the first audio signal as well as the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal by a signal noise estimation algorithm and a voice spectrum estimation algorithm based on the voice presence prior probability corresponding to each frequency point in the frequency- domain signal of the first audio signal, and calculating estimation value of the noise variance corresponding to the second audio signal as well as the estimation value of the pure voice spectrum corresponding to the second audio signal by a signal noise estimation algorithm and a voice spectrum estimation algorithm based on a predetermined voice presence prior probability; alternatively, calculating the voice presence prior probability corresponding to the second audio signal in real-time by the manner of calculating the voice presence prior probability corresponding to the first audio signal, and calculating the estimation value of the noise variance corresponding to the second audio signal as well as the estimation value of the pure voice spectrum corresponding to the second audio signal by a signal noise estimation algorithm and a voice spectrum estimation algorithm based on the voice presence prior probability corresponding to the second audio signal.
Step S1003: obtaining a voice-enhancement-processed audio signal to be output according to the obtained joint voice estimation result.
Wherein, the obtained joint voice estimation result is the final voice spectrum amplitude value corresponding to each frequency point. In the embodiment of the present application, the final voice spectrum amplitude value corresponding to each frequency point is a voice spectrum amplitude value corresponding to each frequency point in the frequency-domain signal corresponding to the voice-enhancement-processed time-domain signal.
Therefore, Step S1003 includes: performing IFFT transformation on the final voice spectrum amplitude value corresponding to each frequency point, and superimposing the sine window and the interframe to obtain the voice-enhancement-processed time-domain audio signal to be output.
Specifically, as shown in Fig. 7b, Fig. 7b illustrates that the audio signal is subjected to the voice presence prior probability processing, the signal noise estimation, the voice spectrum estimation and the joint voice estimation processing by using the processing manner of the present application, specifically including: performing FFT on the first audio signal and the second audio signal to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second audio signal respectively; performing the voice presence prior probability processing to obtain the voice presence prior probability corresponding to the first audio signal based on the frequency-domain signal corresponding to the first audio signal and the frequency-domain signal corresponding to the second audio signal; then performing noise estimation on the first audio signal based on the voice presence prior probability corresponding to the first audio signal, to obtain an estimation value of the noise variance corresponding to the first audio signal and the first voice presence posterior probability; performing noise estimation processing on the second audio signal based on a predefined voice presence prior probability, to obtain an estimation value of the noise variance corresponding to the second audio signal and the second voice presence posterior probability; performing the voice spectrum estimation on the first audio signal based on the estimation value of the noise variance corresponding to the first audio signal and the first voice presence posterior probability, to obtain an estimation value of the pure voice spectrum amplitude corresponding to the first audio signal; performing voice spectrum estimation on the second audio signal based on the estimation value of the noise variance corresponding to the second audio signal and the second voice presence posterior probability, to obtain an estimation value of the pure voice spectrum amplitude corresponding to the second audio signal; performing joint voice spectrum estimation on the first audio signal and the second audio signal according to the noise estimation result corresponding to the first audio signal and the second audio signal and the voice spectrum estimation result corresponding to the first audio signal and the second audio signal; and then, performing IFFT transformation on the joint estimation result to obtain a voice-enhancement-processed time-domain audio signal to be output, that is, the output signal x. The specific implementing process is shown in Fig. 7c, in which Fig. 7c illustrates a specific implementing process of the method for audio processing provided by the specific example, including:
Step S701: performing FFT on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second audio signal.
Before performing noise estimation processing on the first audio signal and the second audio signal, further includes: performing Fourier transformation on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second signal.
For the embodiment of the present application, the first audio signal and the second audio signal are respectively calculated by windowed short-time Fourier transformation to obtain a frequency-domain signal corresponding to the first audio signal and a frequency-domain signal corresponding to the second audio signal, which may also be referred to as a first frequency-domain signal and a second frequency-domain signal.
Wherein, the formula of windowed short-time Fourier transformation may be:
Figure PCTKR2019012099-appb-I000001
wherein, x is the first audio signal x0 or the second audio signal xi, w represents a window function, the window function w in the embodiment of the present application being selected as the sine window, and N is the frame length. The output frequency-domain signal f(k) is the frequency-domain signal corresponding to the first audio signal x0 or the frequency-domain signal corresponding to the second audio signal xi, the output frequency-domain signal f(k) is represented as a vector Y, and k is 0 to N-1 in the following.
For example, the frame length N may be 10 ms.
Step S702: determining the voice presence prior probability corresponding to the first audio signal. That is, the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal is calculated by voice presence prior probability processing.
First, an Outer Inner Ratio (OIR) of the first frequency-domain signal and the second frequency-domain signal may be calculated, which may also be referred to as a signal OIR between the first audio signal and the second audio signal; and the voice presence prior probability corresponding to the first audio signal is determined by the Outer Inner Ratio.
Specifically, the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal (also referred as the first voice presence prior probability) is calculated by the Cauchy distribution model based on the OIR of the first frequency-domain signal and the second frequency-domain signal obtained by calculation.
For the embodiment of the present application, according to the empirical information, the amplitude value of the pure voice audio point roughly conforms to the Gaussian distribution with a mean value of 0, and the ratio of two of Gaussian distribution with a mean value of 0 conforms to the Cauchy distribution. Therefore, the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal is calculated based on the OIR and by the Cauchy distribution model.
Specifically, the OIR of the first frequency-domain signal to the second frequency-domain signal is calculated by the following formula:
Figure PCTKR2019012099-appb-I000002
;
OIR is the Outer Inner Ratio of the first frequency-domain signal and the second frequency domain signal, Y0 is the first frequency-domain signal output after the first audio signal is subjected to time-frequency conversion, and Yi is the second frequency-domain signal output after the second audio signal is subjected to time-frequency conversion.
The voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first frequency-domain signal is calculated by the following formula:
Figure PCTKR2019012099-appb-I000003
wherein, P is the initial value vector of the voice presence prior probability, and the vector element is the frequency point. The initial value of the voice presence prior probability may be obtained by empirical statistics (for example, in a 4-hour experimental signal sequence, at a certain frequency point, wherein 2 hours is in voice, then the initial value of the voice presence probability is 2 hours/4 hours=0.5), wherein the voice is different for different hardware devices. However, the general rule is that the voice presence probability in the second audio signal decreases rapidly as the frequency increases, and the first audio signal decreases relatively slowly. g is an empirical coefficient (which may be a fixed value), priOIR is the outer inner ratio of the second audio signal and the first audio signal when the signal is pure voice, and priOIR may be obtained by pre-statistics.
Step S703: performing noise estimation on the first audio signal based on the voice presence prior probability corresponding to the first audio signal.
Further, the estimation value of the noise variance corresponding to the first audio signal is calculated by a signal noise estimation algorithm based on the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal obtained by calculation.
Specifically, the voice presence posterior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal (which may also be referred to as: the first voice presence posterior probability) is calculated by a signal noise estimation algorithm, based on the voice presence prior probability corresponding to each frequency point in the frequency-domain signal of the first audio signal obtained by calculation; the estimation value of the noise variance corresponding to the first audio signal is calculated based on the first voice presence posterior probability.
Step S704: performing noise estimation on the second audio signal.
The posterior probability corresponding to each frequency point in the frequency-domain signal of the second audio signal (which may also be referred to as: the second voice presence posterior probability) is calculated by a signal noise estimation algorithm, based on the predetermined voice presence prior probability; the estimation value of the noise variance corresponding to the second audio signal is calculated based on the second voice presence posterior probability.
Specifically, the first voice presence posterior probability or the second voice presence posterior probability is calculated by the following formula (1), and the estimation value of the noise variance corresponding to the first audio signal or the estimation value of the noise variance corresponding to the second audio signal is calculated by the formula (2).
Figure PCTKR2019012099-appb-M000001
wherein, P(H1|y) is the voice presence posterior probability, which may be characterized as the first voice presence posterior probability or the second voice presence posterior probability; P(H0) is the voice absence prior probability P(H0)=1-P(H1), if P(H1|y) is the first voice presence posterior probability, then P(H1) is the first voice presence prior probability and P(H0) is the first voice absence prior probability, and if P(H1|y) is the second voice presence posterior probability, then P(H1) is a predefined voice presence prior probability, and P(H0) is the second voice absence posterior probability; ξ is the prior signal to noise ratio or may be a fixed value, which may be 12 dB in the embodiment of the present application; Y is a frequency-domain signal, which may be characterized as a first frequency-domain signal or a second frequency-domain signal;
Figure PCTKR2019012099-appb-I000004
is the estimation value of the noise variance estimated in the previous frame and may also be represented by
Figure PCTKR2019012099-appb-I000005
, which may represent the estimation value of the noise variance corresponding to the first audio signal and may also represent the estimation value of the noise variance corresponding to the second audio signal.
Figure PCTKR2019012099-appb-M000002
wherein,
Figure PCTKR2019012099-appb-I000006
represents an estimation value of the noise variance of the current frame, which may also be referred to as an updated estimation value of the noise variance, may represent an estimation value of the noise variance corresponding to the first audio signal, and may also represent an estimation value of the noise variance corresponding to the second audio signal, wherein
Figure PCTKR2019012099-appb-I000007
may be used to represent the estimation value of the noise variance corresponding to the first audio signal,
Figure PCTKR2019012099-appb-I000008
may be used to represent the estimation value of the noise variance corresponding to the second audio signal and
Figure PCTKR2019012099-appb-I000009
may be represented by
Figure PCTKR2019012099-appb-I000010
and
Figure PCTKR2019012099-appb-I000011
;
Figure PCTKR2019012099-appb-I000012
is the estimation value of the noise variance estimated in the previous frame; α is an updating coefficient, and may be a fixed value between 0 and 1, for example, it may be 0.8; P(H0|y) represents the voice absence posterior probability, which may be the first voice absence posterior probability corresponding to the first audio signal, and may be the second voice absence posterior probability corresponding to the second audio signal; |Y| is the amplitude value of the frequency-domain signal.
Step S705: performing voice spectrum estimation on the first audio signal.
The estimation value of the pure voice audio frequency-domain amplitude corresponding to the first audio signal is calculated based on the estimation value of the noise variance corresponding to the first audio signal and the first voice presence posterior probability.
Step S706: performing voice spectrum estimation on the second audio signal.
The estimation value of the pure voice spectrum amplitude corresponding to the second audio signal is calculated based on the estimation value of the noise variance corresponding to the second audio signal and the second voice presence posterior probability.
Specifically, after calculating and obtaining the first voice posterior probability and the estimation value of the noise variance corresponding to the first audio signal, the OM-LSA algorithm is used to calculate the ratio G1 to the collected original signal (the first audio signal), and then calculate the estimation value of the pure voice frequency-domain amplitude corresponding to the first audio signal based on the ratio G1; after calculating and obtaining the second voice presence posterior probability and the estimation value of the noise variance corresponding to the second audio signal, the OM-LSA algorithm is used to calculate the ratio G2 to the collected original signal (the second audio signal), and then calculate the estimation value of the pure voice frequency-domain amplitude corresponding to the second audio signal based on the ratio G2.
Specifically, the estimation value S of the pure voice frequency-domain amplitude is calculated by the formulas (3) and (4), and S may be the estimation value S1 of the pure voice frequency-domain amplitude corresponding to the first audio signal, or may be the estimation value S2 of the pure voice frequency-domain amplitude corresponding to the second audio signal, wherein,
Figure PCTKR2019012099-appb-M000003
Figure PCTKR2019012099-appb-M000004
wherein,
Figure PCTKR2019012099-appb-I000013
,
Figure PCTKR2019012099-appb-I000014
wherein, when calculating the estimation value S1 of the pure voice frequency-domain amplitude corresponding to the first audio signal, in the formula (3), G is G1, and Y is the first frequency-domain signal; in the formula (4), P(H1|y) is the first voice presence posterior probability, P(H0|y) is the first voice absence posterior probability, |Y| is the amplitude value of the frequency-domain signal corresponding to the first audio signal, and
Figure PCTKR2019012099-appb-I000015
is the estimation value of the noise variance corresponding to the first audio signal.
wherein, when calculating the estimation value S2 of the pure voice frequency-domain amplitude corresponding to the second audio signal, in the formula (3), G is G2, and Y is the second frequency-domain signal; in the formula (4), P(H1|y) is the second voice presence posterior probability, P(H0|y) is the first voice absence posterior probability, |Y| is the amplitude value of the frequency-domain signal corresponding to the second audio signal, and
Figure PCTKR2019012099-appb-I000016
is the estimation value of the noise variance corresponding to the second audio signal.
Gmin is an empirical coefficient with a fixed value, which is the lower limit of G, and may selected as one value between -18 dB and -30 dB.
Step S707: performing joint voice spectrum estimation on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
After determining noise estimation results corresponding to the first audio signal and the second audio signal (the estimation value of the noise variance corresponding to the first audio signal and the estimation value of the noise variance corresponding to the second audio signal), and voice spectrum estimation results corresponding to the first audio signal and the second audio signal respectively (the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal), a mean value of a third Gaussian distribution model is determined according to a first Gaussian distribution model whose mean value is the voice spectrum estimation result of the first audio signal and variance is the noise estimation result of the first audio signal, and a second Gaussian distribution model whose mean value is the voice spectrum estimation result of the second audio signal and variance is the noise estimation result of the second audio signal; the joint voice spectrum estimation result for joint voice spectrum estimation on the first audio signal and the second audio signal is determined according to the mean value of the third Gaussian distribution model.
For the embodiment of the present application, the voice spectrum amplitude of the first frequency point may be regarded as a Gaussian distribution, of which the mean value is the voice spectrum amplitude corresponding to the first frequency point and the variance is the estimation value of the noise variance corresponding to the first frequency point, by above-calculated these of the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal, and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal; the voice spectrum amplitude of the second frequency point may be regarded as a Gaussian distribution, of which the mean value is the voice spectrum amplitude corresponding to the second frequency point and the variance is the estimation value of the noise variance corresponding to the second frequency point. Based on that of "the product of two Gaussian distribution is a Gaussian distribution" and the above-mentioned information, the final voice spectrum amplitude value corresponding to any frequency point, that is, the mean value of the new Gaussian distribution, is calculated, as shown in Fig. 7d, wherein the common probability distribution in the figure refers to the final voice spectrum amplitude probability distribution.
Wherein, the first frequency point is any frequency point of the frequency-domain signal of the first audio signal, and the voice spectrum amplitude of the first frequency point is the voice spectrum amplitude corresponding to the first frequency point; the second frequency point is any frequency point in the frequency-domain signal of the second audio signal, and the voice spectrum amplitude of the second frequency point is the voice spectrum amplitude corresponding to the second frequency point.
Specifically, the final voice spectrum amplitude value corresponding to any frequency point, that is, the joint voice spectrum estimation result is calculated by the formula (5).
Figure PCTKR2019012099-appb-M000005
wherein,
Figure PCTKR2019012099-appb-I000017
, Sio is the final voice spectrum amplitude value corresponding to any frequency point, So is the examination value of the pure voice frequency-domain amplitude corresponding to the first audio signal, and Si is the examination value of the pure voice frequency-domain amplitude corresponding to the second audio signal.
Step S708: performing IFFT transformation on the joint voice spectrum estimation result to obtain a voice-enhanced time-domain audio signal to be output, that is, an output signal x.
Specifically, the IFFT transformation is performed on the final voice spectrum amplitude value corresponding to each frequency point, then by windowing of sine window and overlap-add process, the voice-enhanced time-domain audio signal to be output is obtained.
The voice-enhanced time-domain audio signal may be calculated according to formula (6).
Figure PCTKR2019012099-appb-M000006
wherein, x(n) is the voice-enhanced time-domain audio signal, w represents the window function, and Sio(k) is the frequency-domain signal corresponding to the voice-enhanced time-domain audio signal.
Second specific example
This specific example provides another method for audio processing, as shown in Fig. 7e, including:
Step S1004: acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
Step S1005: obtaining a voice-enhancement-processed audio signa by voice presence prior probability calculation processing, based on the first audio signal and the second audio signal.
For the embodiment of the present application, before Step S1005, further includes: performing Fourier transformation on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal (which may also be referred as a first frequency-domain signal) and a frequency domain signal corresponding to the second audio signal (which may also be referred as a second frequency-domain signal).
Specifically, the manner of performing Fourier transformation on the first audio signal and the second audio signal is described in detail in the first specific example, which is not limited in this example.
For the embodiment of the present application, Step S1005 may specifically include: Step S10051 (not shown in the figure), Step S10052 (not shown in the figure), Step S10053 (not shown in the figure), Step S10054 (not shown in the figure) and Step S10055 (not shown in the figure), wherein,
Step S10051: determining the voice presence prior probability corresponding to the first audio signal.
Step S10052: performing noise estimation on the first audio signal based on the determined voice presence prior probability.
Step S10053: performing noise estimation on the second audio signal.
Step S10054: performing voice spectrum estimation on the first audio signal and the second audio signal respectively according to the noise estimation results corresponding to the first audio signal and the second audio signal.
Step S10055: performing voice enhancement processing on the first audio signal and the second audio signal respectively to obtain the voice-enhancement-processed audio signal according to the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
For the embodiments of the present application, the detail process of obtaining the first voice presence prior probability by the voice presence prior probability calculation processing based on the first audio signal and the second audio signal is described in the first specific example, which will not be described herein.
For the embodiments of the present application, the first audio signal and the second audio signal may be subjected to voice spectrum estimation and voice enhancement processing by the manner of the voice spectrum estimation and the manner of the voice enhancement processing in the prior art; the voice-enhancement-processed time-domain signal may be determined by the signal noise estimation, the voice spectrum estimation, the joint voice estimation and IFFT, based on to the first voice presence prior probability in the present application.
Specifically, the detail calculating manner of determining the voice-enhancement-processed time-domain signal by the signal noise estimation, the voice spectrum estimation, the joint voice estimation and IFFT, based on the first voice presence prior probability is described in the first specific example, which will not be described herein.
Third specific example
This specific example provides another method for audio processing, as shown in Fig. 7f, including:
Step S1006: acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
Step S1007: obtaining the voice presence prior probability corresponding to the first audio signal by the voice presence prior probability calculation processing based on the first audio signal and the second audio signal.
Before Step S1007, further includes: performing Fourier transformation on the first audio signal and the second audio signal, respectively, to obtain a frequency-domain signal corresponding to the first audio signal (which may also be referred as a first frequency-domain signal) and a frequency-domain signal corresponding to the second audio signal (which may also be referred as a second frequency-domain signal).
Step S1007 specifically includes: obtaining the voice presence prior probability corresponding to the first audio signal by the voice presence prior probability calculation processing based on the first audio signal and the second audio signal.
Step S1008: obtaining a voice-enhancement-processed audio signal by joint voice estimation processing and based on the following information:
the estimation value of the noise variance corresponding to the first audio signal;
the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal;
the estimation value of the noise variance corresponding to the second audio signal; and
the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal.
For the embodiment of the present application, before Step S1008, further includes: determining the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal, and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal according to the voice presence prior probability corresponding to the first audio signal obtained by calculation by Step S1007.
The detail calculating manner is described in the first specific example, which will not be described herein.
The detail calculating manner in Step S1008 for obtaining the voice-enhancement-processed audio signal by the joint voice estimation processing, based on the estimation value of the noise variance corresponding to the first audio signal, the estimation value of the pure voice spectrum amplitude corresponding to the first audio signal, the estimation value of the noise variance corresponding to the second audio signal, and the estimation value of the pure voice spectrum amplitude corresponding to the second audio signal is described in the first specific example, which will not be described herein.
Embodiment Ⅱ
The embodiment of the present application provides another method for audio processing. As shown in Fig. 8a, the detected audio signal collected by the body conduction audio collecting device of the earphone and the signal to be played by the audio signal playing device (i.e., the earphone speaker) is subjected to the voice activation detection to detect whether it is currently in the voice activation state, to determine whether the user is making voices; if detecting that at least one channel in the body conduction audio collecting device channel and the earphone speaker channel is in the voice activation state, then the ambient sound cancellation processing is performed by a set filter, and the voice enhancement processing is performed according to the ambient-sound-cancellation-processed audio signal and the audio signal collected by the air conduction audio collecting device of the earphone, to obtain the voice-enhancement-processed signal, which is used as an output signal; if detecting that both the body conduction audio collecting device channel and the earphone speaker channel are in the voice inactivation state, the parameter information (i.e., the parameter information of the ambient sound cancellation processing) of the set filter are updated according to the audio signal collected in the inactivation state, which corresponds to the filter updating in the figure. The above content is described in detail below, as shown in Fig. 8b, wherein,
Step S1101: acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
For the embodiment of the present application, when the user is speaking, in addition to voice signal of the user, the first audio signal may also contain ambient noise signal; the second audio signal includes the voice signal that is conducted via body tissues and collected by the body conduction audio collecting device of the earphone, and the audio signal played by the earphone speaker and collected by the body conduction audio collecting device.
Step S1102: acquiring a third audio signal to be played by the earphone speaker.
For the embodiment of the present application, Step S1101 and Step S1102 may be performed simultaneously.
Step S1103a: performing ambient sound cancellation processing on the second audio signal through the third audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
Before Step S1103a, it is detected whether it is currently in the voice activation state, and if detecting that it is in the voice activation state, it is determined to perform Step S1103a.
Specifically, the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal in Step S1103a includes: performing ambient sound cancellation filter processing on the third audio signal, and obtaining a filter-processed signal; and, removing the filter-processed signal from the second audio signal, and obtaining an ambient-sound-cancellation-processed second audio signal.
Wherein, being currently in the voice activation state indicates that the user is currently making voices.
For the embodiment of the present application, the ambient noise is not contained in the ambient-sound-cancellation-processed second audio signal, and only the voice signal conducted via body tissues and collected by the body conduction audio collecting device is contained.
Specifically, the ambient-sound-cancellation-processed second audio signal is calculated by the formula (7):
Figure PCTKR2019012099-appb-M000007
Where ε is the ambient-sound-cancellation-processed second audio signal; d is the expected signal (that is, the second audio signal) collected by the body conduction audio collecting device of the earphone when it is currently in the voice activation state; if it is currently in the voice activation state, then y is the above filter-processed signal; X is the third audio signal; k is the kth point in the time-domain sampling point, which may be referred as k time, and the value is an index value; M is the order of the set filter; wi is the ith order coefficient of the filter.
Step S1103b: if detected that it is currently in the voice inactivation state, updating the parameter information of the ambient sound cancellation filter processing.
For the embodiment of the present application, Step S1103a may be performed before the Step S1103b, or may be performed after Step S1103b, which is not limited in the embodiment of the present application.
Specifically, the step of updating the parameter information of the ambient sound cancellation filter processing in Step S1103b includes: determining a prediction signal for the second audio signal based on the third audio signal; updating parameter information of the ambient sound cancellation filter processing according to the second audio signal and the prediction signal for the second audio signal.
The step of updating the parameter information of the ambient sound cancellation filter processing, means that updating the parameter information of the set filter. When it is currently in the voice inactivation state, as shown in Fig. 8c, the signal to be played by the earphone speaker (i.e., the third audio signal) X(k) is used to predict a signal collected by the body conduction audio collecting device, to obtain a prediction signal (the prediction signal for the ambient-sound-cancellation-processed second audio signal) y(k); the parameter information of the set filter is updated by the expected signal collected by the body conduction audio collecting device in inactivation state, to obtain the updated parameter information W of the set filter, wherein, the calculation formula of the updated parameter information of the set filter is shown in formula (8), in which,
Figure PCTKR2019012099-appb-M000008
wherein, W(k) is the filter coefficient at time k; W(k+1) represents the coefficient at the next time k+1 of time k, that is, the updated coefficient; μ is a fixed empirical value; ε(k) is the difference between the expected signal d(k) collected by the body conduction audio collecting device of the earphone and the prediction signal y(k) when in an inactivation state; wherein W={w1, w2, w3, w4 ... wM}.
Step S1104: determining an audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
For the embodiment of the present application, Step S1104 is performed after Step S1103a.
When the voice is currently in the activation state, the ambient-sound-cancellation-processed second audio signal in the embodiment of the present application may be equivalent to the second audio signal collected by the body conduction audio collecting device in Embodiment I. Specifically, the manner for performing the voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal is described in detail in Embodiment I, and details will not be described in this embodiment.
Further, the specific detecting manner for detecting whether it is currently in the voice activation state, is shown in Fig. 9a, including: performing voice activation detection on the third audio signal to be played by the earphone speaker and the second audio signal collected by the body conduction audio collecting device of the earphone respectively; if at least one is in the activation state, then determining the correlation between the third audio signal to be played by the earphone speaker and the second audio signal collected by the body conduction audio collecting device of the earphone, that is, performing correlation detection, to obtain a sequence of correlation coefficients; then detecting whether there is another peak value in the predefined range before the main peak of the detection of the sequence of the correlation coefficients, if existing, determining that it is currently in the voice activation state, otherwise it is in the inactivation state. The voice activation detection is described in detail below with reference to Fig. 9b, wherein,
Step S1201: for the third audio signal and/or the second audio signal, determining whether the earphone speaker channel and/or the body conduction audio collecting device channel is/are in the voice activation state.
Specifically, the step includes: for the third audio signal, calculating whether the earphone speaker channel is in a voice activation state by the short-time energy algorithm or the zero-crossing rate algorithm; and/or for the second audio signal, calculating whether the body conduction audio collecting device channel is in a voice activated state by using a short-time energy algorithm or a zero-crossing rate algorithm.
Wherein, the short-time energy calculation formula is:
Figure PCTKR2019012099-appb-I000018
, in which, s(n) is the amplitude value of the frequency point n of the frequency-domain signal corresponding to the third audio signal, or the amplitude value of the frequency point n of the frequency-domain signal corresponding to the second audio signal, where N is the frame length.
Wherein, the zero-crossing rate algorithm formula is:
Figure PCTKR2019012099-appb-I000019
, in which,
Figure PCTKR2019012099-appb-I000020
where, S(n) is the amplitude value of the frequency point n of the frequency-domain signal corresponding to the third audio signal, or the amplitude value of the frequency point n of the frequency-domain signal corresponding to the second audio signal, where N is the frame length.
For the embodiment of the present application, when the short-time energy value is greater than the predefined threshold or the zero-crossing rate value is greater than the predefined threshold, it is determined that the channel is in a voice activation state.
Step S1202: if at least one channel is in the voice activation state, then determining it is currently in the voice activation state, according to a correlation between the third audio signal and the second audio signal.
Specifically, the step of determining it is currently in the voice activation state, according to a correlation between the third audio signal and the second audio signal in Step S1202 includes: calculating a correlation between of the third audio signal and the second audio signal to obtain a sequence of correlation coefficients; and determining whether it is currently in the activation state, based on the sequence of correlation coefficients.
Specifically, the correlation between the third audio signal and the second audio signal is calculated by formula (9):
Figure PCTKR2019012099-appb-M000009
Wherein,
Figure PCTKR2019012099-appb-I000021
is a cross-correlation value between the third audio signal and the second audio signal, Var[X] and Var[Y] and are signal variance values of the third audio signal and the second audio signal respectively.
Specifically, the step of determining whether it is currently in the activation state, based on the sequence of correlation coefficients, includes: determining a main peak in the sequence of correlation coefficients; if there is another peak in the predefined delay range before the main peak in the sequence of correlation coefficients, determining that the voice is currently in the activation state.
For the embodiment of the present application, as shown in Fig. 9c, in the correlation coefficient sequence, there is another peak (corresponding to the correlated peak in the figure) in the predefined delay range before the main peak, and it is determined that it is currently in the voice activation state.
Since in AS mode, the user needs to hear the ambient sound, the noise outside the ear is required to be record and then played by the in-ear speaker. Since there is a delay for that the audio signal outside the ear is recorded and then played in the in-ear speaker, if the user is currently in the speaking state, the speaking voice is collected by the air conduction audio collecting device and collected by the body conduction audio collecting device simultaneously, there is a delay since the sound audio signals collected by the air conduction audio collecting device is required to be recorded and then played by the in-ear speaker; that is, the audio signal collected by the in-ear voice device is composed of two parts: one is the signal collected by the body conduction audio collecting device via the body tissue conduction, and the other is the part collected by the air conduction audio collecting device and then played by the in-ear speaker and finally picked up by the body conduction audio collecting device, so the audio signal at this time will have two peaks in the correlation, and the second peak is the autocorrelation of the audio signal collected by the air conduction audio collecting device, and may be greater than the inter-correlated peak value between the signals (excluding the high-frequency part of the out-ear signal) collected by the body conduction audio collecting device via body tissue conduction and the audio signal collected by the air conduction audio collecting device, which is specifically shown in Fig. 9d. Part (1) in Fig. 9d shows the audio signal collected by the air conduction audio collecting device; Part (2) in Fig. 9d shows the audio signal to be played by the earphone speaker; Part (3) in Fig. 9d shows that when the earphone is in the non-AS mode, the audio signal collected by the body conduction audio collecting device; Part (4) in Fig. 9d shows that when the earphone is in the AS mode, the audio signal collected by the body conduction audio collecting device.
Embodiment Ⅲ
The embodiment of the present application provides another method for audio processing, as shown in Fig. 10a, including:
Step S1301: acquiring a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
Step S1302: performing ambient sound cancellation processing on the second audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
For the embodiment of the present application, if the current mode of the earphone is the non-AS mode, the ambient sound cancellation processing may not be performed on the second audio signal; if the current mode of the earphone is the AS mode, the ambient audio cancellation processing may be performed on the second audio signal.
Step S1302 includes: Step S1302a (not shown in the figure) to Step S1302b (not shown in the figure), wherein,
Step S1302a: acquiring a third audio signal to be played by the earphone speaker.
Step S1302b: performing ambient sound cancellation processing on the second audio signal through the third audio signal to obtain the ambient-sound-cancellation-processed second audio signal.
In a possible implementation, Step S1302b includes: Step S1302b1 (not shown in the figure) to Step S1302b2 (not shown in the figure), wherein,
Step S1302b1: performing ambient sound cancellation filter processing on the third audio signal, and obtaining a filter-processed signal.
Step S1302b2: removing the filter-processed signal from the second audio signal, and obtaining the ambient-sound-cancellation-processed second audio signal.
Specifically, Step S1302b includes: Step S1302b3 (not shown in the figure) to Step S1302b4 (not shown in the figure), wherein,
Step S1302b3: detecting whether it is currently in a voice activation state, wherein the voice activation state indicates that the user is making voices.
Specifically, Step S1302b3 may specifically include: Step S1302b31 (not shown in the figure) and Step S1302b32 (not shown in the figure), wherein,
Step S1302b31: determining whether the earphone speaker channel and/or the body conduction audio collecting device channel is/are in the voice activation state according to the second audio signal and/or the third audio signal.
Step S1302b32: if at least one channel is in the voice activation state, then determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal.
Specifically, in Step S1302b32, the step of determining whether it is currently in the voice activation state according to a signal correlation between the second audio signal and the third audio signal may include Step Sd (not shown in the figure) to Step Se (not shown in the figure), wherein,
Step Sd: determining a sequence of correlation coefficients between the second audio signal and the third audio signal.
Step Se: determining whether it is currently in the voice activation state based on the correlation of coefficient sequence.
Specifically, the Step Se may specifically include: Step Se1 (not shown in the figure) and Step Se2 (not shown in the figure), wherein,
Step Se1: in the sequence of correlation coefficients, determining the main peak.
Step Se2: if there is another peak in the predefined delay range before the main peak in the sequence of correlation coefficients, determining that it is currently in the voice activation state.
Step S1302b4: if detecting that it is in the voice activation state, performing the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal.
In a possible implementation, the method further includes: Step Sc (not shown in the figure), wherein,
Step Sc: if detecting that it is in the voice inactivation state, updating the parameter information of the ambient sound cancellation filter processing.
Step Sc may be performed after Step S1302b3.
Step S1303: performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal based on the signal correlation between the first audio signal and the second audio signal, to obtain the voice-enhancement-processed audio signal to be output.
Specifically, in Step S1303, the step of performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal based on the signal correlation between the first audio signal and the second audio signal, may specifically include Step S13031 (not shown in the figure), Step S13032 (not shown in the figure) and Step S13033 (not shown in the figure), wherein,
Step S13031: performing noise estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal, respectively.
Specifically, the step of performing noise estimation on the first audio signal in Step S13031 may include Step Sf (not shown in the figure) to Step Sg (not shown in the figure), wherein,
Step Sf: determining the voice presence prior probabilities corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Specifically, Step Sf may include: Step Sf1 (not shown in the figure) to Step Sf2 (not shown in the figure), wherein,
Step Sf1: determining a signal OIR between the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Step Sf2: determining the voice presence prior probabilities corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal based on the signal OIR.
Step Sg: performing noise estimation on the first audio signal based on the voice presence prior probability.
Specifically, Step Sg may include Step Sg1 (not shown in the figure) and Step Sg2 (not shown in the figure), wherein,
Step Sg1: determining the corresponding voice presence posterior probability based on the voice presence prior probability.
Step Sg2: performing noise estimation on the first audio signal based on the voice presence posterior probability.
Step S13032: performing a voice spectrum estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the noise estimation result corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Step S13033: performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the voice spectrum estimation result corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Specifically, Step S13033 may include Step S13033a (not shown in the figure), wherein,
Step S13033a: performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the noise estimation result corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Specifically, Step S13033a may include Step Sh (not shown in the figure) to Step Si (not shown in the figure), wherein,
Step Sh: performing joint voice spectrum estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to the noise estimation results corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the ambient-sound-cancellation-processed second audio signal.
Specifically, the Step Sh may include a Step Sh1 (not shown in the figure) to a Step Sh2 (not shown in the figure), wherein,
Step Sh1: determining a mean value of a third Gaussian distribution model according to a first Gaussian distribution model whose mean value is the voice spectrum estimation result of the first audio signal and variance is the noise estimation result of the first audio signal, and a second Gaussian distribution model whose mean value is the voice spectrum estimation result of the ambient-sound-cancellation-processed second audio signal and variance is the noise estimation result of the ambient-sound-cancellation-processed second audio signal.
Step Sh2: determining joint voice spectrum estimation results for joint voice spectrum estimation on the first audio signal and the ambient-sound-cancellation-processed second audio signal according to a mean value of the third Gaussian distribution model.
Step Si: obtaining a voice-enhancement-processed audio signal to be output according to the obtained joint voice spectrum estimation results.
For the embodiment of the present application, the technical solution of Embodiment Ⅰ and Embodiment Ⅱ is contained in Embodiment Ⅲ, and the specific implementations of the steps in Embodiment Ⅲ are described in detail in Embodiment Ⅰ and Embodiment Ⅱ, which will not be specifically described in the embodiment.
The method for audio processing provided in the embodiment of the present application allows the earphone user to activate the AS mode when using the earphone to make a call, so that the user who is talking with the earphone may clearly hear the surrounding ambient sound, and avoid reality dangers of ignoring the ambient sound during wearing the earphone when calling which also makes it possible to be sensitive to the ambient sound during wearing the earphone when calling, thereby making users easily and natural calling with the earphone. In addition, based on the correlation of audio signals obtained by air conduction and body conduction, joint enhancement is performed, for the characteristics of audio signals obtained by air conduction and body conduction (the audio collected by body conduction contains less noise, with insufficient bandwidth, while the audio collected by the air conduction has high bandwidth, but contain a lot of ambient noise) to quire each other's strong points to overcome one's own weakness, thereby making the voice heard by the call peer clean and natural during the call while preserving the high intelligibility of the voice, and improving the intelligibility of the voice, such that even if the user is in a noisy environment, the sound transmitted by the earphone user to the far end is also highly intelligible.
Embodiment Ⅳ
In order to further explain the technical solution in Embodiment III, the embodiment of the present application contains two specific examples, and respectively introduces the method for performing voice enhancement on the collected audio signal in two different application scenarios, including the first specific example wherein the first specific example describes the application scenario in which the device user communicates with the remote call user and the collected audio signal is processed and sent to the remote call peer with which the communication connection is established, and a second specific example, wherein the second specific example introduces the process of sending the voice instruction and controlling the execution of the voice instruction after collecting and processing the audio signals of the device user in the voice-based instruction recognition application scenario ,wherein the device user in the embodiment is a user using an earphone provided with a body conduction audio collecting device and an air conduction audio collecting device.
First specific example
The specific example describes that in the application scenario of the device user to communicate with the remote call user, the collected audio signal is processed and sent to the remote call user with which the communication connection is established, as shown in Fig. 10b. wherein,
Step I: establishing a call connection between the device user with the remote call user.
Step II: the device user making a call voice, for example, "Hello?";
Step III: when the earphone is in the AS mode, performing voice activation detection on the collected audio signal, and performing ambient sound cancellation processing in an activation state; and updating the parameter information of the set filter in an inactivation state;
Step IV: performing voice enhancement processing on the ambient-sound-cancellation-processed audio signal (including: time-frequency conversion, noise signal estimation, voice spectrum estimation, joint enhancement, and frequency-time conversion);
Step V: sending the voice-enhancement-processed audio signal to the remote call user; and
Step VI: receiving the voice of the remote call user.
Second specific example
The specific example introduces a process of a voice instruction and controlling the execution of the voice instruction after processing the collected audio signal of the device user in the voice-based instruction recognition application scenario, as shown in Fig. 10c, wherein,
Step I: the device user sending a voice instruction, for example "open a map";
Step II: when the earphone is in the AS mode, performing voice activation detection on the collected audio signal, and performing ambient sound cancellation processing in an activation state; and updating the parameter information of the set filter in an inactivation state;
Step III: performing voice enhancement processing on the ambient-sound-cancellation-processed audio signal (including: time-frequency conversion, noise signal estimation, voice spectrum estimation, joint enhancement, and frequency-time conversion);
Step IV: recognizing the voice-enhancement-processed voice instruction, and executing the instruction, for example, "Open a map APP".
Embodiment Ⅴ
The embodiment of the present application provides an electronic device, which is applicable to the foregoing method embodiments. The electronic device may be an earphone device. As shown in Fig. 11, the electronic device 1400 includes: an air conduction audio collecting device 1401 and a body conduction audio collecting device 1402, an audio signal playing device 1403, a processor 1404, and a memory 1405; wherein,
the air conduction audio collecting device 1401, configured to collect a first audio signal conducted via air;
the body conduction audio collecting device 1402, configured to collect a second audio signal conducted via the body tissue;
the audio signal playing device 1403, configured to play an audio signal;
the memory 1405, configured to store machine readable instructions that, when executed by the processor 1404, cause the processor 1404 to perform the methods described above.
Fig. 12 schematically illustrates a block diagram of a computing system that may be used to implement an electronic device of the embodiment of the present disclosure. As shown in Fig. 12, the computing system 1500 includes a processor 1510, a computer readable storage medium 1520, an output interface 1530, and an input interface 1540. The computing system 1500 may perform the methods described above with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10a to implement voice enhancement processing on the signal collected by the air conduction audio collecting device and the signal collected by the body conduction audio collecting device to obtain audio signals with better effect for voice transmission or voice recognition.
Specifically, the processor 1510 may include, for example, a general-purpose microprocessor, an instruction set processor, and/or a related chipset and/or a special purpose microprocessor (e.g., an application specific integrated circuit (ASIC)), and the like. The processor 1510 may also include onboard memory for caching purposes. The processor 1810 may be a single processing unit or multiple processing units for performing different actions of the method flow described with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10a.
The computer readable storage medium 1520, for example, may be any medium that may contain, store, communicate, propagate or transport the instructions. For example, the readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the readable storage medium include: a magnetic storage device such as a magnetic tape or a hard disk (HDD); an optical storage device such as a compact disk (CD-ROM); a memory such as a random access memory (RAM) or a flash memory; and/or a wired /wireless communication link.
The computer readable storage medium 1520 may include a computer program 1521 that may include code/computer executable instructions that, when executed by the processor 1510, cause the processor 1510 to perform, for example, the method flow with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10a, or any variations thereof. The computer program 1521 may be configured to have for example, computer program codes including a computer program module. For example, in an example embodiment, the codes in computer program 1521 may include one or more program modules, including, for example, 1521A, module 1521B, .... It should be noted that the division manner and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations. When these program module combinations are executed by the processor 1510, cause the processor 1510 to implement the method flow described above with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10a and any variations thereof.
According to an embodiment of the present disclosure, the processor 1510 may use the output interface 1530 and the input interface 1540 to perform the above described method flow with reference to Fig. 5, Fig. 6, Fig. 7a, Fig. 7c, Fig. 7e, Fig. 7f, Fig. 8b, Fig. 9b, and Fig. 10 and any variation thereof.
The embodiments of the present application provide an electronic device. The present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by an air conduction audio collecting device and the audio signal collected by a body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing voice transmission or voice recognition.
The embodiments of the present application provide another electronic device. The present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal. That is, ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, thus to obtain better audio signals for performing voice transmission or voice recognition.
Embodiment Ⅵ
The embodiment of the present application provides an apparatus for audio processing, as shown in Fig. 13, wherein the apparatus 1600 for audio processing includes: a first acquiring module 1601 and a voice enhancement processing module 1602, wherein,
the first acquiring module 1601 is configured to acquire a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
the voice enhancement processing module 1602 is configured to perform voice enhancement processing on the first audio signal and the second audio signal acquired by the first acquiring module 1601 to obtain the voice-enhancement-processed audio signal to be output based on a signal correlation between the first audio signal and the second audio signal.
The embodiments of the present application provide an apparatus for audio processing. The present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device, is capable of performing voice enhancement processing on the first audio signal and the second audio signal to obtain the voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal, that is, performing voice enhancement processing on the audio signal collected by an air conduction audio collecting device and the audio signal collected by a body conduction audio collecting device based on a correlation between the audio signal collected by the air conduction audio collecting device and the audio signal collected by the body conduction audio collecting device, to obtain voice signals with better effect for performing voice transmission or voice recognition.
The embodiments of the present application are applicable to the foregoing method embodiments, and are not described herein.
Embodiment Ⅶ
The embodiment of the present application provides another apparatus for audio processing. As shown in Fig. 14, the apparatus 1700 for audio processing includes: a second acquiring module 1701, an ambient sound cancellation processing module 1702, and a determining module 1703, wherein,
the second acquiring module 1701 is configured to acquire a first audio signal collected by the air conduction audio collecting device and a second audio signal collected by the body conduction audio collecting device.
The ambient sound cancellation processing module 1702 is configured to perform ambient sound cancellation processing on the second audio signal acquired by the second acquiring module 1701.
The determining module 1703 is configured to determine the audio signal to be output based on the first audio signal acquired by the second acquiring module 1701 and the second audio signal after the ambient sound cancellation processing module 1702 performs the ambient sound cancellation processing.
The embodiment of the present application provides an apparatus for audio processing. The present application acquires a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device, then performs ambient sound cancellation processing on the second audio signal, and determines the audio signal to be output based on the first audio signal and the ambient-sound-cancellation-processed second audio signal. That is, ambient sound cancellation processing is performed on the audio signal collected by the body conduction audio collecting device first to obtain a voice signal that does not contain ambient sound, and a signal to be output is obtained based on the audio signal collected by the air conduction audio collecting device and the ambient-sound-cancellation-processed audio signal collected by body conduction audio collecting device, to obtain audio signals with better effect for performing voice transmission or voice recognition.
The embodiments of the present application are applicable to the foregoing method embodiments, and details will not be described herein again.
It should be understood by those skilled in the art that the present invention involves apparatuses for performing one or more of operations as described in the present invention. Those apparatuses may be specially designed and manufactured as intended, or may include well known apparatuses in a general-purpose computer. Those apparatuses have computer programs stored therein, which are selectively activated or reconstructed. Such computer programs may be stored in device (such as computer) readable media or in any type of media suitable for storing electronic instructions and respectively coupled to a bus, the computer readable media include but are not limited to any type of disks (including floppy disks, hard disks, optical disks, CD-ROM and magneto optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memories, magnetic cards or optical line cards. That is, readable media include any media storing or transmitting information in a device (for example, computer) readable form.
It may be understood by those skilled in the art that computer program instructions may be used to realize each block in structure diagrams and/or block diagrams and/or flowcharts as well as a combination of blocks in the structure diagrams and/or block diagrams and/or flowcharts. It may be understood by those skilled in the art that these computer program instructions may be provided to general purpose computers, special purpose computers or other processors of programmable data processing means to be implemented, so that solutions designated in a block or blocks of the structure diagrams and/or block diagrams and/or flow diagrams are performed by computers or other processors of programmable data processing means.
It may be understood by those skilled in the art that the operations, methods, steps in the flows, measures and solutions already discussed in the present invention may be alternated, changed, combined or deleted. Further, the operations, methods, other steps in the flows, measures and solutions already discussed in the present invention may also be alternated, changed, rearranged, decomposed, combined or deleted. Further, prior arts having the operations, methods, the steps in the flows, measures and solutions already discussed in the present invention may also be alternated, changed, rearranged, decomposed, combined or deleted.
The foregoing descriptions are merely preferred embodiments of the present invention. It should be noted that, for a person of ordinary skill in the art, various modifications and embellishments can be made without departing from the principle of the present invention. Such modifications and embellishments shall be regarded as falling into the protection scope of the present invention.

Claims (15)

  1. A method for audio processing, comprising:
    acquiring a first audio signal using an air conduction audio collecting device and a second audio signal using a body conduction audio collecting device; and
    performing voice enhancement processing on at least one of the first audio signal and the second audio signal to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
  2. The method according to claim 1, wherein, performing voice enhancement processing, comprises:
    performing noise estimation on the first audio signal and the second audio signal, respectively;
    performing voice spectrum estimation on the first audio signal and the second audio signal respectively according to noise estimation results corresponding to the first audio signal and the second audio signal; and
    performing voice enhancement processing on the first audio signal and the second audio signal according to voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
  3. The method according to claim 2, wherein, performing noise estimation on the first audio signal, comprises:
    determining a voice presence prior probability corresponding to the first audio signal;
    performing noise estimation on the first audio signal based on the voice presence prior probability.
  4. The method according to claim 3, wherein, determining a voice presence prior probability corresponding to the first audio signal, comprises:
    determining a signal outer inner ratio between the first audio signal and the second audio signal; and
    determining the voice presence prior probability corresponding to the first audio signal, based on the signal outer inner ratio.
  5. The method according to claim 3 or 4, wherein, performing noise estimation on the first audio signal, comprises:
    determining a corresponding voice presence posterior probability based on the voice presence prior probability; and
    performing noise estimation on the first audio signal based on the voice presence posterior probability.
  6. The method according to any one of claims 2 to 5, wherein, performing voice enhancement processing on the first audio signal and the second audio, comprises:
    performing voice enhancement processing on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal.
  7. The method according to claim 6, wherein, performing voice enhancement processing on the first audio signal and the second audio signal, further comprises:
    performing joint voice spectrum estimation on the first audio signal and the second audio signal, according to the noise estimation results corresponding to the first audio signal and the second audio signal, and the voice spectrum estimation results corresponding to the first audio signal and the second audio signal; and
    obtaining the voice-enhancement-processed audio signal to be output according to the obtained joint voice spectrum estimation result.
  8. The method according to claim 7, wherein, performing joint voice spectrum estimation on the first audio signal and the second audio signal, comprises:
    determining a mean value of a third Gaussian distribution model, according to a first Gaussian distribution model whose mean value is the voice spectrum estimation result of the first audio signal and variance is the noise estimation result of the first audio signal, and a second Gaussian distribution model whose mean value is the voice spectrum estimation result of the second audio signal and variance is the noise estimation result of the second audio signal; and
    determining joint voice spectrum estimation results for joint voice spectrum estimation on the first audio signal and the second audio signal according to the mean value of the third Gaussian distribution model.
  9. The method according to any one of claims 1 to 8, wherein, before performing voice enhancement processing on the first audio signal and the second audio signal, comprises:
    performing ambient sound cancellation processing on the second audio signal to obtain an ambient-sound-cancellation-processed second audio signal;
    performing voice enhancement processing on the first audio signal and the second audio signal, comprises:
    performing voice enhancement processing on the first audio signal and the ambient-sound-cancellation-processed second audio signal.
  10. The method according to claim 9, wherein, performing ambient sound cancellation processing on the second audio signal, comprises:
    acquiring a third audio signal to be played by an audio signal playing device; and
    performing ambient sound cancellation processing on the second audio signal through the third audio signal, and obtaining an ambient-sound-cancellation-processed second audio signal.
  11. The method according to claim 10, wherein, performing ambient sound cancellation processing on the second audio signal through the third audio signal, comprises:
    detecting whether it is currently in a voice activation state, wherein the voice activation state indicates that the user is making voices; and
    if detected that it is currently in the voice activation state, performing the step of performing ambient sound cancellation processing on the second audio signal through the third audio signal.
  12. The method according to claim 11, wherein, detecting whether it is currently in the voice activation state, comprises:
    determining whether an audio signal playing device channel and/or a body conduction audio collecting device channel is in the voice activation state, according to the second audio signal and/or the third audio signal;
    if at least one channel is in the voice activation state, determining whether it is currently in the voice activation state, according to a signal correlation between the second audio signal and the third audio signal.
  13. An apparatus for audio processing, comprising:
    a first acquiring module, configured to acquire a first audio signal collected by an air conduction audio collecting device and a second audio signal collected by a body conduction audio collecting device; and
    a voice enhancement processing module, configured to perform voice enhancement processing on at least one of the first audio signal and the second audio acquired by the first acquiring module, to obtain a voice-enhancement-processed audio signal to be output, based on a signal correlation between the first audio signal and the second audio signal.
  14. An electronic device, comprising: an air conduction audio collecting device, a body conduction audio collecting device, an audio signal playing device, a processor, and a memory; wherein,
    the air conduction audio collecting device, configured to collect a first audio signal conducted via air;
    the body conduction audio collecting device, configured to collect a second audio signal conducted via body tissues;
    the audio signal playing device, configured to play an audio signal; and
    the memory, configured to store machine readable instructions that, when executed by the processor, cause the processor to perform the method of any one of claims 1 to 12.
  15. A computer readable storage medium, wherein the computer readable storage medium stores a computer program that, when executed by a processor, implements the method of any one of claims 1 to 12.
PCT/KR2019/012099 2018-09-18 2019-09-18 Methods for audio processing, apparatus, electronic device and computer readable storage medium WO2020060206A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811090353.X 2018-09-18
CN201811090353.XA CN110931027A (en) 2018-09-18 2018-09-18 Audio processing method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2020060206A1 true WO2020060206A1 (en) 2020-03-26

Family

ID=69855801

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2019/012099 WO2020060206A1 (en) 2018-09-18 2019-09-18 Methods for audio processing, apparatus, electronic device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110931027A (en)
WO (1) WO2020060206A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022193327A1 (en) * 2021-03-19 2022-09-22 深圳市韶音科技有限公司 Signal processing system, method and apparatus, and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113038318B (en) * 2019-12-25 2022-06-07 荣耀终端有限公司 Voice signal processing method and device
CN111883117B (en) * 2020-07-03 2024-04-16 北京声智科技有限公司 Voice wake-up method and device
CN111935573B (en) * 2020-08-11 2022-06-14 Oppo广东移动通信有限公司 Audio enhancement method and device, storage medium and wearable device
CN111988702B (en) * 2020-08-25 2022-02-25 歌尔科技有限公司 Audio signal processing method, electronic device and storage medium
CN113223561B (en) * 2021-05-08 2023-03-24 紫光展锐(重庆)科技有限公司 Voice activity detection method, electronic equipment and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7447630B2 (en) * 2003-11-26 2008-11-04 Microsoft Corporation Method and apparatus for multi-sensory speech enhancement
US20120278070A1 (en) * 2011-04-26 2012-11-01 Parrot Combined microphone and earphone audio headset having means for denoising a near speech signal, in particular for a " hands-free" telephony system
CN105533986A (en) * 2016-01-26 2016-05-04 王泽玲 Bone conduction hair clasp
US20160217781A1 (en) * 2013-10-23 2016-07-28 Google Inc. Methods And Systems For Implementing Bone Conduction-Based Noise Cancellation For Air-Conducted Sound
US20160379661A1 (en) * 2015-06-26 2016-12-29 Intel IP Corporation Noise reduction for electronic devices

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2448669A1 (en) * 2001-05-30 2002-12-05 Aliphcom Detecting voiced and unvoiced speech using both acoustic and nonacoustic sensors
CA2354755A1 (en) * 2001-08-07 2003-02-07 Dspfactory Ltd. Sound intelligibilty enhancement using a psychoacoustic model and an oversampled filterbank
US7406303B2 (en) * 2005-07-05 2008-07-29 Microsoft Corporation Multi-sensory speech enhancement using synthesized sensor signal
KR20180019752A (en) * 2008-11-10 2018-02-26 구글 엘엘씨 Multisensory speech detection
CN101853667B (en) * 2010-05-25 2012-08-29 无锡中星微电子有限公司 Voice noise reduction device
CN104616662A (en) * 2015-01-27 2015-05-13 中国科学院理化技术研究所 Active noise reduction method and device
US10204637B2 (en) * 2016-05-21 2019-02-12 Stephen P Forte Noise reduction methodology for wearable devices employing multitude of sensors
CN106251878A (en) * 2016-08-26 2016-12-21 彭胜 Meeting affairs voice recording device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7447630B2 (en) * 2003-11-26 2008-11-04 Microsoft Corporation Method and apparatus for multi-sensory speech enhancement
US20120278070A1 (en) * 2011-04-26 2012-11-01 Parrot Combined microphone and earphone audio headset having means for denoising a near speech signal, in particular for a " hands-free" telephony system
US20160217781A1 (en) * 2013-10-23 2016-07-28 Google Inc. Methods And Systems For Implementing Bone Conduction-Based Noise Cancellation For Air-Conducted Sound
US20160379661A1 (en) * 2015-06-26 2016-12-29 Intel IP Corporation Noise reduction for electronic devices
CN105533986A (en) * 2016-01-26 2016-05-04 王泽玲 Bone conduction hair clasp

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022193327A1 (en) * 2021-03-19 2022-09-22 深圳市韶音科技有限公司 Signal processing system, method and apparatus, and storage medium

Also Published As

Publication number Publication date
CN110931027A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2020060206A1 (en) Methods for audio processing, apparatus, electronic device and computer readable storage medium
WO2020138844A1 (en) Home appliance and method for voice recognition thereof
WO2018139884A1 (en) Method for processing vr audio and corresponding equipment
WO2019045474A1 (en) Method and device for processing audio signal using audio filter having non-linear characteristics
WO2020045950A1 (en) Method, device, and system of selectively using multiple voice data receiving devices for intelligent service
WO2013147384A1 (en) Wired and wireless earset using ear-insertion-type microphone
WO2016018004A1 (en) Method, apparatus, and system for providing translated content
WO2013168970A1 (en) Method and system for operating communication service
WO2010107269A2 (en) Apparatus and method for encoding/decoding a multichannel signal
WO2017003096A1 (en) Method for establishing connection between devices
WO2018028135A1 (en) Downlink data information feedback method and relevant device
WO2017143690A1 (en) Echo cancellation method and device for use in voice communication
WO2020122593A1 (en) Electronic device for attenuating at least part of signal received by antenna and method for controlling communication signal
WO2016204496A1 (en) System and method of providing information of peripheral device
WO2020130535A1 (en) Electronic device including earphone, and method of controlling the electronic device
WO2020027559A1 (en) Electronic apparatus and control method thereof
WO2021145644A1 (en) Mobile device and operating method thereof
WO2019000466A1 (en) Face recognition method and apparatus, storage medium, and electronic device
WO2018048098A1 (en) Portable camera and controlling method therefor
WO2015034316A1 (en) Method for transmitting/receiving sound wave using symbol on basis of time-varying frequency and device using same
WO2022158686A1 (en) Electronic device for performing inference on basis of encrypted information by using artificial intelligence model, and operating method therefor
EP3830821A1 (en) Method, device, and system of selectively using multiple voice data receiving devices for intelligent service
WO2009123412A1 (en) Method for processing noisy speech signal, apparatus for same and computer-readable recording medium
WO2021010562A1 (en) Electronic apparatus and controlling method thereof
WO2018016854A2 (en) Asynchronous digital communication module

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19862521

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19862521

Country of ref document: EP

Kind code of ref document: A1