WO2017075979A1 - 语音信号的处理方法及装置 - Google Patents

语音信号的处理方法及装置 Download PDF

Info

Publication number
WO2017075979A1
WO2017075979A1 PCT/CN2016/083622 CN2016083622W WO2017075979A1 WO 2017075979 A1 WO2017075979 A1 WO 2017075979A1 CN 2016083622 W CN2016083622 W CN 2016083622W WO 2017075979 A1 WO2017075979 A1 WO 2017075979A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
power spectrum
power
calculating
echo
Prior art date
Application number
PCT/CN2016/083622
Other languages
English (en)
French (fr)
Inventor
袁豪磊
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2017553962A priority Critical patent/JP6505252B2/ja
Priority to EP16861250.5A priority patent/EP3373300B1/en
Priority to KR1020177029724A priority patent/KR101981879B1/ko
Publication of WO2017075979A1 publication Critical patent/WO2017075979A1/zh
Priority to US15/691,300 priority patent/US10586551B2/en
Priority to US16/774,854 priority patent/US10924614B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/02Circuits for transducers, loudspeakers or microphones for preventing acoustic reaction, i.e. acoustic oscillatory feedback

Definitions

  • the present invention relates to the field of terminal technologies, and in particular, to a method and an apparatus for processing a voice signal.
  • Voice intelligibility refers to the percentage of the user's listening to the voice signal transmitted by the sound system. For example, if the user hears that the sound system has transmitted 100 words but only understands 50 words, the system's voice intelligibility. It is 50%. As the external size of the portable mobile terminal gradually develops toward miniaturization, the maximum sound power that the mobile terminal can output gradually decreases, and accordingly the voice intelligibility of the user when using the mobile terminal for communication is also affected. Since speech intelligibility is an important indicator for measuring the performance of mobile terminals, how mobile terminals handle speech signals to improve speech intelligibility is the key to their development.
  • an automatic gain control algorithm is used to detect a broadcast signal to be played, and a small signal in the broadcast signal to be played is amplified, and the amplified broadcast is performed.
  • the signal is converted into an electrical signal and the electrical signal is transmitted to the speaker.
  • the average fluctuation amplitude of the broadcast signal is much smaller than the peak fluctuation amplitude, for a speaker with a maximum rated output power of 1 watt, under the excitation of the normal speech signal, the average output power during normal operation generally only reaches the maximum rated output. About 10% of the power (that is, 0.1W).
  • the amplitude of the electrical signal input to the speaker is continuously increased, the portion of the signal having a larger amplitude in the broadcast signal will cause the speaker to be overloaded, forming saturation distortion, and reducing the intelligibility and clarity of the speech; If only the small signal in the broadcast signal is amplified, the broadcast signal will be reduced.
  • the effective dynamic range, the corresponding speech intelligibility is also not significantly improved.
  • an embodiment of the present invention provides a method and an apparatus for processing a voice signal.
  • the technical solution is as follows:
  • a method of processing a voice signal comprising:
  • the adjusted speech signal is output.
  • a processing apparatus for a voice signal comprising:
  • At least one processor At least one processor
  • a memory wherein the memory stores program instructions that, when executed by the processor, configure the apparatus to perform the operations of:
  • the adjusted speech signal is output.
  • the frequency amplitude of the broadcast signal is automatically adjusted according to the frequency distribution of the noise signal and the broadcast signal, thereby significantly improving the speech intelligibility.
  • FIG. 1 is a schematic diagram of an implementation environment involved in a method for processing a voice signal according to an embodiment of the present invention
  • FIG. 2 is a system architecture diagram of a method for processing a voice signal according to another embodiment of the present invention.
  • FIG. 3 is a flowchart of a method for processing a voice signal according to another embodiment of the present invention.
  • FIG. 4 is a flowchart of a method for processing a voice signal according to another embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a signal flow corresponding to a method for processing a voice signal according to another embodiment of the present invention.
  • FIG. 6 is a flowchart of a method for processing a voice signal according to another embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a device for processing a voice signal according to another embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a processing terminal of a voice signal according to another embodiment of the present invention.
  • the voice instant messaging application is an application that can make VoIP or network audio conferences and is widely installed on mobile terminals such as smart phones, tablets, notebook computers, and wearable electronic products. As the physical dimensions of these mobile terminals gradually develop toward miniaturization, the maximum sound power that the micro-speakers in the mobile terminal device can output also encounters a bottleneck.
  • the existing electroacoustic sound amplification technology mainly relies on three parts of a power amplifier, a speaker and a sound chamber to realize the generation of sound waves.
  • the physical size of the speaker and the sound chamber is proportional to the wavelength of the sound wave, the mobile terminal device
  • the speaker can achieve electro-acoustic conversion with maximum efficiency.
  • the size of portable mobile devices has become smaller, the size of mobile terminals tends to be smaller than the wavelength of sound waves.
  • the size of the mobile terminal needs to be at least 1 meter, and the miniaturization of the speaker size results in a reduction in the maximum sound power output by the mobile terminal.
  • the currently used moving coil speakers need to reach a certain size and thickness to ensure that the diaphragm has sufficient space for movement.
  • the external dimensions of the mobile terminal decrease and the thickness becomes thin, the overall acoustics in the mobile terminal The design is limited by the physical size, so that the maximum sound power output by the mobile terminal is limited.
  • the voice instant messaging application installed in the mobile terminal generally runs on the operating system, and the volume control of the hardware can be implemented through an application program interface provided by the operating system.
  • the current mainstream implementation method is that the voice instant messaging application declares the audio configuration mode to the operating system, and the operating system sets the relevant hardware. After the configuration is completed, the voice instant messaging application only needs to periodically broadcast the signal. The corresponding data is written into the recording API of the operating system, and then the data can be read from the recording API of the operating system.
  • the types of audio configuration modes supported by the operating system are limited. These limited audio configuration modes are implemented by the mobile terminal manufacturer in the hardware firmware (firmware firmware), and the application's control of the hardware output volume is affected by this factor.
  • hardware vendors often only do the underlying audio optimization for normal usage scenarios. For use scenarios in extreme environments (such as large ambient noise), mobile terminal manufacturers generally do not optimize this ( For example, mobile terminal manufacturers generally do not provide a dedicated software interface that can increase the hardware output volume).
  • the order of output volume from large to small is: laptop, tablet, smart phone (hands-free mode), wearable device, and the like.
  • the environmental noise problems faced by these kinds of mobile terminals are in the opposite trend: usually, the frequency of use of laptops indoors is relatively high, and the noise that is exposed is also low noise of indoor low decibels. Mainly; tablets and smartphones are used more frequently in outdoor and public places, and the noise that comes into contact is dominated by high noise of high decibels; wearable devices are exposed to the human body for a long time, and the noise scenes are the most exposed. The most complicated. As the external dimensions of mobile terminals are becoming smaller, the problem of environmental noise faced by mobile terminals becomes more and more prominent, which seriously affects the experience of users when using mobile terminals for communication.
  • the embodiment of the present invention provides a method for improving the mobile terminal by processing the voice signal without changing the hardware of the mobile terminal.
  • the method of speech intelligibility With the method provided by the embodiment of the present invention, the user of the mobile terminal can hear the voice content of the opposite end of the call even in a noisy scene.
  • FIG. 1 is a flowchart of a method and an apparatus for processing a voice signal according to an embodiment of the present invention. Schematic diagram.
  • the implementation environment includes three acoustic bodies of a mobile terminal P, a user U, and a noise source N, and further includes a sound output and input device speaker S and a microphone M.
  • the mobile terminal P can be a mobile phone, a tablet computer, a notebook computer, a wearable device, etc., in which one or more voice instant messaging applications (Apps) are installed, and based on these voice instant messaging applications, the user can communicate with other users anytime and anywhere.
  • Apps voice instant messaging applications
  • the speaker S and the microphone M can be built in the mobile terminal, or can be connected to the mobile terminal in the form of an external device such as an external audio, an external speaker, a Bluetooth speaker, or a Bluetooth headset.
  • the microphone M can pick up the sound in the entire scene, including: the noise emitted by the noise source N, the voice emitted by the user U when speaking, and the sound broadcast by the speaker S.
  • the mobile terminal receives the voice signal to be played by the opposite end (for the sake of differentiation, hereinafter referred to as the broadcast signal), and after the broadcast signal is processed, the speaker converts into a sound wave.
  • the sound wave emitted by the noise source N is also transmitted to the user U through the air, and is also perceived by the user U, and the sound wave emitted by the noise source N may interfere with the user U. , reducing the voice intelligibility of the mobile terminal.
  • the present invention will utilize the psychoacoustic masking effect to solve the interference problem of the noise signal to the broadcast signal.
  • the broadcast signal and the noise signal are not single frequency signals, they each occupy different frequency bands, and their energy distribution at each frequency point is not uniform.
  • the frequency points of the lowest energy in the noise signal can be found, which is denoted as f_weak.
  • the energy of the broadcast signal is concentrated to be played near f_weak without exceeding the output power of the speaker, and at the same time, the energy of the broadcast signal at the frequency away from f_weak is attenuated to avoid speaker overload. In this way, at the frequency point near f_weak, the noise signal is masked by the broadcast signal, and the user perceives the content of the broadcast signal.
  • the broadcast signal is still masked by the noise signal.
  • the enhanced broadcast signal masks the noise signal at part of the frequency, so that the noise no longer forms an overall mask on the broadcast signal, and the user can hear the content of the broadcast signal.
  • the system architecture includes a user U, a speaker S, a microphone M, and various functional modules.
  • the function module package
  • the signal detection and classification module, the spectrum estimation module, the loop function transfer calculation module, the speech intelligibility estimation module, and the like are included.
  • the spectrum estimation module may specifically include a voice activation detection module, a noise power spectrum module, and an echo power spectrum module.
  • the microphone M is used to pick up the ambient sound.
  • the ambient sound is referred to as a recording signal (denoted as x), and the recording signal x is sent to the signal detection and classification module.
  • the signal detection and classification module is used for detecting and distinguishing the recorded signal, and outputs three types of signals: a voice signal when the user U speaks (denoted as the near-end signal v), and a noise signal generated by the noise source N (recorded as the noise signal n) ), the signal that the sound played by the speaker S is re-recorded by the M (denoted as the echo signal e).
  • the spectrum estimation module is configured to calculate a power spectrum of the noise signal, a power spectrum of the echo signal, and a power characteristic value of the near-end signal, wherein the power spectrum of the noise signal can be represented by Pn , the power of the echo signal can be represented by P e , and the signal of the near-end signal
  • the power characteristic value can be expressed by VAD_v.
  • the loop transfer function calculation module is configured to calculate a transfer function on the path of "heavy filter-speaker-sound field-microphone" according to the broadcast signal y and the recording signal x picked up by the microphone, and record it as H_loop.
  • the speech intelligibility estimation module is configured to determine speech intelligibility (denoted as SII) based on H_loop, VAD_v, Pn, and P e , and the speech intelligibility is also used to calculate the frequency emphasis coefficient of the emphasis filter W.
  • SII speech intelligibility
  • the purpose of processing the broadcast signal and the recording signal is to hope that the user U is in the ear.
  • the SII in position is adjusted to the maximum, not the position where the microphone M is located.
  • the method provided by this embodiment employs an approximation process.
  • the length of the propagation path of the sound between the speaker S and the ear of the user U is represented by h1
  • the length of the propagation path of the sound between the noise source N and the user's ear is h2.
  • the length of the propagation path of the sound between the noise source N and the microphone M is represented by h3
  • the length of the propagation path of the sound between the mouth of the user U and the microphone M is represented by h4
  • the sound is in the microphone M and the speaker.
  • the length of the propagation path between S is denoted by h5.
  • the problem of calculating the maximum speech intelligibility of the location of the user U can be converted into the maximum speech intelligibility problem of calculating the position of the microphone M.
  • FIG. 3 is a flow chart showing a method of processing a voice signal according to an embodiment of the present invention. Referring to FIG. 3, the method provided in this embodiment includes:
  • a recording signal and a voice signal for example, collecting a recording signal from a near end and receiving a voice signal (ie, a broadcasting signal) sent by the opposite end.
  • the recording signal includes at least a noise signal and an echo signal.
  • the method provided by the embodiment of the invention automatically adjusts the frequency amplitude of the broadcast signal according to the frequency distribution of the noise signal and the broadcast signal under the premise of ensuring that the speaker is not overloaded without destroying the dynamic amplitude of the original broadcast signal, thereby significantly improving the amplitude of the broadcast signal.
  • Voice intelligibility
  • the loop transfer function is calculated based on the recorded signal and the broadcast signal, including:
  • the loop transfer function is calculated based on the frequency domain cross-correlation function between the recording signal and the broadcast signal and the frequency domain autocorrelation function of the broadcast signal.
  • the power spectrum of the recorded signal is calculated using the following formula:
  • X(n) is the vector obtained by Fourier transforming the recorded signal acquired at the nth time, and .2 is used to square each vector element in X(n) .
  • the power spectrum of the noise signal is obtained by subtracting the power spectrum of the echo signal from the power spectrum of the recorded signal.
  • calculating the square of the spectral estimation value of the echo signal, before obtaining the power spectrum of the echo signal further includes:
  • the power characteristic value of the recording signal is greater than the first threshold
  • the power value of the broadcast signal is greater than the second threshold
  • the power characteristic value of the echo signal is greater than the third threshold
  • the method before subtracting the power spectrum of the echo signal from the power spectrum of the echo signal to obtain the power spectrum of the noise signal, the method further includes:
  • the power spectrum of the echo signal is subtracted from the power spectrum of the recorded signal to obtain a power spectrum of the noise signal.
  • the frequency emphasis coefficient is calculated based on the power spectrum of the echo signal and the power spectrum of the noise signal, including:
  • the frequency emphasis coefficient is obtained according to the maximum value of the speech intelligibility function.
  • FIG. 4 is a flow chart showing a method of processing a voice signal according to another embodiment of the present invention. Referring to FIG. 4, the method provided in this embodiment includes:
  • the mobile terminal collects the recording signal from the near end and receives the broadcast signal sent by the opposite end.
  • the near end is the current environment of the mobile terminal, and the mobile terminal collects the recording signal from the near end, including but not limited to: turning on the microphone, collecting the sound signal in the current environment through the microphone, and The sound signal collected by the microphone is used as a recording signal, and the recording signal includes a noise signal, an echo signal, and a near-end signal.
  • the recording signal can be represented by x
  • the noise signal can be represented by n
  • the echo signal can be represented by e
  • the near-end signal can be represented by v.
  • the peer end collects the voice signal of the peer user through the microphone, processes the collected voice signal, and sends it to the mobile terminal through the network.
  • the instant messaging application on the mobile terminal receives the voice signal sent by the opposite end, and the opposite end
  • the transmitted voice signal is used as a broadcast signal.
  • the peer end may be other mobile terminals that communicate with the mobile terminal through a voice instant messaging application.
  • the broadcast signal can be represented by y.
  • the microphone on the mobile terminal side collects the recording signal every preset time period, and the opposite side microphone also collects the broadcasting signal every preset time period, and collects the sound signal.
  • the incoming broadcast signal is sent to the mobile terminal.
  • the preset duration may be 10 ms (milliseconds), 20 ms, 50 ms, and the like.
  • the recording signal collected by the mobile terminal from the near end and the broadcast signal sent by the opposite end are substantially time domain signals.
  • the method provided in this embodiment will also adopt Fourier transform or the like. The method separately processes the collected recording signal and the received broadcasting signal, and converts the recording signal in the time domain form into the recording signal in the frequency domain form, and converts the broadcasting signal in the time domain form into the frequency domain form. Broadcast signal for subsequent calculations.
  • the recording signal in the frequency domain form is a column vector, and the vector length is equal to the number of points of the Fourier transform used, and can be represented by X;
  • the broadcast signal in the frequency domain form is also a column vector, and the vector length is also equal to the adopted
  • the number of points of the Fourier transform can be represented by Y.
  • the obtained recording signal in the frequency domain form and the broadcast signal in the frequency domain form have the same dimension.
  • the mobile terminal calculates a loop transfer function according to the recording signal and the broadcast signal.
  • the mobile terminal acquires a frequency domain cross-correlation function between the recording signal and the broadcast signal.
  • the cross-correlation function is used to indicate the degree of correlation between the two signals.
  • the mobile terminal acquires the frequency domain cross-correlation function between the recording signal and the broadcast signal, the following formula ⁇ 1> can be used:
  • r_xy is the cross-correlation function between the recording signal and the broadcast signal
  • E[.] is the expected operator
  • the mobile terminal acquires a frequency domain autocorrelation function of the broadcast signal.
  • the autocorrelation function is used to indicate the degree of correlation between the signal and the delayed signal of the signal.
  • the mobile terminal acquires the frequency domain autocorrelation function of the broadcast signal, the following formula ⁇ 2> can be used:
  • R_yy is the frequency domain autocorrelation function of the broadcast signal
  • the symbol * indicates the matrix product operation
  • the symbol ' indicates the conjugate transpose operation
  • Y(n) is the Fourier transform obtained by the broadcast signal acquired at the nth time.
  • the mobile terminal may apply the following formula ⁇ 3 based on the frequency domain cross-correlation function between the recording signal and the broadcast signal acquired in the above step 4021 and the frequency domain autocorrelation function of the broadcast signal obtained in step 4022. >, calculate the loop transfer function:
  • H_loop is a loop transfer function
  • ⁇ -1 represents a matrix inversion operation
  • the mobile terminal acquires a power spectrum of the recorded signal and a power spectrum of the broadcast signal.
  • the mobile terminal can calculate the power spectrum of the recorded signal by applying the following formula ⁇ 4>:
  • X(n) is the vector obtained by Fourier transforming the recorded signal acquired at the nth time, and .2 is used for each vector element in X(n) Find the square.
  • P x ⁇ a 1 2 , a 2 2 , a 3 2 , . . . , a n 2 ⁇ .
  • the mobile terminal can calculate the power spectrum of the broadcast signal by applying the following formula ⁇ 5>:
  • Y(n) is the vector obtained by Fourier transforming the broadcast signal collected at the nth time, and .2 is used for each vector element in Y(n) Find the square.
  • P y ⁇ b 1 2 , b 2 2 , b 3 2 , . . . , b n 2 ⁇ .
  • the mobile terminal calculates an estimated value of the echo signal according to the loop transfer function and the broadcast signal.
  • the mobile terminal can calculate the estimated value of the echo signal by applying the following formula ⁇ 6>:
  • E(n) is an estimated value of the echo signal.
  • the mobile terminal acquires a power feature value of the recorded signal, a power feature value of the broadcast signal, and a power feature value of the echo signal.
  • the power characteristic value of the recorded signal is a measure of the power spectrum of the recorded signal, and can be obtained by processing the power spectrum of the recorded signal.
  • the power characteristic value of the recorded signal can be represented by VAD_x.
  • the power characteristic value of the broadcast signal is a measure of the power spectrum of the broadcast signal, and can be obtained by processing the power spectrum of the broadcast signal.
  • the power feature value of the broadcast signal can be represented by VAD_y.
  • the power characteristic value of the echo signal is a measure of the power spectrum of the echo signal.
  • the power characteristic value of the echo signal can be represented by VAD_e.
  • the power spectrum of an echo signal can be calculated according to the spectrum estimation value of the echo signal, and then the power spectrum of the echo signal is processed to obtain the echo signal. Power characteristic value.
  • the power spectrum of the echo signal calculated here is an estimate of the power spectrum of the echo signal. Whether the power spectrum of the echo signal is the power spectrum of the echo signal calculated here needs to be further determined by the following step 406.
  • the mobile terminal determines whether the power feature value of the recorded signal is greater than the first threshold, whether the power feature value of the broadcast signal is greater than the second threshold, and whether the power feature value of the echo signal is greater than a third threshold. If yes, step 407 is performed.
  • the present embodiment applies the signal detection and classification module and the voice activation detection mechanism, and according to the power characteristic value of the recording signal, the power characteristic value of the echo signal, and the power characteristic value of the broadcast signal, The time distinguishes the near-end signal (with background noise superimposed) and the non-near-end signal to obtain the power spectrum of the noise signal.
  • the mobile terminal needs to determine whether the power feature value of the recorded signal is greater than the first threshold, whether the power feature value of the broadcast signal is greater than the second threshold, and whether the power feature value of the echo signal is greater than the third threshold.
  • the first threshold, the second threshold, and the third threshold are preset thresholds.
  • the first threshold may be represented by Tx
  • the second threshold may be represented by a Ty table
  • the third threshold may be represented by Te.
  • the mobile terminal may not have a near-end signal in the recording signal collected by the microphone.
  • the following formula ⁇ 8> can be used to determine:
  • VAD_v VAD_x ⁇ 8>
  • the recording signal collected by the microphone is the near-end signal. At this time, the user is talking, otherwise the user is not talking.
  • the judging process if it is determined that the power feature value of the recording signal is greater than the first threshold, the power feature value of the broadcast signal is greater than the second threshold, and the power feature value of the echo signal is greater than the third threshold, performing the following step 407;
  • the power feature value of the recorded signal is greater than the first threshold, the power feature value of the broadcast signal is greater than the second threshold, the power feature value of the echo signal is less than or equal to the third threshold, or the power feature value of the recorded signal is greater than the first threshold, and the broadcast is performed. If the power characteristic value of the signal is less than or equal to the second threshold, the acquired recording signal and the broadcast signal are ignored.
  • the mobile terminal calculates a square of a spectrum estimation value of the echo signal as a power spectrum of the echo signal.
  • the mobile terminal obtains the square of the spectrum estimation value of the echo signal.
  • the power spectrum of the echo signal in the specific calculation, the following formula ⁇ 9> can be applied:
  • P e is the power spectrum of the echo signal.
  • the mobile terminal determines whether the power feature value of the recorded signal is less than the first threshold, and whether the power feature value of the echo signal is less than a third threshold. If yes, step 409 is performed.
  • the mobile terminal further determines whether the power feature value of the recorded signal is less than the first threshold, and whether the power feature value of the echo signal is less than a third threshold to obtain a power spectrum of the noise signal.
  • step 409 is performed; if it is determined that the power feature value of the recorded signal is less than the first threshold If the power characteristic value of the echo signal is greater than or equal to the third threshold, the acquired recording signal and the broadcast signal are ignored.
  • the mobile terminal subtracts the power spectrum of the echo signal from the power spectrum of the recorded signal as a power spectrum of the noise signal.
  • the power spectrum of the noise signal is obtained by subtracting the power spectrum of the echo signal from the power spectrum of the recorded signal. For specific implementation, see the following formula ⁇ 10>:
  • P n is the power spectrum of the noise signal.
  • the mobile terminal calculates a frequency emphasis coefficient according to a power spectrum of the echo signal and a power spectrum of the noise signal.
  • the mobile terminal constructs a speech intelligibility function according to a power spectrum of the echo signal and a power spectrum of the noise signal.
  • the speech intelligibility function (SII) has multiple sets of standards.
  • the standard [4] in ASNI-S3.5 is used for calculation.
  • the speech intelligibility function can represent The power spectrum of the echo signal and the power spectrum of the noise signal are functions of the independent variable. Therefore, after the mobile terminal calculates the power spectrum of the echo signal and the power spectrum of the noise signal, a speech intelligibility function can be constructed.
  • the constructed speech intelligibility function can be found in the following formula ⁇ 11>:
  • i max is the total number of bands split, i is any band within i max , SII is a speech intelligibility function, Pe i is the power spectrum of the echo signal in the ith band, and Pn i is a noise signal In the power spectrum in the i-th frequency band, Pu i is the power spectrum of the standard speech intensity in the i-th band, I i is the sub-band weighting weight, and Pd i is the intermediate variable, which can be expressed by the following formula ⁇ 12>:
  • f k represents the kth frequency point in the i-th frequency band
  • C k is an intermediate variable, which can be expressed by the following formula ⁇ 13>:
  • Pe k is the power spectrum of the echo signal at the kth frequency point
  • Pn k is the power spectrum of the noise signal at the kth frequency point
  • the mobile terminal calculates a maximum value of the speech intelligibility function, thereby obtaining a frequency emphasis coefficient.
  • the frequency emphasis coefficient is a coefficient of the weighting filter in the mobile terminal, and is used to adjust the frequency point amplitude of the broadcast signal output by the mobile terminal. At different times, the frequency accretion coefficients calculated by the mobile terminal are different.
  • the speech intelligibility function is a function of the power spectrum of the echo signal and the power spectrum of the noise signal as an independent variable, that is, the speech intelligibility.
  • the method provided in this embodiment performs an approximate calculation, and sets the power spectrum of the noise signal at the nth time to be approximately equal to the power spectrum of the noise signal at time n-1, so that when calculating the frequency emphasis coefficient at the nth time
  • the mobile terminal can directly use the power spectrum of the noise signal calculated at time n-1.
  • the mobile terminal converts the speech intelligibility function into a function of the power spectrum of the echo signal as an independent variable.
  • the mobile terminal will also use the emphasis filter to process the broadcast signal before the broadcast signal is played through the speaker, so as to increase the amplitude of the broadcast signal at the specified frequency point. Increase the energy of the broadcast signal.
  • the maximum sound power played by the speaker has a maximum value.
  • this method is called the extremum problem under constraints. This extreme value problem can be expressed by the following formula ⁇ 14>:
  • Pe i is the power spectrum of the echo signal before the enhancement at the i-th frequency point
  • Pe' i is the power spectrum of the enhanced echo signal at the i-th frequency point
  • the signal processed by the emphasis filter is an electrical signal, and the electrical signal needs to be converted by the speaker to become an acoustic wave. Since the output frequency responses of the speakers of different types of mobile terminals are different, if the output frequency response of the speakers of different mobile terminals is to be obtained, it is necessary to separately measure the speakers of each mobile terminal and perform correction compensation during operation. A hardware fragmentation issue will result. In order to avoid this problem, the method provided by this embodiment will adopt the following method to avoid direct measurement of the frequency response of the speaker.
  • the mobile terminal adjusts a frequency amplitude of the broadcast signal based on the frequency emphasis coefficient.
  • the mobile terminal Based on the determined frequency emphasis coefficient, by the mobile terminal to dynamically track and adjust the speech intelligibility functions to implement the power spectrum of the noise signal P n, changes in the power spectrum of the echo signal P e automatically adapt.
  • the mobile terminal outputs the adjusted broadcast signal.
  • the mobile terminal In order to improve the accuracy of the broadcast signal outputted by the mobile terminal at the current time, the mobile terminal combines the broadcast signal outputted during the period before the current time and the corresponding frequency emphasis coefficient, and determines the current time according to the following formula ⁇ 17>.
  • the output broadcast signal In order to improve the accuracy of the broadcast signal outputted by the mobile terminal at the current time, the mobile terminal combines the broadcast signal outputted during the period before the current time and the corresponding frequency emphasis coefficient, and determines the current time according to the following formula ⁇ 17>. The output broadcast signal.
  • z(n) is the output broadcast signal
  • w(k) is the corresponding value of the frequency emphasis coefficient calculated at the nth time in the time domain
  • K max is equal to the order of the weighting filter W
  • y(nk) is The value of the broadcast signal before the nk time.
  • the adjusted broadcast signal output by the mobile terminal in this step can mask the noise signal, the user can hear the content of the broadcast signal after listening to the broadcast signal to be adjusted.
  • FIG. 5 is a diagram showing a signal flow corresponding to a method for processing a voice signal according to an embodiment of the present invention.
  • the mobile terminal when based on the acquired recording signal X and the broadcast signal Y, the mobile terminal is based on the recorded signal and the broadcast signal.
  • the mobile terminal according to the power feature value of the recorded signal and the power feature value of the broadcast signal And the power characteristic value of the echo signal, and using the voice activation detection mechanism, calculating the power spectrum of the echo signal and the power spectrum of the noise signal, and then obtaining the frequency emphasis coefficient by calculating the maximum value of the voice intelligibility function, and finally based on the frequency
  • the weighting coefficient is adjusted by using an emphasis filter to adjust the frequency of the broadcast signal, and the adjusted broadcast signal is output.
  • FIG. 6 is a flowchart of a method for processing a voice signal according to another embodiment of the present invention.
  • This method can be implemented by software.
  • the mobile terminal After the voice instant messaging application is started, the mobile terminal periodically acquires the recording signal x collected by the microphone from the near end and the broadcast signal y sent by the opposite end, and calculates the power spectrum P x of the recorded signal and the power spectrum P of the broadcast signal. y , and then calculate the loop transfer function H_loop based on the formula ⁇ 3>. After determining the loop transfer function, the mobile terminal can calculate the estimated value E(n) of the echo signal according to the formula ⁇ 6>. In addition, since the echo signal, the near-end speech signal, and the noise signal are picked up by the same microphone, there is overlap in time.
  • Equation ⁇ 10> calculates the noise power spectrum P n . Then, based on the power spectrum of the echo signal and the power spectrum of the noise signal, a speech intelligibility function SII is constructed, and by calculating the maximum value of the speech intelligibility function SII, the spectral emphasis coefficient W can be obtained. Finally, according to the formula ⁇ 17>, the output enhanced audio signal is sent to the speaker, and the speaker converts into sound for playing.
  • the foregoing method may be implemented in a voice instant messaging application layer, or may be implemented at an operating system level, or may be implemented in firmware of a hardware chip.
  • the processing method of the voice data provided by the embodiment of the present invention is applicable, and the only difference is that the processing method of the same voice data is specifically at which level in the mobile terminal system.
  • the present invention has been described above by taking a mobile terminal as an example, and those skilled in the art can understand that the present invention can also be applied to other terminal devices, such as a desktop computer and the like.
  • the above broadcast signal may be received from the opposite end, for example, the voice signal received by the terminal device from other terminal devices (ie, the peer device) through a wired or wireless network, and the above broadcast signal may also be local storage of the terminal device.
  • Voice signal may be exemplified above, and those skilled in the art can understand that the above voice instant messaging application can be replaced with any other voice playing application.
  • the above method can be used not only to improve speech intelligibility, but also to improve audio signals of other content.
  • the tone of the ringtone and the alarm clock can be automatically enhanced according to different environmental noises, so that the enhanced prompt sound can be heard more clearly by the user, so as to overcome the interference of environmental noise.
  • the above method can be used to combat non-noise environments in addition to anti-noise scenes.
  • two people, A and B make calls at similar distances at the same time, where A and a call, B and b call. Since the distance between A and B is very close, the voice of A will interfere with the listening of B, and the voice of B also interferes with the listening of A.
  • the method provided by the implementation of the present invention can also be applied to the voice competition scenario.
  • the mobile terminal on the A side will use the voice of B as a noise signal and the voice of a as a signal to be enhanced.
  • B The mobile terminal on the side will use A as the noise signal and the speech of b as the signal to be enhanced.
  • the method provided by the embodiment of the invention automatically adjusts the frequency amplitude of the broadcast signal according to the frequency distribution of the noise signal and the broadcast signal under the premise of ensuring that the speaker is not overloaded without destroying the dynamic amplitude of the original broadcast signal, thereby significantly improving the amplitude of the broadcast signal.
  • Voice intelligibility
  • an embodiment of the present invention provides a schematic structural diagram of a processing apparatus for a voice signal, where the apparatus includes:
  • the collecting module 701 is configured to collect a recording signal from the near end, where the recording signal includes at least a noise signal and an echo signal;
  • the receiving module 702 is configured to receive a broadcast signal sent by the opposite end;
  • the first calculating module 703 is configured to calculate a loop transfer function according to the recording signal and the broadcast signal;
  • a second calculation module 704 configured to calculate a power spectrum of the recorded signal
  • a third calculating module 705, configured to calculate a power spectrum of the echo signal and a power spectrum of the noise signal according to the power spectrum, the broadcast signal, and the loop transfer function of the recorded signal;
  • a fourth calculating module 706, configured to calculate a frequency emphasis coefficient according to a power spectrum of the echo signal and a power spectrum of the noise signal;
  • the adjusting module 707 is configured to adjust a frequency amplitude of the broadcast signal based on the frequency emphasis coefficient
  • the output module 708 is configured to output the adjusted broadcast signal.
  • the first calculating module 703 is configured to calculate a frequency domain cross-correlation function between the recording signal and the broadcast signal; calculate a frequency domain autocorrelation function of the broadcast signal; and according to the recording signal and the broadcast signal The frequency domain autocorrelation function between the frequency domain and the frequency domain autocorrelation function of the broadcast signal are used to calculate the loop transfer function.
  • the second calculating module 704 is configured to calculate a power spectrum of the recorded signal by applying the following formula to the recorded signal:
  • X(n) is the vector obtained by Fourier transforming the recorded signal acquired at the nth time, and .2 is used for each vector element in X(n) Find the square.
  • the third calculating module 705 is configured to calculate a spectrum estimation value of the echo signal according to the loop transfer function and the broadcast signal; calculate a square of the spectrum estimation value of the echo signal, and obtain the power of the echo signal. Spectrum; the power spectrum of the echo signal is subtracted from the power spectrum of the recorded signal to obtain the power spectrum of the noise signal.
  • the apparatus further includes:
  • a fifth calculating module configured to calculate a power feature value of the recorded signal, a power feature value of the broadcast signal, and a power feature value of the echo signal;
  • a first determining module configured to determine whether a power feature value of the recorded signal is greater than a first threshold, whether a power feature value of the broadcast signal is greater than a second threshold, and whether a power feature value of the echo signal is greater than a third threshold;
  • the third calculating module 705 is configured to calculate a spectrum estimation value of the echo signal when the power feature value of the recording signal is greater than the first threshold, the power value of the broadcast signal is greater than the second threshold, and the power feature value of the echo signal is greater than the third threshold. Squared to get the power spectrum of the echo signal.
  • the apparatus further includes:
  • a second determining module configured to determine whether a power feature value of the recorded signal is less than a first threshold, and whether a power feature value of the echo signal is less than a third threshold;
  • the third calculating module 705 is configured to: when the power feature value of the recorded signal is less than the first threshold and the power feature value of the echo signal is less than the third threshold, subtract the power spectrum of the echo signal from the power spectrum of the recorded signal to obtain a noise signal. power spectrum.
  • the fourth calculating module 706 is configured to construct a speech intelligibility function according to the power spectrum of the echo signal and the power spectrum of the noise signal; and under the condition that the power spectrum of the echo signal remains unchanged According to the maximum value of the speech intelligibility function, the frequency emphasis coefficient is obtained.
  • the device provided by the embodiment of the present invention automatically adjusts the frequency amplitude of the broadcast signal according to the frequency distribution of the noise signal and the broadcast signal, while ensuring that the speaker is not overloaded and does not damage the dynamic amplitude of the original broadcast signal.
  • FIG. 8 is a schematic structural diagram of a processing terminal of a voice signal according to an embodiment of the present invention.
  • the terminal may be used to implement a method for processing a voice signal provided in the foregoing embodiment. Specifically:
  • the terminal 800 may include an RF (Radio Frequency) circuit 110, a memory 120 including one or more computer readable storage media, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, and a WiFi (Wireless Fidelity, wireless).
  • the fidelity module 170 includes a processor 180 having one or more processing cores, and a power supply 190 and the like. It will be understood by those skilled in the art that the terminal structure shown in FIG. 8 does not constitute a limitation to the terminal, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements. among them:
  • the RF circuit 110 can be used for transmitting and receiving information or during a call, receiving and transmitting signals, and in particular, receiving downlink information of the base station and then processing it by one or more processors 180; The data related to the uplink is sent to the base station.
  • the RF circuit 110 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, an LNA (Low Noise Amplifier). , duplexer, etc.
  • SIM Subscriber Identity Module
  • RF circuitry 110 can also communicate with the network and other devices via wireless communication.
  • the wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System of Mobile communication), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access). , Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), LTE (Long Term Evolution), e-mail, SMS (Short Messaging Service), and the like.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • e-mail Short Messaging Service
  • the memory 120 can be used to store software programs and modules, and the processor 180 executes various functional applications and data processing by running software programs and modules stored in the memory 120.
  • the memory 120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to The data created by the use of the terminal 800 (such as audio data, phone book, etc.) and the like.
  • memory 120 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 120 may also include a memory controller to provide access to memory 120 by processor 180 and input unit 130.
  • the input unit 130 can be configured to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • input unit 130 can include touch-sensitive surface 131 as well as other input devices 132.
  • Touch-sensitive surface 131 also referred to as a touch display or trackpad, can collect touch operations on or near the user (such as a user using a finger, stylus, etc., on any suitable object or accessory on touch-sensitive surface 131 or The operation near the touch-sensitive surface 131) and driving the corresponding connecting device according to a preset program.
  • the touch-sensitive surface 131 can include two portions of a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
  • the processor 180 is provided and can receive commands from the processor 180 and execute them.
  • the touch-sensitive surface 131 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 130 can also include other input devices 132.
  • other input devices 132 may include but are not limited to physical keyboards, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, and operations. One or more of a rod or the like.
  • Display unit 140 can be used to display information entered by the user or information provided to the user and various graphical user interfaces of terminal 800, which can be constructed from graphics, text, icons, video, and any combination thereof.
  • the display unit 140 may include a display panel 141.
  • the display panel 141 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like.
  • the touch-sensitive surface 131 may cover the display panel 141, and when the touch-sensitive surface 131 detects a touch operation thereon or nearby, it is transmitted to the processor 180 to determine the type of the touch event, and then the processor 180 according to the touch event The type provides a corresponding visual output on display panel 141.
  • touch-sensitive surface 131 and display panel 141 are implemented as two separate components to implement input and input functions, in some embodiments, touch-sensitive surface 131 can be integrated with display panel 141 for input. And output function.
  • Terminal 800 can also include at least one type of sensor 150, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 141 according to the brightness of the ambient light, and the proximity sensor may close the display panel 141 when the terminal 800 moves to the ear. / or backlight.
  • the gravity acceleration sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity.
  • the gesture of the mobile phone such as horizontal and vertical screen switching, related Game, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for the terminal 800 can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, here Let me repeat.
  • the audio circuit 160, the speaker 161, and the microphone 162 can provide an audio interface between the user and the terminal 800.
  • the audio circuit 160 can transmit the converted electrical data of the received audio data to the speaker 161 for conversion to the sound signal output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal by the audio circuit 160. After receiving, it is converted into audio data, and then processed by the audio data output processor 180, transmitted to the terminal, for example, via the RF circuit 110, or outputted to the memory 120 for further processing.
  • the audio circuit 160 may also include an earbud jack to provide communication of the peripheral earphones with the terminal 800.
  • WiFi is a short-range wireless transmission technology
  • the terminal 800 can help users to send and receive emails, browse web pages, and access streaming media through the WiFi module 170, which provides wireless broadband Internet access for users.
  • FIG. 8 shows the WiFi module 170, it can be understood that it does not belong to the essential configuration of the terminal 800, and may be omitted as needed within the scope of not changing the essence of the invention.
  • the processor 180 is the control center of the terminal 800, connecting various portions of the entire handset with various interfaces and lines, by running or executing software programs and/or modules stored in the memory 120, and recalling data stored in the memory 120, The various functions and processing data of the terminal 800 are performed to perform overall monitoring of the mobile phone.
  • the processor 180 may include one or more processing cores; optionally, the processor 180 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and an application. Etc.
  • the modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 180.
  • the terminal 800 also includes a power source 190 (such as a battery) for powering various components.
  • a power source 190 such as a battery
  • the power source can be logically coupled to the processor 180 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • Power supply 190 may also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
  • the terminal 800 may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
  • the display unit of the terminal 800 is a touch screen display
  • the terminal 800 further includes a memory, and one or more programs, wherein one or more programs are stored in the memory and configured to be one or one The above processor executes.
  • the one or more programs include instructions for performing the following operations:
  • the adjusted speech signal is output.
  • the recorded signal is a sound signal collected using a microphone of the terminal device.
  • the outputting the adjusted voice signal comprises playing the adjusted voice signal through a speaker, wherein the voice signal is received by the terminal device through the network. Or a locally stored broadcast signal to be played through the speaker.
  • the end memory also contains instructions for performing the following operations:
  • Calculate the loop transfer function based on the recorded signal and the speech signal including:
  • the terminal's memory also contains instructions for performing the following operations:
  • Calculate the power spectrum of the recorded signal including:
  • X(n) is the vector obtained by Fourier transforming the recorded signal acquired at the nth time, and .2 is used for each vector element in X(n) Find the square.
  • the memory of the terminal further includes an instruction for performing the following operations:
  • Calculating the power spectrum of the echo signal and the power spectrum of the noise signal according to the recording signal, the voice signal, and the loop transfer function including:
  • the power spectrum of the noise signal is obtained by subtracting the power spectrum of the echo signal from the power spectrum of the recorded signal.
  • the memory of the terminal further includes an instruction for performing the following operations:
  • Calculating the square of the spectral estimate of the echo signal, before obtaining the power spectrum of the echo signal also includes:
  • the power characteristic value of the recording signal is greater than the first threshold
  • the power value of the broadcast signal is greater than the second threshold
  • the power characteristic value of the echo signal is greater than the third threshold
  • the end memory also contains instructions for performing the following operations:
  • the method further includes:
  • the step of subtracting the power spectrum of the echo signal from the power spectrum of the recorded signal to obtain a power spectrum of the noise signal is performed.
  • the memory of the terminal further includes an instruction for performing the following operations:
  • Calculating the frequency emphasis coefficient according to the power spectrum of the echo signal and the power spectrum of the noise signal including:
  • the frequency emphasis coefficient is obtained according to the maximum value of the speech intelligibility function.
  • the terminal provided by the embodiment of the invention automatically adjusts the frequency amplitude of the broadcast signal according to the frequency distribution of the noise signal and the broadcast signal under the premise of ensuring that the speaker is not overloaded and does not destroy the dynamic amplitude of the original broadcast signal, thereby significantly improving the frequency. Voice intelligibility.
  • the embodiment of the present invention further provides a computer readable storage medium, which may be a computer readable storage medium included in the memory in the above embodiment; or may exist separately and not assembled into the terminal.
  • Computer readable storage medium stores one or more programs that are used by one or more processors to perform a method of processing a speech signal, the method comprising:
  • the adjusted speech signal is output.
  • the first possible implementation is used as the basis.
  • the recorded signal is a sound signal collected using a microphone of the terminal device.
  • the outputting the adjusted voice signal comprises playing the adjusted voice signal through a speaker, wherein the voice signal is received by the terminal device through the network. Or a locally stored broadcast signal to be played through the speaker.
  • the memory of the terminal further includes an instruction for performing the following operations:
  • Calculate the loop transfer function based on the recorded signal and the speech signal including:
  • the terminal's memory also contains instructions for performing the following operations:
  • Calculate the power spectrum of the recorded signal including:
  • X(n) is the vector obtained by Fourier transforming the recorded signal acquired at the nth time, and .2 is used for each vector element in X(n) Find the square.
  • the memory of the terminal further includes an instruction for performing the following operations:
  • Calculating the power spectrum of the echo signal and the power spectrum of the noise signal according to the recording signal, the voice signal, and the loop transfer function including:
  • the power spectrum of the noise signal is obtained by subtracting the power spectrum of the echo signal from the power spectrum of the recorded signal.
  • the memory of the terminal further includes an instruction for performing the following operations:
  • Calculating the square of the spectral estimate of the echo signal, before obtaining the power spectrum of the echo signal also includes:
  • the power characteristic value of the recording signal is greater than the first threshold
  • the power value of the broadcast signal is greater than the second threshold
  • the power characteristic value of the echo signal is greater than the third threshold
  • the memory of the terminal further includes an instruction for performing the following operations:
  • the method further includes:
  • the step of subtracting the power spectrum of the echo signal from the power spectrum of the recorded signal to obtain a power spectrum of the noise signal is performed.
  • the memory of the terminal further includes an instruction for performing the following operations:
  • Calculating the frequency emphasis coefficient according to the power spectrum of the echo signal and the power spectrum of the noise signal including:
  • the frequency emphasis coefficient is obtained according to the maximum value of the speech intelligibility function.
  • the computer readable storage medium provided by the embodiment of the invention automatically adjusts the frequency amplitude of the broadcast signal according to the frequency distribution of the noise signal and the broadcast signal under the premise of ensuring that the speaker is not overloaded and does not destroy the dynamic amplitude of the original broadcast signal. , significantly improved speech intelligibility.
  • a graphic user interface is provided.
  • the graphic user interface is used on a processing terminal of a voice signal, and the processing terminal for executing a voice signal includes a touch screen display, a memory, and a program for executing one or more programs. Or more than one processor; the graphical user interface includes:
  • the recording signal includes at least a noise signal and an echo signal
  • the adjusted speech signal is output.
  • the graphic user interface provided by the embodiment of the invention automatically adjusts the frequency amplitude of the broadcast signal according to the frequency distribution of the noise signal and the broadcast signal, while ensuring that the speaker is not overloaded and does not destroy the dynamic amplitude of the original broadcast signal. Improved speech intelligibility.
  • the processing apparatus for the voice signal provided by the foregoing embodiment is only illustrated by the division of the foregoing functional modules. In actual applications, the foregoing functions may be allocated by different functional modules as needed. Completion, that is, the internal structure of the processing device of the voice signal is divided into different functional modules to complete all or part of the functions described above.
  • the processing device of the voice signal provided by the foregoing embodiment is the same as the embodiment of the method for processing the voice signal, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Telephone Function (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种语音信号的处理方法及装置,属于终端技术领域。语音信号处理方法包括:获取录音信号和语音信号,录音信号中至少包括噪声信号及回声信号(301);根据录音信号和语音信号,计算环路传递函数(302);根据录音信号、语音信号及环路传递函数,计算回声信号的功率谱和噪声信号的功率谱(303);根据回声信号的功率谱和噪声信号的功率谱,计算频率加重系数(304);基于频率加重系数,对语音信号的频点幅值进行调节(305);输出调节后的语音信号(306)。该语音信号处理方法及装置在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。

Description

语音信号的处理方法及装置
本申请要求于2015年11月4日提交中国专利局,申请号为201510741057.1,发明名称为“语音信号的处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及终端技术领域,特别涉及一种语音信号的处理方法及装置。
背景技术
语音可懂度是指用户听懂声音系统所传递的语音信号的百分比,例如,如果用户听到声音系统传递了100个单词,但仅听懂了50个单词,则该系统的语音可懂度为50%。随着便携式移动终端的外形尺寸逐渐向小型化方向发展,移动终端所能输出的最大声音功率逐渐减小,相应地用户使用移动终端进行通信时的语音可懂度也受到了影响。由于语音可懂度是衡量移动终端性能的一项重要指标,因此,移动终端如何处理语音信号,以改善语音可懂度,成为其发展的关键。
目前,在由移动终端、用户、噪声源所构成的典型声学应用场景下,采用自动增益控制算法检测待播放的播音信号,并对待播放的播音信号中的小信号进行放大,将放大后的播音信号转化为电信号,并将电信号传送到扬声器。通过上述放大处理,使得送到扬声器的电信号达到扬声器所允许的最大值,扬声器工作在最大输出功率的状态下,此时扬声器以最大的输出声压级输出语音信号。
在实现本发明的过程中,发明人发现相关技术至少存在以下问题:
由于通常播音信号的平均波动幅度远小于峰值波动幅度,对于一个最大额定输出功率为1瓦的扬声器来说,在正常语音信号的激励下,它正常工作时的平均输出功率一般仅达到最大额定输出功率的10%左右(也就是0.1W)。在正常工作状态下,如果继续加大输入到扬声器的电信号幅度,则播音信号中幅度较大的信号部分将导致扬声器过载,形成饱和失真,反而降低了语音可懂度与清晰度;另外,如果仅对播音信号中的小信号作放大处理,则将缩小播音信号 的有效动态范围,对应的语音可懂度同样也得不到明显提高。
发明内容
为了解决相关技术的问题,本发明实施例提供了一种语音信号的处理方法及装置。所述技术方案如下:
一方面,提供了一种语音信号的处理方法,所述方法包括:
获取录音信号和语音信号,所述录音信号中至少包括噪声信号及回声信号;
根据所述录音信号和所述语音信号,计算环路传递函数;
根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱;
根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数;
基于所述频率加重系数,对所述语音信号的频点幅值进行调节;
输出调节后的语音信号。
另一方面,提供了一种语音信号的处理装置,所述装置包括:
至少一个处理器;和
存储器,其中所述存储器存储有程序指令,所述指令当由所述处理器执行时,配置所述装置执行下述操作:
获取录音信号和语音信号,所述录音信号中至少包括噪声信号及回声信号;
根据所述录音信号和所述语音信号,计算环路传递函数;
根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱;
根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数;
基于所述频率加重系数,对所述语音信号的频点幅值进行调节;
输出调节后的语音信号。
本发明实施例提供的技术方案带来的有益效果是:
在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明一实施例提供的语音信号的处理方法所涉及的实施环境的示意图;
图2是本发明另一实施例提供的语音信号的处理方法的系统架构图;
图3是本发明另一实施例提供的一种语音信号的处理方法流程图;
图4是本发明的另一实施例提供的一种语音信号的处理方法流程图;
图5是本发明的另一实施例提供的一种语音信号的处理方法对应的信号流的示意图;
图6是本发明的另一实施例提供的一种语音信号的处理方法的流程图;
图7是本发明另一实施例提供的一种语音信号的处理装置的结构示意图;
图8是本发明另一实施例提供的一种语音信号的处理终端的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
语音即时通讯应用是一种能够拨打网络电话或网络音频会议的应用,被广泛地安装在智能手机、平板电脑、笔记本电脑、可穿戴电子产品等移动终端设备上。随着这些移动终端的外形尺寸逐渐向小型化方向发展,移动终端设备中的微型扬声器所能够输出的最大声功率也遇到了瓶颈。
造成移动终端所输出的最大声功率遇到瓶颈的主要原因有以下两方面:
第一方面、现有的电声扩音技术主要依靠功放、扬声器、音腔三个部分共同作用实现声波的产生,当扬声器与音腔的物理尺寸与声波的波长成正比时,移动终端设备中的扬声器才能最大效率地实现电声转换。然而,随着便携式移动设备的外形尺寸向小型化发展,移动终端的外形尺寸往往比声波的波长要小 得多,以波长为340Hz的声波为例,若想实现最大的声电转换效率,移动终端的尺寸至少需要达到1米,扬声器尺寸的微型化导致移动终端输出的最大声音功率减小。另外,目前普遍使用的动圈式扬声器需要达到一定的尺寸厚度,以保证震膜有足够的运动空间,然而,随着移动终端的外形尺寸的减小、厚度变薄,移动终端内的整体声学设计受到了物理尺寸的限制,使得移动终端输出的最大声功率受到了限制。
第二方面、通常移动终端中所安装的语音即时通讯应用一般运行于操作系统之上,需要通过操作系统提供的应用程序接口才能实现对硬件的音量控制。对于音频输入输出而言,目前主流的实现方法是语音即时通讯应用向操作系统声明要求音频配置模式,由操作系统对相关硬件作出设置,完成配置之后,语音即时通讯应用只需要定时地将播音信号所对应的数据写入操作系统的录音API,再从操作系统的录音API中读取数据即可。然而操作系统所支持的音频配置模式的类型是有限的,这些有限的音频配置模式是由移动终端生产商在硬件底层(固件firmware)中实现的,应用程序对硬件输出音量的控制受到这一因素的制约,此外硬件厂商往往仅针对正常的使用场景做底层的音频优化,对于极端环境(比如存在很大环境噪声)下的使用场景,移动终端生产商一般不会对此作针对性的优化(比如移动终端生产商一般不会提供可以提高硬件输出音量的专用软件接口)。
在常见的移动终端中,输出音量从大到小排序依次是:笔记本电脑、平板电脑、智能手机(免提模式)、可穿戴设备等。在采用这些移动终端进行通信时,这几种移动终端面临的环境噪声问题却呈相反的变化趋势:通常笔记本电脑在室内使用的频率比较高,接触到的噪声也以室内低分贝的小噪声为主;平板电脑和智能手机在室外、公共场所使用的频率要更高,接触到的噪声以高分贝的大噪声为主;可穿戴设备由于长时间佩戴在人体上,接触到的噪声场景最多、最复杂。随着移动终端的外形尺寸向小型化发展,移动终端所面临的环境噪声问题越来越突出,严重影响了用户使用移动终端进行通信时的体验效果。
为了解决上述移动终端所输出的最大声功率遇到瓶颈的问题,本发明实施例提供了一种在不对移动终端在硬件方面作改动的前提下,通过对语音信号进行处理,来提高移动终端的语音可懂度的方法。采用本发明实施例提供的方法,移动终端的用户即便处于嘈杂的场景下,也能够听清通话对端的语音内容。
图1为本发明实施例提供的语音信号的处理方法和装置所涉及到的实施环 境示意图。参见图1,该实施环境包括移动终端P、用户U及噪声源N这3个声学主体,还包括声音输出和输入设备扬声器S和麦克风M。该移动终端P可以是手机、平板电脑、笔记本电脑、可穿戴设备等,其中安装一个或多个语音即时通讯应用(App),基于这些语音即时通讯应用,用户可随时随地与其他用户进行通信。扬声器S和麦克风M既可以内置于移动终端内,也可以以外接设备如外接音响、外接扬声器、蓝牙音箱、蓝牙耳机的形式连接在移动终端上。麦克风M可以拾取到整个场景中的声音,包括:噪声源N发出的噪声、用户U说话时发出的语音、扬声器S播出的声音。当用户通过语音即时通讯软件与对端用户进行通信时,移动终端接收对端发送的要播放的语音信号(为了区分,下文简称播音信号),将该播音信号处理之后,由扬声器转换成声波,通过空气传播给用户U并被用户U所感知;与此同时噪声源N发出的声波也通过空气传播给用户U,同时也被用户U感知,该噪声源N发出的声波会对用户U形成干扰,降低了移动终端的语音可懂度。
在声学领域,根据心理声学的掩蔽效应原理,当两个频率相近、幅值差别较大的信号同时出现时,幅值较大的信号会对幅值较小的信号形成掩蔽作用。也即是,当噪声源N发出的噪声强度很大时,用户U无法听清扬声器S中正在播放的语音内容。此时若想加大扬声器S的输出功率,则需要加大S的物理尺寸,而这又与移动终端小型化、轻薄化的设计相矛盾。鉴于此,本发明将利用心理声学的掩蔽效应解决噪声信号对播音信号的干扰问题。
通常播音信号、噪声信号都不是单频信号,它们各自占据不同的频带范围,并且它们在各个频点上的能量分布也不是均匀的。通过对比播音信号、噪声信号的功率谱分布,可找到噪声信号中能量最低的那些频点,记为f_weak。本实施例在不超过扬声器输出功率的前提下,将播音信号能量集中到f_weak附近播放出去,与此同时衰减远离f_weak的频点上的播音信号的能量,以避免扬声器过载。通过这种处理方式,在临近f_weak的频点上,噪声信号被播音信号所掩蔽,用户所感知到的是播音信号的内容。在远离f_weak的频点上,播音信号仍旧被噪声信号所掩蔽。综合上述内容,增强后的播音信号在部分频点上将噪声信号掩蔽,使得噪声不再对播音信号形成整体掩蔽,此时用户可以听清播音信号的内容。
图2为本发明提供的语音信号的处理方法的系统架构图。参见图2,该系统架构包括用户U、扬声器S、麦克风M以及各种功能模块。其中,功能模块包 括信号检测和分类模块、频谱估计模块、环路函数传递计算模块、语音可懂度估计模块等。频谱估计模块具体可以包括语音激活检测模块、噪声功率谱模块和回声功率谱模块。对于系统的各个模块的作用及模块间的相互关系如下:
其中,麦克风M用于拾取环境声音,在本实施例中将环境声音称为录音信号(记为x),并将录音信号x送入信号检测与分类模块。
信号检测与分类模块用于对录音信号进行检测与区分,并输出三类信号:用户U讲话时的语音信号(记为近端信号v)、噪声源N发出的噪声信号(记为噪声信号n)、扬声器S播出的声音被M重新录回的信号(记为回声信号e)。
频谱估计模块用于计算噪声信号功率谱、回声信号的功率谱及近端信号的功率特征值,其中,噪声信号的功率谱可用Pn表示、回声信号的功率可用Pe表示、近端信号的功率特征值可用VAD_v表示。VAD_v具有true和false两种状态,当VAD_v=true时,说明当前时刻有近端信号存在,也即是用户U正在说话,当VAD_v=false时,说明当前时刻近端信号不存在,也即是用户U未在说话,或者用户U的说话声音的音量明显小于噪声信号的音量或者回声信号的音量。
环路传递函数计算模块用于根据播音信号y和麦克风拾取到的录音信号x计算出“加重滤波器--扬声器--声场--麦克风”这条路径上的传递函数,记为H_loop。
语音可懂度估计模块用于根据H_loop、VAD_v、Pn和Pe,确定语音可懂度(记为SII),该语音可懂度还用于计算加重滤波器W的频率加重系数。
参见图2,在实际应用中由于用户、移动终端、噪声源这三者在空间上的具体位置是无法确定的,而对播音信号和录音信号进行处理的目的,是希望将用户U耳朵所处位置上的SII调节到最大,而不是麦克风M所在的位置。为了解决这一问题,本实施例提供的方法采用了近似处理。为了便于后续叙述,在本发明实施例中,将声音在扬声器S与用户U耳朵之间的传播路径的长度用h1表示,将声音在噪声源N与用户耳朵之间的传播路径的长度用h2表示,将声音在噪声源N和麦克风M之间的传播路径的长度用h3表示,将声音在用户U的嘴与麦克风M之间的传播路径的长度用h4表示,将声音在麦克风M与扬声器S之间的传播路径的长度用h5表示。本发明实施例中所作的近似如下:
(1)、设定麦克风所拾取到的噪声与用户所感受到的噪声是近似相同,也即是h2≈h3。
(2)、设定麦克风所拾取到的来自扬声器的回声与用户感受到的扬声器所 播放的声音近似相同,也即是h1≈h5。
在以上近似条件满足的前提下,可将计算用户U所在位置的最大语音可懂度问题,转化为计算麦克风M所在位置的最大语音可懂度问题。
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。
图3示出根据本发明一个实施例提供的语音信号的处理方法的流程图。参见图3,本实施例提供的方法包括:
301、获取录音信号和语音信号,例如,从近端采集录音信号并接收对端发送的语音信号(即播音信号)。其中该录音信号中至少包括噪声信号及回声信号。
302、根据录音信号和播音信号,计算环路传递函数。
303、根据录音信号、播音信号及环路传递函数,计算回声信号的功率谱和噪声信号的功率谱。
304、根据回声信号的功率谱和噪声信号的功率谱,计算频率加重系数。
305、基于频率加重系数,对播音信号的频点幅值进行调节。
306、输出调节后的播音信号。
本发明实施例提供的方法,在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。
在本发明的另一个实施例中,根据录音信号和播音信号,计算环路传递函数,包括:
计算录音信号与播音信号之间的频域互相关函数;
计算播音信号的频域自相关函数;
根据录音信号与播音信号之间的频域互相关函数以及播音信号的频域自相关函数,计算环路传递函数。
在本发明的另一个实施例中,对于录音信号,应用以下公式,计算录音信号的功率谱:
Px=X(n).^2
其中,Px为录音信号的功率谱,X(n)为将第n时刻采集到的录音信号进行傅立叶变换得到的向量,.^2用于将X(n)中的每个向量元素求平方。
在本发明的另一个实施例中,根据录音信号、播音信号及环路传递函数, 计算回声信号的功率谱和噪声信号的功率谱,包括:
计算所述录音信号的功率谱;
根据环路传递函数及播音信号,计算回声信号的频谱估计值;
计算回声信号的频谱估计值的平方,得到回声信号的功率谱;
将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱。
在本发明的另一个实施例中,计算回声信号的频谱估计值的平方,得到回声信号的功率谱之前,还包括:
计算录音信号的功率特征值、播音信号的功率特征值及回声信号的功率特征值;
判断录音信号的功率特征值是否大于第一阈值、播音信号的功率特征值是否大于第二阈值、回声信号的功率特征值是否大于第三阈值;
当录音信号的功率特征值大于第一阈值、播音信号的功率值大于第二阈值且回声信号的功率特征值大于第三阈值时,计算回声信号的频谱估计值的平方,得到回声信号的功率谱。
在本发明的另一个实施例中,将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱之前,还包括:
判断录音信号的功率特征值是否小于第一阈值、回声信号的功率特征值是否小于第三阈值;
当录音信号的功率特征值小于第一阈值且回声信号的功率特征值小于第三阈值时,将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱。
在本发明的另一个实施例中,根据回声信号的功率谱、噪声信号的功率谱,计算频率加重系数,包括:
根据回声信号的功率谱及噪声信号的功率谱,构建语音可懂度函数;
在回声信号的功率谱保持不变的条件下,根据语音可懂度函数的极大值,得到频率加重系数。
图4示出根据本发明另一实施例的一种语音信号的处理方法的流程图。参见图4,本实施例提供的方法包括:
401、移动终端从近端采集录音信号并接收对端发送的播音信号。
其中,近端为移动终端当前所处的环境,移动终端从近端采集录音信号方式,包括但不限于:开启麦克风,通过麦克风采集当前环境中的声音信号,并 将麦克风采集到的声音信号作为录音信号,该录音信号中包括噪声信号、回声信号及近端信号等。在本实施例中,录音信号可用x表示,噪声信号可用n表示,回声信号可用e表示,近端信号可用v表示。
对端通过麦克风采集到对端用户的语音信号,对采集到的语音信号处理后,通过网络发送至移动终端,移动终端上的即时通讯应用接收到对端发送来的语音信号,并将对端发送的语音信号作为播音信号。对端可以是与移动终端通过语音即时通讯应用进行通信的其它移动终端。在本实施例中,播音信号可用y表示。
可选地,为了提高采用语音即时通讯应用的时效性,移动终端侧的麦克风会每隔预设时长采集一次录音信号,对端侧麦克风也将每隔预设时长采集一次播音信号,并将采集到的播音信号发送给移动终端。其中,预设时长可以为10ms(毫秒)、20ms、50ms等等。
在本实施例中,移动终端从近端采集到的录音信号以及对端发送的播音信号实质上为时域信号,为了便于后续的计算,本实施例提供的方法还将采用傅里叶变换等方法分别对采集到的录音信号和接收到的播音信号进行处理,通过处理可将时域形式的录音信号转换为频域形式的录音信号,并将时域形式的播音信号转换为频域形式的播音信号,以用于后续计算。在本实施例中,频域形式的录音信号为一个列向量,向量长度等于所采用的傅立叶变换的点数,可用X表示;频域形式的播音信号也为一个列向量,向量长度也等于所采用的傅立叶变换的点数,可用Y表示。
可选地,在将时域形式的录音信号、播音信号经过傅里叶变换后,得到的频域形式的录音信号、频域形式的播音信号的维度相同。
402、移动终端根据录音信号和播音信号,计算环路传递函数。
在本实施例中,移动终端根据录音信号和播音信号,计算环路传递函数时,可采用如下步骤4021~4023:
4021、移动终端获取录音信号与播音信号之间的频域互相关函数。
其中,互相关函数用于表示两个信号之间的相关程度。移动终端在获取录音信号与播音信号之间的频域互相关函数时,可采用如下公式<1>:
r_xy=E[X.*Y]           <1>
其中,r_xy为录音信号与播音信号之间的互相关函数,E[.]为期望运算符,.*用于对向量按元素逐个相乘。例如,X={a1,a2,a3,a4},Y={b1,b2,b3,b4}, 则X.*Y={a1b1,a2b2,a3b3,a4b4}。
4022、移动终端获取播音信号的频域自相关函数。
其中,自相关函数用于表示信号与该信号的延迟信号之间的相关程度。移动终端在获取播音信号的频域自相关函数时,可采用如下公式<2>:
R_yy=E[Y(n)*Y’(n-k)]       <2>
其中,R_yy为播音信号的频域自相关函数,符号*表示矩阵乘积运算,符号’表示共轭转置运算,Y(n)为将第n时刻采集到的播音信号进行傅里叶变换得到的向量,Y(n-k)为将第n-k时刻采集到的播音信号进行傅里叶变换得到的向量,k=[0,Kmax],k∈Z,即k是整数,Kmax的取值大小决定系统的阶数。
4023、基于上述步骤4021中所获取到的录音信号与播音信号之间的频域互相关函数,以及步骤4022中所获取到的播音信号的频域自相关函数,移动终端可应用以下公式<3>,计算环路传递函数:
H_loop=R_yy^-1*r_xy          <3>
其中,H_loop为环路传递函数,符号^-1表示矩阵求逆运算。
403、移动终端获取录音信号的功率谱和播音信号的功率谱。
对于录音信号,移动终端可应用以下公式<4>,计算录音信号的功率谱:
Px=X(n).^2           <4>
其中,Px为录音信号的功率谱,X(n)为将第n时刻采集到的录音信号进行傅里叶变换得到的向量,.^2用于将X(n)中的每个向量元素求平方。
例如,第n时刻采集到的录音信号X(n)={a1,a2,a3,….,an},移动终端通过应用公式Px=X(n).^2,可得到Px={a1 2,a2 2,a3 2,….,an 2}。
对于播音信号,移动终端可应用以下公式<5>,计算播音信号的功率谱:
Py=Y(n).^2           <5>
其中,Py为播音信号的功率谱,Y(n)为将第n时刻采集到的播音信号进行傅里叶变换得到的向量,.^2用于将Y(n)中的每个向量元素求平方。
例如,第n时刻采集到的播音信号Y(n)={b1,b2,b3,….,bn},移动终端通过应用公式Py=Y(n).^2,可得到Py={b1 2,b2 2,b3 2,….,bn 2}。
404、移动终端根据环路传递函数及播音信号,计算回声信号的估计值。
移动终端根据环路传递函数及播音信号,可应用如下公式<6>,计算回声信号的估计值:
Figure PCTCN2016083622-appb-000001
其中,E(n)为回声信号的估计值。
405、移动终端获取录音信号的功率特征值、播音信号的功率特征值及回声信号的功率特征值。
其中,录音信号的功率特征值为对录音信号的功率谱进行衡量的一种量度,可通过对录音信号的功率谱进行处理得到,在本实施例中,录音信号的功率特征值可用VAD_x表示。VAD_x为一种二值状态,具有true和flase两种状态。当VAD_x=true时,表示录音信号较强;当VAD_x=flase时,表示录音信号较弱。
播音信号的功率特征值为对播音信号的功率谱进行衡量的一种量度,可通过对播音信号的功率谱进行处理得到,在本实施例中,播音信号的功率特征值可用VAD_y表示。VAD_y为一种二值状态,具有true和flase两种状态。当VAD_y=true时,表示播音信号较强;当VAD_y=flase时,表示播音信号较弱。
回声信号的功率特征值为对回声信号的功率谱进行衡量的一种量度,在本实施例中,回声信号的功率特征值可用VAD_e表示。VAD_e为一种二值状态,具有true和flase两种状态。当VAD_e=true时,表示回声信号较强;当VAD_e=flase时,表示回音信号较弱。此处需要说明的是,在获取回声信号的功率特征值时,可预先根据回声信号的频谱估计值,计算一个回声信号的功率谱,进而通过对回声信号的功率谱进行处理,得到回声信号的功率特征值。此处计算得到的回声信号的功率谱为对回声信号的功率谱的一种估计,对于回声信号的功率谱是否为此处计算得到的回声信号的功率谱,需要通过下述步骤406进一步判断。
406、移动终端判断录音信号的功率特征值是否大于第一阈值、播音信号的功率特征值是否大于第二阈值、回声信号的功率特征值是否大于第三阈值,如果是,执行步骤407。
为了将噪声信号及近端信号进行区分,本实施例应用信号检测和分类模块以及语音激活检测机制,并根据录音信号的功率特征值、回声信号的功率特征值以及播音信号的功率特征值,按时间区分近端信号(叠加有背景噪声)和非近端信号,以获取噪声信号的功率谱。具体判断时,移动终端需要判断录音信号的功率特征值是否大于第一阈值,播音信号的功率特征值是否大于第二阈值、回声信号的功率特征值是否大于第三阈值。其中,第一阈值、第二阈值、第三阈值为预设门限值,在本实施例中,第一阈值可用Tx表示,第二阈值可用Ty表 示,第三阈值可用Te表示,第一阈值、第二阈值、第三阈值取值越小,移动终端对噪声的反应越敏感,反之,移动终端仅当噪声能量非常大时,才对噪声作出反应。
上述判断过程,可用如下公式<7>表示:
Figure PCTCN2016083622-appb-000002
一般情况下,移动终端通过麦克风所采集到的录音信号中可能并不存在近端信号,为了进一步判断录音信号中是否存在近端信号,可采用如下公式<8>进行判断:
当VAD_y=flase,并且VAD_e=flase时,VAD_v=VAD_x        <8>
也即是,当移动终端的扬声器并没用播放声音(即VAD_y=flase)时,且未检测到回声信号(即VAD_e=flase),则此时麦克风所收集到的录音信号即为近端信号,此时用户正在说话,否则说明用户未在说话。
在判断过程中,如果判断出录音信号的功率特征值大于第一阈值、播音信号的功率特征值大于第二阈值、回声信号的功率特征值大于第三阈值,则执行下述步骤407;如果判断出录音信号的功率特征值大于第一阈值、播音信号的功率特征值大于第二阈值、回声信号的功率特征值小于或等于第三阈值,或者,录音信号的功率特征值大于第一阈值、播音信号的功率特征值小于或等于第二阈值,则忽略本次获取到的录音信号和播音信号。
407、移动终端计算回声信号的频谱估计值的平方,作为回声信号的功率谱。
当录音信号的功率特征值是大于第一阈值、播音信号的功率特征值大于第二阈值、回声信号的功率特征值大于第三阈值时,移动终端通过计算回声信号的频谱估计值的平方,获取回声信号的功率谱,具体计算时,可应用以下公式<9>:
Pe=E(n).^2       <9>
其中,Pe为回声信号的功率谱。
408、移动终端判断录音信号的功率特征值是否小于第一阈值、回声信号的功率特征值是否小于第三阈值,如果是,执行步骤409。
基于上述步骤407,移动终端还将继续判断录音信号的功率特征值是否小于第一阈值、回声信号的功率特征值是否小于第三阈值,以获取噪声信号的功率谱。
在判断过程中,如果判断出录音信号的功率特征值小于第一阈值且回声信号的功率特征值小于第三阈值,则执行下述步骤409;如果判断出录音信号的功率特征值小于第一阈值、回声信号的功率特征值大于或等于第三阈值,则忽略本次获取到的录音信号和播音信号。
409、移动终端将录音信号的功率谱减去回声信号的功率谱,作为噪声信号的功率谱。
当判断出录音信号的功率特征值小于第一阈值且回声信号的功率特征值小于第三阈值,此时可认为未检测到近端信号,也即是用户此时并未讲话,此时移动终端通过将录音信号的功率谱减去回声信号的功率谱,作为噪声信号的功率谱。具体实施时,可参见下述公式<10>:
Pn=Px–Pe           <10>
其中,Pn为噪声信号的功率谱。
410、移动终端根据回声信号的功率谱和噪声信号的功率谱,计算频率加重系数。
移动终端在根据回声信号的功率谱和噪声信号的功率谱,计算频率加重系数时,可采用如下步骤4101~4102:
4101、移动终端根据回声信号的功率谱及噪声信号的功率谱,构建语音可懂度函数。
在声学领域,语音可懂度函数(SII)具有多套标准,本实施例中采用ASNI-S3.5中的标准[4]进行计算,在标准[4]中,语音可懂度函数可以表示为以回声信号的功率谱及噪声信号的功率谱为自变量的函数。因此,当移动终端计算出回声信号的功率谱和噪声信号的功率谱之后,即可构建出语音可懂度函数。构建的语音可懂度函数可参见如下公式<11>:
Figure PCTCN2016083622-appb-000003
其中,imax为所拆分的频带总数,i为imax内的任一频带,SII为语音可懂度函数,Pei为回声信号在第i个频带内的功率谱,Pni为噪声信号在第i个频带内的功率谱,Pui为标准语音强度在第i个频带内功率谱,Ii为分频带加权权重,Pdi为中间变量,可用如下公式<12>表示:
Figure PCTCN2016083622-appb-000004
其中,fk表示第i个频带内的第k个频点,Ck为中间变量,可用如下公式<13>表示:
Ck=0.6(max{Pnk,Pek-24}+10log10fk-6.353)-80             <13>
其中,Pek为回声信号在第k个频点上的功率谱,Pnk为噪声信号在第k个频点上的功率谱。
需要说明的是,上述Pui和Ii的具体取值可以参考ANSI-S3.5标准[4]中规定的数值,也可以由设计人员根据需要自行确定。
4102、在回声信号的功率谱保持不变的条件下,移动终端计算语音可懂度函数的极大值,从而获得频率加重系数。
在本实施例中,频率加重系数即为移动终端中加重滤波器的系数,用于调节移动终端输出的播音信号的频点幅值。在不同时刻,移动终端所计算出的频率加重系数是不同的。
通过观察上述步骤4101中所构建的语音可懂度函数可以看出,语音可懂度函数为以回声信号的功率谱和噪声信号的功率谱为自变量的函数,也即是,语音可懂度函数中的变量有两个,此时很难计算计算语音可懂度函数的极大值。为此,本实施例提供的方法作了一个近似计算,设定第n时刻的噪声信号的功率谱近似等于n-1时刻的噪声信号的功率谱,这样在计算第n时刻的频率加重系数时,移动终端可直接使用第n-1时刻所计算出的噪声信号的功率谱。通过采用该种处理方式,移动终端将语音可懂度函数转换为以回声信号的功率谱为自变量的函数。
为了提高用户扬声器播放的语音信号的语音可懂度,移动终端在将播音信号通过扬声器播放之前,还将采用加重滤波器对播音信号进行处理,以提高播音信号在指定频点上的幅值,增加播音信号的能量。受限于移动终端的尺寸, 扬声器播放的最大声功率具极大值,为了避免扬声器不会过载,本实施例在基于所构建的语音可懂度函数,计算频率加重系数时,假设加重滤波器增强前后的回声信号功率谱保持不变,再计算语音可懂度函数的极大值,在数学上这一方法称为有约束条件下求极值问题。该极值问题,可用如下公式<14>表示:
Figure PCTCN2016083622-appb-000005
其中,Pei为增强前的回声信号在第i个频点上的功率谱,Pe’i为增强后的回声信号在第i个频点上的功率谱,公式
Figure PCTCN2016083622-appb-000006
保证了增强前后的回声信号功率谱不变,从而确保扬声器不会过载。
需要注意的是,通过加重滤波器处理后的信号为电信号,电信号需通过扬声器转换后才变成声波。由于不同型号的移动终端的扬声器的输出频率响应是不同的,如果要获取不同移动终端的扬声器的输出频率响应,就需要分别测量每个移动终端的扬声器,并在运行时进行校正补偿,由此将产生硬件碎片化问题。为了避免该问题,本实施例提供的方法将采用如下方法,以避免对扬声器频响的直接测量。
通过对上述公式<6>的观察可以发现,E(n)与Y(n)可通过环路传递函数H_loop建立起映射关系。本实施例将扬声器的频率响应记为Hspk,将麦克风的频率响应记为Hmic,根据公式<6>,则可得出:
Figure PCTCN2016083622-appb-000007
对上述公式<15>,可将公式<14>求极值问题转化为求偏导的问题,通过计算公式<15>的偏导数,可得到语音可懂度函数的拐点,具体过程可参见下述公式<16>:
Figure PCTCN2016083622-appb-000008
其中,|W|2为频率加重系数,|H_loop|2可通过上述公式<3>得到,Pyi可通过上述公式<5>得到,SII可通过公式<11>得到。
通过对上述公式<16>进行计算,可得到当前时刻的|W|2
411、基于频率加重系数,移动终端对播音信号的频点幅值进行调节。
基于所确定的频率加重系数,移动终端通过动态地跟踪并调整语音可懂度函数,以实现对噪声信号的功率谱Pn、回声信号的功率谱Pe的变化的自动适应。
412、移动终端输出调节后的播音信号。
为了提高移动终端当前时刻所输出的播音信号的准确性,移动终端将结合当前时刻之前的一段时间内所输出的播音信号及相应的频率加重系数,根据下述公式<17>,确定当前时刻要输出的播音信号。
Figure PCTCN2016083622-appb-000009
其中,z(n)为输出的播音信号,w(k)为第n时刻计算出的频率加重系数在时域上的对应值,Kmax等于加重滤波器W的阶数,y(n-k)为加重前的播音信号在第n-k时刻的值。
由于本步骤中移动终端所输出的调节后的播音信号能够掩蔽噪声信号,因此,当收听到待调节后的播音信号后,用户能够听清该播音信号的内容。
图5示出了本发明实施例提供的语音信号的处理方法对应的信号流,由图5可知,当基于所获取到的录音信号X和播音信号Y,移动终端根据录音信号和播音信号之间的频域互相关函数r_xy及播音信号的频域自相关函数R_yy,可计算出环路传递函数H_loop=R_yy^-1*r_xy。移动终端根据播音信号及环路传递函数,可计算出回声信号的估计值E(n)=H_loop·Y(n),进一步地,移动终端根据录音信号的功率特征值、播音信号的功率特征值及回声信号的功率特征值,并采用语音激活检测机制,计算出回声信号的功率谱和噪声信号的功率谱,进而通过计算语音可懂度函数的极大值,获取频率加重系数,最后基于频率加重系数,采用加重滤波器对播音信号的频点幅值进行调节,并将调节后的播音信号输出。
图6示出了本发明另一实施例提供的语音信号的处理方法的流程图。该方法可以通过软件实现。当语音即时通讯应用启动后,移动终端会定时地获取麦克风从近端采集到的录音信号x与对端发送的播音信号y,并计算出录音信号的 功率谱Px、播音信号的功率谱Py,进而基于公式<3>计算出环路传递函数H_loop。当确定了环路传递函数之后,移动终端可根据公式<6>,计算出回声信号的估计值E(n)。另外,由于回声信号、近端语音信号、噪声信号三者被同一个麦克风拾取,时间上存在重叠,因此,需要对录音信号进行分类,进而根据公式<9>计算出回声功率谱Pe,根据公式<10>计算出噪声功率谱Pn。之后,根据回声信号的功率谱和噪声信号的功率谱,构建语音可懂度函数SII,通过计算语音可懂度函数SII的极大值,可得到频谱加重系数W。最后根据公式<17>计算输出增强后的播音信号送给扬声器,由扬声器转换成声音进行播放。
需要说明的是,上述方法可以在语音即时通讯应用层面中实现,也可以在操作系统层面实现,也可以固化在硬件芯片的固件(firmware)中实现。无论是在这三个层面的哪一层面上实现,本发明实施例提供的语音数据的处理方法均适用的,区别仅在于同一语音数据的处理方法具体是运行于移动终端系统中的哪个层面。
需要说明的是,上文以移动终端为例描述了本发明,本领域的技术人员可以理解,本发明还可以应用于其他终端设备,例如桌面计算机等。另外,上文的播音信号可以是从对端接收的,例如终端设备通过有线或无线网络从其他终端设备(即对端设备)接收的语音信号,上文的播音信号也可以是终端设备本地存储的语音信号。此外,上文以语音即时通讯应用为例进行了说明,本领域的技术人员可以理解,上文的语音即时通讯应用可以替换为任何其它语音播放应用。
需要说明的是,上述方法不仅可用于提高语音可懂度之外,还可以用于提高其他内容的音频信号。例如:可根据不同的环境噪声自动地对铃声、闹钟的提示音做增强,使得增强后的提示声音能更清楚地被用户听到,以达到克服环境噪声干扰的目的。
需要说明的是,上述方法除了用于对抗噪声场景之外,还可用于对抗非噪声的环境。例如:A与B两个人同时在相近的距离内拨打电话,其中,A与a通话、B与b通话。由于A与B两个人距离很近,因而A的说话声将会对B的收听形成干扰,同时B的说话声也对A的收听形成干扰。本发明实施提供的方法同样可用于这种语音竞争场景,在该场景下,A侧的移动终端将当把B的语音作为噪声信号,把a的语音作为需要做增强的信号;同理,B侧的移动终端将把A作为噪声信号,把b的语音作为需要做增强的信号。
本发明实施例提供的方法,在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。
参见图7,本发明实施例提供了一种语音信号的处理装置的结构示意图,该装置包括:
采集模块701,用于从近端采集录音信号,录音信号中至少包括噪声信号及回声信号;
接收模块702,用于接收对端发送的播音信号;
第一计算模块703,用于根据录音信号和播音信号,计算环路传递函数;
第二计算模块704,用于计算录音信号的功率谱;
第三计算模块705,用于根据录音信号的功率谱、播音信号及环路传递函数,计算回声信号的功率谱和噪声信号的功率谱;
第四计算模块706,用于根据回声信号的功率谱和噪声信号的功率谱,计算频率加重系数;
调节模块707,用于基于频率加重系数,对播音信号的频点幅值进行调节;
输出模块708,用于输出调节后的播音信号。
在本发明的另一个实施例中,第一计算模块703,用于计算录音信号与播音信号之间的频域互相关函数;计算播音信号的频域自相关函数;根据录音信号与播音信号之间的频域互相关函数以及播音信号的频域自相关函数,计算环路传递函数。
在本发明的另一个实施例中,第二计算模块704,用于对于录音信号,应用以下公式,计算录音信号的功率谱:
Px=X(n).^2
其中,Px为录音信号的功率谱,X(n)为将第n时刻采集到的录音信号进行傅里叶变换得到的向量,.^2用于将X(n)中的每个向量元素求平方。
在本发明的另一个实施例中,第三计算模块705,用于根据环路传递函数及播音信号,计算回声信号的频谱估计值;计算回声信号的频谱估计值的平方,得到回声信号的功率谱;将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱。
在本发明的另一个实施例中,该装置还包括:
第五计算模块,用于计算录音信号的功率特征值、播音信号的功率特征值及回声信号的功率特征值;
第一判断模块,用于判断录音信号的功率特征值是否大于第一阈值、播音信号的功率特征值是否大于第二阈值、回声信号的功率特征值是否大于第三阈值;
第三计算模块705,用于当录音信号的功率特征值大于第一阈值、播音信号的功率值大于第二阈值且回声信号的功率特征值大于第三阈值时,计算回声信号的频谱估计值的平方,得到回声信号的功率谱。
在本发明的另一个实施例中,该装置还包括:
第二判断模块,用于判断录音信号的功率特征值是否小于第一阈值、回声信号的功率特征值是否小于第三阈值;
第三计算模块705,用于当录音信号的功率特征值小于第一阈值且回声信号的功率特征值小于第三阈值时,将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱。
在本发明的另一个实施例中,第四计算模块706,用于根据回声信号的功率谱及噪声信号的功率谱,构建语音可懂度函数;在回声信号的功率谱保持不变的条件下,根据语音可懂度函数的极大值,得到频率加重系数。
综上,本发明实施例提供的装置,在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。
参见图8,其示出了本发明实施例所涉及的语音信号的处理终端的结构示意图,该终端可以用于实施上述实施例中提供的语音信号的处理方法。具体来讲:
终端800可以包括RF(Radio Frequency,射频)电路110、包括有一个或一个以上计算机可读存储介质的存储器120、输入单元130、显示单元140、传感器150、音频电路160、WiFi(Wireless Fidelity,无线保真)模块170、包括有一个或者一个以上处理核心的处理器180、以及电源190等部件。本领域技术人员可以理解,图8中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
RF电路110可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,交由一个或者一个以上处理器180处理;另外,将 涉及上行的数据发送给基站。通常,RF电路110包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM)卡、收发信机、耦合器、LNA(Low Noise Amplifier,低噪声放大器)、双工器等。此外,RF电路110还可以通过无线通信与网络和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于GSM(Global System of Mobile communication,全球移动通讯系统)、GPRS(General Packet Radio Service,通用分组无线服务)、CDMA(Code Division Multiple Access,码分多址)、WCDMA(Wideband Code Division Multiple Access,宽带码分多址)、LTE(Long Term Evolution,长期演进)、电子邮件、SMS(Short Messaging Service,短消息服务)等。
存储器120可用于存储软件程序以及模块,处理器180通过运行存储在存储器120的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器120可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端800的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器120还可以包括存储器控制器,以提供处理器180和输入单元130对存储器120的访问。
输入单元130可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,输入单元130可包括触敏表面131以及其他输入设备132。触敏表面131,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面131上或在触敏表面131附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触敏表面131可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器180,并能接收处理器180发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面131。除了触敏表面131,输入单元130还可以包括其他输入设备132。具体地,其他输入设备132可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操 作杆等中的一种或多种。
显示单元140可用于显示由用户输入的信息或提供给用户的信息以及终端800的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元140可包括显示面板141,可选的,可以采用LCD(Liquid Crystal Display,液晶显示器)、OLED(Organic Light-Emitting Diode,有机发光二极管)等形式来配置显示面板141。进一步的,触敏表面131可覆盖显示面板141,当触敏表面131检测到在其上或附近的触摸操作后,传送给处理器180以确定触摸事件的类型,随后处理器180根据触摸事件的类型在显示面板141上提供相应的视觉输出。虽然在图8中,触敏表面131与显示面板141是作为两个独立的部件来实现输入和输入功能,但是在某些实施例中,可以将触敏表面131与显示面板141集成而实现输入和输出功能。
终端800还可包括至少一种传感器150,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板141的亮度,接近传感器可在终端800移动到耳边时,关闭显示面板141和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于终端800还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路160、扬声器161,传声器162可提供用户与终端800之间的音频接口。音频电路160可将接收到的音频数据转换后的电信号,传输到扬声器161,由扬声器161转换为声音信号输出;另一方面,传声器162将收集的声音信号转换为电信号,由音频电路160接收后转换为音频数据,再将音频数据输出处理器180处理后,经RF电路110以发送给比如另一终端,或者将音频数据输出至存储器120以便进一步处理。音频电路160还可能包括耳塞插孔,以提供外设耳机与终端800的通信。
WiFi属于短距离无线传输技术,终端800通过WiFi模块170可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图8示出了WiFi模块170,但是可以理解的是,其并不属于终端800的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器180是终端800的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器120内的软件程序和/或模块,以及调用存储在存储器120内的数据,执行终端800的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器180可包括一个或多个处理核心;可选的,处理器180可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器180中。
终端800还包括给各个部件供电的电源190(比如电池),优选的,电源可以通过电源管理系统与处理器180逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源190还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管未示出,终端800还可以包括摄像头、蓝牙模块等,在此不再赘述。具体在本实施例中,终端800的显示单元是触摸屏显示器,终端800还包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行。
所述一个或者一个以上程序包含用于执行以下操作的指令:
获取录音信号和语音信号,所述录音信号中至少包括噪声信号及回声信号;
根据所述录音信号和所述语音信号,计算环路传递函数;
根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱;
根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数;
基于所述频率加重系数,对所述语音信号的频点幅值进行调节;
输出调节后的语音信号。
假设上述为第一种可能的实施方式,则在第一种可能的实施方式作为基础而提供的第二种可能的实施方式中,录音信号是使用终端设备的麦克风采集的声音信号。
在第一种可能的实施方式作为基础而提供的第三种可能的实施方式中,输出调节后的语音信号包括通过扬声器播放调节后的语音信号,其中所述语音信号是终端设备通过网络接收的或本地存储的要通过扬声器播放的播音信号。
在第三种可能的实施方式作为基础而提供的第四种可能的实施方式中,终 端的存储器中,还包含用于执行以下操作的指令:
根据录音信号和语音信号,计算环路传递函数,包括:
计算录音信号与播音信号之间的频域互相关函数;
计算播音信号的频域自相关函数;
根据录音信号与播音信号之间的频域互相关函数以及播音信号的频域自相关函数,计算环路传递函数;
或者,终端的存储器中,还包含用于执行以下操作的指令:
计算录音信号的功率谱,包括:
对于录音信号,应用以下公式,计算录音信号的功率谱:
Px=X(n).^2
其中,Px为录音信号的功率谱,X(n)为将第n时刻采集到的录音信号进行傅里叶变换得到的向量,.^2用于将X(n)中的每个向量元素求平方。
在第三种可能的实施方式作为基础而提供的第五种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
根据录音信号、语音信号及环路传递函数,计算回声信号的功率谱和噪声信号的功率谱,包括:
计算所述录音信号的功率谱;
根据环路传递函数及播音信号,计算回声信号的频谱估计值;
计算回声信号的频谱估计值的平方,得到回声信号的功率谱;
将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱。
在第五种可能的实施方式作为基础而提供的第六种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
计算回声信号的频谱估计值的平方,得到回声信号的功率谱之前,还包括:
计算录音信号的功率特征值、播音信号的功率特征值及回声信号的功率特征值;
判断录音信号的功率特征值是否大于第一阈值、播音信号的功率特征值是否大于第二阈值、回声信号的功率特征值是否大于第三阈值;
当录音信号的功率特征值大于第一阈值、播音信号的功率值大于第二阈值且回声信号的功率特征值大于第三阈值时,执行计算回声信号的频谱估计值的平方,得到回声信号的功率谱的步骤。
在第六种可能的实施方式作为基础而提供的第七种可能的实施方式中,终 端的存储器中,还包含用于执行以下操作的指令:
将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱之前,还包括:
判断录音信号的功率特征值是否小于第一阈值、回声信号的功率特征值是否小于第三阈值;
当录音信号的功率特征值小于第一阈值且回声信号的功率特征值小于第三阈值时,执行将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱的步骤。
在第三种可能的实施方式作为基础而提供的第八种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
根据回声信号的功率谱、噪声信号的功率谱,计算频率加重系数,包括:
根据回声信号的功率谱及噪声信号的功率谱,构建语音可懂度函数;
在回声信号的功率谱保持不变的条件下,根据语音可懂度函数的极大值,得到频率加重系数。
本发明实施例提供的终端,在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。
本发明实施例还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中的存储器中所包含的计算机可读存储介质;也可以是单独存在,未装配入终端中的计算机可读存储介质。该计算机可读存储介质存储有一个或者一个以上程序,该一个或者一个以上程序被一个或者一个以上的处理器用来执行语音信号的处理方法,该方法包括:
获取录音信号和语音信号,所述录音信号中至少包括噪声信号及回声信号;
根据所述录音信号和所述语音信号,计算环路传递函数;
根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱;
根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数;
基于所述频率加重系数,对所述语音信号的频点幅值进行调节;
输出调节后的语音信号。
假设上述为第一种可能的实施方式,则在第一种可能的实施方式作为基础 而提供的第二种可能的实施方式中,录音信号是使用终端设备的麦克风采集的声音信号。
在第一种可能的实施方式作为基础而提供的第三种可能的实施方式中,输出调节后的语音信号包括通过扬声器播放调节后的语音信号,其中所述语音信号是终端设备通过网络接收的或本地存储的要通过扬声器播放的播音信号。
在第三种可能的实施方式作为基础而提供的第四种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
根据录音信号和语音信号,计算环路传递函数,包括:
计算录音信号与播音信号之间的频域互相关函数;
计算播音信号的频域自相关函数;
根据录音信号与播音信号之间的频域互相关函数以及播音信号的频域自相关函数,计算环路传递函数;
或者,终端的存储器中,还包含用于执行以下操作的指令:
计算录音信号的功率谱,包括:
对于录音信号,应用以下公式,计算录音信号的功率谱:
Px=X(n).^2
其中,Px为录音信号的功率谱,X(n)为将第n时刻采集到的录音信号进行傅里叶变换得到的向量,.^2用于将X(n)中的每个向量元素求平方。
在第三种可能的实施方式作为基础而提供的第五种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
根据录音信号、语音信号及环路传递函数,计算回声信号的功率谱和噪声信号的功率谱,包括:
计算所述录音信号的功率谱;
根据环路传递函数及播音信号,计算回声信号的频谱估计值;
计算回声信号的频谱估计值的平方,得到回声信号的功率谱;
将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱。
在第五种可能的实施方式作为基础而提供的第六种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
计算回声信号的频谱估计值的平方,得到回声信号的功率谱之前,还包括:
获取录音信号的功率特征值、播音信号的功率特征值及回声信号的功率特征值;
判断录音信号的功率特征值是否大于第一阈值、播音信号的功率特征值是否大于第二阈值、回声信号的功率特征值是否大于第三阈值;
当录音信号的功率特征值大于第一阈值、播音信号的功率值大于第二阈值且回声信号的功率特征值大于第三阈值时,执行计算回声信号的频谱估计值的平方,得到回声信号的功率谱的步骤。
在第六种可能的实施方式作为基础而提供的第七种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱之前,还包括:
判断录音信号的功率特征值是否小于第一阈值、回声信号的功率特征值是否小于第三阈值;
当录音信号的功率特征值小于第一阈值且回声信号的功率特征值小于第三阈值时,执行将录音信号的功率谱减去回声信号的功率谱,得到噪声信号的功率谱的步骤。
在第三种可能的实施方式作为基础而提供的第八种可能的实施方式中,终端的存储器中,还包含用于执行以下操作的指令:
根据回声信号的功率谱、噪声信号的功率谱,计算频率加重系数,包括:
根据回声信号的功率谱及噪声信号的功率谱,构建语音可懂度函数;
在回声信号的功率谱保持不变的条件下,根据语音可懂度函数的极大值,得到频率加重系数。
本发明实施例提供的计算机可读存储介质,在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。
本发明实施例中提供了一种图形用户接口,该图形用户接口用在语音信号的处理终端上,该执行语音信号的处理终端包括触摸屏显示器、存储器和用于执行一个或者一个以上的程序的一个或者一个以上的处理器;该图形用户接口包括:
获取录音信号和语音信号,录音信号中至少包括噪声信号及回声信号;
根据录音信号和语音信号,计算环路传递函数;
根据录音信号、语音信号及环路传递函数,计算回声信号的功率谱和噪声 信号的功率谱;
根据回声信号的功率谱和噪声信号的功率谱,计算频率加重系数;
基于频率加重系数,对语音信号的频点幅值进行调节;
输出调节后的语音信号。
本发明实施例提供的图形用户接口,在确保扬声器不过载,且不破坏原始播音信号的动态幅度的前提下,自动根据噪声信号与播音信号的频率分布,调整播音信号的频点幅值,明显提高了语音可懂度。
需要说明的是:上述实施例提供的语音信号的处理装置在处理语音信号时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将语音信号的处理装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音信号的处理装置与语音信号的处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (18)

  1. 一种语音信号的处理方法,包括:
    获取录音信号和要输出的语音信号,所述录音信号中至少包括噪声信号及回声信号;
    根据所述录音信号和所述语音信号,计算环路传递函数;
    根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱;
    根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数;
    基于所述频率加重系数,对所述语音信号的频点幅值进行调节;
    输出调节后的语音信号。
  2. 根据权利要求1所述的方法,其中,所述录音信号是使用终端设备的麦克风采集的声音信号。
  3. 根据权利要求1所述的方法,其中,输出调节后的语音信号包括通过终端设备的扬声器播放调节后的语音信号,其中所述语音信号是终端设备通过网络接收的或本地存储的要通过扬声器播放的播音信号。
  4. 根据权利要求3所述的方法,其中,所述根据所述录音信号和所述语音信号,计算环路传递函数,包括:
    计算所述录音信号与所述播音信号之间的频域互相关函数;
    计算所述播音信号的频域自相关函数;
    根据所述录音信号与所述播音信号之间的频域互相关函数以及所述播音信号的频域自相关函数计算所述环路传递函数。
  5. 根据权利要求3所述的方法,其中,所述根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱,包括:
    计算所述录音信号的功率谱;
    根据所述环路传递函数及所述播音信号,计算所述回声信号的频谱估计值;
    计算所述回声信号的频谱估计值的平方,得到所述回声信号的功率谱;
    将所述录音信号的功率谱减去所述回声信号的功率谱,得到所述噪声信号的功率谱。
  6. 根据权利要求5所述的方法,还包括:
    计算所述录音信号的功率特征值、所述播音信号的功率特征值及所述回声信号的功率特征值;和
    判断所述录音信号的功率特征值是否大于第一阈值、所述播音信号的功率特征值是否大于第二阈值、所述回声信号的功率特征值是否大于第三阈值,
    其中,所述计算所述回声信号的频谱估计值的平方,得到所述回声信号的功率谱包括:
    当所述录音信号的功率特征值大于所述第一阈值、所述播音信号的功率值大于所述第二阈值且所述回声信号的功率特征值大于所述第三阈值时,计算所述回声信号的频谱估计值的平方,得到所述回声信号的功率谱。
  7. 根据权利要求6所述的方法,还包括:
    判断所述录音信号的功率特征值是否小于所述第一阈值、所述回声信号的功率特征值是否小于所述第三阈值,
    其中,所述将所述录音信号的功率谱减去所述回声信号的功率谱,得到所述噪声信号的功率谱包括:
    当所述录音信号的功率特征值小于所述第一阈值且所述回声信号的功率特征值小于所述第三阈值时,将所述录音信号的功率谱减去所述回声信号的功率谱,得到所述噪声信号的功率谱。
  8. 根据权利要求3所述的方法,其中,所述根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数,包括:
    根据所述回声信号的功率谱及所述噪声信号的功率谱,构建语音可懂度函数;
    在所述回声信号的功率谱保持不变的条件下,根据所述语音可懂度函数的极大值,得到所述频率加重系数。
  9. 根据权利要求1所述的方法,其中所述终端设备包括加重滤波器、扬声器和麦克风,所述频率加重系数表示语音信号经过加重滤波器和扬声器后被麦克风拾取的比例。
  10. 一种语音信号的处理装置,包括:
    至少一个处理器;和
    存储器,其中所述存储器存储有程序指令,所述指令当由所述处理器执行时,配置所述装置执行下述操作:
    获取录音信号和语音信号,所述录音信号中至少包括噪声信号及回声信号;
    根据所述录音信号和所述语音信号,计算环路传递函数;
    根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱;
    根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数;
    基于所述频率加重系数,对所述语音信号的频点幅值进行调节;
    输出调节后的语音信号。
  11. 根据权利要求10所述的装置,其中,所述录音信号是使用终端设备的麦克风采集的声音信号。
  12. 根据权利要求10所述的装置,其中,输出调节后的语音信号包括通过扬声器播放调节后的语音信号,其中所述语音信号是终端设备通过网络接收的或本地存储的要通过扬声器播放的播音信号。
  13. 根据权利要求12所述的装置,其中,所述根据所述录音信号和所述语音信号,计算环路传递函数,包括:
    计算所述录音信号与所述播音信号之间的频域互相关函数;
    计算所述播音信号的频域自相关函数;
    根据所述录音信号与所述播音信号之间的频域互相关函数以及所述播音信号的频域自相关函数计算所述环路传递函数。
  14. 根据权利要求12所述的装置,其中,所述根据所述录音信号、所述语音信号及所述环路传递函数,计算所述回声信号的功率谱和所述噪声信号的功率谱,包括:
    计算所述录音信号的功率谱;
    根据所述环路传递函数及所述播音信号,计算所述回声信号的频谱估计值;
    计算所述回声信号的频谱估计值的平方,得到所述回声信号的功率谱;
    将所述录音信号的功率谱减去所述回声信号的功率谱,得到所述噪声信号的功率谱。
  15. 根据权利要求12所述的装置,其中,所述装置还被配置为:
    计算所述录音信号的功率特征值、所述播音信号的功率特征值及所述回声信号的功率特征值;
    判断所述录音信号的功率特征值是否大于第一阈值、所述播音信号的功率特征值是否大于第二阈值、所述回声信号的功率特征值是否大于第三阈值;
    当所述录音信号的功率特征值大于所述第一阈值、所述播音信号的功率值 大于所述第二阈值且所述回声信号的功率特征值大于所述第三阈值时,计算所述回声信号的频谱估计值的平方,得到所述回声信号的功率谱。
  16. 根据权利要求12所述的装置,其中,所述装置还被配置为:
    判断所述录音信号的功率特征值是否小于所述第一阈值、所述回声信号的功率特征值是否小于所述第三阈值;
    当所述录音信号的功率特征值小于所述第一阈值且所述回声信号的功率特征值小于所述第三阈值时,将所述录音信号的功率谱减去所述回声信号的功率谱,得到所述噪声信号的功率谱。
  17. 根据权利要求12所述的装置,其中,所述根据所述回声信号的功率谱和所述噪声信号的功率谱,计算频率加重系数,包括:
    根据所述回声信号的功率谱及所述噪声信号的功率谱,构建语音可懂度函数;
    在所述回声信号的功率谱保持不变的条件下,根据所述语音可懂度函数的极大值,得到所述频率加重系数。
  18. 一种计算机可读存储介质,所述存储介质存储有程序指令,所述指令当由计算装置的处理器执行时,配置所述装置执行根据权利要求1-9中任一项所述的方法。
PCT/CN2016/083622 2015-11-04 2016-05-27 语音信号的处理方法及装置 WO2017075979A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2017553962A JP6505252B2 (ja) 2015-11-04 2016-05-27 音声信号を処理するための方法及び装置
EP16861250.5A EP3373300B1 (en) 2015-11-04 2016-05-27 Method and apparatus for processing voice signal
KR1020177029724A KR101981879B1 (ko) 2015-11-04 2016-05-27 음성 신호를 처리하기 위한 방법 및 장치
US15/691,300 US10586551B2 (en) 2015-11-04 2017-08-30 Speech signal processing method and apparatus
US16/774,854 US10924614B2 (en) 2015-11-04 2020-01-28 Speech signal processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510741057.1 2015-11-04
CN201510741057.1A CN105280195B (zh) 2015-11-04 2015-11-04 语音信号的处理方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/691,300 Continuation-In-Part US10586551B2 (en) 2015-11-04 2017-08-30 Speech signal processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2017075979A1 true WO2017075979A1 (zh) 2017-05-11

Family

ID=55149085

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/083622 WO2017075979A1 (zh) 2015-11-04 2016-05-27 语音信号的处理方法及装置

Country Status (7)

Country Link
US (2) US10586551B2 (zh)
EP (1) EP3373300B1 (zh)
JP (1) JP6505252B2 (zh)
KR (1) KR101981879B1 (zh)
CN (1) CN105280195B (zh)
MY (1) MY179978A (zh)
WO (1) WO2017075979A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390947A (zh) * 2018-04-23 2019-10-29 北京京东尚科信息技术有限公司 声源位置的确定方法、系统、设备和存储介质

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105280195B (zh) 2015-11-04 2018-12-28 腾讯科技(深圳)有限公司 语音信号的处理方法及装置
US20170330564A1 (en) * 2016-05-13 2017-11-16 Bose Corporation Processing Simultaneous Speech from Distributed Microphones
CN106506872B (zh) * 2016-11-02 2019-05-24 腾讯科技(深圳)有限公司 通话状态检测方法及装置
WO2018054171A1 (zh) 2016-09-22 2018-03-29 腾讯科技(深圳)有限公司 通话方法、装置、计算机存储介质及终端
CN108447472B (zh) * 2017-02-16 2022-04-05 腾讯科技(深圳)有限公司 语音唤醒方法及装置
CN106878575B (zh) * 2017-02-24 2019-11-05 成都喜元网络科技有限公司 残留回声的估计方法及装置
CN107833579B (zh) * 2017-10-30 2021-06-11 广州酷狗计算机科技有限公司 噪声消除方法、装置及计算机可读存储介质
CN108200526B (zh) * 2017-12-29 2020-09-22 广州励丰文化科技股份有限公司 一种基于可信度曲线的音响调试方法及装置
US11335357B2 (en) * 2018-08-14 2022-05-17 Bose Corporation Playback enhancement in audio systems
CN109727605B (zh) * 2018-12-29 2020-06-12 苏州思必驰信息科技有限公司 处理声音信号的方法及系统
KR20210072384A (ko) 2019-12-09 2021-06-17 삼성전자주식회사 전자 장치 및 이의 제어 방법
CN111048096B (zh) * 2019-12-24 2022-07-26 大众问问(北京)信息科技有限公司 一种语音信号处理方法、装置及终端
CN111048118B (zh) * 2019-12-24 2022-07-26 大众问问(北京)信息科技有限公司 一种语音信号处理方法、装置及终端
CN111128194A (zh) * 2019-12-31 2020-05-08 云知声智能科技股份有限公司 一种提高在线语音识别效果的系统及方法
CN112203188B (zh) * 2020-07-24 2021-10-01 北京工业大学 一种自动音量调节方法
KR102424795B1 (ko) * 2020-08-25 2022-07-25 서울과학기술대학교 산학협력단 음성 구간 검출 방법
CN111986688B (zh) * 2020-09-09 2024-07-23 北京小米松果电子有限公司 一种提高语音清晰度的方法、装置及介质
CN112259125B (zh) * 2020-10-23 2023-06-16 江苏理工学院 基于噪声的舒适度评价方法、系统、设备及可存储介质
US11610598B2 (en) * 2021-04-14 2023-03-21 Harris Global Communications, Inc. Voice enhancement in presence of noise
CN112820311A (zh) * 2021-04-16 2021-05-18 成都启英泰伦科技有限公司 一种基于空间预测的回声消除方法及装置
CN114822571A (zh) * 2021-04-25 2022-07-29 美的集团(上海)有限公司 一种回声消除方法、装置、电子设备和存储介质
CN113178192B (zh) * 2021-04-30 2024-05-24 平安科技(深圳)有限公司 语音识别模型的训练方法、装置、设备及存储介质
CN115665642B (zh) * 2022-12-12 2023-03-17 杭州兆华电子股份有限公司 一种噪声消除方法及系统
DE202023103428U1 (de) 2023-06-21 2023-06-28 Richik Kashyap Ein Sprachqualitätsschätzsystem für reale Signale basierend auf nicht negativer frequenzgewichteter Energie

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763858A (zh) * 2009-10-19 2010-06-30 瑞声声学科技(深圳)有限公司 双麦克风信号处理方法
CN102893331A (zh) * 2010-05-20 2013-01-23 高通股份有限公司 用于使用头戴式麦克风对来处理语音信号的方法、设备和计算机可读媒体
CN103606374A (zh) * 2013-11-26 2014-02-26 国家电网公司 一种瘦终端的噪音消除和回声抑制方法及装置
CN104050971A (zh) * 2013-03-15 2014-09-17 杜比实验室特许公司 声学回声减轻装置和方法、音频处理装置和语音通信终端
CN105280195A (zh) * 2015-11-04 2016-01-27 腾讯科技(深圳)有限公司 语音信号的处理方法及装置

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04100460A (ja) * 1990-08-20 1992-04-02 Nippon Telegr & Teleph Corp <Ntt> 電話機の歪測定方法
JP3397269B2 (ja) * 1994-10-26 2003-04-14 日本電信電話株式会社 多チャネル反響消去方法
IL115892A (en) * 1994-11-10 1999-05-09 British Telecomm Interference detection system for telecommunications
JP3420705B2 (ja) * 1998-03-16 2003-06-30 日本電信電話株式会社 エコー抑圧方法及び装置並びにエコー抑圧プログラムが記憶されたコンピュータに読取り可能な記憶媒体
EP0980064A1 (de) * 1998-06-26 2000-02-16 Ascom AG Verfahren zur Durchführung einer maschinengestützten Beurteilung der Uebertragungsqualität von Audiosignalen
KR100723283B1 (ko) * 1999-06-24 2007-05-30 코닌클리케 필립스 일렉트로닉스 엔.브이. 음향 에코 및 잡음 제거 적응성 필터
WO2002013572A2 (en) * 2000-08-07 2002-02-14 Audia Technology, Inc. Method and apparatus for filtering and compressing sound signals
US7117145B1 (en) * 2000-10-19 2006-10-03 Lear Corporation Adaptive filter for speech enhancement in a noisy environment
US7171003B1 (en) * 2000-10-19 2007-01-30 Lear Corporation Robust and reliable acoustic echo and noise cancellation system for cabin communication
DE10157535B4 (de) * 2000-12-13 2015-05-13 Jörg Houpert Verfahren und Vorrichtung zur Reduzierung zufälliger, kontinuierlicher, instationärer Störungen in Audiosignalen
WO2003083828A1 (en) * 2002-03-27 2003-10-09 Aliphcom Nicrophone and voice activity detection (vad) configurations for use with communication systems
JP3864914B2 (ja) * 2003-01-20 2007-01-10 ソニー株式会社 エコー抑圧装置
EP1591995B1 (en) * 2004-04-29 2019-06-19 Harman Becker Automotive Systems GmbH Indoor communication system for a vehicular cabin
US7454332B2 (en) * 2004-06-15 2008-11-18 Microsoft Corporation Gain constrained noise suppression
CN1321400C (zh) * 2005-01-18 2007-06-13 中国电子科技集团公司第三十研究所 客观音质评价中基于噪声掩蔽门限算法的巴克谱失真测度方法
US8594320B2 (en) * 2005-04-19 2013-11-26 (Epfl) Ecole Polytechnique Federale De Lausanne Hybrid echo and noise suppression method and device in a multi-channel audio signal
CN101233561B (zh) * 2005-08-02 2011-07-13 皇家飞利浦电子股份有限公司 通过根据背景噪声控制振动器的操作来增强移动通信设备中的语音可懂度
JP4671303B2 (ja) * 2005-09-02 2011-04-13 国立大学法人北陸先端科学技術大学院大学 マイクロホンアレイ用ポストフィルタ
ATE492979T1 (de) * 2005-09-20 2011-01-15 Ericsson Telefon Ab L M Verfahren zur messung der sprachverständlichkeit
US8046218B2 (en) * 2006-09-19 2011-10-25 The Board Of Trustees Of The University Of Illinois Speech and method for identifying perceptual features
JP4509126B2 (ja) * 2007-01-24 2010-07-21 沖電気工業株式会社 エコーキャンセラ及びエコーキャンセル方法
US8954324B2 (en) * 2007-09-28 2015-02-10 Qualcomm Incorporated Multiple microphone voice activity detector
ATE521064T1 (de) * 2007-10-08 2011-09-15 Harman Becker Automotive Sys Verstärkung und spektralformenanpassung bei der verarbeitung von audiosignalen
DE602007007090D1 (de) * 2007-10-11 2010-07-22 Koninkl Kpn Nv Verfahren und System zur Messung der Sprachverständlichkeit eines Tonübertragungssystems
US8412525B2 (en) * 2009-04-30 2013-04-02 Microsoft Corporation Noise robust speech classifier ensemble
CN101582264A (zh) * 2009-06-12 2009-11-18 瑞声声学科技(深圳)有限公司 语音增强的方法及语音增加的声音采集系统
GB2493327B (en) * 2011-07-05 2018-06-06 Skype Processing audio signals
DK2563045T3 (da) * 2011-08-23 2014-10-27 Oticon As Fremgangsmåde og et binauralt lyttesystem for at maksimere en bedre øreeffekt
CN102306496B (zh) * 2011-09-05 2014-07-09 歌尔声学股份有限公司 一种多麦克风阵列噪声消除方法、装置及系统
CN102510418B (zh) * 2011-10-28 2015-11-25 声科科技(南京)有限公司 噪声环境下的语音可懂度测量方法及装置
CN103578479B (zh) * 2013-09-18 2016-05-25 中国人民解放军电子工程学院 基于听觉掩蔽效应的语音可懂度测量方法
US10262677B2 (en) * 2015-09-02 2019-04-16 The University Of Rochester Systems and methods for removing reverberation from audio signals
US10403299B2 (en) * 2017-06-02 2019-09-03 Apple Inc. Multi-channel speech signal enhancement for robust voice trigger detection and automatic speech recognition
US20180358032A1 (en) * 2017-06-12 2018-12-13 Ryo Tanaka System for collecting and processing audio signals

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763858A (zh) * 2009-10-19 2010-06-30 瑞声声学科技(深圳)有限公司 双麦克风信号处理方法
CN102893331A (zh) * 2010-05-20 2013-01-23 高通股份有限公司 用于使用头戴式麦克风对来处理语音信号的方法、设备和计算机可读媒体
CN104050971A (zh) * 2013-03-15 2014-09-17 杜比实验室特许公司 声学回声减轻装置和方法、音频处理装置和语音通信终端
CN103606374A (zh) * 2013-11-26 2014-02-26 国家电网公司 一种瘦终端的噪音消除和回声抑制方法及装置
CN105280195A (zh) * 2015-11-04 2016-01-27 腾讯科技(深圳)有限公司 语音信号的处理方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390947A (zh) * 2018-04-23 2019-10-29 北京京东尚科信息技术有限公司 声源位置的确定方法、系统、设备和存储介质
CN110390947B (zh) * 2018-04-23 2024-04-05 北京京东尚科信息技术有限公司 声源位置的确定方法、系统、设备和存储介质

Also Published As

Publication number Publication date
CN105280195B (zh) 2018-12-28
EP3373300A1 (en) 2018-09-12
JP2018517167A (ja) 2018-06-28
US20200168237A1 (en) 2020-05-28
EP3373300A4 (en) 2019-07-31
CN105280195A (zh) 2016-01-27
US10586551B2 (en) 2020-03-10
EP3373300B1 (en) 2020-09-16
MY179978A (en) 2020-11-19
US20170365270A1 (en) 2017-12-21
KR20170129211A (ko) 2017-11-24
US10924614B2 (en) 2021-02-16
KR101981879B1 (ko) 2019-05-23
JP6505252B2 (ja) 2019-04-24

Similar Documents

Publication Publication Date Title
WO2017075979A1 (zh) 语音信号的处理方法及装置
US10609483B2 (en) Method for sound effect compensation, non-transitory computer-readable storage medium, and terminal device
EP3547659B1 (en) Method for processing audio signal and related products
JP5876154B2 (ja) 雑音を制御するための電子デバイス
US20230008818A1 (en) Sound masking method and apparatus, and terminal device
WO2017143805A1 (zh) 回声消除方法、装置和计算机存储介质
CN108540900B (zh) 音量调节方法及相关产品
US10687142B2 (en) Method for input operation control and related products
US10878833B2 (en) Speech processing method and terminal
JP2016541222A (ja) フィードバック検出のためのシステムおよび方法
US20140341386A1 (en) Noise reduction
CN111083297A (zh) 一种回声消除方法及电子设备
CN110995909B (zh) 一种声音补偿方法及装置
CN111182118A (zh) 一种音量调节方法及电子设备
CN111541975B (zh) 音频信号的调节方法及电子设备
CN116994596A (zh) 啸叫抑制方法、装置、存储介质及电子设备
WO2023284406A1 (zh) 一种通话方法及电子设备
CN115691524A (zh) 音频信号的处理方法、装置、设备及存储介质
CN106210951A (zh) 一种蓝牙耳机的适配方法、装置和终端
WO2022254834A1 (ja) 信号処理装置、信号処理方法およびプログラム
CN116246645A (zh) 语音处理方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16861250

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2016861250

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017553962

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20177029724

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE