WO2020107455A1 - Voice processing method and apparatus, storage medium, and electronic device - Google Patents

Voice processing method and apparatus, storage medium, and electronic device Download PDF

Info

Publication number
WO2020107455A1
WO2020107455A1 PCT/CN2018/118713 CN2018118713W WO2020107455A1 WO 2020107455 A1 WO2020107455 A1 WO 2020107455A1 CN 2018118713 W CN2018118713 W CN 2018118713W WO 2020107455 A1 WO2020107455 A1 WO 2020107455A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
speech signal
signal
windowed
voice
Prior art date
Application number
PCT/CN2018/118713
Other languages
French (fr)
Chinese (zh)
Inventor
陈岩
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to CN201880098277.9A priority Critical patent/CN112997249B/en
Priority to PCT/CN2018/118713 priority patent/WO2020107455A1/en
Publication of WO2020107455A1 publication Critical patent/WO2020107455A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation

Definitions

  • the present application belongs to the technical field of electronic equipment, and particularly relates to a voice processing method, device, storage medium, and electronic equipment.
  • Embodiments of the present application provide a voice processing method, device, storage medium, and electronic equipment, which can construct a clearer and clean voice signal.
  • an embodiment of the present application provides a voice processing method, including:
  • a target clean speech signal is constructed.
  • an embodiment of the present application provides a voice processing device, including:
  • Acquisition module for acquiring reverberation voice signals
  • a processing module configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal
  • a transformation module configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal
  • a calculation module configured to calculate a second power spectrum according to the first power spectrum
  • the construction module is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.
  • an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the voice processing method provided in this embodiment.
  • an embodiment of the present application provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:
  • a target clean speech signal is constructed.
  • the clean speech signal is not directly obtained from the reverberation speech signal, but the fidelity speech signal is processed first to obtain the first speech signal, and then according to the first power spectrum corresponding to the first speech signal Calculate the second power spectrum, and then construct a clean speech signal according to the phase spectrum and the second power spectrum corresponding to the first speech signal.
  • the first speech signal is obtained by performing fidelity processing on the reverberation speech signal, and then the first speech signal is processed, so that a clean speech signal with higher definition can be constructed.
  • FIG. 1 is a first schematic flowchart of a voice processing method provided by an embodiment of the present application.
  • FIG. 2 is a second schematic flowchart of a voice processing method provided by an embodiment of the present application.
  • FIG. 3 is a third schematic flowchart of a voice processing method provided by an embodiment of the present application.
  • FIG. 4 is a fourth schematic flowchart of a voice processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice processing device provided by an embodiment of the present application.
  • FIG. 6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 7 is a second schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a first schematic flowchart of a voice processing method provided by an embodiment of the present application.
  • the flow of the voice processing method may include:
  • the microphone in addition to receiving the direct sound signal directly from the sound source to emit sound waves, the microphone also receives other sound signals from the sound source and arriving through other channels. , And unwanted sound waves (ie background noise) generated by other sound sources in the environment. Acoustically, the reflected wave with a delay time of more than about 50ms is called echo, and the effect of the remaining reflected waves is called reverb.
  • the phenomenon of reverberation will greatly reduce the intelligibility of the sound, thereby affecting the call quality, voice and voiceprint recognition rate. In this case, how to reduce the reverberation of the voice signal collected by the microphone is particularly important.
  • the signal received by the microphone end is easily affected by environmental reverberation.
  • voice is radiated multiple times through walls, floors, and furniture.
  • the signal received at the microphone is a mixed signal of direct sound and reflected sound, which is a reverberated voice signal. This part of the reflected sound is the reverb signal, and the direct sound is the clean voice signal.
  • the reverb signal will be delayed relative to the clean voice signal.
  • the speaker is far away from the microphone and the call environment is a relatively closed space, it is easy to produce reverberation.
  • the voice will be unclear and affect the call quality.
  • the interference caused by the reverberation phenomenon will also cause the performance of the acoustic receiving system to deteriorate, and the performance of the voice recognition and voiceprint recognition system will be significantly reduced.
  • the electronic device acquires the reverberation speech signal.
  • signal fidelity processing is performed on the reverberation speech signal to obtain a first speech signal.
  • the electronic device performs signal fidelity processing on the reverberation speech signal to obtain the first speech signal, so as to reduce the distortion rate of the signal.
  • the electronic device performs Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.
  • a second power spectrum is calculated based on the first power spectrum.
  • the electronic device may calculate the second power spectrum according to the first power spectrum.
  • the target clean speech signal is constructed according to the phase spectrum and the second power spectrum corresponding to the first speech signal.
  • the electronic device constructs a target clean speech signal according to the phase spectrum and the second power spectrum corresponding to the first speech signal.
  • the clean speech signal is not directly obtained from the reverberation speech signal, but the fidelity of the reverberation speech signal is first processed to obtain the first speech signal, and then according to the corresponding The first power spectrum calculates the second power spectrum, and then constructs a clean voice signal according to the phase spectrum and the second power spectrum corresponding to the first voice signal.
  • the first speech signal is obtained by performing fidelity processing on the reverberation speech signal, and then the first speech signal is processed, so that a clean speech signal with higher definition can be constructed.
  • FIG. 2 is a second schematic flowchart of a voice processing method provided by an embodiment of the present application.
  • the voice processing method may include:
  • the electronic device acquires a reverberation speech signal.
  • the electronic device may use a microphone to collect the reverberation voice signal to obtain the reverberation voice signal.
  • the electronic device performs windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal.
  • the electronic device may perform windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowing speech signal.
  • the electronic device may take a frame length of 20 ms and a frame shift of 10 ms to frame the reverberation voice signal.
  • the multi-frame windowing signal obtained by the electronic device does not include at least the first frame windowing signal.
  • the electronic device performs windowing processing on the reverberant speech signal to obtain 8 frames of windowed speech signal, and the electronic device can obtain the windowed speech signal of the next 7 frames, that is, the windowed speech signal from frame 2 to frame 7, frame 2
  • the windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device; the electronic device can also obtain the windowed speech signal of the next six frames, ie the windowed speech signal from the third frame to the eighth frame, the third frame
  • the windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device.
  • windowing and framing of the reverberant speech signal is not limited to the above-mentioned methods, and may also be other methods, which are not specifically limited here.
  • the electronic device performs signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal.
  • the electronic device may perform signal fidelity processing on each frame of the windowing signals to obtain multiple frames of fidelity voice signals.
  • the multi-frame fidelity speech signal constitutes the first voice signal
  • the fidelity speech signal may be expressed as z(i), and i represents the number of frames.
  • the electronic device obtains 5 frames of windowed speech signals, namely the 4th framed windowed speech signal y(4), the 5th framed windowed speech signal y(5), the 6th framed windowed speech signal y(6), The windowed speech signal y(7) in the 7th frame and the windowed speech signal y(8) in the 8th frame. Then, the electronic device performs signal fidelity processing on these 5 frames of windowed voice signals to obtain 5 frames of fidelity voice signals, which are the fourth frame fidelity voice signal z(4) and the fifth frame fidelity voice signal z(5 ), the sixth frame fidelity speech signal z(6), the seventh frame fidelity speech signal z(7) and the eighth frame fidelity speech signal z(8).
  • the fidelity processing of the signal can reduce the distortion rate of the signal.
  • the electronic device performs Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal, respectively.
  • the electronic device may perform Fourier transform on each frame of fidelity signals, thereby obtaining phase spectra and power spectra corresponding to the multi-frame fidelity speech signals, respectively.
  • the phase spectrum and the power spectrum corresponding to the multi-frame fidelity speech signal respectively constitute the phase spectrum and the first power spectrum corresponding to the first speech signal.
  • FFT Fast Fourier Transformation
  • DFT discrete Fourier transform
  • phase spectrum corresponding to the first speech signal includes: arg[Z(4)], arg[Z(5)], arg[Z(6), arg[Z(7)], arg[Z(8) ].
  • the first power spectrum corresponding to the first speech signal includes:
  • the electronic device calculates a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums.
  • the electronic device may calculate a third power spectrum according to the power spectrum corresponding to each frame of fidelity speech signals to obtain multiple third power spectra.
  • a plurality of third power spectrums constitute a second power spectrum.
  • the electronic device can use the following formula to calculate the third power spectrum:
  • 2 represents the power spectrum corresponding to the i-th frame fidelity speech signal
  • 2 represents the i-th third power spectrum, which is based on the power corresponding to the i-th frame fidelity speech signal
  • is the number of reverberation time translation frames
  • is a gain value
  • represents the value of the direct sound signal attenuation by a certain decibel i>-a, i>-a
  • ⁇ (i) is a smoothing function
  • a is used to control the width of the smoothing function
  • i represents the number of frames.
  • the electronic device obtains the power spectrum corresponding to the 5 frames of fidelity speech signals as follows: the power spectrum corresponding to the 4th frame fidelity speech signal
  • the electronic devices can substitute
  • the first power spectrum includes:
  • the electronic device constructs a clean speech signal for each frame according to the phase spectrum corresponding to each frame of the fidelity speech signal and a plurality of third power spectra, to obtain multiple frames of clean speech signals.
  • the electronic device obtains a phase spectrum corresponding to 5 frames of fidelity speech signals and 5 third power spectra, and the corresponding phase spectra of 5 frames of fidelity speech signals are: arg[Z(4)], arg[Z(5 )], arg[Z(6), arg[Z(7)], arg[Z(8)].
  • the five third power spectra are:
  • the electronic device can construct the clean voice signal of the first frame according to arg[Z(4)] and
  • the electronic device can construct the second frame of clean speech signal according to arg[Z(5)] and
  • the electronic device can construct the third frame of clean speech signal according to arg[Z(6)] and
  • the electronic device can construct the fourth frame of clean speech signal according to arg[Z(7)] and
  • the electronic device can construct the fifth frame of clean speech signal according to arg[Z(8)] and
  • the electronic device performs windowing and framing processing on multiple frames of clean voice signals to obtain target clean voice signals.
  • the electronic device may perform windowing and framing processing on these 5 frames of clean voice signals in chronological order to obtain a target clean voice signal.
  • windowing and framing processing on these 5 frames of clean voice signals in chronological order to obtain a target clean voice signal.
  • the process 203 may include the following process:
  • the electronic device performs linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain a linear prediction residual signal of the i-th frame and each frame after the i-th frame.
  • the electronic device since the linear prediction and analysis of the current frame requires the data of the first few frames of the current frame, the electronic device does not perform linear prediction and analysis from the first frame. Therefore, in this embodiment, the electronic device obtains The multi-frame windowing signal of at least does not include the first frame windowing signal.
  • the electronic device performs windowing processing on the reverberant speech signal to obtain 8 frames of windowed speech signal, and the electronic device can obtain the windowed speech signal of the next 7 frames, that is, the windowed speech signal from frame 2 to frame 7, frame 2
  • the windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device; the electronic device can also obtain the windowed speech signal of the next six frames, ie the windowed speech signal from the third frame to the eighth frame, the third frame
  • the windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device.
  • the specific acquisition of windowed speech signals in the next few frames is determined according to the actual situation, and no specific restrictions are made here.
  • the electronic device obtains 5 frames of windowed speech signals, namely the 4th framed windowed speech signal y(4), the 5th framed windowed speech signal y(5), the 6th framed windowed speech signal y(6), The windowed speech signal y(7) in the 7th frame and the windowed speech signal y(8) in the 8th frame. Therefore, starting from the fourth frame, the electronic device performs linear prediction analysis on the windowed speech signals from the fourth frame to the eighth frame to obtain the linear prediction residual signal w(4) of the fourth frame and the linear prediction residual signal of the fifth frame w(5), the sixth frame linear prediction residual signal w(6), the seventh frame linear prediction residual signal w(7), the eighth frame linear prediction residual signal w(8).
  • the linear prediction residual signal can be expressed as: w(i), i is the number of frames.
  • the electronic device uses an inverse filter to perform inverse filtering on the linear prediction residual signal of the i-th frame to obtain a filtered speech signal of the i-th frame.
  • the electronic device may use an inverse filter of length L to perform inverse filtering processing on the linear prediction residual signal of the i-th frame to obtain a filtered speech signal of the i-th frame.
  • the value of L can be determined according to the actual situation, and no specific restrictions are made here.
  • the electronic device detects that the kurtosis of the i-th frame filtered speech signal is minimum, the electronic device obtains an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal to obtain an inverse filter corresponding to the i-th windowed voice signal.
  • the electronic device may use an inverse filter of length L to perform inverse filtering on the fourth-frame linear prediction residual signal to obtain the fourth-frame filtered speech signal.
  • the electronic device continuously changes the parameters of the inverse filter, so that the filtered speech signal of the fourth frame continuously changes.
  • the electronic device continuously detects the changing kurtosis of the filtered speech signal in the fourth frame.
  • an inverse filter that minimizes the kurtosis of the filtered speech signal in the fourth frame is obtained to obtain the inverse filter g(4) corresponding to the windowed speech signal in the fourth frame .
  • the electronic device uses an inverse filter corresponding to the windowed i-frame speech signal to perform inverse filtering on the windowed i-frame speech signal to obtain the i-th frame fidelity speech signal.
  • z(i) g(i)y(i).
  • z(i) represents the fidelity speech signal of the i-th frame
  • g(i) represents the inverse filter corresponding to the windowed speech signal of the i-th frame
  • y(i) represents the windowed speech signal of the i-th frame.
  • the electronic device performs inverse filtering according to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame of the windowed speech signal of the windowed voice signal of each frame after the i-th frame To obtain the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame.
  • the determination method of the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame is different from the determination method of the inverse filter corresponding to the windowed speech signal of the i-th frame.
  • the electronic device can linearly predict the residual according to each frame of the fifth frame to the eighth frame
  • the linear prediction residual signal of the previous frame of the signal and the inverse filter corresponding to the windowed voice signal of the previous frame of the windowed speech signal of each frame of the 5th frame to the 8th frame to obtain each frame of the 5th frame to the 8th frame
  • the inverse filter corresponding to the windowed speech signal is
  • the electronic device may obtain the inverse filter corresponding to the windowed speech signal in the fifth frame according to the linear prediction residual signal in the fourth frame and the windowed speech signal in the fourth frame.
  • the electronic device can obtain the inverse filter corresponding to the windowed voice signal in the sixth frame, the inverse filter corresponding to the windowed voice signal in the seventh frame, and the inverse filter corresponding to the windowed voice signal in the eighth frame.
  • the electronic device obtains the multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame.
  • the fidelity speech signal z(5) of the fifth frame is obtained.
  • the electronic device can obtain the sixth frame fidelity signal z(6), the seventh frame fidelity signal z(7), and the eighth frame fidelity signal z(8), that is, the multi-frame fidelity after the fourth frame True voice signal.
  • the electronic device combines the fidelity speech signal of the i-th frame and the multi-frame fidelity speech signal after the i-th frame to obtain a multi-frame fidelity speech signal.
  • the electronic device combines the fidelity speech signal z(4) of the fourth frame and the fidelity speech signals z(5), z(6), z(7), and z(8) of the fifth frame to the eighth frame to obtain 5 Frame fidelity speech signal.
  • the electronic device when the electronic device detects that the i-th frame filtered speech signal has the smallest kurtosis, the electronic device may acquire the i-th frame filtered speech signal with the smallest kurtosis.
  • the electronic device may use an inverse filter to perform inverse filtering on the linear prediction residual signal of the fourth frame to obtain the filtered speech signal of the fourth frame.
  • the electronic device continuously changes the parameters of the inverse filter, so that the filtered speech signal of the fourth frame continuously changes.
  • the electronic device continuously detects the changing kurtosis of the filtered speech signal in the fourth frame.
  • the electronic device detects that the kurtosis of the inverse filtered speech signal in the fourth frame is the smallest, the fourth frame filtered speech signal s(4) with the smallest kurtosis is acquired.
  • the process 2035 may include the following processes:
  • the electronic device obtains the corresponding i+1 frame windowed speech signal according to the linear prediction residual signal of the i frame, the inverse filter corresponding to the windowed speech signal of the i frame and the i frame filtered speech signal with the smallest kurtosis Inverse filter.
  • the electronic device uses the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linearly predicted residual signal of the previous frame of each frame after the i+1th frame Linearly predict the residual signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame.
  • the electronic device predicts the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, and the windowed voice signal of the previous frame of the windowed speech signal of each frame after the i+1th frame Corresponding inverse filter and the previous frame of the filtered speech signal of each frame after the i+1th frame of the filtered speech signal, to obtain an inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame.
  • the calculation formula of the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame can be:
  • g(i+j+1) g(i+j)+ ⁇ e(i+j)w(i+j), where, s(i+j) represents the filtered speech signal in frame i+j, g(i+j) represents the inverse filter corresponding to the windowed speech signal in frame i+j, and g(i+j+1) represents the i+j
  • g(i+j+1) g(i+j)+ ⁇ e(i+j)w(i+j) and It can be determined that the inverse filter corresponding to the windowed voice signal of the current frame can linearly predict the residual signal according to the previous frame of the current frame, the inverse filter corresponding to the windowed voice signal of the previous frame of the current frame and the previous frame of the current frame The filtered speech signal is determined.
  • the inverse filter corresponding to the windowed speech signal in frame i+2 can be linearly predicted residual signal according to frame i+1, the inverse filter corresponding to the windowed speech signal in frame i+1 and the filter in frame i+1 The voice signal is determined.
  • the electronic device before performing the process 20353, the electronic device has obtained the windowed speech signal and the linear prediction residual signal of the i+1th frame. Therefore, in order to determine the inverse filter corresponding to the windowed speech signal in the i+2 frame, the filtered speech signal in the i+1 frame needs to be determined, that is, the filtered speech signal in the previous frame of the filtered speech signal in the i+2 frame needs to be determined.
  • s(i+j) g(i+j)w(i+j), where s(i+j) represents the filtered speech signal of frame i+j, and g(i+j) represents the frame i+j
  • the inverse filter corresponding to the windowed speech signal, w(i+j) represents the linear prediction residual signal of frame i+j, j ⁇ 1.
  • the inverse filter corresponding to the windowed speech signal in the i+2 frame that is, the inverse filter g(6) corresponding to the windowed speech signal in the 6th frame
  • the electronic device obtains the corresponding windowed speech signal after the i frame according to the inverse filter corresponding to the windowed speech signal of the i+1 frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1 frame Inverse filter.
  • the electronic device according to the inverse filter g(5) corresponding to the windowed speech signal in the fifth frame, the inverse filter g(6) corresponding to the windowed speech signal in the sixth frame, and the inverse filter corresponding to the windowed speech signal in the seventh frame G(7), the inverse filter g(8) corresponding to the windowed speech signal of the eighth frame, and the inverse filter corresponding to the windowed speech signal of each frame after the fourth frame is obtained.
  • the process 201 may include the following process:
  • the electronic device detects whether the user's voice information includes preset keywords
  • the electronic device acquires the reverberation voice signal.
  • the voice information of the user may be the voice information of the callee.
  • the electronic device obtains the voice information of the callee who is in a call with the current user, and then detects whether the callee's voice information includes preset keywords.
  • the preset keywords may be "inaudible", "say again” and so on.
  • the voice information of the callee includes the preset keyword "inaudible”, etc., there may be a current user too far away from the electronic device, resulting in the signal collected by the electronic device is a mixture of direct sound and reflected sound The reverberation speech signal in this embodiment. Therefore, when the electronic device detects that the user's voice information includes a preset keyword, the electronic device can obtain the mixed signal of the reverberation voice signal and the microphone.
  • the electronic device when the electronic device performs the process of acquiring the reverberation voice signal if the user's voice information includes preset keywords, the following process may be performed:
  • the electronic device If the user's voice information includes preset keywords, the electronic device generates a record and saves the record;
  • the electronic device acquires the reverberation voice signal.
  • the current signal of the electronic device may be poor, which causes the electronic device to detect that the voice information of the call target includes the preset keyword "inaudible” and the like. Therefore, the electronic device can generate a record once and save the record when it is detected that the voice information of the call target includes the preset keyword "inaudible” and the like.
  • the electronic device acquires the reverberation voice signal.
  • the preset number threshold can be set by the user or determined by the electronic device, etc., which is not specifically limited here. Assuming that the preset number threshold is set to 10, then the electronic device will obtain the reverberation speech signal when the number of saved records is 11.
  • the electronic device may delete the saved record and stop performing the process of detecting whether the user's voice information includes the preset keyword.
  • the electronic device can stop the process of acquiring the reverberation voice signal.
  • the electronic device can stop the process of acquiring the reverberation voice signal.
  • the process 201 may include the following process:
  • the electronic device detects whether the distance between the sound source and the electronic device is greater than a preset distance threshold
  • the electronic device acquires the reverberation voice signal.
  • the electronic device collects a mixed signal of direct sound and reflected sound, that is, a reverberated speech signal, and the reverberated speech signal has a reverberation phenomenon, and the reverberation The phenomenon can interfere with the results of voiceprint recognition and voice recognition of electronic devices.
  • the electronic device can detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold, and if the distance between the sound source and the electronic device is greater than the preset distance threshold , The electronic device obtains the reverberation voice signal.
  • the preset distance threshold can be set according to the actual situation, and no specific limitation is made here.
  • FIG. 5 is a schematic structural diagram of a voice processing device 300 according to an embodiment of the present application.
  • the voice processing apparatus 300 may include: an acquisition module 301, a processing module 302, a transformation module 303, a calculation module 304, and a construction module 305.
  • the obtaining module 301 is used to obtain a reverberation voice signal
  • the processing module 302 is configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal.
  • the transformation module 303 is configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.
  • the calculation module 304 is configured to calculate a second power spectrum according to the first power spectrum.
  • the construction module 305 is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.
  • the acquisition module 301 may be used to: perform windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal;
  • the processing module 302 may be used to: perform signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal, and the multi-frame fidelity voice signal constitutes the first voice signal;
  • the transformation module 303 may be used to: perform Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal respectively, and the multi-frame fidelity speech signal respectively correspond to
  • the phase spectrum and the power spectrum constitute the phase spectrum and the first power spectrum corresponding to the first speech signal;
  • the calculation module 304 may be used to calculate a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, the plurality of third power spectrums constituting the second power Spectrum
  • the construction module 305 may be used to: construct a clean speech signal for each frame according to the phase spectrum corresponding to the fidelity speech signal of each frame and the plurality of third power spectra, and obtain multiple frames of clean speech signals;
  • the clean voice signal is windowed and framed to obtain the target clean voice signal.
  • the processing module 302 may be used to: perform linear prediction and analysis on the i-th frame and every windowed voice signal after the i-th frame to obtain the i-th frame and every frame after the i-th frame Prediction residual signal; inverse filter is used to inverse filter the i-th frame linear prediction residual signal to obtain the i-th frame filtered speech signal; when the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame is acquired Filtering the inverse filter with the smallest kurtosis of the speech signal to obtain an inverse filter corresponding to the windowed speech signal of the i frame; using the inverse filter corresponding to the windowed speech signal of the i frame to inverse the windowed speech signal of the i frame Filtering process to get the i-th frame fidelity speech signal; linearly predict the residual signal according to the previous frame of the linear prediction residual signal for each frame after the i-th frame and the previous frame of the windowed speech signal for
  • the processing module 302 may be used to: when the i-th frame filtered speech signal has the smallest kurtosis, obtain the i-th frame filtered speech signal with the smallest kurtosis; linearly predict the residual according to the i-th frame The inverse filter corresponding to the signal and the i-th windowed speech signal and the i-th frame filtered speech signal with the smallest kurtosis to obtain the inverse filter corresponding to the i+1th windowed speech signal; The inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame and the linear predicted residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, to obtain the i+1th frame Filtered speech signal of the previous frame of the filtered speech signal of each frame; Linearly predicted residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, plus The inverse filter corresponding to the windowed speech signal in the previous frame
  • the obtaining module 301 may be used to: obtain the user's voice information when the electronic device is in a call state; detect whether the user's voice information includes preset keywords; if the user's If the voice information includes preset keywords, the reverb voice signal is obtained.
  • the acquisition module 301 may be used to: if the user's voice information includes preset keywords, generate a record and save the record; when the number of saved records is greater than the preset number At the threshold, the reverberation speech signal is obtained.
  • the acquisition module 301 may be used to: when the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold; if the sound If the distance between the source and the electronic device is greater than the preset distance threshold, the reverberation voice signal is acquired.
  • An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed on a computer, the computer is caused to execute the process in the voice processing method provided in this embodiment.
  • An embodiment of the present application also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program stored in the memory by executing the computer program In the voice processing method.
  • the aforementioned electronic device may be a mobile terminal such as a tablet computer or a smart phone.
  • FIG. 6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 400 may include components such as a microphone 401, a memory 402, and a processor 403. Those skilled in the art may understand that the structure of the electronic device shown in FIG. 6 does not constitute a limitation on the electronic device, and may include more or less components than those illustrated, or combine certain components, or arrange different components.
  • the microphone 401 can be used to pick up the voice uttered by the user and the like.
  • the memory 402 may be used to store application programs and data.
  • the application program stored in the memory 402 contains executable code.
  • the application program can form various functional modules.
  • the processor 403 executes application programs stored in the memory 402 to execute various functional applications and data processing.
  • the processor 403 is the control center of the electronic device, and uses various interfaces and lines to connect the various parts of the entire electronic device, and executes the electronic device by running or executing the application program stored in the memory 402 and calling the data stored in the memory 402 Various functions and processing data, so as to carry out overall monitoring of electronic equipment.
  • the processor 403 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 executes and stores the memory in the memory The application in 402, thereby implementing the process:
  • a target clean speech signal is constructed.
  • FIG. 7 is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device 500 may include components such as a microphone 501, a memory 502, a processor 503, an input unit 504, an output unit 505, a speaker 506, and the like.
  • the microphone 501 can be used to pick up the voice uttered by the user and the like.
  • the memory 502 may be used to store application programs and data.
  • the application program stored in the memory 502 contains executable code.
  • the application program can form various functional modules.
  • the processor 503 executes application programs stored in the memory 502 to execute various functional applications and data processing.
  • the processor 503 is the control center of the electronic device, and uses various interfaces and lines to connect various parts of the entire electronic device, and executes the electronic device by running or executing application programs stored in the memory 502 and calling data stored in the memory 502 Various functions and processing data, so as to carry out overall monitoring of electronic equipment.
  • the input unit 504 may be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
  • user characteristic information such as fingerprints
  • the output unit 505 may be used to display information input by the user or provided to the user and various graphical user interfaces of the electronic device. These graphical user interfaces may be composed of graphics, text, icons, video, and any combination thereof.
  • the output unit may include a display panel.
  • the speaker 506 may be used to convert electrical signals into sound.
  • the processor 503 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 503 runs the stored code in the memory The application in 502, thereby implementing the process:
  • a target clean speech signal is constructed.
  • the processor 503 may further perform: performing windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal; the processor 503 When performing the process of performing signal fidelity processing on the reverberation speech signal to obtain the first speech signal, it may perform: performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal.
  • the multi-frame fidelity speech signal constitutes the first speech signal; the processor 503 executes the Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal
  • the process can be performed by performing a Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal, respectively.
  • the power spectrum constitutes the phase spectrum and the first power spectrum corresponding to the first speech signal; when the processor 503 executes the process of calculating the second power spectrum according to the first power spectrum, it may perform: according to each frame of the fidelity speech signal Corresponding power spectrum, calculate a third power spectrum to obtain multiple third power spectrums, the multiple third power spectrums constitute the second power spectrum; the processor 503 executes the
  • the phase spectrum and the second power spectrum are used to construct the target clean speech signal, the following steps may be performed: according to the phase spectrum corresponding to each frame of the fidelity speech signal and the multiple third power spectra, construct a clean speech signal for each frame, Obtain multiple frames of clean speech signals; perform windowing and framing processing on the multiple frames of clean speech signals to obtain target clean speech signals.
  • the processor 503 executes the process of performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal, and may execute: the i frame and each subsequent frame after the i frame Frame windowed speech signal is subjected to linear prediction analysis to obtain the i-th frame and the i-th frame after each linear prediction residual signal; the inverse filter is used to inversely filter the i-th frame linear prediction residual signal to obtain the i-th frame Filter the speech signal; when the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, obtain the inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal to obtain the inverse filter corresponding to the i-th windowed speech signal; The inverse filter corresponding to the i-th windowed speech signal performs inverse filtering on the i-th windowed speech signal to obtain the i-th frame fidelity speech signal; the linear prediction of the residual signal before
  • the processor 503 may further execute: when it is detected that the i-th frame filtered speech signal has the smallest kurtosis, obtain the i-th frame filtered speech signal with the smallest kurtosis; after the processor 503 executes the The inverse filter corresponding to the previous frame of the linear prediction residual signal of each frame of the linear prediction residual signal and the previous frame of the windowed speech signal of each frame after the i-th frame, to obtain
  • the process of the inverse filter corresponding to the framed windowed speech signal can be performed: the linear prediction residual signal according to the ith frame, the inverse filter corresponding to the windowed speech signal of the ith frame, and the ith frame filtered speech signal with the smallest kurtosis , Get the inverse filter corresponding to the windowed voice signal of the i+1th frame; according to the inverse filter corresponding to the windowed voice signal of the previous frame of the windowed voice signal of each frame after the i+1th frame and the i+1th frame
  • the processor 503 when the processor 503 executes the process of acquiring the reverberation voice signal, it may perform: acquiring the user's voice information when the electronic device is in a call state; detecting whether the user's voice information includes a preset Keywords; if the user's voice information includes preset keywords, the reverb voice signal is obtained.
  • the processor 503 executes the process of acquiring a reverberation voice signal if the user's voice information includes a preset keyword, and may execute: if the user's voice information includes a preset For keywords, a record is generated and the record is saved; when the number of saved records is greater than a preset number threshold, a reverberation speech signal is acquired.
  • the processor 503 when the processor 503 executes the process of acquiring the reverberation speech signal, it may perform: when the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than the Set a distance threshold; if the distance between the sound source and the electronic device is greater than a preset distance threshold, then obtain a reverberation speech signal.
  • the voice processing device provided by the embodiment of the present application and the voice processing method in the above embodiments belong to the same concept, and any method provided in the voice processing method embodiment can be run on the voice processing device, which is specific For the implementation process, please refer to the embodiment of the voice processing method, which will not be repeated here.
  • the voice processing method described in the embodiments of the present application a person of ordinary skill in the art can understand that all or part of the process of implementing the voice processing method described in the embodiments of the present application can be controlled by a computer program related hardware
  • the computer program may be stored in a computer readable storage medium, such as stored in a memory, and executed by at least one processor, the execution process may include the process of the embodiment of the voice processing method as described .
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and so on.
  • each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)

Abstract

A voice processing method and apparatus, a storage medium, and an electronic device. The method comprises: obtaining a reverberation voice signal (101); performing signal fidelity processing on the reverberation voice signal to obtain a first voice signal (102); performing Fourier transformation on the first voice signal to obtain a phase spectrum corresponding to the first voice signal and a first power spectrum (103); calculating a second power spectrum according to the first power spectrum (104); and constructing a target clean voice signal according to the phase spectrum and the second power spectrum (105).

Description

语音处理方法、装置、存储介质及电子设备Voice processing method, device, storage medium and electronic equipment 技术领域Technical field
本申请属于电子设备技术领域,尤其涉及一种语音处理方法、装置、存储介质及电子设备。The present application belongs to the technical field of electronic equipment, and particularly relates to a voice processing method, device, storage medium, and electronic equipment.
背景技术Background technique
在室内采用麦克风采集语音信号时,若声源与麦克风距离较远,就会有混响。过大的混响会严重影响语音的清晰度和可懂度。从而影响通话的质量、语音和和声纹唤醒的识别。目前,常用的混响消除算法大多是直接对混响语音信号做处理,得到去混响语音信号。然而,这种混响消除算法得到的去混响语音信号的清晰度不高。When using a microphone to collect voice signals indoors, if the sound source is far away from the microphone, there will be reverberation. Excessive reverberation will seriously affect the intelligibility and intelligibility of speech. Thus affecting the quality of the call, the recognition of voice and harmonization. At present, most commonly used reverberation cancellation algorithms mostly process the reverberation speech signal directly to obtain the dereverberation speech signal. However, the dereverberation speech signal obtained by this reverberation cancellation algorithm has low clarity.
发明内容Summary of the invention
本申请实施例提供一种语音处理方法、装置、存储介质及电子设备,可以构建出更清晰的干净语音信号。Embodiments of the present application provide a voice processing method, device, storage medium, and electronic equipment, which can construct a clearer and clean voice signal.
第一方面,本申请实施例提供一种语音处理方法,包括:In a first aspect, an embodiment of the present application provides a voice processing method, including:
获取混响语音信号;Obtain the reverberation voice signal;
对所述混响语音信号进行信号保真处理,得到第一语音信号;Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
根据所述第一功率谱,计算第二功率谱;Calculating a second power spectrum according to the first power spectrum;
根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
第二方面,本申请实施例提供一种语音处理装置,包括:In a second aspect, an embodiment of the present application provides a voice processing device, including:
获取模块,用于获取混响语音信号;Acquisition module for acquiring reverberation voice signals;
处理模块,用于对所述混响语音信号进行信号保真处理,得到第一语音信号;A processing module, configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
变换模块,用于对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;A transformation module, configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
计算模块,用于根据所述第一功率谱,计算第二功率谱;A calculation module, configured to calculate a second power spectrum according to the first power spectrum;
构建模块,用于根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。The construction module is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.
第三方面,本申请实施例提供一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上执行时,使得所述计算机执行本实施例提供的语音处理方法。In a third aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the voice processing method provided in this embodiment.
第四方面,本申请实施例提供一种电子设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:According to a fourth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:
获取混响语音信号;Obtain the reverberation voice signal;
对所述混响语音信号进行信号保真处理,得到第一语音信号;Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
根据所述第一功率谱,计算第二功率谱;Calculating a second power spectrum according to the first power spectrum;
根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
本申请实施例中,不直接根据混响语音信号得到干净语音信号,而是先对混响语音信号进行保真处理,得到第一语音信号,然后再根据第一语音信号对应的第一功率谱计算第二功率谱,进而根据第一语音信号对应的相位谱和第二功率谱,构建干净语音信号。本申请实施例中,通过对混响语音信号进行保真处理,得到第一语音信号,进而对第一语音信号进行处理,从而能够构建出清晰度更高的干净语音信号。In the embodiment of the present application, the clean speech signal is not directly obtained from the reverberation speech signal, but the fidelity speech signal is processed first to obtain the first speech signal, and then according to the first power spectrum corresponding to the first speech signal Calculate the second power spectrum, and then construct a clean speech signal according to the phase spectrum and the second power spectrum corresponding to the first speech signal. In the embodiment of the present application, the first speech signal is obtained by performing fidelity processing on the reverberation speech signal, and then the first speech signal is processed, so that a clean speech signal with higher definition can be constructed.
附图说明BRIEF DESCRIPTION
下面结合附图,通过对本申请的具体实施方式详细描述,将使本申请的技术方案及其有益效果显而易见。The technical solutions and beneficial effects of the present application will be apparent through the detailed description of the specific implementation of the present application in conjunction with the accompanying drawings.
图1是本申请实施例提供的语音处理方法的第一种流程示意图。FIG. 1 is a first schematic flowchart of a voice processing method provided by an embodiment of the present application.
图2是本申请实施例提供的语音处理方法的第二种流程示意图。2 is a second schematic flowchart of a voice processing method provided by an embodiment of the present application.
图3是本申请实施例提供的语音处理方法的第三种流程示意图。3 is a third schematic flowchart of a voice processing method provided by an embodiment of the present application.
图4是本申请实施例提供的语音处理方法的第四种流程示意图。FIG. 4 is a fourth schematic flowchart of a voice processing method provided by an embodiment of the present application.
图5是本申请实施例提供的语音处理装置的结构示意图。5 is a schematic structural diagram of a voice processing device provided by an embodiment of the present application.
图6是本申请实施例提供的电子设备的第一种结构示意图。6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
图7是本申请实施例提供的电子设备的第二种结构示意图。7 is a second schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式detailed description
请参照图示,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。Please refer to the illustration, where the same component symbol represents the same component. The principle of the present application is illustrated by implementation in an appropriate computing environment. The following description is based on the illustrated specific embodiments of the present application, which should not be considered as limiting other specific embodiments not detailed herein.
请参阅图1,图1是本申请实施例提供的语音处理方法的第一种流程示意图。该语音处理方法的流程可以包括:Please refer to FIG. 1, which is a first schematic flowchart of a voice processing method provided by an embodiment of the present application. The flow of the voice processing method may include:
比如,通常在声音信号采集或录制的情况下,麦克风除了接收到所需要的声源发射声波直接到达的直达声音信号外,还会接收声源发出的、经过其它途径传递而到达的其他声音信号,以及所在环境其它声源产生的不需要的声波(即背景噪声)。在声学上,延迟时间达到约50ms以上的反射波称为回声,其余的反射波产生的效应称为混响。混响现象将会使得声音的清晰度大幅度下降,从而影响到通话质量、语音和声纹的识别率。在这种情况下,如何降低麦克风采集语音信号的混响尤为重要。For example, in the case of sound signal collection or recording, in addition to receiving the direct sound signal directly from the sound source to emit sound waves, the microphone also receives other sound signals from the sound source and arriving through other channels. , And unwanted sound waves (ie background noise) generated by other sound sources in the environment. Acoustically, the reflected wave with a delay time of more than about 50ms is called echo, and the effect of the remaining reflected waves is called reverb. The phenomenon of reverberation will greatly reduce the intelligibility of the sound, thereby affecting the call quality, voice and voiceprint recognition rate. In this case, how to reduce the reverberation of the voice signal collected by the microphone is particularly important.
在101中,获取混响语音信号。In 101, a reverberation speech signal is obtained.
可以理解,麦克风端接收的信号容易受到环境混响的影响。比如,在房间内,语音经过墙面、地板和家具等多次放射,麦克风端接收到的信号是直达声和反射声的混合信号,即为混响语音信号。这部分反射声就是混响信号,直达声即为干净语音信号,混响信号相对于干净语音信号会有延迟。当说话人距离麦克风比较远,且通话环境是一个相对封闭的空间时,就很容易产生混响现象。混响现象严重时,会导致语音不清楚,影响通话质量。另外,混响现象带来的干扰,还会导致声学接收系统性能变差,语音识别和声纹识别系统性能显著下降等。It can be understood that the signal received by the microphone end is easily affected by environmental reverberation. For example, in a room, voice is radiated multiple times through walls, floors, and furniture. The signal received at the microphone is a mixed signal of direct sound and reflected sound, which is a reverberated voice signal. This part of the reflected sound is the reverb signal, and the direct sound is the clean voice signal. The reverb signal will be delayed relative to the clean voice signal. When the speaker is far away from the microphone and the call environment is a relatively closed space, it is easy to produce reverberation. When the reverberation is serious, the voice will be unclear and affect the call quality. In addition, the interference caused by the reverberation phenomenon will also cause the performance of the acoustic receiving system to deteriorate, and the performance of the voice recognition and voiceprint recognition system will be significantly reduced.
在本实施例中,电子设备获取混响语音信号。In this embodiment, the electronic device acquires the reverberation speech signal.
在102中,对混响语音信号进行信号保真处理,得到第一语音信号。In 102, signal fidelity processing is performed on the reverberation speech signal to obtain a first speech signal.
比如,由于麦克风在采集混响语音信号时,会存在一定的失真现象。若直接对混响语音信号进行去混响处理,可能最终得到的干净语音信号(去混响之后的语音信号)的清晰度不够高。因此,在本实施例中,电子设备对混响语音信号进行信号保真处理,得到第一语音信号,以减少信号的失真率。For example, when the microphone is collecting reverberant speech signals, there will be some distortion. If the reverberation speech signal is directly dereverberated, the clarity of the final clean speech signal (the speech signal after dereverberation) may not be high enough. Therefore, in this embodiment, the electronic device performs signal fidelity processing on the reverberation speech signal to obtain the first speech signal, so as to reduce the distortion rate of the signal.
在103中,对第一语音信号进行傅里叶变换,得到第一语音信号对应的相位谱和第一功率谱。In 103, Fourier transform is performed on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.
其中,电子设备对第一语音信号进行傅里叶变换,得到第一语音信号对应的相位谱和第一功率谱。The electronic device performs Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.
在104中,根据第一功率谱,计算第二功率谱。At 104, a second power spectrum is calculated based on the first power spectrum.
比如,电子设备可以根据第一功率谱,计算第二功率谱。For example, the electronic device may calculate the second power spectrum according to the first power spectrum.
在105中,根据第一语音信号对应的相位谱和第二功率谱,构建目标干净语音信号。In 105, the target clean speech signal is constructed according to the phase spectrum and the second power spectrum corresponding to the first speech signal.
比如,电子设备根据第一语音信号对应的相位谱和第二功率谱,构建出目标干净语音 信号。For example, the electronic device constructs a target clean speech signal according to the phase spectrum and the second power spectrum corresponding to the first speech signal.
可以理解的是,本实施例中,不直接根据混响语音信号得到干净语音信号,而是先对混响语音信号进行保真处理,得到第一语音信号,然后再根据第一语音信号对应的第一功率谱计算第二功率谱,进而根据第一语音信号对应的相位谱和第二功率谱,构建干净语音信号。本申请实施例中,通过对混响语音信号进行保真处理,得到第一语音信号,进而对第一语音信号进行处理,从而能够构建出清晰度更高的干净语音信号。It can be understood that, in this embodiment, the clean speech signal is not directly obtained from the reverberation speech signal, but the fidelity of the reverberation speech signal is first processed to obtain the first speech signal, and then according to the corresponding The first power spectrum calculates the second power spectrum, and then constructs a clean voice signal according to the phase spectrum and the second power spectrum corresponding to the first voice signal. In the embodiment of the present application, the first speech signal is obtained by performing fidelity processing on the reverberation speech signal, and then the first speech signal is processed, so that a clean speech signal with higher definition can be constructed.
请参阅图2,图2为本申请实施例提供的语音处理方法的第二种流程示意图。该语音处理方法可以包括:Please refer to FIG. 2, which is a second schematic flowchart of a voice processing method provided by an embodiment of the present application. The voice processing method may include:
在201中,电子设备获取混响语音信号。In 201, the electronic device acquires a reverberation speech signal.
在本实施例中,电子设备可以采用麦克风采集混响语音信号,以获取混响语音信号。In this embodiment, the electronic device may use a microphone to collect the reverberation voice signal to obtain the reverberation voice signal.
在202中,电子设备对混响语音信号进行加窗分帧处理,得到多帧加窗语音信号。In 202, the electronic device performs windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal.
例如,电子设备在获取到混响语音信号之后,可以对混响语音信号进行加窗分帧处理,得到多帧加窗语音信号。其中,电子设备可以取一帧长度为20ms,取帧移为10ms对混响语音信号进行分帧。电子设备对混响语音信号加窗时,优先而不局限地,窗函数可以选取矩形窗,即w(n)=1。For example, after acquiring the reverberation speech signal, the electronic device may perform windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowing speech signal. Among them, the electronic device may take a frame length of 20 ms and a frame shift of 10 ms to frame the reverberation voice signal. When the electronic device adds a window to the reverberant speech signal, it has priority and is not limited, and the window function can select a rectangular window, that is, w(n)=1.
例如,长度为L的加窗语音信号y(i)可以表示为:y(i)=[y(i-L+1),…y(i-1),y(i)],其中,i表示帧数。For example, the windowed speech signal y(i) of length L can be expressed as: y(i)=[y(i-L+1),...y(i-1),y(i)], where, i Indicates the number of frames.
需要说明的是,在本实施例中,电子设备得到的多帧加窗信号至少不包括第1帧加窗信号。例如,电子设备对混响语音信号进行加窗处理,得到8帧加窗语音信号,电子设备可以获取后7帧加窗语音信号,即第2帧至第7帧加窗语音信号,第2帧至第8帧加窗语音信号即为电子设备得到的多帧加窗语音信号;电子设备也可以获取后6帧加窗语音信号,即第3帧至第8帧加窗语音信号,第3帧至第8帧加窗语音信号即为电子设备得到的多帧加窗语音信号。具体获取后几帧加窗语音信号根据实际情况确定,此处不做具体限制。另外,如何对混响语音信号进行加窗分帧并不限于上述方式,还可以是其他方式,此处不做具体限制。It should be noted that, in this embodiment, the multi-frame windowing signal obtained by the electronic device does not include at least the first frame windowing signal. For example, the electronic device performs windowing processing on the reverberant speech signal to obtain 8 frames of windowed speech signal, and the electronic device can obtain the windowed speech signal of the next 7 frames, that is, the windowed speech signal from frame 2 to frame 7, frame 2 The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device; the electronic device can also obtain the windowed speech signal of the next six frames, ie the windowed speech signal from the third frame to the eighth frame, the third frame The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device. The specific acquisition of windowed speech signals in the next few frames is determined according to the actual situation, and no specific restrictions are made here. In addition, how to perform windowing and framing of the reverberant speech signal is not limited to the above-mentioned methods, and may also be other methods, which are not specifically limited here.
在203中,电子设备对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号。In 203, the electronic device performs signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal.
比如,电子设备在得到多帧加窗信号之后,可以对每帧加窗信号进行信号保真处理,得到多帧保真语音信号。其中,多帧保真语音信号构成第一语音信号,保真语音信号可以表示为z(i),i表示帧数。For example, after obtaining multiple frames of windowing signals, the electronic device may perform signal fidelity processing on each frame of the windowing signals to obtain multiple frames of fidelity voice signals. Wherein, the multi-frame fidelity speech signal constitutes the first voice signal, and the fidelity speech signal may be expressed as z(i), and i represents the number of frames.
例如,电子设备得到5帧加窗语音信号,分别为第4帧加窗语音信号y(4)、第5帧加窗语音信号y(5)、第6帧加窗语音信号y(6)、第7帧加窗语音信号y(7)和第8帧加窗语音信号y(8)。然后,电子设备对这5帧加窗语音信号进行信号保真处理,得到5帧保真语音信号,分别为第4帧保真语音信号z(4)、第5帧保真语音信号z(5)、第6帧保真语音信号z(6)、第7帧保真语音信号z(7)和第8帧保真语音信号z(8)。For example, the electronic device obtains 5 frames of windowed speech signals, namely the 4th framed windowed speech signal y(4), the 5th framed windowed speech signal y(5), the 6th framed windowed speech signal y(6), The windowed speech signal y(7) in the 7th frame and the windowed speech signal y(8) in the 8th frame. Then, the electronic device performs signal fidelity processing on these 5 frames of windowed voice signals to obtain 5 frames of fidelity voice signals, which are the fourth frame fidelity voice signal z(4) and the fifth frame fidelity voice signal z(5 ), the sixth frame fidelity speech signal z(6), the seventh frame fidelity speech signal z(7) and the eighth frame fidelity speech signal z(8).
可以理解,对信号进行保真处理,可以减少信号的失真率。It can be understood that the fidelity processing of the signal can reduce the distortion rate of the signal.
在204中,电子设备对每帧保真语音信号进行傅里叶变换,得到多帧保真语音信号分别对应的相位谱和功率谱。In 204, the electronic device performs Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal, respectively.
比如,电子设备在得到多帧保真信号之后,可以对每帧保真信号进行傅里叶变换,进而得到多帧保真语音信号分别对应的相位谱和功率谱。其中,多帧保真语音信号分别对应的相位谱和功率谱构成第一语音信号对应的相位谱和第一功率谱。For example, after obtaining multiple frames of fidelity signals, the electronic device may perform Fourier transform on each frame of fidelity signals, thereby obtaining phase spectra and power spectra corresponding to the multi-frame fidelity speech signals, respectively. Wherein, the phase spectrum and the power spectrum corresponding to the multi-frame fidelity speech signal respectively constitute the phase spectrum and the first power spectrum corresponding to the first speech signal.
例如,假设电子设备得到5帧保真语音信号,分别为第4帧保真语音信号z(4)、第5帧保真语音信号z(5)、第6帧保真语音信号z(6)、第7帧保真语音信号z(7)和第8帧保真语音信号z(8)。那么电子设备对z(4)进行傅里叶变换,即FFT[z(4)]=Z(4),可以得到z(4)对应的相位谱arg[Z(4)]和z(4)对应的功率谱|Z(4)| 2。电子设备对z(5)进行傅里叶变换,即 FFT[z(5)]=Z(5),可以得到z(5)对应的相位谱arg[Z(5)]和z(5)对应的功率谱|Z(5)| 2。电子设备对z(6)进行傅里叶变换,即FFT[z(6)]=Z(6),可以得到z(6)对应的相位谱arg[Z(6)]和z(6)对应的功率谱|Z(6)| 2。电子设备对z(7)进行傅里叶变换,即FFT[z(7)]=Z(7),可以得到z(7)对应的相位谱arg[Z(7)]和z(7)对应的功率谱|Z(7)| 2。电子设备对z(8)进行傅里叶变换,即FFT[z(8)]=Z(6),可以得到z(8)对应的相位谱arg[Z(8)]和z(8)对应的功率谱|Z(8)| 2For example, suppose the electronic device obtains 5 frames of fidelity speech signals, which are the fourth frame of fidelity speech signal z(4), the fifth frame of fidelity speech signal z(5), and the sixth frame of fidelity speech signal z(6) , Frame 7 fidelity speech signal z(7) and Frame 8 fidelity speech signal z(8). Then the electronic device performs a Fourier transform on z(4), that is, FFT[z(4)]=Z(4), and the phase spectrum arg[Z(4)] and z(4) corresponding to z(4) can be obtained. The corresponding power spectrum |Z(4)| 2 . The electronic device performs a Fourier transform on z(5), that is, FFT[z(5)]=Z(5), and the phase spectrum arg[Z(5)] corresponding to z(5) and z(5) are obtained. Power spectrum |Z(5)| 2 . The electronic device performs a Fourier transform on z(6), that is, FFT[z(6)]=Z(6), and the phase spectrum arg[Z(6)] corresponding to z(6) and z(6) can be obtained. Power spectrum |Z(6)| 2 . The electronic device performs a Fourier transform on z(7), that is, FFT[z(7)]=Z(7), and the phase spectrum arg[Z(7)] corresponding to z(7) and z(7) are obtained. Power spectrum |Z(7)| 2 . The electronic device performs a Fourier transform on z(8), that is, FFT[z(8)]=Z(6), and the phase spectrum arg[Z(8)] corresponding to z(8) and z(8) are obtained. Power spectrum |Z(8)| 2 .
其中,FFT(Fast Fourier Transformation)是离散傅氏变换(DFT)的快速算法。即为快速傅氏变换。它是根据离散傅氏变换的奇、偶、虚、实等特性,对离散傅立叶变换的算法进行改进获得的。Among them, FFT (Fast Fourier Transformation) is a fast algorithm of discrete Fourier transform (DFT). This is the Fast Fourier Transform. It is obtained by improving the discrete Fourier transform algorithm according to the odd, even, virtual, real and other characteristics of the discrete Fourier transform.
可以理解,第一语音信号对应的相位谱包括:arg[Z(4)]、arg[Z(5)]、arg[Z(6)、arg[Z(7)]、arg[Z(8)]。第一语音信号对应的第一功率谱包括:|Z(4)| 2、|Z(5)| 2、|Z(6)| 2、|Z(7)| 2、|Z(8)| 2It can be understood that the phase spectrum corresponding to the first speech signal includes: arg[Z(4)], arg[Z(5)], arg[Z(6), arg[Z(7)], arg[Z(8) ]. The first power spectrum corresponding to the first speech signal includes: |Z(4)| 2 , |Z(5)| 2 , |Z(6)| 2 , |Z(7)| 2 , |Z(8)| 2 .
在205中,电子设备根据每帧保真语音信号对应的功率谱,计算第三功率谱,得到多个第三功率谱。In 205, the electronic device calculates a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums.
比如,电子设备在得到多帧保真语音信号分别对应的功率谱之后,可以根据每帧保真语音信号对应的功率谱,计算第三功率谱,得到多个第三功率谱。其中,多个第三功率谱构成第二功率谱。For example, after obtaining power spectra corresponding to multiple frames of fidelity speech signals, the electronic device may calculate a third power spectrum according to the power spectrum corresponding to each frame of fidelity speech signals to obtain multiple third power spectra. Among them, a plurality of third power spectrums constitute a second power spectrum.
比如,电子设备可以采用以下公式,计算第三功率谱:For example, the electronic device can use the following formula to calculate the third power spectrum:
Figure PCTCN2018118713-appb-000001
其中,|Z(i)| 2表示第i帧保真语音信号对应的功率谱,|X(i)| 2表示第i个第三功率谱,其根据第i帧保真语音信号对应的功率谱确定,ρ是混响时间平移帧数,γ是一个增益值,ε表示直达声信号衰减一定分贝的值
Figure PCTCN2018118713-appb-000002
i>-a,i>-a,ω(i)是一个平滑函数,a用来控制平滑函数的宽度,i表示帧数。
Figure PCTCN2018118713-appb-000001
Among them, |Z(i)| 2 represents the power spectrum corresponding to the i-th frame fidelity speech signal, and |X(i)| 2 represents the i-th third power spectrum, which is based on the power corresponding to the i-th frame fidelity speech signal The spectrum is determined, ρ is the number of reverberation time translation frames, γ is a gain value, and ε represents the value of the direct sound signal attenuation by a certain decibel
Figure PCTCN2018118713-appb-000002
i>-a, i>-a, ω(i) is a smoothing function, a is used to control the width of the smoothing function, i represents the number of frames.
其中,ρ、γ、ε、a的取值可以为:ρ=7,γ=0.32,ε=0.01,a=5。其中,ρ=7表示帧移7帧,即假设混响时间在50ms左右,窗移8ms,需要移7帧。γ=0.32表示增益值为0.32。ε=0.01表示直达声信号衰减30dB的值。a=5表示平滑函数的宽度为5。需要说明的是,ρ=7,γ=0.32,ε=0.01,a=5只是本实施例的一种示例,并不用于限制本申请,在实际应用过程中,ρ、γ、ε、a的取值并不局限于本实施例中的示例,可以根据实际情况确定ρ、γ、ε、a的取值,此处不做具体限制。The values of ρ, γ, ε, and a may be: ρ=7, γ=0.32, ε=0.01, and a=5. Among them, ρ=7 means that the frame shifts by 7 frames, that is, assuming that the reverberation time is around 50ms, and the window shifts by 8ms, it needs to shift by 7 frames. γ=0.32 means that the gain value is 0.32. ε=0.01 means that the direct sound signal is attenuated by 30dB. a=5 means the width of the smoothing function is 5. It should be noted that ρ=7, γ=0.32, ε=0.01, a=5 are only an example of this embodiment, and are not used to limit this application. In the actual application process, ρ, γ, ε, a The values are not limited to the examples in this embodiment, and the values of ρ, γ, ε, and a can be determined according to the actual situation, which is not specifically limited here.
例如,假设电子设备得到5帧保真语音信号分别对应的功率谱为:第4帧保真语音信号对应的功率谱|Z(4)| 2、第5帧保真语音信号对应的功率谱|Z(5)| 2、第6帧保真语音信号对应的功率谱|Z(6)| 2、第7帧保真语音信号对应的功率谱|Z(7)| 2、第8帧保真语音信号对应的功率谱|Z(8)| 2。那么电子设备可以将|Z(4)| 2代入公式
Figure PCTCN2018118713-appb-000003
中,计算得到第四个第三功率谱|X(4)| 2,即
Figure PCTCN2018118713-appb-000004
同理,电子设备可以通过计算得到第五个第三功率谱|X(5)| 2、第三个第六功率谱|X(6)| 2、第七个第三功率谱|X(7)| 2、第八个第三功率谱|X(4)| 2
For example, suppose the electronic device obtains the power spectrum corresponding to the 5 frames of fidelity speech signals as follows: the power spectrum corresponding to the 4th frame fidelity speech signal |Z(4)| 2 , the power spectrum corresponding to the 5th frame fidelity speech signal| Z(5)| 2 , the power spectrum corresponding to the 6th frame fidelity speech signal|Z(6)| 2 , the power spectrum corresponding to the 7th frame fidelity speech signal|Z(7)| 2 , the 8th frame fidelity The power spectrum corresponding to the voice signal |Z(8)| 2 . Then electronic devices can substitute |Z(4)| 2 into the formula
Figure PCTCN2018118713-appb-000003
In the calculation, the fourth third power spectrum |X(4)| 2 is obtained , that is
Figure PCTCN2018118713-appb-000004
Similarly, the electronic device can obtain the fifth third power spectrum |X(5)| 2 , the third sixth power spectrum |X(6)| 2 , the seventh third power spectrum |X(7 )| 2 , the eighth third power spectrum |X(4)| 2 .
可以理解,第一功率谱包括:|X(4)| 2、|X(5)| 2、|X(6)| 2、|X(7)| 2、|X(8)| 2这5个第三功率 谱。 It can be understood that the first power spectrum includes: |X(4)| 2 , |X(5)| 2 , |X(6)| 2 , |X(7)| 2 , |X(8)| 2 These 5 Third power spectrum.
在206中,电子设备根据每帧保真语音信号对应的相位谱和多个第三功率谱,构建每帧干净语音信号,得到多帧干净语音信号。In 206, the electronic device constructs a clean speech signal for each frame according to the phase spectrum corresponding to each frame of the fidelity speech signal and a plurality of third power spectra, to obtain multiple frames of clean speech signals.
例如,假设电子设备得到5帧保真语音信号对应的相位谱和5个第三功率谱,5帧保真语音信号分别对应的相位谱为:arg[Z(4)]、arg[Z(5)]、arg[Z(6)、arg[Z(7)]、arg[Z(8)]。5个第三功率谱分别为:|X(4)| 2、|X(5)| 2、|X(6)| 2、|X(7)| 2、|X(8)| 2。那么,电子设备可以根据arg[Z(4)]和|X(4)| 2,构建第1帧干净语音信号。电子设备可以根据arg[Z(5)]和|Z(5)| 2,构建第2帧干净语音信号。电子设备可以根据arg[Z(6)]和|Z(6)| 2,构建第3帧干净语音信号。电子设备可以根据arg[Z(7)]和|Z(7)| 2,构建第4帧干净语音信号。电子设备可以根据arg[Z(8)]和|Z(8)| 2,构建第5帧干净语音信号。从而电子设备一共可以构建5帧干净语音信号。 For example, suppose that the electronic device obtains a phase spectrum corresponding to 5 frames of fidelity speech signals and 5 third power spectra, and the corresponding phase spectra of 5 frames of fidelity speech signals are: arg[Z(4)], arg[Z(5 )], arg[Z(6), arg[Z(7)], arg[Z(8)]. The five third power spectra are: |X(4)| 2 , |X(5)| 2 , |X(6)| 2 , |X(7)| 2 , |X(8)| 2 . Then, the electronic device can construct the clean voice signal of the first frame according to arg[Z(4)] and |X(4)| 2 . The electronic device can construct the second frame of clean speech signal according to arg[Z(5)] and |Z(5)| 2 . The electronic device can construct the third frame of clean speech signal according to arg[Z(6)] and |Z(6)| 2 . The electronic device can construct the fourth frame of clean speech signal according to arg[Z(7)] and |Z(7)| 2 . The electronic device can construct the fifth frame of clean speech signal according to arg[Z(8)] and |Z(8)| 2 . Therefore, the electronic device can construct a total of 5 frames of clean voice signals.
在207中,电子设备对多帧干净语音信号进行加窗合帧处理,得到目标干净语音信号。In 207, the electronic device performs windowing and framing processing on multiple frames of clean voice signals to obtain target clean voice signals.
比如,假设电子设备构建了5帧干净语音信号,那么电子设备可以按照时间顺序对这5帧干净语音信号进行加窗合帧处理,得到目标干净语音信号。其中,各相邻帧干净语音信号之间可以存在一定重叠,以构建出更清晰的目标干净语音信号。For example, assuming that the electronic device constructs 5 frames of clean voice signals, the electronic device may perform windowing and framing processing on these 5 frames of clean voice signals in chronological order to obtain a target clean voice signal. Among them, there may be a certain overlap between clean speech signals in adjacent frames to construct a clearer target clean speech signal.
如图3所示,在一些实施例中,流程203可以包括以下流程:As shown in FIG. 3, in some embodiments, the process 203 may include the following process:
2031,电子设备对第i帧及第i帧之后的每帧加窗语音信号进行线性预测分析,得到第i帧及第i帧之后的每帧线性预测残差信号。2031. The electronic device performs linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain a linear prediction residual signal of the i-th frame and each frame after the i-th frame.
可以理解,由于对当前帧进行线性预测分析需要用到当前帧的前几帧的数据,因此,电子设备并不是从第1帧开始进行线性预测分析,因此,在本实施例中,电子设备得到的多帧加窗信号至少不包括第1帧加窗信号。例如,电子设备对混响语音信号进行加窗处理,得到8帧加窗语音信号,电子设备可以获取后7帧加窗语音信号,即第2帧至第7帧加窗语音信号,第2帧至第8帧加窗语音信号即为电子设备得到的多帧加窗语音信号;电子设备也可以获取后6帧加窗语音信号,即第3帧至第8帧加窗语音信号,第3帧至第8帧加窗语音信号即为电子设备得到的多帧加窗语音信号。具体获取后几帧加窗语音信号根据实际情况确定,此处不做具体限制。It can be understood that since the linear prediction and analysis of the current frame requires the data of the first few frames of the current frame, the electronic device does not perform linear prediction and analysis from the first frame. Therefore, in this embodiment, the electronic device obtains The multi-frame windowing signal of at least does not include the first frame windowing signal. For example, the electronic device performs windowing processing on the reverberant speech signal to obtain 8 frames of windowed speech signal, and the electronic device can obtain the windowed speech signal of the next 7 frames, that is, the windowed speech signal from frame 2 to frame 7, frame 2 The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device; the electronic device can also obtain the windowed speech signal of the next six frames, ie the windowed speech signal from the third frame to the eighth frame, the third frame The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device. The specific acquisition of windowed speech signals in the next few frames is determined according to the actual situation, and no specific restrictions are made here.
例如,电子设备得到5帧加窗语音信号,分别为第4帧加窗语音信号y(4)、第5帧加窗语音信号y(5)、第6帧加窗语音信号y(6)、第7帧加窗语音信号y(7)和第8帧加窗语音信号y(8)。因此,电子设备从第4帧开始,对第4帧至第8帧加窗语音信号分别进行线性预测分析,得到第4帧线性预测残差信号w(4)、第5帧线性预测残差信号w(5)、第6帧线性预测残差信号w(6)、第7帧线性预测残差信号w(7)、第8帧线性预测残差信号w(8)。其中,线性预测残差信号可以表示为:w(i),i为帧数。For example, the electronic device obtains 5 frames of windowed speech signals, namely the 4th framed windowed speech signal y(4), the 5th framed windowed speech signal y(5), the 6th framed windowed speech signal y(6), The windowed speech signal y(7) in the 7th frame and the windowed speech signal y(8) in the 8th frame. Therefore, starting from the fourth frame, the electronic device performs linear prediction analysis on the windowed speech signals from the fourth frame to the eighth frame to obtain the linear prediction residual signal w(4) of the fourth frame and the linear prediction residual signal of the fifth frame w(5), the sixth frame linear prediction residual signal w(6), the seventh frame linear prediction residual signal w(7), the eighth frame linear prediction residual signal w(8). Among them, the linear prediction residual signal can be expressed as: w(i), i is the number of frames.
2032,电子设备采用逆滤波器对第i帧线性预测残差信号进行逆滤波处理,得到第i帧滤波语音信号。2032. The electronic device uses an inverse filter to perform inverse filtering on the linear prediction residual signal of the i-th frame to obtain a filtered speech signal of the i-th frame.
比如,电子设备可以采用长度为L的逆滤波器对第i帧线性预测残差信号进行逆滤波处理,得到第i帧滤波语音信号。For example, the electronic device may use an inverse filter of length L to perform inverse filtering processing on the linear prediction residual signal of the i-th frame to obtain a filtered speech signal of the i-th frame.
其中,长度为L的逆滤波器可以表示为:g(i)=[g(1),g(2),…g(L)]。L的取值可以根据实际情况确定,此处不做具体限制。Among them, the inverse filter of length L can be expressed as: g(i)=[g(1), g(2),...g(L)]. The value of L can be determined according to the actual situation, and no specific restrictions are made here.
2033,当电子设备检测到第i帧滤波语音信号峰度最小时,电子设备获取使得第i帧滤波语音信号峰度最小的逆滤波器,得到第i帧加窗语音信号对应的逆滤波器。2033. When the electronic device detects that the kurtosis of the i-th frame filtered speech signal is minimum, the electronic device obtains an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal to obtain an inverse filter corresponding to the i-th windowed voice signal.
比如,电子设备可以采用长度为L的逆滤波器对第4帧线性预测残差信号进行逆滤波处理,得到第4帧滤波语音信号。电子设备不断更改逆滤波器的参数,使得第4帧滤波语音信号不断变化。同时,电子设备持续检测不断变化的第4帧滤波语音信号的峰度。当电子设备检测到第4帧滤波语音信号的峰度最小时,获取使得第4帧滤波语音信号峰度最小的逆滤波器,得到第4帧加窗语音信号对应的逆滤波器g(4)。For example, the electronic device may use an inverse filter of length L to perform inverse filtering on the fourth-frame linear prediction residual signal to obtain the fourth-frame filtered speech signal. The electronic device continuously changes the parameters of the inverse filter, so that the filtered speech signal of the fourth frame continuously changes. At the same time, the electronic device continuously detects the changing kurtosis of the filtered speech signal in the fourth frame. When the electronic device detects that the kurtosis of the filtered speech signal in the fourth frame is the smallest, an inverse filter that minimizes the kurtosis of the filtered speech signal in the fourth frame is obtained to obtain the inverse filter g(4) corresponding to the windowed speech signal in the fourth frame .
2034,电子设备采用第i帧加窗语音信号对应的逆滤波器对第i帧加窗语音信号进行逆滤波处理,得到第i帧保真语音信号。2034. The electronic device uses an inverse filter corresponding to the windowed i-frame speech signal to perform inverse filtering on the windowed i-frame speech signal to obtain the i-th frame fidelity speech signal.
第i帧保真语音信号的计算公式为:The formula for the fidelity speech signal of frame i is:
z(i)=g(i)y(i)。其中,z(i)表示第i帧保真语音信号,g(i)表示第i帧加窗语音信号对应的逆滤波器,y(i)表示第i帧加窗语音信号。z(i)=g(i)y(i). Among them, z(i) represents the fidelity speech signal of the i-th frame, g(i) represents the inverse filter corresponding to the windowed speech signal of the i-th frame, and y(i) represents the windowed speech signal of the i-th frame.
例如,电子设备可以采用第4帧加窗语音信号对应的逆滤波器g(4)对第3帧加窗语音信号y(4)进行逆滤波处理,得到第4帧保真语音信号z(4),即z(4)=g(4)y(4)。For example, the electronic device may use an inverse filter g(4) corresponding to the windowed speech signal of the fourth frame to perform inverse filtering on the windowed speech signal y(4) of the third frame to obtain the fidelity speech signal z(4 of the fourth frame ), that is, z(4)=g(4)y(4).
2035,电子设备根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器。2035, the electronic device performs inverse filtering according to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame of the windowed speech signal of the windowed voice signal of each frame after the i-th frame To obtain the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame.
需要说明的是,第i帧之后的每帧加窗语音信号对应的逆滤波器的确定方式与第i帧加窗语音信号对应的逆滤波器的确定方式不同。It should be noted that the determination method of the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame is different from the determination method of the inverse filter corresponding to the windowed speech signal of the i-th frame.
比如,对于第4帧之后的每帧加窗语音信号,即对于第5帧至第8帧的每帧加窗语音信号,电子设备可以根据第5帧至第8帧的每帧线性预测残差信号的前一帧线性预测残差信号和第5帧至第8帧的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第5帧至第8帧的每帧加窗语音信号对应的逆滤波器。For example, for the windowed speech signal every frame after the fourth frame, that is, for the windowed speech signal every frame from the fifth frame to the eighth frame, the electronic device can linearly predict the residual according to each frame of the fifth frame to the eighth frame The linear prediction residual signal of the previous frame of the signal and the inverse filter corresponding to the windowed voice signal of the previous frame of the windowed speech signal of each frame of the 5th frame to the 8th frame, to obtain each frame of the 5th frame to the 8th frame The inverse filter corresponding to the windowed speech signal.
例如,对于第5帧加窗语音信号,电子设备可以根据第4帧线性预测残差信号和第4帧加窗语音信号对应的逆滤波器,得到第5帧加窗语音信号对应的逆滤波器,同理,电子设备可以得到第6帧加窗语音信号对应的逆滤波器、第7帧加窗语音信号对应的逆滤波器、第8帧加窗语音信号对应的逆滤波器。For example, for the windowed speech signal in the fifth frame, the electronic device may obtain the inverse filter corresponding to the windowed speech signal in the fifth frame according to the linear prediction residual signal in the fourth frame and the windowed speech signal in the fourth frame In the same way, the electronic device can obtain the inverse filter corresponding to the windowed voice signal in the sixth frame, the inverse filter corresponding to the windowed voice signal in the seventh frame, and the inverse filter corresponding to the windowed voice signal in the eighth frame.
2036,电子设备根据第i帧之后的每帧加窗语音信号对应的逆滤波器和第i帧之后的每帧加窗语音信号,得到第i帧之后的多帧保真语音信号。2036, the electronic device obtains the multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame.
比如,电子设备可以将第5帧加窗语音信号对应的逆滤波器g(5)和第5帧加窗语音信号y(5)代入公式z(i)=g(i)y(i),得到第5帧保真语音信号z(5)。同理,电子设备可以得到第6帧保真信号z(6)、第7帧保真信号z(7)、第8帧保真信号z(8),即得到第4帧之后的多帧保真语音信号。For example, the electronic device may substitute the inverse filter g(5) corresponding to the windowed speech signal in the fifth frame and the windowed speech signal y(5) in the fifth frame into the formula z(i)=g(i)y(i), The fidelity speech signal z(5) of the fifth frame is obtained. Similarly, the electronic device can obtain the sixth frame fidelity signal z(6), the seventh frame fidelity signal z(7), and the eighth frame fidelity signal z(8), that is, the multi-frame fidelity after the fourth frame True voice signal.
2037,电子设备结合第i帧保真语音信号与第i帧之后的多帧保真语音信号,得到多帧保真语音信号。2037. The electronic device combines the fidelity speech signal of the i-th frame and the multi-frame fidelity speech signal after the i-th frame to obtain a multi-frame fidelity speech signal.
比如,电子设备结合第4帧保真语音信号z(4)与第5帧至第8帧保真语音信号z(5)、z(6)、z(7)、z(8),得到5帧保真语音信号。For example, the electronic device combines the fidelity speech signal z(4) of the fourth frame and the fidelity speech signals z(5), z(6), z(7), and z(8) of the fifth frame to the eighth frame to obtain 5 Frame fidelity speech signal.
在一些实施方式中,当电子设备检测到第i帧滤波语音信号峰度最小时,电子设备可以获取峰度最小的第i帧滤波语音信号。In some embodiments, when the electronic device detects that the i-th frame filtered speech signal has the smallest kurtosis, the electronic device may acquire the i-th frame filtered speech signal with the smallest kurtosis.
比如,电子设备可以采用逆滤波器对第4帧线性预测残差信号进行逆滤波处理,得到第4帧滤波语音信号。电子设备不断更改逆滤波器的参数,使得第4帧滤波语音信号不断变化。同时,电子设备持续检测不断变化的第4帧滤波语音信号的峰度。当电子设备检测到第4帧逆滤波语音信号的峰度最小时,获取峰度最小的第4帧滤波语音信号s(4)。For example, the electronic device may use an inverse filter to perform inverse filtering on the linear prediction residual signal of the fourth frame to obtain the filtered speech signal of the fourth frame. The electronic device continuously changes the parameters of the inverse filter, so that the filtered speech signal of the fourth frame continuously changes. At the same time, the electronic device continuously detects the changing kurtosis of the filtered speech signal in the fourth frame. When the electronic device detects that the kurtosis of the inverse filtered speech signal in the fourth frame is the smallest, the fourth frame filtered speech signal s(4) with the smallest kurtosis is acquired.
那么,如图4所示,流程2035可以包括以下流程:Then, as shown in FIG. 4, the process 2035 may include the following processes:
20351,电子设备根据第i帧线性预测残差信号、第i帧加窗语音信号对应的逆滤波器和峰度最小的第i帧滤波语音信号,得到第i+1帧加窗语音信号对应的逆滤波器。20351, the electronic device obtains the corresponding i+1 frame windowed speech signal according to the linear prediction residual signal of the i frame, the inverse filter corresponding to the windowed speech signal of the i frame and the i frame filtered speech signal with the smallest kurtosis Inverse filter.
第i+1帧加窗语音信号对应的逆滤波器的计算公式为:The calculation formula of the inverse filter corresponding to the windowed speech signal in frame i+1 is:
g(i+1)=g(i)+μe(i)w(i),其中,
Figure PCTCN2018118713-appb-000005
s(i)表示第i帧滤波语音信号,g(i+1)表示第i+1帧加窗语音信号对应的逆滤波器,g(i)表示第i帧加窗语音信号对应的逆滤波器,w(i) 表示第i帧线性预测残差信号,μ=3*10 -9为收敛步长,E(x)表示期望。
g(i+1)=g(i)+μe(i)w(i), where,
Figure PCTCN2018118713-appb-000005
s(i) represents the i-frame filtered speech signal, g(i+1) represents the inverse filter corresponding to the windowed speech signal in frame i+1, and g(i) represents the inverse filter corresponding to the windowed speech signal in frame i W(i) represents the linear prediction residual signal of the ith frame, μ=3*10 -9 is the convergence step size, and E(x) represents the expectation.
比如,电子设备可以根据第4帧线性预测残差信号w(4)、第4帧加窗语音信号对应的逆滤波器g(4)和峰度最小的第4帧滤波语音信号s(4),得到第5帧加窗语音信号对应的逆滤波器g(5)。即g(5)=g(4)+μe(4)w(4),其中,
Figure PCTCN2018118713-appb-000006
For example, the electronic device may linearly predict the residual signal w(4) in the fourth frame, the inverse filter g(4) corresponding to the windowed speech signal in the fourth frame, and the fourth frame filtered speech signal s(4) with the least kurtosis To obtain the inverse filter g(5) corresponding to the windowed speech signal in the fifth frame. That is, g(5)=g(4)+μe(4)w(4), where,
Figure PCTCN2018118713-appb-000006
20352,电子设备根据第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号,得到第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号。20352, the electronic device uses the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linearly predicted residual signal of the previous frame of each frame after the i+1th frame Linearly predict the residual signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame.
20353,电子设备根据第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号、第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号,得到第i+1帧之后的每帧加窗语音信号对应的逆滤波器。20353, the electronic device predicts the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, and the windowed voice signal of the previous frame of the windowed speech signal of each frame after the i+1th frame Corresponding inverse filter and the previous frame of the filtered speech signal of each frame after the i+1th frame of the filtered speech signal, to obtain an inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame.
其中,第i+1帧之后的每帧加窗语音信号对应的逆滤波器的计算公式可以为:The calculation formula of the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame can be:
g(i+j+1)=g(i+j)+μe(i+j)w(i+j),其中,
Figure PCTCN2018118713-appb-000007
s(i+j)表示第i+j帧滤波语音信号,g(i+j)表示第i+j帧加窗语音信号对应的逆滤波器,g(i+j+1)表示第i+j+1帧加窗语音信号对应的逆滤波器,w(i+j)表示第i+j帧线性预测残差信号,μ=3*10 -9为收敛步长,E(x)表示期望,j≥1。
g(i+j+1)=g(i+j)+μe(i+j)w(i+j), where,
Figure PCTCN2018118713-appb-000007
s(i+j) represents the filtered speech signal in frame i+j, g(i+j) represents the inverse filter corresponding to the windowed speech signal in frame i+j, and g(i+j+1) represents the i+j The inverse filter corresponding to the windowed speech signal of frame j+1, w(i+j) represents the linear prediction residual signal of frame i+j, μ=3*10 -9 is the convergence step, and E(x) represents the expected , J≥1.
根据g(i+j+1)=g(i+j)+μe(i+j)w(i+j)和
Figure PCTCN2018118713-appb-000008
可以确定:当前帧加窗语音信号对应的逆滤波器可以根据当前帧的前一帧线性预测残差信号、当前帧的前一帧加窗语音信号对应的逆滤波器和当前帧的前一帧滤波语音信号确定。
According to g(i+j+1)=g(i+j)+μe(i+j)w(i+j) and
Figure PCTCN2018118713-appb-000008
It can be determined that the inverse filter corresponding to the windowed voice signal of the current frame can linearly predict the residual signal according to the previous frame of the current frame, the inverse filter corresponding to the windowed voice signal of the previous frame of the current frame and the previous frame of the current frame The filtered speech signal is determined.
比如,第i+2帧加窗语音信号对应的逆滤波器可以根据第i+1帧线性预测残差信号、第i+1帧加窗语音信号对应的逆滤波器和第i+1帧滤波语音信号确定。在本实施例中,电子设备在执行流程20353之前,便已经得到第i+1帧加窗语音信号和线性预测残差信号。因此,为了确定第i+2帧加窗语音信号对应的逆滤波器,需要确定第i+1帧滤波语音信号,即需要确定第i+2帧滤波语音信号的前一帧滤波语音信号。For example, the inverse filter corresponding to the windowed speech signal in frame i+2 can be linearly predicted residual signal according to frame i+1, the inverse filter corresponding to the windowed speech signal in frame i+1 and the filter in frame i+1 The voice signal is determined. In this embodiment, before performing the process 20353, the electronic device has obtained the windowed speech signal and the linear prediction residual signal of the i+1th frame. Therefore, in order to determine the inverse filter corresponding to the windowed speech signal in the i+2 frame, the filtered speech signal in the i+1 frame needs to be determined, that is, the filtered speech signal in the previous frame of the filtered speech signal in the i+2 frame needs to be determined.
其中,第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号的计算公式为:The calculation formula of the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame is:
s(i+j)=g(i+j)w(i+j),其中,s(i+j)表示第i+j帧滤波语音信号,g(i+j)表示第i+j帧加窗语音信号对应的逆滤波器,w(i+j)表示第i+j帧线性预测残差信号,j≥1。s(i+j)=g(i+j)w(i+j), where s(i+j) represents the filtered speech signal of frame i+j, and g(i+j) represents the frame i+j The inverse filter corresponding to the windowed speech signal, w(i+j) represents the linear prediction residual signal of frame i+j, j≥1.
例如,若i=4,那么第i+2帧滤波语音信号的前一帧滤波语音信号(此时j=1),即第i+1帧滤波语音信号,即第5帧滤波语音信号s(5)=g(5)w(5)。从而第i+2帧加窗语音信号对应的逆滤波器,即第6帧加窗语音信号对应的逆滤波器g(6)可以根据第5帧线性预测残差信号、第5帧加窗语音信号对应的逆滤波器和第5帧滤波语音信号确定,即g(6)=g(5)+μe(5)w(5),其中,
Figure PCTCN2018118713-appb-000009
当得到第6帧加窗语音信号对应的逆滤波器g(6)时,电子设备可以确定第6帧滤波语音信号s(6)=g(6)w(6)。从而,第7帧加窗语音信号对应的逆滤波器g(7)=g(6)+μe(6)w(6), 其中,
Figure PCTCN2018118713-appb-000010
同理,电子设备可以得到第8帧加窗语音信号对应的逆滤波器g(8)。
For example, if i=4, the filtered voice signal of the previous frame of the filtered voice signal of frame i+2 (in this case j=1), that is, the filtered voice signal of frame i+1, that is, the filtered voice signal of frame 5 s( 5)=g(5)w(5). Therefore, the inverse filter corresponding to the windowed speech signal in the i+2 frame, that is, the inverse filter g(6) corresponding to the windowed speech signal in the 6th frame, can linearly predict the residual signal according to the 5th frame, and the windowed speech in the 5th frame The inverse filter corresponding to the signal and the 5th frame filtered speech signal are determined, that is, g(6)=g(5)+μe(5)w(5), where,
Figure PCTCN2018118713-appb-000009
When the inverse filter g(6) corresponding to the windowed speech signal in the sixth frame is obtained, the electronic device may determine the filtered speech signal s(6)=g(6)w(6) in the sixth frame. Therefore, the inverse filter g(7)=g(6)+μe(6)w(6) corresponding to the windowed speech signal in the seventh frame, where,
Figure PCTCN2018118713-appb-000010
Similarly, the electronic device can obtain the inverse filter g(8) corresponding to the windowed speech signal of the eighth frame.
20354,电子设备根据第i+1帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧加窗语音信号对应的逆滤波器,得到第i帧之后的加窗语音信号对应的逆滤波器。20354, the electronic device obtains the corresponding windowed speech signal after the i frame according to the inverse filter corresponding to the windowed speech signal of the i+1 frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1 frame Inverse filter.
例如,电子设备根据第5帧加窗语音信号对应的逆滤波器g(5),第6帧加窗语音信号对应的逆滤波器g(6),第7帧加窗语音信号对应的逆滤波器g(7),第8帧加窗语音信号对应的逆滤波器g(8),得到第4帧之后的每帧加窗语音信号对应的逆滤波器。For example, the electronic device according to the inverse filter g(5) corresponding to the windowed speech signal in the fifth frame, the inverse filter g(6) corresponding to the windowed speech signal in the sixth frame, and the inverse filter corresponding to the windowed speech signal in the seventh frame G(7), the inverse filter g(8) corresponding to the windowed speech signal of the eighth frame, and the inverse filter corresponding to the windowed speech signal of each frame after the fourth frame is obtained.
在一些实施方式中,流程201可以包括以下流程:In some embodiments, the process 201 may include the following process:
在电子设备处于通话状态时,获取用户的语音信息;Obtain the user's voice information when the electronic device is in a call state;
电子设备检测用户的语音信息中是否包括预设关键词;The electronic device detects whether the user's voice information includes preset keywords;
若用户的语音信息中包括预设关键词,电子设备则获取混响语音信号。If the user's voice information includes preset keywords, the electronic device acquires the reverberation voice signal.
其中,用户的语音信息可以为通话对象的语音信息。例如,在电子设备处于通话状态时,电子设备获取到与当前用户正在通话的通话对象的语音信息,然后检测通话对象的语音信息中是否包括预设关键词。其中,预设关键词可以为“听不清楚”、“再说一遍”等。当通话对象的语音信息中包括预设关键词“听不清楚”等时,则有可能存在当前用户距离电子设备过远,导致电子设备采集到的信号是直达声和反射声的混合信号,即本实施例中的混响语音信号。因此,在电子设备检测到用户的语音信息中包括预设关键词时,电子设备可以获取混响语音信号,麦克风采集到的混合信号。The voice information of the user may be the voice information of the callee. For example, when the electronic device is in a call state, the electronic device obtains the voice information of the callee who is in a call with the current user, and then detects whether the callee's voice information includes preset keywords. Among them, the preset keywords may be "inaudible", "say again" and so on. When the voice information of the callee includes the preset keyword "inaudible", etc., there may be a current user too far away from the electronic device, resulting in the signal collected by the electronic device is a mixture of direct sound and reflected sound The reverberation speech signal in this embodiment. Therefore, when the electronic device detects that the user's voice information includes a preset keyword, the electronic device can obtain the mixed signal of the reverberation voice signal and the microphone.
在一些实施方式中,电子设备在执行若用户的语音信息中包括预设关键词,则获取混响语音信号的流程时,可以执行以下流程:In some embodiments, when the electronic device performs the process of acquiring the reverberation voice signal if the user's voice information includes preset keywords, the following process may be performed:
若用户的语音信息中包括预设关键词,电子设备则生成一次记录并保存该记录;If the user's voice information includes preset keywords, the electronic device generates a record and saves the record;
当保存的记录的数量大于预设数量阈值时,电子设备获取混响语音信号。When the number of saved records is greater than the preset number threshold, the electronic device acquires the reverberation voice signal.
为了减轻处理器的处理负载,也考虑到可能是电子设备当前信号较差,才导致电子设备检测到通话对象的语音信息中包括预设关键词“听不清楚”等。因此,电子设备可以在检测到通话对象的语音信息中包括预设关键词“听不清楚”等时,生成一次记录并保存该记录。当保存的记录的数量大于预设数量阈值时,电子设备获取混响语音信号。其中,预设数量阈值可以由用户设置,也可以由电子设备确定,等等,此处不做具体限制。假设预设数量阈值设置为10,那么电子设备就会在保存的记录的数量为11时,获取混响语音信号。In order to reduce the processing load of the processor, it is also considered that the current signal of the electronic device may be poor, which causes the electronic device to detect that the voice information of the call target includes the preset keyword "inaudible" and the like. Therefore, the electronic device can generate a record once and save the record when it is detected that the voice information of the call target includes the preset keyword "inaudible" and the like. When the number of saved records is greater than the preset number threshold, the electronic device acquires the reverberation voice signal. Wherein, the preset number threshold can be set by the user or determined by the electronic device, etc., which is not specifically limited here. Assuming that the preset number threshold is set to 10, then the electronic device will obtain the reverberation speech signal when the number of saved records is 11.
当电子设备执行获取混响语音信号的流程时,电子设备可以将保存的记录删除,并停止执行检测用户的语音信息中是否包括预设关键词的流程。When the electronic device executes the process of acquiring the reverberation voice signal, the electronic device may delete the saved record and stop performing the process of detecting whether the user's voice information includes the preset keyword.
同样,为了减轻处理器的处理负载,在电子设备获取混响语音信号之后的一段时间之后,在通话对象的语音信息中没有检测到预设关键词时,那么可能表示用户已经距离电子设备较近,麦克风采集到的不再是混合声音,此时,电子设备可以停止获取混响语音信号的流程。Similarly, in order to reduce the processing load of the processor, after a period of time after the electronic device acquires the reverberation voice signal, if the preset keyword is not detected in the callee's voice information, it may indicate that the user is closer to the electronic device In this case, the microphone no longer collects mixed sound. At this time, the electronic device can stop the process of acquiring the reverberation voice signal.
例如,在电子设备获取混响语音信号之后的20分钟之后,在通话对象的语音信息中没有检测到预设关键词时,那么可能表示用户已经距离电子设备较近,麦克风采集到的不再是混合声音,此时,电子设备可以停止获取混响语音信号的流程。For example, after 20 minutes after the electronic device acquires the reverberation voice signal, if the preset keyword is not detected in the callee's voice information, it may indicate that the user is closer to the electronic device, and the microphone collected is no longer Mixed sound, at this time, the electronic device can stop the process of acquiring the reverberation voice signal.
在一些实施方式中,流程201可以包括以下流程:In some embodiments, the process 201 may include the following process:
当电子设备要进行声纹识别或者语音识别时,电子设备检测声源与电子设备之间的距离是否大于预设距离阈值;When the electronic device is to perform voiceprint recognition or voice recognition, the electronic device detects whether the distance between the sound source and the electronic device is greater than a preset distance threshold;
若声源与电子设备之间的距离大于预设距离阈值,电子设备则获取混响语音信号。If the distance between the sound source and the electronic device is greater than the preset distance threshold, the electronic device acquires the reverberation voice signal.
可以理解,若声源与电子设备之间的距离过远的话,电子设备采集到的是直达声和反射声的混合信号,即混响语音信号,混响语音信号存在混响现象,而混响现象能够干扰到 电子设备进行声纹识别和语音识别的结果。It can be understood that if the distance between the sound source and the electronic device is too far, the electronic device collects a mixed signal of direct sound and reflected sound, that is, a reverberated speech signal, and the reverberated speech signal has a reverberation phenomenon, and the reverberation The phenomenon can interfere with the results of voiceprint recognition and voice recognition of electronic devices.
因此,当电子设备要进行声纹识别或者语音识别时,电子设备可以检测声源与电子设备之间的距离是否大于预设距离阈值,若声源与电子设备之间的距离大于预设距离阈值,电子设备则获取混响语音信号。Therefore, when the electronic device is to perform voiceprint recognition or voice recognition, the electronic device can detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold, and if the distance between the sound source and the electronic device is greater than the preset distance threshold , The electronic device obtains the reverberation voice signal.
其中,预设距离阈值可以根据实际情况进行设置,此处不作具体限制。Among them, the preset distance threshold can be set according to the actual situation, and no specific limitation is made here.
请参阅图5,图5为本申请实施例提供的语音处理装置300的结构示意图。该语音处理装置300可以包括:获取模块301,处理模块302,变换模块303,计算模块304和构建模块305。Please refer to FIG. 5, which is a schematic structural diagram of a voice processing device 300 according to an embodiment of the present application. The voice processing apparatus 300 may include: an acquisition module 301, a processing module 302, a transformation module 303, a calculation module 304, and a construction module 305.
获取模块301,用于获取混响语音信号;The obtaining module 301 is used to obtain a reverberation voice signal;
处理模块302,用于对所述混响语音信号进行信号保真处理,得到第一语音信号。The processing module 302 is configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal.
变换模块303,用于对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱。The transformation module 303 is configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.
计算模块304,用于根据所述第一功率谱,计算第二功率谱。The calculation module 304 is configured to calculate a second power spectrum according to the first power spectrum.
构建模块305,用于根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。The construction module 305 is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.
在一些实施方式中,所述获取模块301,可以用于:对所述混响语音信号进行加窗分帧处理,得到多帧加窗语音信号;In some embodiments, the acquisition module 301 may be used to: perform windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal;
所述处理模块302,可以用于:对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号,所述多帧保真语音信号构成所述第一语音信号;The processing module 302 may be used to: perform signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal, and the multi-frame fidelity voice signal constitutes the first voice signal;
所述变换模块303,可以用于:对每帧保真语音信号进行傅里叶变换,得到多帧保真语音信号分别对应的相位谱和功率谱,所述多帧保真语音信号分别对应的相位谱和功率谱构成第一语音信号对应的相位谱和第一功率谱;The transformation module 303 may be used to: perform Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal respectively, and the multi-frame fidelity speech signal respectively correspond to The phase spectrum and the power spectrum constitute the phase spectrum and the first power spectrum corresponding to the first speech signal;
所述计算模块304,可以用于:根据每帧保真语音信号对应的功率谱,计算第三功率谱,得到多个第三功率谱,所述多个第三功率谱构成所述第二功率谱;The calculation module 304 may be used to calculate a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, the plurality of third power spectrums constituting the second power Spectrum
所述构建模块305,可以用于:根据每帧保真语音信号对应的相位谱和所述多个第三功率谱,构建每帧干净语音信号,得到多帧干净语音信号;对所述多帧干净语音信号进行加窗合帧处理,得到目标干净语音信号。The construction module 305 may be used to: construct a clean speech signal for each frame according to the phase spectrum corresponding to the fidelity speech signal of each frame and the plurality of third power spectra, and obtain multiple frames of clean speech signals; The clean voice signal is windowed and framed to obtain the target clean voice signal.
在一些实施方式中,所述处理模块302,可以用于:对第i帧及第i帧之后的每帧加窗语音信号进行线性预测分析,得到第i帧及第i帧之后的每帧线性预测残差信号;采用逆滤波器对第i帧线性预测残差信号进行逆滤波处理,得到第i帧滤波语音信号;当检测到第i帧滤波语音信号峰度最小时,获取使得第i帧滤波语音信号峰度最小的逆滤波器,得到第i帧加窗语音信号对应的逆滤波器;采用所述第i帧加窗语音信号对应的逆滤波器对第i帧加窗语音信号进行逆滤波处理,得到第i帧保真语音信号;根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器;根据所述第i帧之后的每帧加窗语音信号对应的逆滤波器和第i帧之后的每帧加窗语音信号,得到第i帧之后的多帧保真语音信号;结合第i帧保真语音信号与第i帧之后的多帧保真语音信号,得到多帧保真语音信号。In some embodiments, the processing module 302 may be used to: perform linear prediction and analysis on the i-th frame and every windowed voice signal after the i-th frame to obtain the i-th frame and every frame after the i-th frame Prediction residual signal; inverse filter is used to inverse filter the i-th frame linear prediction residual signal to obtain the i-th frame filtered speech signal; when the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame is acquired Filtering the inverse filter with the smallest kurtosis of the speech signal to obtain an inverse filter corresponding to the windowed speech signal of the i frame; using the inverse filter corresponding to the windowed speech signal of the i frame to inverse the windowed speech signal of the i frame Filtering process to get the i-th frame fidelity speech signal; linearly predict the residual signal according to the previous frame of the linear prediction residual signal for each frame after the i-th frame and the previous frame of the windowed speech signal for each frame after the i-th frame An inverse filter corresponding to the windowed speech signal to obtain an inverse filter corresponding to the windowed speech signal for each frame after the i-th frame; according to the inverse filter and the i-th corresponding to the windowed speech signal for each frame after the i-th frame Windowed speech signals for each frame after the frame to obtain multi-frame fidelity speech signals after the i-th frame; combining the i-th frame fidelity speech signal with the multi-frame fidelity speech signal after the i-th frame to obtain multi-frame fidelity speech signal.
在一些实施方式中,所述处理模块302,可以用于:当检测到第i帧滤波语音信号峰度最小时,获取峰度最小的第i帧滤波语音信号;根据第i帧线性预测残差信号、第i帧加窗语音信号对应的逆滤波器和峰度最小的第i帧滤波语音信号,得到第i+1帧加窗语音信号对应的逆滤波器;根据第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号,得到第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号;根据第i+1帧之后的每帧线性预测残差信 号的前一帧线性预测残差信号、第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号,得到第i+1帧之后的每帧加窗语音信号对应的逆滤波器;根据第i+1帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧加窗语音信号对应的逆滤波器,得到第i帧之后的加窗语音信号对应的逆滤波器。In some embodiments, the processing module 302 may be used to: when the i-th frame filtered speech signal has the smallest kurtosis, obtain the i-th frame filtered speech signal with the smallest kurtosis; linearly predict the residual according to the i-th frame The inverse filter corresponding to the signal and the i-th windowed speech signal and the i-th frame filtered speech signal with the smallest kurtosis to obtain the inverse filter corresponding to the i+1th windowed speech signal; The inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame and the linear predicted residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, to obtain the i+1th frame Filtered speech signal of the previous frame of the filtered speech signal of each frame; Linearly predicted residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, plus The inverse filter corresponding to the windowed speech signal in the previous frame of the windowed speech signal and the filtered speech signal in the previous frame of the filtered speech signal in each frame after the i+1th frame, to obtain the windowed speech in each frame after the i+1th frame The inverse filter corresponding to the signal; according to the inverse filter corresponding to the windowed speech signal in the i+1th frame and the inverse filter corresponding to the windowed speech signal in each frame after the i+1th frame, the windowing after the ith frame is obtained The inverse filter corresponding to the voice signal.
在一些实施方式中,所述获取模块301,可以用于:在电子设备处于通话状态时,获取用户的语音信息;检测所述用户的语音信息中是否包括预设关键词;若所述用户的语音信息中包括预设关键词,则获取混响语音信号。In some embodiments, the obtaining module 301 may be used to: obtain the user's voice information when the electronic device is in a call state; detect whether the user's voice information includes preset keywords; if the user's If the voice information includes preset keywords, the reverb voice signal is obtained.
在一些实施方式中,所述获取模块301,可以用于:若所述用户的语音信息中包括预设关键词,则生成一次记录并保存所述记录;当保存的记录的数量大于预设数量阈值时,获取混响语音信号。In some embodiments, the acquisition module 301 may be used to: if the user's voice information includes preset keywords, generate a record and save the record; when the number of saved records is greater than the preset number At the threshold, the reverberation speech signal is obtained.
在一些实施方式中,所述获取模块301,可以用于:当电子设备要进行声纹识别或者语音识别时,检测声源与电子设备之间的距离是否大于预设距离阈值;若所述声源与电子设备之间的距离大于预设距离阈值,则获取混响语音信号。In some embodiments, the acquisition module 301 may be used to: when the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold; if the sound If the distance between the source and the electronic device is greater than the preset distance threshold, the reverberation voice signal is acquired.
本申请实施例提供一种计算机可读的存储介质,其上存储有计算机程序,当所述计算机程序在计算机上执行时,使得所述计算机执行如本实施例提供的语音处理方法中的流程。An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed on a computer, the computer is caused to execute the process in the voice processing method provided in this embodiment.
本申请实施例还提供一种电子设备,包括存储器,处理器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行本实施例提供的语音处理方法中的流程。An embodiment of the present application also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program stored in the memory by executing the computer program In the voice processing method.
例如,上述电子设备可以是诸如平板电脑或者智能手机等移动终端。For example, the aforementioned electronic device may be a mobile terminal such as a tablet computer or a smart phone.
请参阅图6,图6为本申请实施例提供的电子设备的第一种结构示意图。Please refer to FIG. 6, which is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
该电子设备400可以包括麦克风401、存储器402、处理器403等部件。本领域技术人员可以理解,图6中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The electronic device 400 may include components such as a microphone 401, a memory 402, and a processor 403. Those skilled in the art may understand that the structure of the electronic device shown in FIG. 6 does not constitute a limitation on the electronic device, and may include more or less components than those illustrated, or combine certain components, or arrange different components.
麦克风401可用于拾取用户发出的语音等。The microphone 401 can be used to pick up the voice uttered by the user and the like.
存储器402可用于存储应用程序和数据。存储器402存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器403通过运行存储在存储器402的应用程序,从而执行各种功能应用以及数据处理。The memory 402 may be used to store application programs and data. The application program stored in the memory 402 contains executable code. The application program can form various functional modules. The processor 403 executes application programs stored in the memory 402 to execute various functional applications and data processing.
处理器403是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器402内的应用程序,以及调用存储在存储器402内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。The processor 403 is the control center of the electronic device, and uses various interfaces and lines to connect the various parts of the entire electronic device, and executes the electronic device by running or executing the application program stored in the memory 402 and calling the data stored in the memory 402 Various functions and processing data, so as to carry out overall monitoring of electronic equipment.
在本实施例中,电子设备中的处理器403会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器402中,并由处理器403来运行存储在存储器402中的应用程序,从而实现流程:In this embodiment, the processor 403 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 executes and stores the memory in the memory The application in 402, thereby implementing the process:
获取混响语音信号;Obtain the reverberation voice signal;
对所述混响语音信号进行信号保真处理,得到第一语音信号;Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
根据所述第一功率谱,计算第二功率谱;Calculating a second power spectrum according to the first power spectrum;
根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
请参阅图7,图7为本申请实施例提供的电子设备的第二种结构示意图。Please refer to FIG. 7, which is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
该电子设备500可以包括麦克风501、存储器502、处理器503、输入单元504、输出单元505、扬声器506等部件。The electronic device 500 may include components such as a microphone 501, a memory 502, a processor 503, an input unit 504, an output unit 505, a speaker 506, and the like.
麦克风501可用于拾取用户发出的语音等。The microphone 501 can be used to pick up the voice uttered by the user and the like.
存储器502可用于存储应用程序和数据。存储器502存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器503通过运行存储在存储器502的应用程序,从而执行各种功能应用以及数据处理。The memory 502 may be used to store application programs and data. The application program stored in the memory 502 contains executable code. The application program can form various functional modules. The processor 503 executes application programs stored in the memory 502 to execute various functional applications and data processing.
处理器503是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器502内的应用程序,以及调用存储在存储器502内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。The processor 503 is the control center of the electronic device, and uses various interfaces and lines to connect various parts of the entire electronic device, and executes the electronic device by running or executing application programs stored in the memory 502 and calling data stored in the memory 502 Various functions and processing data, so as to carry out overall monitoring of electronic equipment.
输入单元504可用于接收输入的数字、字符信息或用户特征信息(比如指纹),以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The input unit 504 may be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
输出单元505可用于显示由用户输入的信息或提供给用户的信息以及电子设备的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。输出单元可包括显示面板。The output unit 505 may be used to display information input by the user or provided to the user and various graphical user interfaces of the electronic device. These graphical user interfaces may be composed of graphics, text, icons, video, and any combination thereof. The output unit may include a display panel.
扬声器506可用于将电信号转换为声音。The speaker 506 may be used to convert electrical signals into sound.
在本实施例中,电子设备中的处理器503会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器502中,并由处理器503来运行存储在存储器502中的应用程序,从而实现流程:In this embodiment, the processor 503 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 503 runs the stored code in the memory The application in 502, thereby implementing the process:
获取混响语音信号;Obtain the reverberation voice signal;
对所述混响语音信号进行信号保真处理,得到第一语音信号;Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
根据所述第一功率谱,计算第二功率谱;Calculating a second power spectrum according to the first power spectrum;
根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
在一些实施方式中,处理器503执行所述获取混响语音信号的流程之后,还可以执行:对所述混响语音信号进行加窗分帧处理,得到多帧加窗语音信号;处理器503执行所述对所述混响语音信号进行信号保真处理,得到第一语音信号的流程时,可以执行:对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号,所述多帧保真语音信号构成所述第一语音信号;处理器503执行所述对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱的流程时,可以执行:对每帧保真语音信号进行傅里叶变换,得到多帧保真语音信号分别对应的相位谱和功率谱,所述多帧保真语音信号分别对应的相位谱和功率谱构成第一语音信号对应的相位谱和第一功率谱;处理器503执行所述根据所述第一功率谱,计算第二功率谱的流程时,可以执行:根据每帧保真语音信号对应的功率谱,计算第三功率谱,得到多个第三功率谱,所述多个第三功率谱构成所述第二功率谱;处理器503执行所述根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号的流程时,可以执行:根据每帧保真语音信号对应的相位谱和所述多个第三功率谱,构建每帧干净语音信号,得到多帧干净语音信号;对所述多帧干净语音信号进行加窗合帧处理,得到目标干净语音信号。In some embodiments, after the processor 503 executes the process of acquiring the reverberation speech signal, it may further perform: performing windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal; the processor 503 When performing the process of performing signal fidelity processing on the reverberation speech signal to obtain the first speech signal, it may perform: performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal. The multi-frame fidelity speech signal constitutes the first speech signal; the processor 503 executes the Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal The process can be performed by performing a Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal, respectively. The power spectrum constitutes the phase spectrum and the first power spectrum corresponding to the first speech signal; when the processor 503 executes the process of calculating the second power spectrum according to the first power spectrum, it may perform: according to each frame of the fidelity speech signal Corresponding power spectrum, calculate a third power spectrum to obtain multiple third power spectrums, the multiple third power spectrums constitute the second power spectrum; the processor 503 executes the When the phase spectrum and the second power spectrum are used to construct the target clean speech signal, the following steps may be performed: according to the phase spectrum corresponding to each frame of the fidelity speech signal and the multiple third power spectra, construct a clean speech signal for each frame, Obtain multiple frames of clean speech signals; perform windowing and framing processing on the multiple frames of clean speech signals to obtain target clean speech signals.
在一些实施方式中,处理器503执行所述对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号的流程时,可以执行:对第i帧及第i帧之后的每帧加窗语音信号进行线性预测分析,得到第i帧及第i帧之后的每帧线性预测残差信号;采用逆滤波器对第i帧线性预测残差信号进行逆滤波处理,得到第i帧滤波语音信号;当检测到第i帧滤波语音信号峰度最小时,获取使得第i帧滤波语音信号峰度最小的逆滤波器,得到第i帧加窗语音信号对应的逆滤波器;采用所述第i帧加窗语音信号对应的逆滤波器对第i帧加窗语音信号进行逆滤波处理,得到第i帧保真语音信号;根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器;根据所述第i帧之后的每帧加 窗语音信号对应的逆滤波器和第i帧之后的每帧加窗语音信号,得到第i帧之后的多帧保真语音信号;结合第i帧保真语音信号与第i帧之后的多帧保真语音信号,得到多帧保真语音信号。In some embodiments, the processor 503 executes the process of performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal, and may execute: the i frame and each subsequent frame after the i frame Frame windowed speech signal is subjected to linear prediction analysis to obtain the i-th frame and the i-th frame after each linear prediction residual signal; the inverse filter is used to inversely filter the i-th frame linear prediction residual signal to obtain the i-th frame Filter the speech signal; when the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, obtain the inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal to obtain the inverse filter corresponding to the i-th windowed speech signal; The inverse filter corresponding to the i-th windowed speech signal performs inverse filtering on the i-th windowed speech signal to obtain the i-th frame fidelity speech signal; the linear prediction of the residual signal before each frame after the i-th frame A frame of linear prediction residual signal and an inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i frame to obtain an inverse filter corresponding to the windowed speech signal of each frame after the i frame ; According to the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame and the windowed speech signal of each frame after the i-th frame, a multi-frame fidelity speech signal after the i-th frame is obtained; combining the i-th frame The fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame obtain a multi-frame fidelity speech signal.
在一些实施方式中,处理器503还可以执行:当检测到第i帧滤波语音信号峰度最小时,获取峰度最小的第i帧滤波语音信号;处理器503执行所述根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器的流程时,可以执行:根据第i帧线性预测残差信号、第i帧加窗语音信号对应的逆滤波器和峰度最小的第i帧滤波语音信号,得到第i+1帧加窗语音信号对应的逆滤波器;根据第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号,得到第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号;根据第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号、第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号,得到第i+1帧之后的每帧加窗语音信号对应的逆滤波器;根据第i+1帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧加窗语音信号对应的逆滤波器,得到第i帧之后的加窗语音信号对应的逆滤波器。In some embodiments, the processor 503 may further execute: when it is detected that the i-th frame filtered speech signal has the smallest kurtosis, obtain the i-th frame filtered speech signal with the smallest kurtosis; after the processor 503 executes the The inverse filter corresponding to the previous frame of the linear prediction residual signal of each frame of the linear prediction residual signal and the previous frame of the windowed speech signal of each frame after the i-th frame, to obtain When the process of the inverse filter corresponding to the framed windowed speech signal can be performed: the linear prediction residual signal according to the ith frame, the inverse filter corresponding to the windowed speech signal of the ith frame, and the ith frame filtered speech signal with the smallest kurtosis , Get the inverse filter corresponding to the windowed voice signal of the i+1th frame; according to the inverse filter corresponding to the windowed voice signal of the previous frame of the windowed voice signal of each frame after the i+1th frame and the i+1th frame The linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame afterwards, to obtain the filtered voice signal of the previous frame of the filtered speech signal of each frame after the i+1 frame; according to each frame after the i+1 frame The linear prediction residual signal in the previous frame of the linear prediction residual signal, the inverse filter corresponding to the windowed voice signal in the previous frame of the windowed speech signal in each frame after the i+1th frame, and each frame after the i+1 frame Frame filter the speech signal in the previous frame to filter the speech signal to obtain the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame; according to the inverse filter and i The inverse filter corresponding to the windowed speech signal of each frame after the +1 frame is obtained, and the inverse filter corresponding to the windowed speech signal after the i-th frame is obtained.
在一些实施方式中,处理器503执行所述获取混响语音信号的流程时,可以执行:在电子设备处于通话状态时,获取用户的语音信息;检测所述用户的语音信息中是否包括预设关键词;若所述用户的语音信息中包括预设关键词,则获取混响语音信号。In some embodiments, when the processor 503 executes the process of acquiring the reverberation voice signal, it may perform: acquiring the user's voice information when the electronic device is in a call state; detecting whether the user's voice information includes a preset Keywords; if the user's voice information includes preset keywords, the reverb voice signal is obtained.
在一些实施方式中,处理器503执行所述若所述用户的语音信息中包括预设关键词,则获取混响语音信号的流程时,可以执行:若所述用户的语音信息中包括预设关键词,则生成一次记录并保存所述记录;当保存的记录的数量大于预设数量阈值时,获取混响语音信号。In some embodiments, the processor 503 executes the process of acquiring a reverberation voice signal if the user's voice information includes a preset keyword, and may execute: if the user's voice information includes a preset For keywords, a record is generated and the record is saved; when the number of saved records is greater than a preset number threshold, a reverberation speech signal is acquired.
在一些实施方式中,处理器503执行所述获取混响语音信号的流程时,可以执行:当电子设备要进行声纹识别或者语音识别时,检测声源与电子设备之间的距离是否大于预设距离阈值;若所述声源与电子设备之间的距离大于预设距离阈值,则获取混响语音信号。In some embodiments, when the processor 503 executes the process of acquiring the reverberation speech signal, it may perform: when the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than the Set a distance threshold; if the distance between the sound source and the electronic device is greater than a preset distance threshold, then obtain a reverberation speech signal.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见上文针对语音处理方法的详细描述,此处不再赘述。In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the detailed description of the voice processing method above, and it will not be repeated here.
本申请实施例提供的所述语音处理装置与上文实施例中的语音处理方法属于同一构思,在所述语音处理装置上可以运行所述语音处理方法实施例中提供的任一方法,其具体实现过程详见所述语音处理方法实施例,此处不再赘述。The voice processing device provided by the embodiment of the present application and the voice processing method in the above embodiments belong to the same concept, and any method provided in the voice processing method embodiment can be run on the voice processing device, which is specific For the implementation process, please refer to the embodiment of the voice processing method, which will not be repeated here.
需要说明的是,对本申请实施例所述语音处理方法而言,本领域普通技术人员可以理解实现本申请实施例所述语音处理方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在存储器中,并被至少一个处理器执行,在执行过程中可包括如所述语音处理方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)等。It should be noted that for the voice processing method described in the embodiments of the present application, a person of ordinary skill in the art can understand that all or part of the process of implementing the voice processing method described in the embodiments of the present application can be controlled by a computer program related hardware To complete, the computer program may be stored in a computer readable storage medium, such as stored in a memory, and executed by at least one processor, the execution process may include the process of the embodiment of the voice processing method as described . The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and so on.
对本申请实施例的所述语音处理装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。For the voice processing device of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module. The above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .
以上对本申请实施例所提供的一种语音处理方法、装置、存储介质以及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例 的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The speech processing method, device, storage medium, and electronic equipment provided by the embodiments of the present application are described in detail above. Specific examples are used to explain the principles and implementation of the present application. The descriptions of the above embodiments are only It is used to help understand the method of this application and its core ideas; meanwhile, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and application scope. In summary, this specification The content should not be interpreted as a limitation of this application.

Claims (20)

  1. 一种语音处理方法,包括:A voice processing method, including:
    获取混响语音信号;Obtain the reverberation voice signal;
    对所述混响语音信号进行信号保真处理,得到第一语音信号;Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
    对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
    根据所述第一功率谱,计算第二功率谱;Calculating a second power spectrum according to the first power spectrum;
    根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
  2. 根据权利要求1所述的语音处理方法,其中,在所述获取混响语音信号之后,还包括:The speech processing method according to claim 1, wherein after the reverberation speech signal is acquired, further comprising:
    对所述混响语音信号进行加窗分帧处理,得到多帧加窗语音信号;Windowing and framing the reverberation speech signal to obtain a multi-frame windowing speech signal;
    所述对所述混响语音信号进行信号保真处理,得到第一语音信号,包括:The signal fidelity processing of the reverberation speech signal to obtain the first speech signal includes:
    对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号,所述多帧保真语音信号构成所述第一语音信号;Performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal, the multi-frame fidelity speech signal constituting the first speech signal;
    所述对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱,包括:The performing Fourier transform on the first speech signal to obtain the phase spectrum and the first power spectrum corresponding to the first speech signal includes:
    对每帧保真语音信号进行傅里叶变换,得到多帧保真语音信号分别对应的相位谱和功率谱,所述多帧保真语音信号分别对应的相位谱和功率谱构成第一语音信号对应的相位谱和第一功率谱;Fourier transform the fidelity speech signal of each frame to obtain the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively, the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively constitute the first voice signal Corresponding phase spectrum and first power spectrum;
    所述根据所述第一功率谱,计算第二功率谱,包括:The calculating the second power spectrum according to the first power spectrum includes:
    根据每帧保真语音信号对应的功率谱,计算第三功率谱,得到多个第三功率谱,所述多个第三功率谱构成所述第二功率谱;Calculating a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, the plurality of third power spectrums constituting the second power spectrum;
    所述根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号,包括:The constructing the target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum includes:
    根据每帧保真语音信号对应的相位谱和所述多个第三功率谱,构建每帧干净语音信号,得到多帧干净语音信号;Construct a clean voice signal for each frame according to the phase spectrum corresponding to each frame of the fidelity voice signal and the multiple third power spectrums to obtain multiple frames of clean voice signals;
    对所述多帧干净语音信号进行加窗合帧处理,得到目标干净语音信号。Windowing and framing the multi-frame clean speech signal to obtain the target clean speech signal.
  3. 根据权利要求2所述的语音处理方法,其中,所述对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号,包括:The speech processing method according to claim 2, wherein said performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal includes:
    对第i帧及第i帧之后的每帧加窗语音信号进行线性预测分析,得到第i帧及第i帧之后的每帧线性预测残差信号;Perform linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain the linear prediction residual signal of the i-th frame and each frame after the i-th frame;
    采用逆滤波器对第i帧线性预测残差信号进行逆滤波处理,得到第i帧滤波语音信号;The inverse filter is used to inversely filter the linear prediction residual signal of the ith frame to obtain the ith frame filtered speech signal;
    当检测到第i帧滤波语音信号峰度最小时,获取使得第i帧滤波语音信号峰度最小的逆滤波器,得到第i帧加窗语音信号对应的逆滤波器;When the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal is obtained to obtain an inverse filter corresponding to the i-th windowed speech signal
    采用所述第i帧加窗语音信号对应的逆滤波器对第i帧加窗语音信号进行逆滤波处理,得到第i帧保真语音信号;Adopting an inverse filter corresponding to the windowed speech signal of the i-th frame to perform inverse filtering processing on the windowed speech signal of the i-th frame to obtain a fidelity speech signal of the i-th frame;
    根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器;According to the inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame windowed voice signal of the windowed speech signal of each frame after the i-th frame, the first The inverse filter corresponding to the windowed speech signal of each frame after the i frame;
    根据所述第i帧之后的每帧加窗语音信号对应的逆滤波器和第i帧之后的每帧加窗语音信号,得到第i帧之后的多帧保真语音信号;Obtaining a multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame;
    结合第i帧保真语音信号与第i帧之后的多帧保真语音信号,得到多帧保真语音信号。Combining the i-th frame fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame, a multi-frame fidelity speech signal is obtained.
  4. 根据权利要求3所述的语音处理方法,其中,所述方法还包括:The voice processing method according to claim 3, wherein the method further comprises:
    当检测到第i帧滤波语音信号峰度最小时,获取峰度最小的第i帧滤波语音信号;When it is detected that the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame filtered speech signal with the smallest kurtosis is acquired;
    所述根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器,包括:The inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame of the windowed speech signal of the windowed voice signal of each frame after the i-th frame, Obtain the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame, including:
    根据第i帧线性预测残差信号、第i帧加窗语音信号对应的逆滤波器和峰度最小的第i帧滤波语音信号,得到第i+1帧加窗语音信号对应的逆滤波器;Obtain the inverse filter corresponding to the windowed voice signal in frame i+1 according to the linear prediction residual signal in the i frame, the inverse filter corresponding to the windowed voice signal in the i frame and the i-frame filtered voice signal with the least kurtosis;
    根据第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号,得到第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号;According to the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linear prediction residual of the previous frame of the linear prediction residual signal of each frame after the i+1th frame Signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame;
    根据第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号、第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号,得到第i+1帧之后的每帧加窗语音信号对应的逆滤波器;According to the inverse filtering of the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1 frame, and the previous frame of the windowed voice signal of the windowed voice signal of each frame after the i+1 frame The filter and the previous frame of the filtered voice signal of each frame after the i+1th frame, to obtain the inverse filter corresponding to the windowed voice signal of each frame after the i+1th frame;
    根据第i+1帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧加窗语音信号对应的逆滤波器,得到第i帧之后的加窗语音信号对应的逆滤波器。According to the inverse filter corresponding to the windowed speech signal of the i+1th frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame, the inverse filter corresponding to the windowed speech signal after the ith frame is obtained .
  5. 根据权利要求1所述的语音处理方法,其中,所述获取混响语音信号,包括:The speech processing method according to claim 1, wherein the acquiring the reverberation speech signal comprises:
    在电子设备处于通话状态时,获取用户的语音信息;Obtain the user's voice information when the electronic device is in a call state;
    检测所述用户的语音信息中是否包括预设关键词;Detecting whether the user's voice information includes preset keywords;
    若所述用户的语音信息中包括预设关键词,则获取混响语音信号。If the user's voice information includes preset keywords, a reverb voice signal is obtained.
  6. 根据权利要求5所述的语音处理方法,其中,所述若所述用户的语音信息中包括预设关键词,则获取混响语音信号,包括:The voice processing method according to claim 5, wherein, if the user's voice information includes a preset keyword, acquiring the reverberation voice signal includes:
    若所述用户的语音信息中包括预设关键词,则生成一次记录并保存所述记录;If the user's voice information includes preset keywords, a record is generated and the record is saved;
    当保存的记录的数量大于预设数量阈值时,获取混响语音信号。When the number of saved records is greater than the preset number threshold, the reverberation voice signal is acquired.
  7. 根据权利要求1所述的语音处理方法,其中,所述获取混响语音信号,包括:The speech processing method according to claim 1, wherein the acquiring the reverberation speech signal comprises:
    当电子设备要进行声纹识别或者语音识别时,检测声源与电子设备之间的距离是否大于预设距离阈值;When the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold;
    若所述声源与电子设备之间的距离大于预设距离阈值,则获取混响语音信号。If the distance between the sound source and the electronic device is greater than a preset distance threshold, a reverberation voice signal is acquired.
  8. 一种语音处理装置,包括:A voice processing device, including:
    获取模块,用于获取混响语音信号;Acquisition module for acquiring reverberation voice signals;
    处理模块,用于对所述混响语音信号进行信号保真处理,得到第一语音信号;A processing module, configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
    变换模块,用于对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;A transformation module, configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
    计算模块,用于根据所述第一功率谱,计算第二功率谱;A calculation module, configured to calculate a second power spectrum according to the first power spectrum;
    构建模块,用于根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。The construction module is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.
  9. 根据权利要求8所述的语音处理装置,其中,所述获取模块,用于:The voice processing device according to claim 8, wherein the acquisition module is configured to:
    对所述混响语音信号进行加窗分帧处理,得到多帧加窗语音信号;Windowing and framing the reverberation speech signal to obtain a multi-frame windowing speech signal;
    所述处理模块,用于:对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号,所述多帧保真语音信号构成所述第一语音信号;The processing module is configured to: perform signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal, and the multi-frame fidelity voice signal constitutes the first voice signal;
    所述变换模块,用于:对每帧保真语音信号进行傅里叶变换,得到多帧保真语音信号分别对应的相位谱和功率谱,所述多帧保真语音信号分别对应的相位谱和功率谱构成第一语音信号对应的相位谱和第一功率谱;The transform module is used for: performing Fourier transform on each frame of the fidelity speech signal to obtain the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively, and the phase spectrum corresponding to the multi-frame fidelity speech signal respectively And the power spectrum constitute a phase spectrum and a first power spectrum corresponding to the first speech signal;
    所述计算模块,用于:根据每帧保真语音信号对应的功率谱,计算第三功率谱,得到多个第三功率谱,所述多个第三功率谱构成所述第二功率谱;The calculation module is configured to calculate a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, and the plurality of third power spectrums constitute the second power spectrum;
    所述构建模块,用于:根据每帧保真语音信号对应的相位谱和所述多个第三功率谱,构建每帧干净语音信号,得到多帧干净语音信号;对所述多帧干净语音信号进行加窗合帧 处理,得到目标干净语音信号。The construction module is configured to: according to the phase spectrum corresponding to the fidelity speech signal of each frame and the plurality of third power spectra, construct a clean speech signal of each frame to obtain multiple frames of clean speech signals; The signal is windowed and framed to obtain the target clean speech signal.
  10. 根据权利要求9所述的语音处理装置,其中,所述处理模块,用于:The voice processing device according to claim 9, wherein the processing module is configured to:
    对第i帧及第i帧之后的每帧加窗语音信号进行线性预测分析,得到第i帧及第i帧之后的每帧线性预测残差信号;Perform linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain the linear prediction residual signal of the i-th frame and each frame after the i-th frame;
    采用逆滤波器对第i帧线性预测残差信号进行逆滤波处理,得到第i帧滤波语音信号;The inverse filter is used to inversely filter the linear prediction residual signal of the ith frame to obtain the ith frame filtered speech signal;
    当检测到第i帧滤波语音信号峰度最小时,获取使得第i帧滤波语音信号峰度最小的逆滤波器,得到第i帧加窗语音信号对应的逆滤波器;When the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal is obtained to obtain an inverse filter corresponding to the i-th windowed speech signal;
    采用所述第i帧加窗语音信号对应的逆滤波器对第i帧加窗语音信号进行逆滤波处理,得到第i帧保真语音信号;Adopting an inverse filter corresponding to the windowed speech signal of the i-th frame to perform inverse filtering processing on the windowed speech signal of the i-th frame to obtain a fidelity speech signal of the i-th frame;
    根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器;According to the inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame windowed voice signal of the windowed speech signal of each frame after the i-th frame, the first The inverse filter corresponding to the windowed speech signal of each frame after the i frame;
    根据所述第i帧之后的每帧加窗语音信号对应的逆滤波器和第i帧之后的每帧加窗语音信号,得到第i帧之后的多帧保真语音信号;Obtaining a multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame;
    结合第i帧保真语音信号与第i帧之后的多帧保真语音信号,得到多帧保真语音信号。Combining the i-th frame fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame, a multi-frame fidelity speech signal is obtained.
  11. 根据权利要求10所述的语音处理装置,其中,所述处理模块,用于:The voice processing device according to claim 10, wherein the processing module is configured to:
    当检测到第i帧滤波语音信号峰度最小时,获取峰度最小的第i帧滤波语音信号;When it is detected that the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame filtered speech signal with the smallest kurtosis is acquired;
    根据第i帧线性预测残差信号、第i帧加窗语音信号对应的逆滤波器和峰度最小的第i帧滤波语音信号,得到第i+1帧加窗语音信号对应的逆滤波器;Obtain the inverse filter corresponding to the windowed voice signal in frame i+1 according to the linear prediction residual signal in the i frame, the inverse filter corresponding to the windowed voice signal in the i frame and the i-frame filtered voice signal with the least kurtosis;
    根据第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号,得到第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号;According to the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linear prediction residual of the previous frame of the linear prediction residual signal of each frame after the i+1th frame Signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame;
    根据第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号、第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号,得到第i+1帧之后的每帧加窗语音信号对应的逆滤波器;According to the inverse filtering of the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1 frame, and the previous frame of the windowed voice signal of the windowed voice signal of each frame after the i+1 frame The filter and the previous frame of the filtered voice signal of each frame after the i+1th frame, to obtain the inverse filter corresponding to the windowed voice signal of each frame after the i+1th frame;
    根据第i+1帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧加窗语音信号对应的逆滤波器,得到第i帧之后的加窗语音信号对应的逆滤波器。According to the inverse filter corresponding to the windowed speech signal of the i+1th frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame, the inverse filter corresponding to the windowed speech signal after the ith frame is obtained .
  12. 根据权利要求8所述的语音处理装置,其中,所述获取模块,用于:The voice processing device according to claim 8, wherein the acquisition module is configured to:
    在电子设备处于通话状态时,获取用户的语音信息;Obtain the user's voice information when the electronic device is in a call state;
    检测所述用户的语音信息中是否包括预设关键词;Detecting whether the user's voice information includes preset keywords;
    若所述用户的语音信息中包括预设关键词,则获取混响语音信号。If the user's voice information includes preset keywords, a reverb voice signal is obtained.
  13. 一种存储介质,其中,所述存储介质中存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行权利要求1至7任一项所述的语音处理方法。A storage medium, wherein a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer is caused to execute the voice processing method according to any one of claims 1 to 7.
  14. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:An electronic device, wherein the electronic device includes a processor and a memory, a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:
    获取混响语音信号;Obtain the reverberation voice signal;
    对所述混响语音信号进行信号保真处理,得到第一语音信号;Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;
    对所述第一语音信号进行傅里叶变换,得到所述第一语音信号对应的相位谱和第一功率谱;Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;
    根据所述第一功率谱,计算第二功率谱;Calculating a second power spectrum according to the first power spectrum;
    根据所述第一语音信号对应的相位谱和所述第二功率谱,构建目标干净语音信号。According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
  15. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 14, wherein the processor is configured to execute:
    对所述混响语音信号进行加窗分帧处理,得到多帧加窗语音信号;Windowing and framing the reverberation speech signal to obtain a multi-frame windowing speech signal;
    对每帧加窗语音信号进行信号保真处理,得到多帧保真语音信号,所述多帧保真语音 信号构成所述第一语音信号;Performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal, the multi-frame fidelity speech signal constituting the first speech signal;
    对每帧保真语音信号进行傅里叶变换,得到多帧保真语音信号分别对应的相位谱和功率谱,所述多帧保真语音信号分别对应的相位谱和功率谱构成第一语音信号对应的相位谱和第一功率谱;Fourier transform the fidelity speech signal of each frame to obtain the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively, the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively constitute the first voice signal Corresponding phase spectrum and first power spectrum;
    根据每帧保真语音信号对应的功率谱,计算第三功率谱,得到多个第三功率谱,所述多个第三功率谱构成所述第二功率谱;Calculating a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, the plurality of third power spectrums constituting the second power spectrum;
    根据每帧保真语音信号对应的相位谱和所述多个第三功率谱,构建每帧干净语音信号,得到多帧干净语音信号;Construct a clean voice signal for each frame according to the phase spectrum corresponding to each frame of the fidelity voice signal and the multiple third power spectrums to obtain multiple frames of clean voice signals;
    对所述多帧干净语音信号进行加窗合帧处理,得到目标干净语音信号。Windowing and framing the multi-frame clean speech signal to obtain the target clean speech signal.
  16. 根据权利要求15所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 15, wherein the processor is configured to execute:
    对第i帧及第i帧之后的每帧加窗语音信号进行线性预测分析,得到第i帧及第i帧之后的每帧线性预测残差信号;Perform linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain the linear prediction residual signal of the i-th frame and each frame after the i-th frame;
    采用逆滤波器对第i帧线性预测残差信号进行逆滤波处理,得到第i帧滤波语音信号;The inverse filter is used to inversely filter the linear prediction residual signal of the ith frame to obtain the ith frame filtered speech signal;
    当检测到第i帧滤波语音信号峰度最小时,获取使得第i帧滤波语音信号峰度最小的逆滤波器,得到第i帧加窗语音信号对应的逆滤波器;When the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal is obtained to obtain an inverse filter corresponding to the i-th windowed speech signal;
    采用所述第i帧加窗语音信号对应的逆滤波器对第i帧加窗语音信号进行逆滤波处理,得到第i帧保真语音信号;Adopting an inverse filter corresponding to the windowed speech signal of the i-th frame to perform inverse filtering processing on the windowed speech signal of the i-th frame to obtain a fidelity speech signal of the i-th frame;
    根据第i帧之后的每帧线性预测残差信号的前一帧线性预测残差信号和第i帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器,得到第i帧之后的每帧加窗语音信号对应的逆滤波器;According to the inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame windowed voice signal of the windowed speech signal of each frame after the i-th frame, the first The inverse filter corresponding to the windowed speech signal of each frame after the i frame;
    根据所述第i帧之后的每帧加窗语音信号对应的逆滤波器和第i帧之后的每帧加窗语音信号,得到第i帧之后的多帧保真语音信号;Obtaining a multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame;
    结合第i帧保真语音信号与第i帧之后的多帧保真语音信号,得到多帧保真语音信号。Combining the i-th frame fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame, a multi-frame fidelity speech signal is obtained.
  17. 根据权利要求16所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 16, wherein the processor is configured to execute:
    当检测到第i帧滤波语音信号峰度最小时,获取峰度最小的第i帧滤波语音信号;When it is detected that the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame filtered speech signal with the smallest kurtosis is acquired;
    根据第i帧线性预测残差信号、第i帧加窗语音信号对应的逆滤波器和峰度最小的第i帧滤波语音信号,得到第i+1帧加窗语音信号对应的逆滤波器;Obtain the inverse filter corresponding to the windowed voice signal in frame i+1 according to the linear prediction residual signal in the i frame, the inverse filter corresponding to the windowed voice signal in the i frame and the i-frame filtered voice signal with the least kurtosis;
    根据第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号,得到第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号;According to the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linear prediction residual of the previous frame of the linear prediction residual signal of each frame after the i+1th frame Signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame;
    根据第i+1帧之后的每帧线性预测残差信号的前一帧线性预测残差信号、第i+1帧之后的每帧加窗语音信号的前一帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧滤波语音信号的前一帧滤波语音信号,得到第i+1帧之后的每帧加窗语音信号对应的逆滤波器;According to the inverse filtering of the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1 frame, and the previous frame of the windowed voice signal of the windowed voice signal of each frame after the i+1 frame The filter and the previous frame of the filtered voice signal of each frame after the i+1th frame, to obtain the inverse filter corresponding to the windowed voice signal of each frame after the i+1th frame;
    根据第i+1帧加窗语音信号对应的逆滤波器和第i+1帧之后的每帧加窗语音信号对应的逆滤波器,得到第i帧之后的加窗语音信号对应的逆滤波器。According to the inverse filter corresponding to the windowed speech signal of the i+1th frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame, the inverse filter corresponding to the windowed speech signal after the ith frame is obtained .
  18. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 14, wherein the processor is configured to execute:
    在电子设备处于通话状态时,获取用户的语音信息;Obtain the user's voice information when the electronic device is in a call state;
    检测所述用户的语音信息中是否包括预设关键词;Detecting whether the user's voice information includes preset keywords;
    若所述用户的语音信息中包括预设关键词,则获取混响语音信号。If the user's voice information includes preset keywords, a reverb voice signal is obtained.
  19. 根据权利要求18所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 18, wherein the processor is configured to execute:
    若所述用户的语音信息中包括预设关键词,则生成一次记录并保存所述记录;If the user's voice information includes preset keywords, a record is generated and the record is saved;
    当保存的记录的数量大于预设数量阈值时,获取混响语音信号。When the number of saved records is greater than the preset number threshold, the reverberation voice signal is acquired.
  20. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 14, wherein the processor is configured to execute:
    当电子设备要进行声纹识别或者语音识别时,检测声源与电子设备之间的距离是否大 于预设距离阈值;When the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold;
    若所述声源与电子设备之间的距离大于预设距离阈值,则获取混响语音信号。If the distance between the sound source and the electronic device is greater than a preset distance threshold, a reverberation voice signal is acquired.
PCT/CN2018/118713 2018-11-30 2018-11-30 Voice processing method and apparatus, storage medium, and electronic device WO2020107455A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880098277.9A CN112997249B (en) 2018-11-30 2018-11-30 Voice processing method, device, storage medium and electronic equipment
PCT/CN2018/118713 WO2020107455A1 (en) 2018-11-30 2018-11-30 Voice processing method and apparatus, storage medium, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/118713 WO2020107455A1 (en) 2018-11-30 2018-11-30 Voice processing method and apparatus, storage medium, and electronic device

Publications (1)

Publication Number Publication Date
WO2020107455A1 true WO2020107455A1 (en) 2020-06-04

Family

ID=70854469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/118713 WO2020107455A1 (en) 2018-11-30 2018-11-30 Voice processing method and apparatus, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN112997249B (en)
WO (1) WO2020107455A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436613A (en) * 2021-06-30 2021-09-24 Oppo广东移动通信有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113724692A (en) * 2021-10-08 2021-11-30 广东电力信息科技有限公司 Voice print feature-based phone scene audio acquisition and anti-interference processing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750956A (en) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
CN106340302A (en) * 2015-07-10 2017-01-18 深圳市潮流网络技术有限公司 De-reverberation method and device for speech data
WO2017160294A1 (en) * 2016-03-17 2017-09-21 Nuance Communications, Inc. Spectral estimation of room acoustic parameters
CN108198568A (en) * 2017-12-26 2018-06-22 太原理工大学 A kind of method and system of more auditory localizations

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315772A (en) * 2008-07-17 2008-12-03 上海交通大学 Speech reverberation eliminating method based on Wiener filtering
JP5172536B2 (en) * 2008-08-22 2013-03-27 日本電信電話株式会社 Reverberation removal apparatus, dereverberation method, computer program, and recording medium
JP5815614B2 (en) * 2013-08-13 2015-11-17 日本電信電話株式会社 Reverberation suppression apparatus and method, program, and recording medium
CN107393550B (en) * 2017-07-14 2021-03-19 深圳永顺智信息科技有限公司 Voice processing method and device
CN108735213B (en) * 2018-05-29 2020-06-16 太原理工大学 Voice enhancement method and system based on phase compensation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750956A (en) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
CN106340302A (en) * 2015-07-10 2017-01-18 深圳市潮流网络技术有限公司 De-reverberation method and device for speech data
WO2017160294A1 (en) * 2016-03-17 2017-09-21 Nuance Communications, Inc. Spectral estimation of room acoustic parameters
CN108198568A (en) * 2017-12-26 2018-06-22 太原理工大学 A kind of method and system of more auditory localizations

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436613A (en) * 2021-06-30 2021-09-24 Oppo广东移动通信有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113724692A (en) * 2021-10-08 2021-11-30 广东电力信息科技有限公司 Voice print feature-based phone scene audio acquisition and anti-interference processing method
CN113724692B (en) * 2021-10-08 2023-07-14 广东电力信息科技有限公司 Telephone scene audio acquisition and anti-interference processing method based on voiceprint features

Also Published As

Publication number Publication date
CN112997249A (en) 2021-06-18
CN112997249B (en) 2022-06-14

Similar Documents

Publication Publication Date Title
Li et al. On the importance of power compression and phase estimation in monaural speech dereverberation
US9721583B2 (en) Integrated sensor-array processor
WO2016180100A1 (en) Method and device for improving audio processing performance
CN108447496B (en) Speech enhancement method and device based on microphone array
US10622004B1 (en) Acoustic echo cancellation using loudspeaker position
US11587575B2 (en) Hybrid noise suppression
US10755728B1 (en) Multichannel noise cancellation using frequency domain spectrum masking
US11349525B2 (en) Double talk detection method, double talk detection apparatus and echo cancellation system
WO2020097828A1 (en) Echo cancellation method, delay estimation method, echo cancellation apparatus, delay estimation apparatus, storage medium, and device
JP6225245B2 (en) Signal processing apparatus, method and program
CN112489670B (en) Time delay estimation method, device, terminal equipment and computer readable storage medium
US10839820B2 (en) Voice processing method, apparatus, device and storage medium
US20170309292A1 (en) Integrated sensor-array processor
US20240177726A1 (en) Speech enhancement
WO2020107455A1 (en) Voice processing method and apparatus, storage medium, and electronic device
CN109215672B (en) Method, device and equipment for processing sound information
US11380312B1 (en) Residual echo suppression for keyword detection
KR20200128687A (en) Howling suppression method, device and electronic equipment
CN112802490B (en) Beam forming method and device based on microphone array
WO2024041512A1 (en) Audio noise reduction method and apparatus, and electronic device and readable storage medium
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
CN113160846A (en) Noise suppression method and electronic device
CN107346658B (en) Reverberation suppression method and device
WO2023287782A1 (en) Data augmentation for speech enhancement
CN113205824B (en) Sound signal processing method, device, storage medium, chip and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18941152

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18941152

Country of ref document: EP

Kind code of ref document: A1