WO2020107455A1

WO2020107455A1 - Voice processing method and apparatus, storage medium, and electronic device

Info

Publication number: WO2020107455A1
Application number: PCT/CN2018/118713
Authority: WO
Inventors: 陈岩
Original assignee: 深圳市欢太科技有限公司; Oppo广东移动通信有限公司
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2020-06-04
Also published as: CN112997249A; CN112997249B

Abstract

A voice processing method and apparatus, a storage medium, and an electronic device. The method comprises: obtaining a reverberation voice signal (101); performing signal fidelity processing on the reverberation voice signal to obtain a first voice signal (102); performing Fourier transformation on the first voice signal to obtain a phase spectrum corresponding to the first voice signal and a first power spectrum (103); calculating a second power spectrum according to the first power spectrum (104); and constructing a target clean voice signal according to the phase spectrum and the second power spectrum (105).

Description

Voice processing method, device, storage medium and electronic equipment

Technical field

The present application belongs to the technical field of electronic equipment, and particularly relates to a voice processing method, device, storage medium, and electronic equipment.

Background technique

When using a microphone to collect voice signals indoors, if the sound source is far away from the microphone, there will be reverberation. Excessive reverberation will seriously affect the intelligibility and intelligibility of speech. Thus affecting the quality of the call, the recognition of voice and harmonization. At present, most commonly used reverberation cancellation algorithms mostly process the reverberation speech signal directly to obtain the dereverberation speech signal. However, the dereverberation speech signal obtained by this reverberation cancellation algorithm has low clarity.

Summary of the invention

Embodiments of the present application provide a voice processing method, device, storage medium, and electronic equipment, which can construct a clearer and clean voice signal.

In a first aspect, an embodiment of the present application provides a voice processing method, including:

Obtain the reverberation voice signal;

Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;

Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;

Calculating a second power spectrum according to the first power spectrum;

According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.

In a second aspect, an embodiment of the present application provides a voice processing device, including:

Acquisition module for acquiring reverberation voice signals;

A processing module, configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal;

A transformation module, configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;

A calculation module, configured to calculate a second power spectrum according to the first power spectrum;

The construction module is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.

In a third aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the voice processing method provided in this embodiment.

According to a fourth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:

Obtain the reverberation voice signal;

Calculating a second power spectrum according to the first power spectrum;

In the embodiment of the present application, the clean speech signal is not directly obtained from the reverberation speech signal, but the fidelity speech signal is processed first to obtain the first speech signal, and then according to the first power spectrum corresponding to the first speech signal Calculate the second power spectrum, and then construct a clean speech signal according to the phase spectrum and the second power spectrum corresponding to the first speech signal. In the embodiment of the present application, the first speech signal is obtained by performing fidelity processing on the reverberation speech signal, and then the first speech signal is processed, so that a clean speech signal with higher definition can be constructed.

BRIEF DESCRIPTION

The technical solutions and beneficial effects of the present application will be apparent through the detailed description of the specific implementation of the present application in conjunction with the accompanying drawings.

FIG. 1 is a first schematic flowchart of a voice processing method provided by an embodiment of the present application.

2 is a second schematic flowchart of a voice processing method provided by an embodiment of the present application.

3 is a third schematic flowchart of a voice processing method provided by an embodiment of the present application.

FIG. 4 is a fourth schematic flowchart of a voice processing method provided by an embodiment of the present application.

5 is a schematic structural diagram of a voice processing device provided by an embodiment of the present application.

6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.

7 is a second schematic structural diagram of an electronic device provided by an embodiment of the present application.

detailed description

Please refer to the illustration, where the same component symbol represents the same component. The principle of the present application is illustrated by implementation in an appropriate computing environment. The following description is based on the illustrated specific embodiments of the present application, which should not be considered as limiting other specific embodiments not detailed herein.

Please refer to FIG. 1, which is a first schematic flowchart of a voice processing method provided by an embodiment of the present application. The flow of the voice processing method may include:

For example, in the case of sound signal collection or recording, in addition to receiving the direct sound signal directly from the sound source to emit sound waves, the microphone also receives other sound signals from the sound source and arriving through other channels. , And unwanted sound waves (ie background noise) generated by other sound sources in the environment. Acoustically, the reflected wave with a delay time of more than about 50ms is called echo, and the effect of the remaining reflected waves is called reverb. The phenomenon of reverberation will greatly reduce the intelligibility of the sound, thereby affecting the call quality, voice and voiceprint recognition rate. In this case, how to reduce the reverberation of the voice signal collected by the microphone is particularly important.

In 101, a reverberation speech signal is obtained.

It can be understood that the signal received by the microphone end is easily affected by environmental reverberation. For example, in a room, voice is radiated multiple times through walls, floors, and furniture. The signal received at the microphone is a mixed signal of direct sound and reflected sound, which is a reverberated voice signal. This part of the reflected sound is the reverb signal, and the direct sound is the clean voice signal. The reverb signal will be delayed relative to the clean voice signal. When the speaker is far away from the microphone and the call environment is a relatively closed space, it is easy to produce reverberation. When the reverberation is serious, the voice will be unclear and affect the call quality. In addition, the interference caused by the reverberation phenomenon will also cause the performance of the acoustic receiving system to deteriorate, and the performance of the voice recognition and voiceprint recognition system will be significantly reduced.

In this embodiment, the electronic device acquires the reverberation speech signal.

In 102, signal fidelity processing is performed on the reverberation speech signal to obtain a first speech signal.

For example, when the microphone is collecting reverberant speech signals, there will be some distortion. If the reverberation speech signal is directly dereverberated, the clarity of the final clean speech signal (the speech signal after dereverberation) may not be high enough. Therefore, in this embodiment, the electronic device performs signal fidelity processing on the reverberation speech signal to obtain the first speech signal, so as to reduce the distortion rate of the signal.

In 103, Fourier transform is performed on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.

The electronic device performs Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.

At 104, a second power spectrum is calculated based on the first power spectrum.

For example, the electronic device may calculate the second power spectrum according to the first power spectrum.

In 105, the target clean speech signal is constructed according to the phase spectrum and the second power spectrum corresponding to the first speech signal.

For example, the electronic device constructs a target clean speech signal according to the phase spectrum and the second power spectrum corresponding to the first speech signal.

It can be understood that, in this embodiment, the clean speech signal is not directly obtained from the reverberation speech signal, but the fidelity of the reverberation speech signal is first processed to obtain the first speech signal, and then according to the corresponding The first power spectrum calculates the second power spectrum, and then constructs a clean voice signal according to the phase spectrum and the second power spectrum corresponding to the first voice signal. In the embodiment of the present application, the first speech signal is obtained by performing fidelity processing on the reverberation speech signal, and then the first speech signal is processed, so that a clean speech signal with higher definition can be constructed.

Please refer to FIG. 2, which is a second schematic flowchart of a voice processing method provided by an embodiment of the present application. The voice processing method may include:

In 201, the electronic device acquires a reverberation speech signal.

In this embodiment, the electronic device may use a microphone to collect the reverberation voice signal to obtain the reverberation voice signal.

In 202, the electronic device performs windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal.

For example, after acquiring the reverberation speech signal, the electronic device may perform windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowing speech signal. Among them, the electronic device may take a frame length of 20 ms and a frame shift of 10 ms to frame the reverberation voice signal. When the electronic device adds a window to the reverberant speech signal, it has priority and is not limited, and the window function can select a rectangular window, that is, w(n)=1.

For example, the windowed speech signal y(i) of length L can be expressed as: y(i)=[y(i-L+1),...y(i-1),y(i)], where, i Indicates the number of frames.

It should be noted that, in this embodiment, the multi-frame windowing signal obtained by the electronic device does not include at least the first frame windowing signal. For example, the electronic device performs windowing processing on the reverberant speech signal to obtain 8 frames of windowed speech signal, and the electronic device can obtain the windowed speech signal of the next 7 frames, that is, the windowed speech signal from frame 2 to frame 7, frame 2 The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device; the electronic device can also obtain the windowed speech signal of the next six frames, ie the windowed speech signal from the third frame to the eighth frame, the third frame The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device. The specific acquisition of windowed speech signals in the next few frames is determined according to the actual situation, and no specific restrictions are made here. In addition, how to perform windowing and framing of the reverberant speech signal is not limited to the above-mentioned methods, and may also be other methods, which are not specifically limited here.

In 203, the electronic device performs signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal.

For example, after obtaining multiple frames of windowing signals, the electronic device may perform signal fidelity processing on each frame of the windowing signals to obtain multiple frames of fidelity voice signals. Wherein, the multi-frame fidelity speech signal constitutes the first voice signal, and the fidelity speech signal may be expressed as z(i), and i represents the number of frames.

For example, the electronic device obtains 5 frames of windowed speech signals, namely the 4th framed windowed speech signal y(4), the 5th framed windowed speech signal y(5), the 6th framed windowed speech signal y(6), The windowed speech signal y(7) in the 7th frame and the windowed speech signal y(8) in the 8th frame. Then, the electronic device performs signal fidelity processing on these 5 frames of windowed voice signals to obtain 5 frames of fidelity voice signals, which are the fourth frame fidelity voice signal z(4) and the fifth frame fidelity voice signal z(5 ), the sixth frame fidelity speech signal z(6), the seventh frame fidelity speech signal z(7) and the eighth frame fidelity speech signal z(8).

It can be understood that the fidelity processing of the signal can reduce the distortion rate of the signal.

In 204, the electronic device performs Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal, respectively.

For example, after obtaining multiple frames of fidelity signals, the electronic device may perform Fourier transform on each frame of fidelity signals, thereby obtaining phase spectra and power spectra corresponding to the multi-frame fidelity speech signals, respectively. Wherein, the phase spectrum and the power spectrum corresponding to the multi-frame fidelity speech signal respectively constitute the phase spectrum and the first power spectrum corresponding to the first speech signal.

For example, suppose the electronic device obtains 5 frames of fidelity speech signals, which are the fourth frame of fidelity speech signal z(4), the fifth frame of fidelity speech signal z(5), and the sixth frame of fidelity speech signal z(6) , Frame 7 fidelity speech signal z(7) and Frame 8 fidelity speech signal z(8). Then the electronic device performs a Fourier transform on z(4), that is, FFT[z(4)]=Z(4), and the phase spectrum arg[Z(4)] and z(4) corresponding to z(4) can be obtained. The corresponding power spectrum |Z(4)| ² . The electronic device performs a Fourier transform on z(5), that is, FFT[z(5)]=Z(5), and the phase spectrum arg[Z(5)] corresponding to z(5) and z(5) are obtained. Power spectrum |Z(5)| ² . The electronic device performs a Fourier transform on z(6), that is, FFT[z(6)]=Z(6), and the phase spectrum arg[Z(6)] corresponding to z(6) and z(6) can be obtained. Power spectrum |Z(6)| ² . The electronic device performs a Fourier transform on z(7), that is, FFT[z(7)]=Z(7), and the phase spectrum arg[Z(7)] corresponding to z(7) and z(7) are obtained. Power spectrum |Z(7)| ² . The electronic device performs a Fourier transform on z(8), that is, FFT[z(8)]=Z(6), and the phase spectrum arg[Z(8)] corresponding to z(8) and z(8) are obtained. Power spectrum |Z(8)| ² .

Among them, FFT (Fast Fourier Transformation) is a fast algorithm of discrete Fourier transform (DFT). This is the Fast Fourier Transform. It is obtained by improving the discrete Fourier transform algorithm according to the odd, even, virtual, real and other characteristics of the discrete Fourier transform.

It can be understood that the phase spectrum corresponding to the first speech signal includes: arg[Z(4)], arg[Z(5)], arg[Z(6), arg[Z(7)], arg[Z(8) ]. The first power spectrum corresponding to the first speech signal includes: |Z(4)| ² , |Z(5)| ² , |Z(6)| ² , |Z(7)| ² , |Z(8)| ² .

In 205, the electronic device calculates a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums.

For example, after obtaining power spectra corresponding to multiple frames of fidelity speech signals, the electronic device may calculate a third power spectrum according to the power spectrum corresponding to each frame of fidelity speech signals to obtain multiple third power spectra. Among them, a plurality of third power spectrums constitute a second power spectrum.

For example, the electronic device can use the following formula to calculate the third power spectrum:

Among them, |Z(i)| ² represents the power spectrum corresponding to the i-th frame fidelity speech signal, and |X(i)| ² represents the i-th third power spectrum, which is based on the power corresponding to the i-th frame fidelity speech signal The spectrum is determined, ρ is the number of reverberation time translation frames, γ is a gain value, and ε represents the value of the direct sound signal attenuation by a certain decibel

i>-a, i>-a, ω(i) is a smoothing function, a is used to control the width of the smoothing function, i represents the number of frames.

The values of ρ, γ, ε, and a may be: ρ=7, γ=0.32, ε=0.01, and a=5. Among them, ρ=7 means that the frame shifts by 7 frames, that is, assuming that the reverberation time is around 50ms, and the window shifts by 8ms, it needs to shift by 7 frames. γ=0.32 means that the gain value is 0.32. ε=0.01 means that the direct sound signal is attenuated by 30dB. a=5 means the width of the smoothing function is 5. It should be noted that ρ=7, γ=0.32, ε=0.01, a=5 are only an example of this embodiment, and are not used to limit this application. In the actual application process, ρ, γ, ε, a The values are not limited to the examples in this embodiment, and the values of ρ, γ, ε, and a can be determined according to the actual situation, which is not specifically limited here.

For example, suppose the electronic device obtains the power spectrum corresponding to the 5 frames of fidelity speech signals as follows: the power spectrum corresponding to the 4th frame fidelity speech signal |Z(4)| ² , the power spectrum corresponding to the 5th frame fidelity speech signal| Z(5)| ² , the power spectrum corresponding to the 6th frame fidelity speech signal|Z(6)| ² , the power spectrum corresponding to the 7th frame fidelity speech signal|Z(7)| ² , the 8th frame fidelity The power spectrum corresponding to the voice signal |Z(8)| ² . Then electronic devices can substitute |Z(4)| ² into the formula

In the calculation, the fourth third power spectrum |X(4)| ^{2 is obtained} , that is

Similarly, the electronic device can obtain the fifth third power spectrum |X(5)| ² , the third sixth power spectrum |X(6)| ² , the seventh third power spectrum |X(7 )| ² , the eighth third power spectrum |X(4)| ² .

It can be understood that the first power spectrum includes: |X(4)| ² , |X(5)| ² , |X(6)| ² , |X(7)| ² , |X(8)| ² These 5 Third power spectrum.

In 206, the electronic device constructs a clean speech signal for each frame according to the phase spectrum corresponding to each frame of the fidelity speech signal and a plurality of third power spectra, to obtain multiple frames of clean speech signals.

For example, suppose that the electronic device obtains a phase spectrum corresponding to 5 frames of fidelity speech signals and 5 third power spectra, and the corresponding phase spectra of 5 frames of fidelity speech signals are: arg[Z(4)], arg[Z(5 )], arg[Z(6), arg[Z(7)], arg[Z(8)]. The five third power spectra are: |X(4)| ² , |X(5)| ² , |X(6)| ² , |X(7)| ² , |X(8)| ² . Then, the electronic device can construct the clean voice signal of the first frame according to arg[Z(4)] and |X(4)| ² . The electronic device can construct the second frame of clean speech signal according to arg[Z(5)] and |Z(5)| ² . The electronic device can construct the third frame of clean speech signal according to arg[Z(6)] and |Z(6)| ² . The electronic device can construct the fourth frame of clean speech signal according to arg[Z(7)] and |Z(7)| ² . The electronic device can construct the fifth frame of clean speech signal according to arg[Z(8)] and |Z(8)| ² . Therefore, the electronic device can construct a total of 5 frames of clean voice signals.

In 207, the electronic device performs windowing and framing processing on multiple frames of clean voice signals to obtain target clean voice signals.

For example, assuming that the electronic device constructs 5 frames of clean voice signals, the electronic device may perform windowing and framing processing on these 5 frames of clean voice signals in chronological order to obtain a target clean voice signal. Among them, there may be a certain overlap between clean speech signals in adjacent frames to construct a clearer target clean speech signal.

As shown in FIG. 3, in some embodiments, the process 203 may include the following process:

2031. The electronic device performs linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain a linear prediction residual signal of the i-th frame and each frame after the i-th frame.

It can be understood that since the linear prediction and analysis of the current frame requires the data of the first few frames of the current frame, the electronic device does not perform linear prediction and analysis from the first frame. Therefore, in this embodiment, the electronic device obtains The multi-frame windowing signal of at least does not include the first frame windowing signal. For example, the electronic device performs windowing processing on the reverberant speech signal to obtain 8 frames of windowed speech signal, and the electronic device can obtain the windowed speech signal of the next 7 frames, that is, the windowed speech signal from frame 2 to frame 7, frame 2 The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device; the electronic device can also obtain the windowed speech signal of the next six frames, ie the windowed speech signal from the third frame to the eighth frame, the third frame The windowed speech signal up to the eighth frame is the multi-frame windowed speech signal obtained by the electronic device. The specific acquisition of windowed speech signals in the next few frames is determined according to the actual situation, and no specific restrictions are made here.

For example, the electronic device obtains 5 frames of windowed speech signals, namely the 4th framed windowed speech signal y(4), the 5th framed windowed speech signal y(5), the 6th framed windowed speech signal y(6), The windowed speech signal y(7) in the 7th frame and the windowed speech signal y(8) in the 8th frame. Therefore, starting from the fourth frame, the electronic device performs linear prediction analysis on the windowed speech signals from the fourth frame to the eighth frame to obtain the linear prediction residual signal w(4) of the fourth frame and the linear prediction residual signal of the fifth frame w(5), the sixth frame linear prediction residual signal w(6), the seventh frame linear prediction residual signal w(7), the eighth frame linear prediction residual signal w(8). Among them, the linear prediction residual signal can be expressed as: w(i), i is the number of frames.

2032. The electronic device uses an inverse filter to perform inverse filtering on the linear prediction residual signal of the i-th frame to obtain a filtered speech signal of the i-th frame.

For example, the electronic device may use an inverse filter of length L to perform inverse filtering processing on the linear prediction residual signal of the i-th frame to obtain a filtered speech signal of the i-th frame.

Among them, the inverse filter of length L can be expressed as: g(i)=[g(1), g(2),...g(L)]. The value of L can be determined according to the actual situation, and no specific restrictions are made here.

2033. When the electronic device detects that the kurtosis of the i-th frame filtered speech signal is minimum, the electronic device obtains an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal to obtain an inverse filter corresponding to the i-th windowed voice signal.

For example, the electronic device may use an inverse filter of length L to perform inverse filtering on the fourth-frame linear prediction residual signal to obtain the fourth-frame filtered speech signal. The electronic device continuously changes the parameters of the inverse filter, so that the filtered speech signal of the fourth frame continuously changes. At the same time, the electronic device continuously detects the changing kurtosis of the filtered speech signal in the fourth frame. When the electronic device detects that the kurtosis of the filtered speech signal in the fourth frame is the smallest, an inverse filter that minimizes the kurtosis of the filtered speech signal in the fourth frame is obtained to obtain the inverse filter g(4) corresponding to the windowed speech signal in the fourth frame .

2034. The electronic device uses an inverse filter corresponding to the windowed i-frame speech signal to perform inverse filtering on the windowed i-frame speech signal to obtain the i-th frame fidelity speech signal.

The formula for the fidelity speech signal of frame i is:

z(i)=g(i)y(i). Among them, z(i) represents the fidelity speech signal of the i-th frame, g(i) represents the inverse filter corresponding to the windowed speech signal of the i-th frame, and y(i) represents the windowed speech signal of the i-th frame.

For example, the electronic device may use an inverse filter g(4) corresponding to the windowed speech signal of the fourth frame to perform inverse filtering on the windowed speech signal y(4) of the third frame to obtain the fidelity speech signal z(4 of the fourth frame ), that is, z(4)=g(4)y(4).

2035, the electronic device performs inverse filtering according to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame of the windowed speech signal of the windowed voice signal of each frame after the i-th frame To obtain the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame.

It should be noted that the determination method of the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame is different from the determination method of the inverse filter corresponding to the windowed speech signal of the i-th frame.

For example, for the windowed speech signal every frame after the fourth frame, that is, for the windowed speech signal every frame from the fifth frame to the eighth frame, the electronic device can linearly predict the residual according to each frame of the fifth frame to the eighth frame The linear prediction residual signal of the previous frame of the signal and the inverse filter corresponding to the windowed voice signal of the previous frame of the windowed speech signal of each frame of the 5th frame to the 8th frame, to obtain each frame of the 5th frame to the 8th frame The inverse filter corresponding to the windowed speech signal.

For example, for the windowed speech signal in the fifth frame, the electronic device may obtain the inverse filter corresponding to the windowed speech signal in the fifth frame according to the linear prediction residual signal in the fourth frame and the windowed speech signal in the fourth frame In the same way, the electronic device can obtain the inverse filter corresponding to the windowed voice signal in the sixth frame, the inverse filter corresponding to the windowed voice signal in the seventh frame, and the inverse filter corresponding to the windowed voice signal in the eighth frame.

2036, the electronic device obtains the multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame.

For example, the electronic device may substitute the inverse filter g(5) corresponding to the windowed speech signal in the fifth frame and the windowed speech signal y(5) in the fifth frame into the formula z(i)=g(i)y(i), The fidelity speech signal z(5) of the fifth frame is obtained. Similarly, the electronic device can obtain the sixth frame fidelity signal z(6), the seventh frame fidelity signal z(7), and the eighth frame fidelity signal z(8), that is, the multi-frame fidelity after the fourth frame True voice signal.

2037. The electronic device combines the fidelity speech signal of the i-th frame and the multi-frame fidelity speech signal after the i-th frame to obtain a multi-frame fidelity speech signal.

For example, the electronic device combines the fidelity speech signal z(4) of the fourth frame and the fidelity speech signals z(5), z(6), z(7), and z(8) of the fifth frame to the eighth frame to obtain 5 Frame fidelity speech signal.

In some embodiments, when the electronic device detects that the i-th frame filtered speech signal has the smallest kurtosis, the electronic device may acquire the i-th frame filtered speech signal with the smallest kurtosis.

For example, the electronic device may use an inverse filter to perform inverse filtering on the linear prediction residual signal of the fourth frame to obtain the filtered speech signal of the fourth frame. The electronic device continuously changes the parameters of the inverse filter, so that the filtered speech signal of the fourth frame continuously changes. At the same time, the electronic device continuously detects the changing kurtosis of the filtered speech signal in the fourth frame. When the electronic device detects that the kurtosis of the inverse filtered speech signal in the fourth frame is the smallest, the fourth frame filtered speech signal s(4) with the smallest kurtosis is acquired.

Then, as shown in FIG. 4, the process 2035 may include the following processes:

20351, the electronic device obtains the corresponding i+1 frame windowed speech signal according to the linear prediction residual signal of the i frame, the inverse filter corresponding to the windowed speech signal of the i frame and the i frame filtered speech signal with the smallest kurtosis Inverse filter.

The calculation formula of the inverse filter corresponding to the windowed speech signal in frame i+1 is:

g(i+1)=g(i)+μe(i)w(i), where,

s(i) represents the i-frame filtered speech signal, g(i+1) represents the inverse filter corresponding to the windowed speech signal in frame i+1, and g(i) represents the inverse filter corresponding to the windowed speech signal in frame i W(i) represents the linear prediction residual signal of the ith frame, μ=3*10 ^-9 is the convergence step size, and E(x) represents the expectation.

For example, the electronic device may linearly predict the residual signal w(4) in the fourth frame, the inverse filter g(4) corresponding to the windowed speech signal in the fourth frame, and the fourth frame filtered speech signal s(4) with the least kurtosis To obtain the inverse filter g(5) corresponding to the windowed speech signal in the fifth frame. That is, g(5)=g(4)+μe(4)w(4), where,

20352, the electronic device uses the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linearly predicted residual signal of the previous frame of each frame after the i+1th frame Linearly predict the residual signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame.

20353, the electronic device predicts the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, and the windowed voice signal of the previous frame of the windowed speech signal of each frame after the i+1th frame Corresponding inverse filter and the previous frame of the filtered speech signal of each frame after the i+1th frame of the filtered speech signal, to obtain an inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame.

The calculation formula of the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame can be:

g(i+j+1)=g(i+j)+μe(i+j)w(i+j), where,

s(i+j) represents the filtered speech signal in frame i+j, g(i+j) represents the inverse filter corresponding to the windowed speech signal in frame i+j, and g(i+j+1) represents the i+j The inverse filter corresponding to the windowed speech signal of frame j+1, w(i+j) represents the linear prediction residual signal of frame i+j, μ=3*10 ^-9 is the convergence step, and E(x) represents the expected , J≥1.

According to g(i+j+1)=g(i+j)+μe(i+j)w(i+j) and

It can be determined that the inverse filter corresponding to the windowed voice signal of the current frame can linearly predict the residual signal according to the previous frame of the current frame, the inverse filter corresponding to the windowed voice signal of the previous frame of the current frame and the previous frame of the current frame The filtered speech signal is determined.

For example, the inverse filter corresponding to the windowed speech signal in frame i+2 can be linearly predicted residual signal according to frame i+1, the inverse filter corresponding to the windowed speech signal in frame i+1 and the filter in frame i+1 The voice signal is determined. In this embodiment, before performing the process 20353, the electronic device has obtained the windowed speech signal and the linear prediction residual signal of the i+1th frame. Therefore, in order to determine the inverse filter corresponding to the windowed speech signal in the i+2 frame, the filtered speech signal in the i+1 frame needs to be determined, that is, the filtered speech signal in the previous frame of the filtered speech signal in the i+2 frame needs to be determined.

The calculation formula of the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame is:

s(i+j)=g(i+j)w(i+j), where s(i+j) represents the filtered speech signal of frame i+j, and g(i+j) represents the frame i+j The inverse filter corresponding to the windowed speech signal, w(i+j) represents the linear prediction residual signal of frame i+j, j≥1.

For example, if i=4, the filtered voice signal of the previous frame of the filtered voice signal of frame i+2 (in this case j=1), that is, the filtered voice signal of frame i+1, that is, the filtered voice signal of frame 5 s( 5)=g(5)w(5). Therefore, the inverse filter corresponding to the windowed speech signal in the i+2 frame, that is, the inverse filter g(6) corresponding to the windowed speech signal in the 6th frame, can linearly predict the residual signal according to the 5th frame, and the windowed speech in the 5th frame The inverse filter corresponding to the signal and the 5th frame filtered speech signal are determined, that is, g(6)=g(5)+μe(5)w(5), where,

When the inverse filter g(6) corresponding to the windowed speech signal in the sixth frame is obtained, the electronic device may determine the filtered speech signal s(6)=g(6)w(6) in the sixth frame. Therefore, the inverse filter g(7)=g(6)+μe(6)w(6) corresponding to the windowed speech signal in the seventh frame, where,

Similarly, the electronic device can obtain the inverse filter g(8) corresponding to the windowed speech signal of the eighth frame.

20354, the electronic device obtains the corresponding windowed speech signal after the i frame according to the inverse filter corresponding to the windowed speech signal of the i+1 frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1 frame Inverse filter.

For example, the electronic device according to the inverse filter g(5) corresponding to the windowed speech signal in the fifth frame, the inverse filter g(6) corresponding to the windowed speech signal in the sixth frame, and the inverse filter corresponding to the windowed speech signal in the seventh frame G(7), the inverse filter g(8) corresponding to the windowed speech signal of the eighth frame, and the inverse filter corresponding to the windowed speech signal of each frame after the fourth frame is obtained.

In some embodiments, the process 201 may include the following process:

Obtain the user's voice information when the electronic device is in a call state;

The electronic device detects whether the user's voice information includes preset keywords;

If the user's voice information includes preset keywords, the electronic device acquires the reverberation voice signal.

The voice information of the user may be the voice information of the callee. For example, when the electronic device is in a call state, the electronic device obtains the voice information of the callee who is in a call with the current user, and then detects whether the callee's voice information includes preset keywords. Among them, the preset keywords may be "inaudible", "say again" and so on. When the voice information of the callee includes the preset keyword "inaudible", etc., there may be a current user too far away from the electronic device, resulting in the signal collected by the electronic device is a mixture of direct sound and reflected sound The reverberation speech signal in this embodiment. Therefore, when the electronic device detects that the user's voice information includes a preset keyword, the electronic device can obtain the mixed signal of the reverberation voice signal and the microphone.

In some embodiments, when the electronic device performs the process of acquiring the reverberation voice signal if the user's voice information includes preset keywords, the following process may be performed:

If the user's voice information includes preset keywords, the electronic device generates a record and saves the record;

When the number of saved records is greater than the preset number threshold, the electronic device acquires the reverberation voice signal.

In order to reduce the processing load of the processor, it is also considered that the current signal of the electronic device may be poor, which causes the electronic device to detect that the voice information of the call target includes the preset keyword "inaudible" and the like. Therefore, the electronic device can generate a record once and save the record when it is detected that the voice information of the call target includes the preset keyword "inaudible" and the like. When the number of saved records is greater than the preset number threshold, the electronic device acquires the reverberation voice signal. Wherein, the preset number threshold can be set by the user or determined by the electronic device, etc., which is not specifically limited here. Assuming that the preset number threshold is set to 10, then the electronic device will obtain the reverberation speech signal when the number of saved records is 11.

When the electronic device executes the process of acquiring the reverberation voice signal, the electronic device may delete the saved record and stop performing the process of detecting whether the user's voice information includes the preset keyword.

Similarly, in order to reduce the processing load of the processor, after a period of time after the electronic device acquires the reverberation voice signal, if the preset keyword is not detected in the callee's voice information, it may indicate that the user is closer to the electronic device In this case, the microphone no longer collects mixed sound. At this time, the electronic device can stop the process of acquiring the reverberation voice signal.

For example, after 20 minutes after the electronic device acquires the reverberation voice signal, if the preset keyword is not detected in the callee's voice information, it may indicate that the user is closer to the electronic device, and the microphone collected is no longer Mixed sound, at this time, the electronic device can stop the process of acquiring the reverberation voice signal.

In some embodiments, the process 201 may include the following process:

When the electronic device is to perform voiceprint recognition or voice recognition, the electronic device detects whether the distance between the sound source and the electronic device is greater than a preset distance threshold;

If the distance between the sound source and the electronic device is greater than the preset distance threshold, the electronic device acquires the reverberation voice signal.

It can be understood that if the distance between the sound source and the electronic device is too far, the electronic device collects a mixed signal of direct sound and reflected sound, that is, a reverberated speech signal, and the reverberated speech signal has a reverberation phenomenon, and the reverberation The phenomenon can interfere with the results of voiceprint recognition and voice recognition of electronic devices.

Therefore, when the electronic device is to perform voiceprint recognition or voice recognition, the electronic device can detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold, and if the distance between the sound source and the electronic device is greater than the preset distance threshold , The electronic device obtains the reverberation voice signal.

Among them, the preset distance threshold can be set according to the actual situation, and no specific limitation is made here.

Please refer to FIG. 5, which is a schematic structural diagram of a voice processing device 300 according to an embodiment of the present application. The voice processing apparatus 300 may include: an acquisition module 301, a processing module 302, a transformation module 303, a calculation module 304, and a construction module 305.

The obtaining module 301 is used to obtain a reverberation voice signal;

The processing module 302 is configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal.

The transformation module 303 is configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal.

The calculation module 304 is configured to calculate a second power spectrum according to the first power spectrum.

The construction module 305 is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.

In some embodiments, the acquisition module 301 may be used to: perform windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal;

The processing module 302 may be used to: perform signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal, and the multi-frame fidelity voice signal constitutes the first voice signal;

The transformation module 303 may be used to: perform Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal respectively, and the multi-frame fidelity speech signal respectively correspond to The phase spectrum and the power spectrum constitute the phase spectrum and the first power spectrum corresponding to the first speech signal;

The calculation module 304 may be used to calculate a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, the plurality of third power spectrums constituting the second power Spectrum

The construction module 305 may be used to: construct a clean speech signal for each frame according to the phase spectrum corresponding to the fidelity speech signal of each frame and the plurality of third power spectra, and obtain multiple frames of clean speech signals; The clean voice signal is windowed and framed to obtain the target clean voice signal.

In some embodiments, the processing module 302 may be used to: perform linear prediction and analysis on the i-th frame and every windowed voice signal after the i-th frame to obtain the i-th frame and every frame after the i-th frame Prediction residual signal; inverse filter is used to inverse filter the i-th frame linear prediction residual signal to obtain the i-th frame filtered speech signal; when the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame is acquired Filtering the inverse filter with the smallest kurtosis of the speech signal to obtain an inverse filter corresponding to the windowed speech signal of the i frame; using the inverse filter corresponding to the windowed speech signal of the i frame to inverse the windowed speech signal of the i frame Filtering process to get the i-th frame fidelity speech signal; linearly predict the residual signal according to the previous frame of the linear prediction residual signal for each frame after the i-th frame and the previous frame of the windowed speech signal for each frame after the i-th frame An inverse filter corresponding to the windowed speech signal to obtain an inverse filter corresponding to the windowed speech signal for each frame after the i-th frame; according to the inverse filter and the i-th corresponding to the windowed speech signal for each frame after the i-th frame Windowed speech signals for each frame after the frame to obtain multi-frame fidelity speech signals after the i-th frame; combining the i-th frame fidelity speech signal with the multi-frame fidelity speech signal after the i-th frame to obtain multi-frame fidelity speech signal.

In some embodiments, the processing module 302 may be used to: when the i-th frame filtered speech signal has the smallest kurtosis, obtain the i-th frame filtered speech signal with the smallest kurtosis; linearly predict the residual according to the i-th frame The inverse filter corresponding to the signal and the i-th windowed speech signal and the i-th frame filtered speech signal with the smallest kurtosis to obtain the inverse filter corresponding to the i+1th windowed speech signal; The inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame and the linear predicted residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, to obtain the i+1th frame Filtered speech signal of the previous frame of the filtered speech signal of each frame; Linearly predicted residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1th frame, plus The inverse filter corresponding to the windowed speech signal in the previous frame of the windowed speech signal and the filtered speech signal in the previous frame of the filtered speech signal in each frame after the i+1th frame, to obtain the windowed speech in each frame after the i+1th frame The inverse filter corresponding to the signal; according to the inverse filter corresponding to the windowed speech signal in the i+1th frame and the inverse filter corresponding to the windowed speech signal in each frame after the i+1th frame, the windowing after the ith frame is obtained The inverse filter corresponding to the voice signal.

In some embodiments, the obtaining module 301 may be used to: obtain the user's voice information when the electronic device is in a call state; detect whether the user's voice information includes preset keywords; if the user's If the voice information includes preset keywords, the reverb voice signal is obtained.

In some embodiments, the acquisition module 301 may be used to: if the user's voice information includes preset keywords, generate a record and save the record; when the number of saved records is greater than the preset number At the threshold, the reverberation speech signal is obtained.

In some embodiments, the acquisition module 301 may be used to: when the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold; if the sound If the distance between the source and the electronic device is greater than the preset distance threshold, the reverberation voice signal is acquired.

An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed on a computer, the computer is caused to execute the process in the voice processing method provided in this embodiment.

An embodiment of the present application also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program stored in the memory by executing the computer program In the voice processing method.

For example, the aforementioned electronic device may be a mobile terminal such as a tablet computer or a smart phone.

Please refer to FIG. 6, which is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.

The electronic device 400 may include components such as a microphone 401, a memory 402, and a processor 403. Those skilled in the art may understand that the structure of the electronic device shown in FIG. 6 does not constitute a limitation on the electronic device, and may include more or less components than those illustrated, or combine certain components, or arrange different components.

The microphone 401 can be used to pick up the voice uttered by the user and the like.

The memory 402 may be used to store application programs and data. The application program stored in the memory 402 contains executable code. The application program can form various functional modules. The processor 403 executes application programs stored in the memory 402 to execute various functional applications and data processing.

The processor 403 is the control center of the electronic device, and uses various interfaces and lines to connect the various parts of the entire electronic device, and executes the electronic device by running or executing the application program stored in the memory 402 and calling the data stored in the memory 402 Various functions and processing data, so as to carry out overall monitoring of electronic equipment.

In this embodiment, the processor 403 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 executes and stores the memory in the memory The application in 402, thereby implementing the process:

Obtain the reverberation voice signal;

Calculating a second power spectrum according to the first power spectrum;

Please refer to FIG. 7, which is a second schematic structural diagram of an electronic device according to an embodiment of the present application.

The electronic device 500 may include components such as a microphone 501, a memory 502, a processor 503, an input unit 504, an output unit 505, a speaker 506, and the like.

The microphone 501 can be used to pick up the voice uttered by the user and the like.

The memory 502 may be used to store application programs and data. The application program stored in the memory 502 contains executable code. The application program can form various functional modules. The processor 503 executes application programs stored in the memory 502 to execute various functional applications and data processing.

The processor 503 is the control center of the electronic device, and uses various interfaces and lines to connect various parts of the entire electronic device, and executes the electronic device by running or executing application programs stored in the memory 502 and calling data stored in the memory 502 Various functions and processing data, so as to carry out overall monitoring of electronic equipment.

The input unit 504 may be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The output unit 505 may be used to display information input by the user or provided to the user and various graphical user interfaces of the electronic device. These graphical user interfaces may be composed of graphics, text, icons, video, and any combination thereof. The output unit may include a display panel.

The speaker 506 may be used to convert electrical signals into sound.

In this embodiment, the processor 503 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 503 runs the stored code in the memory The application in 502, thereby implementing the process:

Obtain the reverberation voice signal;

Calculating a second power spectrum according to the first power spectrum;

In some embodiments, after the processor 503 executes the process of acquiring the reverberation speech signal, it may further perform: performing windowing and framing processing on the reverberation speech signal to obtain a multi-frame windowed speech signal; the processor 503 When performing the process of performing signal fidelity processing on the reverberation speech signal to obtain the first speech signal, it may perform: performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal. The multi-frame fidelity speech signal constitutes the first speech signal; the processor 503 executes the Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal The process can be performed by performing a Fourier transform on each frame of the fidelity speech signal to obtain a phase spectrum and a power spectrum corresponding to the multi-frame fidelity speech signal, respectively. The power spectrum constitutes the phase spectrum and the first power spectrum corresponding to the first speech signal; when the processor 503 executes the process of calculating the second power spectrum according to the first power spectrum, it may perform: according to each frame of the fidelity speech signal Corresponding power spectrum, calculate a third power spectrum to obtain multiple third power spectrums, the multiple third power spectrums constitute the second power spectrum; the processor 503 executes the When the phase spectrum and the second power spectrum are used to construct the target clean speech signal, the following steps may be performed: according to the phase spectrum corresponding to each frame of the fidelity speech signal and the multiple third power spectra, construct a clean speech signal for each frame, Obtain multiple frames of clean speech signals; perform windowing and framing processing on the multiple frames of clean speech signals to obtain target clean speech signals.

In some embodiments, the processor 503 executes the process of performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal, and may execute: the i frame and each subsequent frame after the i frame Frame windowed speech signal is subjected to linear prediction analysis to obtain the i-th frame and the i-th frame after each linear prediction residual signal; the inverse filter is used to inversely filter the i-th frame linear prediction residual signal to obtain the i-th frame Filter the speech signal; when the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, obtain the inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal to obtain the inverse filter corresponding to the i-th windowed speech signal; The inverse filter corresponding to the i-th windowed speech signal performs inverse filtering on the i-th windowed speech signal to obtain the i-th frame fidelity speech signal; the linear prediction of the residual signal before each frame after the i-th frame A frame of linear prediction residual signal and an inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i frame to obtain an inverse filter corresponding to the windowed speech signal of each frame after the i frame ; According to the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame and the windowed speech signal of each frame after the i-th frame, a multi-frame fidelity speech signal after the i-th frame is obtained; combining the i-th frame The fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame obtain a multi-frame fidelity speech signal.

In some embodiments, the processor 503 may further execute: when it is detected that the i-th frame filtered speech signal has the smallest kurtosis, obtain the i-th frame filtered speech signal with the smallest kurtosis; after the processor 503 executes the The inverse filter corresponding to the previous frame of the linear prediction residual signal of each frame of the linear prediction residual signal and the previous frame of the windowed speech signal of each frame after the i-th frame, to obtain When the process of the inverse filter corresponding to the framed windowed speech signal can be performed: the linear prediction residual signal according to the ith frame, the inverse filter corresponding to the windowed speech signal of the ith frame, and the ith frame filtered speech signal with the smallest kurtosis , Get the inverse filter corresponding to the windowed voice signal of the i+1th frame; according to the inverse filter corresponding to the windowed voice signal of the previous frame of the windowed voice signal of each frame after the i+1th frame and the i+1th frame The linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame afterwards, to obtain the filtered voice signal of the previous frame of the filtered speech signal of each frame after the i+1 frame; according to each frame after the i+1 frame The linear prediction residual signal in the previous frame of the linear prediction residual signal, the inverse filter corresponding to the windowed voice signal in the previous frame of the windowed speech signal in each frame after the i+1th frame, and each frame after the i+1 frame Frame filter the speech signal in the previous frame to filter the speech signal to obtain the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame; according to the inverse filter and i The inverse filter corresponding to the windowed speech signal of each frame after the +1 frame is obtained, and the inverse filter corresponding to the windowed speech signal after the i-th frame is obtained.

In some embodiments, when the processor 503 executes the process of acquiring the reverberation voice signal, it may perform: acquiring the user's voice information when the electronic device is in a call state; detecting whether the user's voice information includes a preset Keywords; if the user's voice information includes preset keywords, the reverb voice signal is obtained.

In some embodiments, the processor 503 executes the process of acquiring a reverberation voice signal if the user's voice information includes a preset keyword, and may execute: if the user's voice information includes a preset For keywords, a record is generated and the record is saved; when the number of saved records is greater than a preset number threshold, a reverberation speech signal is acquired.

In some embodiments, when the processor 503 executes the process of acquiring the reverberation speech signal, it may perform: when the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than the Set a distance threshold; if the distance between the sound source and the electronic device is greater than a preset distance threshold, then obtain a reverberation speech signal.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the detailed description of the voice processing method above, and it will not be repeated here.

The voice processing device provided by the embodiment of the present application and the voice processing method in the above embodiments belong to the same concept, and any method provided in the voice processing method embodiment can be run on the voice processing device, which is specific For the implementation process, please refer to the embodiment of the voice processing method, which will not be repeated here.

It should be noted that for the voice processing method described in the embodiments of the present application, a person of ordinary skill in the art can understand that all or part of the process of implementing the voice processing method described in the embodiments of the present application can be controlled by a computer program related hardware To complete, the computer program may be stored in a computer readable storage medium, such as stored in a memory, and executed by at least one processor, the execution process may include the process of the embodiment of the voice processing method as described . The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and so on.

For the voice processing device of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module. The above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .

The speech processing method, device, storage medium, and electronic equipment provided by the embodiments of the present application are described in detail above. Specific examples are used to explain the principles and implementation of the present application. The descriptions of the above embodiments are only It is used to help understand the method of this application and its core ideas; meanwhile, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and application scope. In summary, this specification The content should not be interpreted as a limitation of this application.

Claims

A voice processing method, including:

Obtain the reverberation voice signal;

Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;

Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;

Calculating a second power spectrum according to the first power spectrum;

According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
The speech processing method according to claim 1, wherein after the reverberation speech signal is acquired, further comprising:

Windowing and framing the reverberation speech signal to obtain a multi-frame windowing speech signal;

The signal fidelity processing of the reverberation speech signal to obtain the first speech signal includes:

Performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal, the multi-frame fidelity speech signal constituting the first speech signal;

The performing Fourier transform on the first speech signal to obtain the phase spectrum and the first power spectrum corresponding to the first speech signal includes:

Fourier transform the fidelity speech signal of each frame to obtain the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively, the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively constitute the first voice signal Corresponding phase spectrum and first power spectrum;

The calculating the second power spectrum according to the first power spectrum includes:

Calculating a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, the plurality of third power spectrums constituting the second power spectrum;

The constructing the target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum includes:

Construct a clean voice signal for each frame according to the phase spectrum corresponding to each frame of the fidelity voice signal and the multiple third power spectrums to obtain multiple frames of clean voice signals;

Windowing and framing the multi-frame clean speech signal to obtain the target clean speech signal.
The speech processing method according to claim 2, wherein said performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal includes:

Perform linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain the linear prediction residual signal of the i-th frame and each frame after the i-th frame;

The inverse filter is used to inversely filter the linear prediction residual signal of the ith frame to obtain the ith frame filtered speech signal;

When the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal is obtained to obtain an inverse filter corresponding to the i-th windowed speech signal

Adopting an inverse filter corresponding to the windowed speech signal of the i-th frame to perform inverse filtering processing on the windowed speech signal of the i-th frame to obtain a fidelity speech signal of the i-th frame;

According to the inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame windowed voice signal of the windowed speech signal of each frame after the i-th frame, the first The inverse filter corresponding to the windowed speech signal of each frame after the i frame;

Obtaining a multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame;

Combining the i-th frame fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame, a multi-frame fidelity speech signal is obtained.
The voice processing method according to claim 3, wherein the method further comprises:

When it is detected that the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame filtered speech signal with the smallest kurtosis is acquired;

The inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame of the windowed speech signal of the windowed voice signal of each frame after the i-th frame, Obtain the inverse filter corresponding to the windowed speech signal of each frame after the i-th frame, including:

Obtain the inverse filter corresponding to the windowed voice signal in frame i+1 according to the linear prediction residual signal in the i frame, the inverse filter corresponding to the windowed voice signal in the i frame and the i-frame filtered voice signal with the least kurtosis;

According to the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linear prediction residual of the previous frame of the linear prediction residual signal of each frame after the i+1th frame Signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame;

According to the inverse filtering of the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1 frame, and the previous frame of the windowed voice signal of the windowed voice signal of each frame after the i+1 frame The filter and the previous frame of the filtered voice signal of each frame after the i+1th frame, to obtain the inverse filter corresponding to the windowed voice signal of each frame after the i+1th frame;

According to the inverse filter corresponding to the windowed speech signal of the i+1th frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame, the inverse filter corresponding to the windowed speech signal after the ith frame is obtained .
The speech processing method according to claim 1, wherein the acquiring the reverberation speech signal comprises:

Obtain the user's voice information when the electronic device is in a call state;

Detecting whether the user's voice information includes preset keywords;

If the user's voice information includes preset keywords, a reverb voice signal is obtained.
The voice processing method according to claim 5, wherein, if the user's voice information includes a preset keyword, acquiring the reverberation voice signal includes:

If the user's voice information includes preset keywords, a record is generated and the record is saved;

When the number of saved records is greater than the preset number threshold, the reverberation voice signal is acquired.
The speech processing method according to claim 1, wherein the acquiring the reverberation speech signal comprises:

When the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold;

If the distance between the sound source and the electronic device is greater than a preset distance threshold, a reverberation voice signal is acquired.
A voice processing device, including:

Acquisition module for acquiring reverberation voice signals;

A processing module, configured to perform signal fidelity processing on the reverberation speech signal to obtain a first speech signal;

A transformation module, configured to perform Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;

A calculation module, configured to calculate a second power spectrum according to the first power spectrum;

The construction module is configured to construct a target clean speech signal according to the phase spectrum corresponding to the first speech signal and the second power spectrum.
The voice processing device according to claim 8, wherein the acquisition module is configured to:

Windowing and framing the reverberation speech signal to obtain a multi-frame windowing speech signal;

The processing module is configured to: perform signal fidelity processing on the windowed voice signal of each frame to obtain a multi-frame fidelity voice signal, and the multi-frame fidelity voice signal constitutes the first voice signal;

The transform module is used for: performing Fourier transform on each frame of the fidelity speech signal to obtain the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively, and the phase spectrum corresponding to the multi-frame fidelity speech signal respectively And the power spectrum constitute a phase spectrum and a first power spectrum corresponding to the first speech signal;

The calculation module is configured to calculate a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, and the plurality of third power spectrums constitute the second power spectrum;

The construction module is configured to: according to the phase spectrum corresponding to the fidelity speech signal of each frame and the plurality of third power spectra, construct a clean speech signal of each frame to obtain multiple frames of clean speech signals; The signal is windowed and framed to obtain the target clean speech signal.
The voice processing device according to claim 9, wherein the processing module is configured to:

Perform linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain the linear prediction residual signal of the i-th frame and each frame after the i-th frame;

The inverse filter is used to inversely filter the linear prediction residual signal of the ith frame to obtain the ith frame filtered speech signal;

When the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal is obtained to obtain an inverse filter corresponding to the i-th windowed speech signal;

Adopting an inverse filter corresponding to the windowed speech signal of the i-th frame to perform inverse filtering processing on the windowed speech signal of the i-th frame to obtain a fidelity speech signal of the i-th frame;

According to the inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame windowed voice signal of the windowed speech signal of each frame after the i-th frame, the first The inverse filter corresponding to the windowed speech signal of each frame after the i frame;

Obtaining a multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame;

Combining the i-th frame fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame, a multi-frame fidelity speech signal is obtained.
The voice processing device according to claim 10, wherein the processing module is configured to:

When it is detected that the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame filtered speech signal with the smallest kurtosis is acquired;

Obtain the inverse filter corresponding to the windowed voice signal in frame i+1 according to the linear prediction residual signal in the i frame, the inverse filter corresponding to the windowed voice signal in the i frame and the i-frame filtered voice signal with the least kurtosis;

According to the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linear prediction residual of the previous frame of the linear prediction residual signal of each frame after the i+1th frame Signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame;

According to the inverse filtering of the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1 frame, and the previous frame of the windowed voice signal of the windowed voice signal of each frame after the i+1 frame The filter and the previous frame of the filtered voice signal of each frame after the i+1th frame, to obtain the inverse filter corresponding to the windowed voice signal of each frame after the i+1th frame;

According to the inverse filter corresponding to the windowed speech signal of the i+1th frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame, the inverse filter corresponding to the windowed speech signal after the ith frame is obtained .
The voice processing device according to claim 8, wherein the acquisition module is configured to:

Obtain the user's voice information when the electronic device is in a call state;

Detecting whether the user's voice information includes preset keywords;

If the user's voice information includes preset keywords, a reverb voice signal is obtained.
A storage medium, wherein a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer is caused to execute the voice processing method according to any one of claims 1 to 7.
An electronic device, wherein the electronic device includes a processor and a memory, a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:

Obtain the reverberation voice signal;

Performing signal fidelity processing on the reverberation speech signal to obtain a first speech signal;

Performing a Fourier transform on the first speech signal to obtain a phase spectrum and a first power spectrum corresponding to the first speech signal;

Calculating a second power spectrum according to the first power spectrum;

According to the phase spectrum corresponding to the first speech signal and the second power spectrum, a target clean speech signal is constructed.
The electronic device according to claim 14, wherein the processor is configured to execute:

Windowing and framing the reverberation speech signal to obtain a multi-frame windowing speech signal;

Performing signal fidelity processing on the windowed speech signal of each frame to obtain a multi-frame fidelity speech signal, the multi-frame fidelity speech signal constituting the first speech signal;

Fourier transform the fidelity speech signal of each frame to obtain the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively, the phase spectrum and power spectrum corresponding to the multi-frame fidelity speech signal respectively constitute the first voice signal Corresponding phase spectrum and first power spectrum;

Calculating a third power spectrum according to the power spectrum corresponding to the fidelity speech signal of each frame to obtain a plurality of third power spectrums, the plurality of third power spectrums constituting the second power spectrum;

Construct a clean voice signal for each frame according to the phase spectrum corresponding to each frame of the fidelity voice signal and the multiple third power spectrums to obtain multiple frames of clean voice signals;

Windowing and framing the multi-frame clean speech signal to obtain the target clean speech signal.
The electronic device according to claim 15, wherein the processor is configured to execute:

Perform linear prediction and analysis on the windowed speech signal of the i-th frame and each frame after the i-th frame to obtain the linear prediction residual signal of the i-th frame and each frame after the i-th frame;

The inverse filter is used to inversely filter the linear prediction residual signal of the ith frame to obtain the ith frame filtered speech signal;

When the kurtosis of the i-th frame filtered speech signal is detected to be the smallest, an inverse filter that minimizes the kurtosis of the i-th frame filtered speech signal is obtained to obtain an inverse filter corresponding to the i-th windowed speech signal;

Adopting an inverse filter corresponding to the windowed speech signal of the i-th frame to perform inverse filtering processing on the windowed speech signal of the i-th frame to obtain a fidelity speech signal of the i-th frame;

According to the inverse filter corresponding to the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i-th frame and the previous frame windowed voice signal of the windowed speech signal of each frame after the i-th frame, the first The inverse filter corresponding to the windowed speech signal of each frame after the i frame;

Obtaining a multi-frame fidelity speech signal after the i-th frame according to the inverse filter corresponding to the windowed speech signal for each frame after the i-th frame and the windowed speech signal for each frame after the i-th frame;

Combining the i-th frame fidelity speech signal and the multi-frame fidelity speech signal after the i-th frame, a multi-frame fidelity speech signal is obtained.
The electronic device according to claim 16, wherein the processor is configured to execute:

When it is detected that the i-th frame filtered speech signal has the smallest kurtosis, the i-th frame filtered speech signal with the smallest kurtosis is acquired;

Obtain the inverse filter corresponding to the windowed voice signal in frame i+1 according to the linear prediction residual signal in the i frame, the inverse filter corresponding to the windowed voice signal in the i frame and the i-frame filtered voice signal with the least kurtosis;

According to the inverse filter corresponding to the windowed speech signal of the previous frame of the windowed speech signal of each frame after the i+1th frame and the linear prediction residual of the previous frame of the linear prediction residual signal of each frame after the i+1th frame Signal to obtain the filtered speech signal of the previous frame of the filtered speech signal of each frame after the i+1th frame;

According to the inverse filtering of the linear prediction residual signal of the previous frame of the linear prediction residual signal of each frame after the i+1 frame, and the previous frame of the windowed voice signal of the windowed voice signal of each frame after the i+1 frame The filter and the previous frame of the filtered voice signal of each frame after the i+1th frame, to obtain the inverse filter corresponding to the windowed voice signal of each frame after the i+1th frame;

According to the inverse filter corresponding to the windowed speech signal of the i+1th frame and the inverse filter corresponding to the windowed speech signal of each frame after the i+1th frame, the inverse filter corresponding to the windowed speech signal after the ith frame is obtained .
The electronic device according to claim 14, wherein the processor is configured to execute:

Obtain the user's voice information when the electronic device is in a call state;

Detecting whether the user's voice information includes preset keywords;

If the user's voice information includes preset keywords, a reverb voice signal is obtained.
The electronic device according to claim 18, wherein the processor is configured to execute:

If the user's voice information includes preset keywords, a record is generated and the record is saved;

When the number of saved records is greater than the preset number threshold, the reverberation voice signal is acquired.
The electronic device according to claim 14, wherein the processor is configured to execute:

When the electronic device is to perform voiceprint recognition or voice recognition, detect whether the distance between the sound source and the electronic device is greater than a preset distance threshold;

If the distance between the sound source and the electronic device is greater than a preset distance threshold, a reverberation voice signal is acquired.