CN115620737A

CN115620737A - Voice signal processing device, method, electronic equipment and sound amplification system

Info

Publication number: CN115620737A
Application number: CN202211193923.4A
Authority: CN
Inventors: 徐友聚; 朱福国; 秦亚光; 尹悦
Original assignee: Beijing Eswin Computing Technology Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-01-17

Abstract

The invention provides a voice signal processing device, a voice signal processing method, electronic equipment and a sound amplifying system, and relates to the technical field of voice processing, wherein the device comprises: the first echo eliminating unit is used for inputting the reference signal and the current voice signal acquired by the voice acquisition device into the adaptive filter to obtain a target residual signal; and the second echo cancellation unit is used for inputting the target residual signal and the far-end voice signal into a preset voice processing model to obtain a near-end voice processing signal of the current frame. The target residual signal is subjected to echo removing processing through the voice processing model, and because the voice processing model is obtained based on the residual signal sample and the far-end voice signal sample, and the residual signal sample and the target residual signal both comprise signals such as nonlinear echo and the like which cannot be completely eliminated by the adaptive filter, the voice processing model can eliminate the echo of nonlinear components in the target residual signal, thereby improving the accuracy of voice signal processing.

Description

Voice signal processing device, method, electronic equipment and sound amplification system

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech signal processing apparatus and method, an electronic device, and a sound amplification system.

Background

Echo is a signal collected by reflecting a far-end voice signal played by a near-end speaker through an object such as a wall and then transmitting the far-end voice signal to a near-end microphone, and the echo signal and the near-end voice are mixed and then transmitted to the far-end speaker, so that a far-end user can hear the speaking sound of the far-end user through the far-end speaker. In order to improve communication quality, echo cancellation techniques are widely applied in scenes such as telephone, video conference, etc., wherein the most common echo cancellation method is echo cancellation by an adaptive filter.

In the related art, it is assumed that a near-end speech signal as an input signal is not related to a far-end speech signal as a reference signal, a minimum correlation between an output signal of an adaptive filter and a far-end speech signal already played by a speaker is taken as a target, coefficients of the adaptive filter are iteratively updated, a transmission path from the near-end speaker to a near-end microphone is modeled based on the updated filter coefficients, an echo signal is estimated by using the far-end speech signal, and the estimated echo signal is subtracted from a speech signal acquired by the near-end microphone, so that a speech signal with echo removed is output.

However, in the above-mentioned related art, the adaptive filter is a linear operation, and cannot remove echoes of nonlinear components caused by a speaker, an acoustic channel, a microphone, and the like, thereby reducing the accuracy of speech signal processing.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the invention provides a voice signal processing device, a voice signal processing method, electronic equipment and a sound amplifying system.

The present invention provides a speech signal processing apparatus, comprising:

the first echo eliminating unit is used for inputting a reference signal and a current voice signal acquired by a voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

the second echo cancellation unit is configured to input the target residual signal and the far-end speech signal into a preset speech processing model, so as to obtain a current frame near-end speech processing signal output by the speech processing model and used for transmission;

the voice processing model is used for performing echo removing processing on the target residual signal; the speech processing model is trained based on the far-end speech signal samples and the residual signal samples.

Further, the reference signal also includes a last frame of near-end speech processing signal output by the speech processing model.

Further, the voice processing model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network and a multilayer artificial neural network;

the second echo cancellation unit is specifically configured to:

inputting a target residual signal into a first feature extraction network, and mapping the target residual signal from a time domain to a transform domain through the first feature extraction network to obtain a first feature;

inputting the far-end voice signal into the second feature extraction network, and mapping the far-end voice signal from a time domain to a transform domain through the second feature extraction network to obtain a second feature;

inputting the first feature and the second feature into the multilayer artificial neural network, and extracting a mask of a near-end speech signal of the current frame from the first feature based on the first feature and the second feature through the multilayer artificial neural network;

determining a third feature of the current frame near-end speech signal in a transform domain based on the mask and the first feature;

inputting the third feature into the third feature extraction network, and mapping the third feature from a transform domain to a time domain through the third feature extraction network to obtain the current frame near-end speech processing signal.

Further, the speech processing model is obtained based on the following mode:

inputting the residual signal sample and the far-end voice signal sample into an initial network model to obtain a near-end voice processing sample output by the initial network model;

determining a loss function based on the near-end speech processing samples and a desired signal; the desired signal comprises near-end residual signal samples;

and optimizing the model parameters of the initial network model based on the loss function until a convergence condition is reached to obtain the voice processing model.

Further, the apparatus further comprises:

the system comprises a sample acquisition unit, a voice acquisition unit and a control unit, wherein the sample acquisition unit is used for acquiring a near-end noisy signal sample and an impulse response sample from a voice playing device to a voice acquisition device;

the sample delay unit is used for delaying the near-end noisy signal sample for a preset time to obtain a target near-end noisy signal sample;

an echo sample determination unit, configured to determine a near-end speech echo sample based on the target near-end noisy signal sample and the impulse response sample;

an input signal sample determination unit for determining an input signal sample based on the near-end speech echo sample and the near-end noisy signal sample;

a near-end residual signal sample determining unit, configured to input the input signal sample and the target near-end noisy signal sample into the adaptive filter, and process the input signal sample based on the target near-end noisy signal sample through the adaptive filter to obtain a near-end residual signal sample;

a residual signal sample determination unit for determining the residual signal samples based on the near-end residual signal samples.

Further, the residual signal sample determination unit is specifically configured to:

determining a far-end voice echo sample based on the far-end voice signal sample and the impulse response sample;

inputting the far-end voice echo sample and the far-end voice signal sample into the adaptive filter, and processing the far-end voice echo sample based on the far-end voice signal sample through the adaptive filter to obtain a far-end residual signal sample;

determining the residual signal samples based on the near-end residual signal samples and the far-end residual signal samples.

Further, the echo sample determination unit is specifically configured to:

determining a reference near-end voice echo sample based on the target near-end noisy signal sample and the impulse response sample, delaying the reference near-end voice echo sample for the preset time to obtain a delayed near-end voice echo sample, taking the delayed near-end voice echo sample as a new target near-end noisy signal sample, and repeatedly executing the steps until the delay times reach the preset times;

the near-end speech echo samples are determined based on the reference near-end speech echo samples obtained each time.

Further, the first echo cancellation unit is specifically configured to:

processing the current voice signal based on the reference signal through the adaptive filter to obtain an output signal;

updating the current impulse response of the self-adaptive filter by taking the minimum correlation between the output signal and the reference signal as a target to obtain a target impulse response;

determining a target echo signal based on the target impulse response and the reference signal;

and processing the current voice signal based on the target echo signal to obtain the target residual signal.

The invention provides a voice signal processing method, which comprises the following steps:

inputting a reference signal and a current voice signal acquired by a voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

inputting the target residual signal and the far-end voice signal into a preset voice processing model to obtain a current frame near-end voice processing signal which is output by the voice processing model and used for transmission;

The present invention also provides a sound amplification system comprising:

the voice acquisition device is used for acquiring a current voice signal and inputting the current voice signal to the voice signal processing device;

the speech signal processing apparatus may employ any of the speech signal processing apparatuses described above.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the voice signal processing method.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech signal processing method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a speech signal processing method as described in any one of the above.

According to the voice signal processing device, the voice signal processing method, the electronic equipment and the sound amplifying system, the far-end voice signal and the target residual signal output by the self-adaptive filter are input into the pre-trained voice processing model, and a current frame near-end voice processing signal output by the voice processing model and used for transmission is obtained. Therefore, the target residual signal output by the adaptive filter is further subjected to echo removing processing through the voice processing model, and the voice processing model is obtained by training based on the residual signal sample and the far-end voice signal sample, and the residual signal sample and the target residual signal both comprise signals such as nonlinear echo and the like which cannot be completely eliminated by the adaptive filter, so that the voice processing model can eliminate the echo of the nonlinear component in the target residual signal, and the accuracy of voice signal processing is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech signal processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a scenario of voice interaction provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of an adaptive filter provided by an embodiment of the invention;

FIG. 4 is a schematic structural diagram of a speech processing model provided by an embodiment of the present invention;

FIG. 5 is a second flowchart illustrating a speech signal processing method according to an embodiment of the present invention;

FIG. 6 is a third schematic flowchart of a speech signal processing method according to an embodiment of the present invention;

FIG. 7 is a fourth flowchart illustrating a speech signal processing method according to an embodiment of the present invention;

FIG. 8 is a fifth flowchart illustrating a voice signal processing method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The speech signal processing method of the present invention is described below in conjunction with fig. 1-8.

Fig. 1 is a schematic flow chart of a speech signal processing method according to an embodiment of the present invention, as shown in fig. 1, the speech signal processing method includes the following steps:

step 101, inputting a reference signal and a current voice signal acquired by a voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal.

The voice acquisition device comprises a microphone, the reference signal comprises a received far-end voice signal, the current voice signal comprises a near-end voice signal (voice spoken by a near-end user), environmental noise and a target signal, the target signal can comprise a far-end voice echo signal and a near-end voice echo signal, the far-end voice echo signal is a far-end voice signal played by a voice playing device acquired by the voice acquisition device, the near-end voice echo signal is a near-end voice signal played by the voice playing device acquired by the voice acquisition device, and the voice playing device comprises a loudspeaker.

Exemplarily, fig. 2 is a schematic view of a voice interaction scenario provided by an embodiment of the present invention, as shown in fig. 2, a user a and a user B perform a teleconference, a voice of the user a is collected by a voice collecting device A1 of the user a, and then is transmitted to an electronic device B1 of the user B through a network by an electronic device A2 of the user a, and the electronic device B1 plays the voice of the user a through a voice playing device B2 of the user B. At this time, the user B is also speaking, and when the speech acquisition device B3 on the user B side acquires a speech signal, not only the speech sound of the user B but also the speech sound of the user a played by the speech playback device B2 at this time are acquired. If the voice signal collected by the voice collecting device B3 is transmitted to the voice playing device A3 at the user a side for playing without being processed by echo cancellation, the user a will listen to the voice playing device A3 and play the speaking voice of the user B and the speaking voice of the user a, which is an echo phenomenon. The far-end speech signal in step 101 can be understood as the speaking voice transmitted to the user a on the user B side.

Specifically, the current speech signal is processed by the adaptive filter based on the reference signal to obtain an output signal; updating the current impulse response of the self-adaptive filter by taking the minimum correlation between the output signal and the reference signal as a target to obtain a target impulse response; determining a target echo signal based on the target impulse response and the reference signal; and processing the current voice signal based on the target echo signal to obtain the target residual signal.

Exemplarily, when a reference signal and a current voice signal acquired by the voice acquisition device are obtained, the current voice signal acquired by the voice acquisition device is used as an input signal, the input signal and the reference signal are both input into the adaptive filter, and the adaptive filter performs filtering processing on a far-end voice echo signal and a near-end voice echo signal in the current voice signal based on the reference signal to obtain an output signal; and then calculating the correlation between the output signal and the reference signal by an adaptive algorithm, taking the minimized correlation as a target, updating the current impulse response of the adaptive filter, modeling a transmission path from the voice playing device to the voice collecting device to obtain a target impulse response, further performing convolution on the target impulse response and the reference signal to obtain a target echo signal, and then subtracting the target echo signal from the current voice signal to obtain a target residual signal, wherein the target residual signal comprises the current frame near-end voice signal and other signals except the current frame near-end voice signal, such as a nonlinear echo signal.

It should be noted that, the adaptive filter may adopt any suitable structure, for example, a single-filter adaptive filter or a dual-filter adaptive filter, and the adaptive filter may also be a time-domain filter or a frequency-domain filter, which is not limited in the present invention.

And 102, inputting the target residual signal and the far-end voice signal into a preset voice processing model to obtain a current frame near-end voice processing signal which is output by the voice processing model and used for transmission.

The voice processing model is used for performing echo removing processing on a target residual signal; the speech processing model is trained based on the far-end speech signal samples and the residual signal samples.

For example, when the target residual signal is obtained, the target residual signal is input into the speech processing model as an input signal of the speech processing model, the far-end speech signal is input into the speech processing model as a reference signal of the speech processing model, the speech processing model further processes other signals in the target residual signal except for the current frame near-end speech signal based on the far-end speech signal, and finally outputs the current frame near-end speech processed signal, which is the speech signal for transmission to the opposite end.

The speech signal processing method provided by the invention inputs the far-end speech signal and the target residual signal output by the adaptive filter into a pre-trained speech processing model to obtain the current frame near-end speech processing signal output by the speech processing model and used for transmission. Therefore, the target residual signal output by the adaptive filter is further subjected to echo removing processing through the voice processing model, and the voice processing model is obtained by training based on the residual signal sample and the far-end voice signal sample, and the residual signal sample and the target residual signal both comprise signals such as nonlinear echo and the like which cannot be completely eliminated by the adaptive filter, so that the voice processing model can eliminate the echo of the nonlinear component in the target residual signal, and the accuracy of voice signal processing is improved; in addition, the reference signals of the adaptive filter and the voice processing model in the invention do not need to be processed by a hardware analog-to-digital converter, and belong to soft reference, thereby reducing the hardware cost and the complexity of circuit board wiring and improving the flexibility of electronic equipment deployment.

Optionally, the reference signal further includes a last frame of near-end speech processing signal output by the speech processing model.

For example, as shown in the speech interaction scenario shown in fig. 2, the previous frame of near-end speech processing signal may be understood as a speech signal obtained by processing, by the speech processing model, the previous frame of speech signal acquired by the speech acquisition device B3 on the user B side, and then transmitting the processed speech signal to the user a side.

For example, fig. 3 is a schematic diagram of an adaptive filter according to an embodiment of the present invention, as shown in fig. 3, when a far-end speech signal f (n) and a previous frame of near-end speech processing signal u (n-1) are obtained, the far-end speech signal f (n) and the previous frame of near-end speech processing signal u (n-1) may be superimposed to be used as a reference signal r1 (n), and then the reference signal r1 (n) and a current speech signal y (n) collected by a speech collecting device are input into the adaptive filter, where y (n) = x1 (n) + s (n) + v (n) + x2 (n), x1 (n) represents a far-end speech echo signal, s (n) represents a near-end speech signal, v (n) represents an environmental noise signal, x2 (n) represents a near-end speech echo signal, and the current speech signal y (n) is processed by the adaptive filter based on the reference signal r1 (n) to obtain a target residual signal e (n), and the target residual signal e (n) and a model of the far-end speech signal (n) are output to the near-end speech processing model; the near-end voice signal s (n) is amplified by the local voice playing device, then collected by the voice collecting device, amplified by the local voice playing device and collected by the voice collecting device, and the cycle can generate howling; the purpose of adding the last frame of near-end speech processing signal u (n-1) to the reference signal r1 (n) is to break this loop to achieve howling suppression.

In addition, under the condition that the reference signal comprises a last frame of near-end speech processing signal and a far-end speech signal and the input signal comprises a current speech signal, when the adaptive filter updates the current impulse response, when the power proportion of the far-end speech signal in the reference signal, which can meet the assumption that the input signal is not related to the reference signal, is higher, higher weight is given, and the modeling precision can be ensured; when the power of the howling suppression part with strong correlation between the input signal and the reference signal is higher, lower weight is given to ensure that the adaptive filter can be normally converged depending on the near-end voice signal under the condition of no far-end voice signal, and the working state of the adaptive filter is stable.

In the speech signal processing method provided in the embodiment of the present invention, the previous frame of near-end speech processing signal is added to the reference signal, and the current speech signal is processed through the adaptive filter based on the reference signal including the previous frame of near-end speech processing signal and the far-end speech signal, so that the processing of far-end echo and the suppression of howling are realized. The echo processing and the howling suppression of the invention are realized by modeling the same transmission path from the voice playing device to the voice collecting device and realizing the echo processing and the howling suppression in the same adaptive filter.

Optionally, fig. 4 is a schematic structural diagram of a speech processing model provided in an embodiment of the present invention, and as shown in fig. 4, the speech processing model includes a first feature extraction network 401, a second feature extraction network 402, a third feature extraction network 403, and a multilayer artificial neural network 404; wherein, the input end of the first feature extraction network 401 is used as the input end of the speech processing model, the output end of the first feature extraction network 401 is connected with the input end of the second feature extraction network 402, the output end of the second feature extraction network 402 is connected with the input end of the multilayer artificial neural network 403, the output end of the multilayer artificial neural network 403 is connected with the input end of the third feature extraction network 404, and the output end of the third feature extraction network 404 is used as the output end of the speech processing model; the first feature extraction network 401, the second feature extraction network 402, and the third feature extraction network 403 may be all composed of one full connection layer, one-dimensional convolution layer, or multiple full connection layers.

Fig. 5 is a second flowchart of the speech signal processing method according to the embodiment of the present invention, and as shown in fig. 5, the step 102 can be specifically implemented by the following steps:

step 1021, inputting the target residual signal into the first feature extraction network, and mapping the target residual signal from a time domain to a transform domain through the first feature extraction network to obtain a first feature.

The first feature extraction network may be a fully connected layer or a one-dimensional convolutional layer.

Exemplarily, when the adaptive filter is a time-domain filter, the target residual signal is a time-domain signal; when the adaptive filter is a frequency domain filter, the target residual signal is a frequency domain signal, the frequency domain target residual signal needs to be converted into a time domain target residual signal, the time domain target residual signal is input into a first feature extraction network, and the first feature extraction network maps the target residual signal from the time domain to a transform domain learned by the speech processing model to obtain a first feature corresponding to the target residual signal.

It should be noted that, when the target residual signal in the time domain is obtained, the target residual signal in the time domain may be segmented according to a preset length and a preset overlap degree to obtain a plurality of residual signal segments in the time domain corresponding to the target residual signal, and one residual signal segment is input to the first feature extraction network each time; the preset length and the preset overlap may be set based on actual requirements, for example, the preset length is 256 sampling points, and the preset overlap is 50%, which is not limited in the present invention.

Step 1022, inputting the far-end voice signal to the second feature extraction network, and mapping the far-end voice signal from a time domain to a transform domain through the second feature extraction network to obtain a second feature.

The second feature extraction network may be a fully connected layer or a one-dimensional convolutional layer.

Exemplarily, when the adaptive filter is a time-domain filter, the far-end speech signal is a time-domain signal; when the adaptive filter is a frequency domain filter, the far-end speech signal is a frequency domain signal, the far-end speech signal in the frequency domain needs to be converted into a far-end speech signal in the time domain, the far-end speech signal is input into a second feature extraction network, and the far-end speech signal is mapped from the time domain to a transform domain learned by the speech processing model by the second feature extraction network, so that a second feature corresponding to the far-end speech signal is obtained.

It should be noted that, when obtaining the far-end speech signal in the time domain, the far-end speech signal in the time domain may also be segmented according to the preset length and the preset overlap degree to obtain a plurality of far-end speech signal segments in the time domain corresponding to the far-end speech signal, and one far-end speech signal segment is input to the second feature extraction network each time.

And 1023, inputting the first feature and the second feature into the multilayer artificial neural network, and extracting a mask of a near-end speech signal of the current frame from the first feature through the multilayer artificial neural network based on the first feature and the second feature.

The multilayer artificial neural network may be a cyclic neural network based on a Long Short-Term Memory network (LSTM) and a Gated cyclic Unit (GRU), or a multilayer convolutional neural network.

Illustratively, when the first feature and the second feature are obtained, the first feature and the second feature are used as input features and input into the multilayer artificial neural network, and the mask of the near-end speech signal of the current frame is extracted from the first feature by the multilayer artificial neural network based on the first feature and the second feature. The mask is a set of coefficients, each number representing the weight that needs to be multiplied in the transform domain for each frame of input data to be converted to near-end speech at the corresponding point.

It should be noted that, when the first feature and the second feature are obtained, normalization processing may be performed on the first feature, normalization processing is also performed on the second feature, and the normalized first feature and the normalized second feature are input into the multilayer artificial neural network, so that accuracy of extracting a mask of a near-end speech signal of a current frame by the multilayer artificial neural network can be further improved, and accuracy of the speech processing model is further improved.

And step 1024, determining a third feature of the near-end speech signal of the current frame in a transform domain based on the mask and the first feature.

Illustratively, when obtaining the mask of the current frame near-end speech signal, multiplying the mask and the first feature to obtain a third feature of the current frame near-end speech signal in the transform domain.

And 1025, inputting the third feature into the third feature extraction network, and mapping the third feature from a transform domain to a time domain through the third feature extraction network to obtain the current frame near-end speech processing signal.

Wherein, the third feature extraction network may be a fully connected layer or a one-dimensional convolutional layer.

Illustratively, when the third feature of the transform domain is obtained, the third feature is input to a third feature extraction network, and the third feature is mapped from the transform domain to the time domain by the third feature extraction network, so as to obtain the current frame near-end speech processing signal.

It should be noted that, each time a residual signal segment is input in the first feature input layer, each time a far-end speech signal segment is input in the second feature input layer, the obtained near-end speech processing signal of the current frame is also a near-end speech processing signal segment, and when all near-end speech processing signal segments are obtained, all near-end speech processing signal segments are combined based on the time sequence and the preset overlap degree to obtain the near-end speech processing signal of the current frame.

The voice signal processing method provided by the embodiment of the invention is characterized in that a target residual signal is subjected to feature extraction based on a first feature extraction network to obtain a first feature layer; performing feature extraction on the far-end voice signal based on a second feature extraction network to obtain a second feature layer; and then, a final current frame near-end speech processing signal is obtained based on a multilayer artificial neural network and a third feature extraction network, so that the elimination of other signals except the current frame near-end speech processing signal in the target residual signal is realized, the nonlinear component in the echo is processed by utilizing the nonlinear processing capability of the speech processing model, in addition, because the speech processing model has the memory capability, the reference signal corresponding to a period of time before can be stored, and then the signal with serious reverberation is processed based on a plurality of reference signals, so that the speech signal processing accuracy is further improved.

Optionally, fig. 6 is a third schematic flowchart of the speech signal processing method provided in the embodiment of the present invention, and as shown in fig. 6, the training steps of the speech processing model are as follows:

step 601, inputting the residual signal sample and the far-end voice signal sample into an initial network model to obtain a near-end voice processing sample output by the initial network model.

Illustratively, before training a speech processing model, a plurality of far-end speech signal samples are firstly acquired to form a far-end speech signal data set, a residual signal sample corresponding to each far-end speech signal sample is acquired to form a residual signal sample data set, an initial network model is constructed, one far-end speech signal sample is randomly selected from the far-end speech signal data set to serve as a reference sample to be input into the initial network model, a residual signal sample corresponding to the far-end speech signal sample is selected from the residual signal sample data set to serve as an input sample to be input into the initial network model, and signals except for near-end speech processing samples in the residual signal sample are processed by the initial network model based on the far-end speech signal samples to obtain near-end speech processing samples output by the initial network model.

Step 602, determining a loss function based on the near-end speech processing samples and the desired signal.

Wherein the desired signal comprises near-end residual signal samples.

Specifically, a first loss sub-function and a second loss sub-function may be constructed based on the near-end speech processing samples and the desired signal, and weights of the first loss sub-function and the second loss sub-function are determined as the loss functions; the first loss sub-function is mainly used for ensuring the accuracy of model convergence, and the second loss sub-function is mainly used for ensuring the stability of model convergence.

Wherein the first loss subfunction is expressed by the following formula (1):

therein, loss _SNR Representing the first loss sub-function, u (n) the near-end speech processing samples, and gt (n) the desired signal.

The second loss subfunction is expressed by the following equation (2):

therein, loss _SmoothL1 Represents a second loss subfunction, x = u (n) -gt (n).

It should be noted that the loss function may also be a loss function related to subjective Speech Quality assessment (PESQ) or Short-Time Objective Intelligibility (STOI) for improving the listening feeling, which is not limited in the present invention.

And 603, optimizing the model parameters of the initial network model based on the loss function until a convergence condition is reached to obtain the voice processing model.

Illustratively, when a loss function is obtained, model parameters of an initial network model are optimized based on the loss function, iteration is continuously performed until the iteration times reach preset times, convergence conditions are determined to be reached, and a speech processing model is obtained.

According to the voice signal processing method provided by the embodiment of the invention, the initial network model is trained based on the residual signal sample and the far-end voice signal sample, the model parameters of the initial network model are optimized based on the loss function, and the trained voice processing model is finally obtained, so that the target residual signal can be further processed based on the voice processing model in the later period, and the accuracy of voice signal processing is improved.

Optionally, fig. 7 is a fourth schematic flowchart of a speech signal processing method according to an embodiment of the present invention, as shown in fig. 7, before the step 401, the speech signal processing method further includes the following steps:

and step 604, acquiring a near-end noise-containing signal sample and an impulse response sample from the voice playing device to the voice acquisition device.

The near-end noisy signal sample is obtained by superposing a near-end voice signal sample and an environmental noise sample, namely, the near-end voice signal sample and the environmental noise sample in an actual application scene are collected, the near-end voice signal sample and the environmental noise sample are superposed to be used as the near-end noisy signal sample, and a plurality of near-end noisy signal samples form a near-end noisy signal data set; in addition, the real impulse response from the voice playing device to the voice collecting device in the actual application scene is collected, the real impulse response is analyzed and synthesized to obtain impulse response samples, and the impulse response data set is formed by a plurality of impulse response samples.

And 605, delaying the near-end noisy signal sample for a preset time to obtain a target near-end noisy signal sample.

The value range of the preset time is slightly larger than the fluctuation range of the algorithm time consumption time, the algorithm time consumption time is the time from the acquisition of the current voice signal by the voice acquisition device to the output of the near-end voice processing signal by the voice processing model, and the value range of the preset time is slightly larger than the fluctuation range of the algorithm time consumption time, so that the processing time of the finally trained voice processing model is closer to the actual processing time of the algorithm.

Illustratively, a random near-end noisy signal sample in the near-end noisy signal data set is delayed for a preset time to obtain a near-end noisy signal sample corresponding to the time delayed for the preset time, and the near-end noisy signal sample corresponding to the time delayed for the preset time is determined as a target near-end noisy signal sample.

Step 606, determining a near-end speech echo sample based on the target near-end noisy signal sample and the impulse response sample.

Illustratively, a target near-end noisy signal sample and a random impulse response sample in an impulse response data set are convolved to obtain a local amplified near-end voice echo sample, where local amplification refers to playing a near-end voice signal sample acquired by a voice acquisition device through a local voice playing device, and the near-end voice echo sample refers to being acquired by the voice acquisition device after the near-end voice signal sample is played through the local voice playing device.

Step 607, determining an input signal sample based on the near-end speech echo sample and the near-end noisy signal sample.

Illustratively, the near-end speech echo sample and the near-end noisy signal sample are superimposed to obtain an input signal sample.

Step 608, inputting the input signal sample and the target near-end noisy signal sample into the adaptive filter, and processing the input signal sample based on the target near-end noisy signal sample through the adaptive filter to obtain a near-end residual signal sample.

Illustratively, the target near-end noisy signal sample is used as a reference sample and is input into the adaptive filter together with the input signal sample, and the near-end speech echo sample in the input signal sample is processed by the adaptive filter based on the target near-end noisy signal sample to obtain a near-end residual signal sample.

Step 609 determines the residual signal samples based on the near-end residual signal samples.

Illustratively, when the near-end residual signal samples are obtained, the near-end residual signal samples are taken as residual signal samples input to the initial network model.

According to the voice signal processing method provided by the embodiment of the invention, the near-end voice echo sample in the input signal sample is processed through the self-adaptive filter to obtain the near-end residual signal sample, so that the residual signal sample input to the initial network model is more accurate.

Optionally, fig. 8 is a fifth schematic flowchart of a speech signal processing method according to an embodiment of the present invention, and as shown in fig. 8, the step 609 may be implemented specifically by:

step 6091, determine a far-end speech echo sample based on the far-end speech signal sample and the impulse response sample.

The far-end voice echo sample is collected by the voice collecting device after the far-end voice signal sample is played by the local voice playing device.

Illustratively, the far-end speech signal samples and the impulse response samples are convolved to obtain far-end speech echo samples.

Step 6092, inputting the far-end voice echo sample and the far-end voice signal sample into the adaptive filter, and processing the far-end voice echo sample based on the far-end voice signal sample through the adaptive filter to obtain a far-end residual signal sample.

Exemplarily, a far-end speech echo sample is input into the adaptive filter as an input signal, and simultaneously, the far-end speech echo sample is input into the adaptive filter as a reference signal, and the far-end speech echo sample is processed by the adaptive filter based on the far-end speech signal sample to obtain a far-end residual signal sample.

Step 6093, determine the residual signal samples based on the near-end residual signal samples and the far-end residual signal samples.

Illustratively, the near-end residual signal samples and the far-end residual signal samples are superimposed as residual signal samples input to the initial network model.

According to the voice signal processing method provided by the embodiment of the invention, the adaptive filter is used for processing the near-end voice echo sample in the input signal sample to obtain the near-end residual signal sample, the adaptive filter is used for processing the far-end voice echo sample to obtain the far-end residual signal sample, and the near-end residual signal sample and the far-end residual signal sample are superposed to be used as the residual signal sample, so that the accuracy of the residual signal sample input to the initial network model is further improved.

Optionally, the step 606 may be specifically implemented by the following steps:

The preset number may be selected based on actual requirements, for example, the preset number is 3.

Illustratively, in order to avoid the situation that the near-end speech signal is not completely suppressed and is circularly acquired, the reference near-end speech echo sample is iteratively executed for a preset number of times, assuming that the preset number of times is 3, the reference near-end speech echo sample is represented by near _ noise _ echo0 (n), then near _ noise _ echo0 (n) is delayed for a preset time and convolved with the impulse response sample, so as to obtain near _ noise _ echo1 (n), near _ noise _ echo1 (n) is delayed for a preset time and convolved with the impulse response sample, so as to obtain near _ noise _ echo2 (n), near _ noise _ echo2 (n) is delayed for a preset time and convolved with the impulse response sample, so as to obtain near _ noise _ echo3 (n), and when the delay number reaches the preset number of times, near _ noise _ echo0 (n _ noise _ echo) + near _ echo sample (n _ noise _ echo) + near _ echo2 (n) is determined as near _ noise _ echo) + near _ echo _ noise _ echo2 (n) + echo _ echo).

The voice signal processing method provided by the embodiment of the invention can be used for executing the reference near-end voice echo sample for the preset times in an iterative manner so as to avoid the situation that signals except the near-end voice signal are incompletely inhibited and are circularly acquired by the voice acquisition device, thereby improving the accuracy of the near-end voice echo sample.

The following describes the speech signal processing apparatus provided by the present invention, and the speech signal processing apparatus described below and the speech signal processing method described above can be referred to correspondingly.

Fig. 9 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention, and as shown in fig. 9, the speech signal processing apparatus 900 includes a first echo cancellation unit 901 and a second echo cancellation unit 902; wherein:

a first echo cancellation unit 901, configured to input a reference signal and a current speech signal acquired by a speech acquisition device into an adaptive filter, and process the current speech signal based on the reference signal through the adaptive filter to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

a second echo cancellation unit 902, configured to input the target residual signal and the far-end speech signal into a preset speech processing model, so as to obtain a current frame near-end speech processing signal output by the speech processing model and used for transmission;

The speech signal processing device provided by the invention inputs the far-end speech signal and the target residual signal output by the adaptive filter into a pre-trained speech processing model to obtain the current frame near-end speech processing signal output by the speech processing model and used for transmission. Therefore, the target residual signal output by the adaptive filter is further subjected to echo removing processing through the voice processing model, and the voice processing model is obtained by training based on the residual signal sample and the far-end voice signal sample, and the residual signal sample and the target residual signal both comprise signals such as nonlinear echo and the like which cannot be completely eliminated by the adaptive filter, so that the voice processing model can eliminate the echo of the nonlinear component in the target residual signal, and the accuracy of voice signal processing is improved.

Based on any of the above embodiments, the reference signal further includes a previous frame of near-end speech processing signal output by the speech processing model.

Based on any one of the above embodiments, the speech processing model includes a first feature extraction network, a second feature extraction network, a third feature extraction network, and a multilayer artificial neural network; the second echo cancellation unit 902 is specifically configured to:

inputting a target residual signal into the first feature extraction network, and mapping the target residual signal from a time domain to a transform domain through the first feature extraction network to obtain a first feature;

Based on any of the above embodiments, the speech processing model is obtained based on the following manner:

Based on any of the above embodiments, the speech signal processing apparatus 900 further includes:

the system comprises a sample acquisition unit, a voice acquisition unit and a processing unit, wherein the sample acquisition unit is used for acquiring a near-end noisy signal sample and an impulse response sample from a voice playing device to a voice acquisition device;

Based on any of the embodiments above, the residual signal sample determination unit is specifically configured to:

Based on any of the above embodiments, the echo sample determination unit is specifically configured to:

Based on any of the above embodiments, the first echo cancellation unit 901 is specifically configured to:

The embodiment of the invention provides a sound amplification system, which comprises a voice acquisition device and a voice signal processing device, wherein the voice acquisition device is used for acquiring a voice signal; wherein:

the voice signal processing device adopts the voice signal processing device of any one of the above embodiments.

Further, the public address system also comprises a voice playing device which is used for playing the current voice signal and/or the far-end voice signal.

Fig. 10 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 10, the electronic device 1000 may include: a processor (processor) 1010, a communication Interface (Communications Interface) 1020, a memory (memory) 1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a method of speech signal processing, the method comprising: inputting a reference signal and a current voice signal acquired by a voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

Furthermore, the above logic instructions in the memory 1030 can be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program being capable of executing, when executed by a processor, a speech signal processing method provided by the above methods, the method including: inputting a reference signal and a current voice signal acquired by a voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a speech signal processing method provided by performing the above methods, the method including: inputting a reference signal and a current voice signal acquired by a voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speech signal processing apparatus, comprising:

the first echo eliminating unit is used for inputting a reference signal and a current voice signal acquired by the voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

the second echo cancellation unit is used for inputting the target residual signal and the far-end voice signal into a preset voice processing model to obtain a current frame near-end voice processing signal which is output by the voice processing model and used for transmission;

2. The speech signal processing apparatus of claim 1, wherein the reference signal further comprises a last frame of near-end speech processing signal output by the speech processing model.

3. The speech signal processing apparatus of claim 1, wherein the speech processing model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, and a multi-layer artificial neural network;

the second echo cancellation unit is specifically configured to:

inputting the target residual signal into the first feature extraction network, and mapping the target residual signal from a time domain to a transform domain through the first feature extraction network to obtain a first feature;

inputting the first feature and the second feature into the multilayer artificial neural network, and extracting a mask of a near-end speech signal of the current frame from the first feature through the multilayer artificial neural network based on the first feature and the second feature;

4. The speech signal processing apparatus of claim 1, wherein the speech processing model is obtained based on:

5. The speech signal processing apparatus of claim 4, wherein the apparatus further comprises:

6. The speech signal processing apparatus of claim 5, wherein the residual signal sample determination unit is specifically configured to:

determining a far-end speech echo sample based on the far-end speech signal sample and the impulse response sample;

7. The speech signal processing apparatus of claim 5, wherein the echo sample determination unit is specifically configured to:

determining the near-end voice echo sample based on the reference near-end voice echo sample obtained each time.

8. The speech signal processing apparatus according to any one of claims 1 to 7, wherein the first echo cancellation unit is specifically configured to:

9. A speech signal processing method, comprising:

10. A loudspeaker system, comprising:

the speech signal processing apparatus using the speech signal processing apparatus according to any one of claims 1 to 8.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech signal processing method of claim 9 when executing the program.

12. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the speech signal processing method according to claim 9.

13. A computer program product comprising a computer program, characterized in that the computer program realizes the speech signal processing method according to claim 9 when executed by a processor.