WO2022160715A1

WO2022160715A1 - Voice signal processing method and electronic device

Info

Publication number: WO2022160715A1
Application number: PCT/CN2021/116212
Authority: WO
Inventors: 邓峰; 王晓瑞; 王仲远
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2021-01-29
Filing date: 2021-09-02
Publication date: 2022-08-04
Also published as: CN112967730A

Abstract

The present invention relates to the technical field of voice processing, and relates to a voice signal processing method and an electronic device. The method comprises: determining a plurality of first voice features in an original voice signal, each first voice feature corresponding to one voice signal frame in the original voice signal; processing the plurality of first voice features to obtain a plurality of non-local voice features, each non-local voice feature corresponding to one voice signal frame; processing the non-local voice feature of each voice signal frame in the original voice signal to obtain a mixed voice feature of each voice signal frame; obtaining a denoising parameter on the basis of the mixed voice features of the plurality of voice signal frames; and denoising the original voice signal on the basis of the denoising parameter to obtain a target voice signal.

Description

Voice signal processing method and electronic device

The present disclosure is based on a Chinese patent application with an application number of 202110125640.5 and an application date of January 29, 2021, and claims the priority of the Chinese patent application, the entire contents of which are incorporated herein by reference.

technical field

The present disclosure relates to the technical field of voice processing, and in particular, to a voice signal processing method and electronic device.

Background technique

Generally, the collected speech signal will contain noise, and the existence of noise will have an adverse effect on the processing of the speech signal. Therefore, for the processing of the speech signal, the removal of noise plays a crucial role.

SUMMARY OF THE INVENTION

According to an aspect of the embodiments of the present disclosure, there is provided a method for processing a voice signal, the method comprising: determining a plurality of first voice features of an original voice signal, each of the first voice features corresponding to the original voice signal a voice signal frame in the ; process a plurality of the first voice features to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features The feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame; Process the non-local speech features of each speech signal frame in the signal to obtain the mixed speech features of each of the speech signal frames; obtain denoising parameters based on the mixed speech features of a plurality of the speech signal frames; The original speech signal is denoised by the noise parameter to obtain the target speech signal.

According to yet another aspect of the embodiments of the present disclosure, there is provided an apparatus for processing a voice signal, the apparatus comprising: a feature determination unit configured to determine a plurality of first voice features of an original voice signal, each of the first voice features The feature corresponds to a voice signal frame in the original voice signal; the non-local feature acquisition unit is configured to process a plurality of the first voice features to obtain a plurality of non-local voice features, each of the non-local features The voice feature corresponds to one voice signal frame, and each of the non-local voice features is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and other voice signals other than the voice signal frame obtained by fusing the first voice features of the frames; the mixed feature acquisition unit is configured to process the non-local voice features of each voice signal frame in the original voice signal respectively, and obtain the mixed feature of each voice signal frame. speech features; a denoising parameter acquisition unit, configured to acquire denoising parameters based on mixed speech features of a plurality of the speech signal frames; a target signal acquisition unit, configured to Perform denoising to obtain the target speech signal.

According to yet another aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device comprising: one or more processors; a memory for storing executable instructions of the one or more processors; wherein the The one or more processors are configured to perform the steps of: determining a plurality of first speech features of the original speech signal, each of the first speech features corresponding to a speech signal frame in the original speech signal; The first voice features are processed to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features is based on the non-local voice features The first voice feature of the corresponding voice signal frame and the first voice feature of other voice signal frames except the voice signal frame are obtained by fusing; The local speech features are processed to obtain a mixed speech feature of each of the speech signal frames; a denoising parameter is obtained based on the mixed speech features of a plurality of the speech signal frames; based on the denoising parameters, the original speech signal is processed Denoising to get the target speech signal.

According to yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided. When the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps: determine the original voice Multiple first voice features of the signal, each of which corresponds to a voice signal frame in the original voice signal; and multiple first voice features are processed to obtain multiple non-local voice features , each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and except for the voice signal It is obtained by fusing the first voice features of other voice signal frames other than the frame; the non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain the mixed voice of each voice signal frame. feature; obtain denoising parameters based on the mixed speech features of a plurality of the speech signal frames; perform denoising on the original speech signal based on the denoising parameters to obtain a target speech signal.

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program product, comprising a computer program, the computer program being executed by a processor for the steps of: determining a plurality of first speech features of an original speech signal, each of the first speech features The voice feature corresponds to a voice signal frame in the original voice signal; a plurality of the first voice features are processed to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame , and each non-local speech feature is fused based on the first speech feature of the speech signal frame corresponding to the non-local speech feature, and the first speech feature of other speech signal frames except the speech signal frame obtained; respectively process the non-local voice features of each voice signal frame in the original voice signal to obtain the mixed voice features of each of the voice signal frames; based on the mixed voice features of the multiple voice signal frames, Obtaining denoising parameters; denoising the original speech signal based on the denoising parameters to obtain a target speech signal.

In the method provided by the embodiment of the present disclosure, the context information of each speech signal frame is considered when acquiring the non-local speech features of each speech signal frame, and then the non-local speech features of each speech signal frame are respectively processed to obtain the speech signal The speech features of the frame itself are used to obtain the speech features of the mixed form. The denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.

Description of drawings

Fig. 1 is a schematic diagram of a speech processing model according to an exemplary embodiment.

Fig. 2 is a schematic diagram of another speech processing model according to an exemplary embodiment.

Fig. 3 is a schematic diagram of another speech processing model according to an exemplary embodiment.

Fig. 4 is a flow chart of a method for processing a speech signal according to an exemplary embodiment.

Fig. 5 is a flow chart of another voice signal processing method according to an exemplary embodiment.

Fig. 6 is a schematic diagram of a non-local attention sub-model according to an exemplary embodiment.

Fig. 7 is a flow chart of a method for acquiring non-local speech features according to an exemplary embodiment.

Fig. 8 is a schematic diagram of a first processing network according to an exemplary embodiment.

Fig. 9 is a schematic diagram of a second processing network according to an exemplary embodiment.

Fig. 10 is a schematic diagram of another second processing network according to an exemplary embodiment.

Fig. 11 is a schematic diagram of a residual non-local sub-network according to an exemplary embodiment.

Fig. 12 is a schematic diagram of another non-local attention sub-model according to an exemplary embodiment.

Fig. 13 is a flowchart showing another method for processing a speech signal according to an exemplary embodiment.

Fig. 14 is a block diagram of an apparatus for processing a speech signal according to an exemplary embodiment.

Fig. 15 is a block diagram of another apparatus for processing speech signals according to an exemplary embodiment.

Fig. 16 is a structural block diagram of a terminal according to an exemplary embodiment.

Fig. 17 is a structural block diagram of a server according to an exemplary embodiment.

Detailed ways

The terms "first", "second", etc. in the description and claims of the present disclosure and the above description of the drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein.

The user information (including but not limited to user equipment information, user personal information, etc.) involved in this disclosure is the information authorized by the user or fully authorized by all parties.

In the related art, spectral subtraction is used to denoise the speech signal, that is, the silent segment in the speech signal is obtained, the noise signal is extracted from the silent segment, and the noise in the speech signal can be removed by subtracting the speech signal and the noise signal. When the noise in the speech signal changes, spectral subtraction is difficult to remove the noise, and the denoising effect is poor.

The voice signal processing method provided by the embodiment of the present disclosure can be applied to various scenarios.

For example, in live broadcast scenarios.

During the live broadcast, there may be noise signals in the voice signals of the anchors collected by the anchor terminals, and the audience terminals directly play the voice signals, which may cause the voice signals to be unclear due to the existence of noise and affect the viewing experience of the viewers. In this case, the present disclosure can be used. The method provided by the embodiment removes the noise signal in the voice signal, improves the voice quality of the voice signal, enables the viewer terminal to play the clear voice signal, and improves the live broadcast effect.

For another example, it is applied to automatic speech recognition scenarios.

In the speech recognition process, when there is a noise signal in the speech signal, the noise signal will affect the speech signal recognition, resulting in a low speech recognition accuracy rate, and it is difficult to accurately recognize the content of the speech signal. At this time, the present disclosure can be used. In the method provided by the embodiment, the speech signal is denoised first, and the denoised speech signal is recognized, so as to improve the accuracy of speech recognition.

The methods provided by the embodiments of the present disclosure can also be applied to scenarios such as video playback, language recognition, speech synthesis, and identity recognition.

Fig. 1 is a schematic diagram of a speech processing model provided according to an exemplary embodiment, the speech processing model includes: a non-local attention network 101 and a local attention network 102, the non-local attention network 101 and the local attention network 102 connections. The non-local attention network 101 is used to process the first speech feature of the input original speech signal to obtain the non-local speech feature of the original speech signal, and the local attention network 102 is used to process the non-local speech feature of the original speech signal. The features are further processed to obtain the mixed speech features of the original speech signal.

In some embodiments, referring to FIG. 2, the speech processing model further includes: a feature extraction network 103, a feature reconstruction network 104 and a speech denoising network 105, the feature extraction network 103 is connected with the non-local attention network 101, the feature reconstruction network 104 and The local attention network 102 is connected, and the speech denoising network 105 is connected with the feature reconstruction network 104 . Among them, the feature extraction network 103 is used to extract the first voice feature of the original voice signal, the feature reconstruction network 104 is used to perform feature reconstruction on the mixed voice features of the processed original voice signal, so as to obtain the denoising parameters of the original voice signal, the voice The denoising network 105 is used to denoise the original speech signal.

In some embodiments, the speech processing model includes a plurality of non-local attention networks 101 and a plurality of local attention networks 102, and the plurality of non-local attention networks 101 and the plurality of local attention networks 102 can be in any order Connect in sequence. For example, referring to Figure 3, the speech processing model includes two non-local attention networks 101 and two local attention networks 102, the feature extraction network 103 is connected to the first non-local attention network 101, the first non-local attention network 101 The network 101 is connected with the first local attention network 102, the first local attention network 102 is connected with the second local attention network 103, and the second local attention network 103 is connected with the second non-local attention network 101 connected, the second non-local attention network 101 is connected with the feature reconstruction network 104.

In this embodiment of the present disclosure, the non-local attention network may be referred to as a non-local attention sub-model, the local attention network may be referred to as a local attention sub-model, the feature extraction network may be referred to as a feature extraction sub-model, and the feature reconstruction network may be referred to as For the feature recognition sub-model, the speech denoising network can be referred to as the speech denoising sub-model.

The voice signal processing method provided by the embodiment of the present disclosure is performed by an electronic device, and the electronic device is a terminal or a server. The terminal is a portable, pocket-sized, hand-held and other various types of terminals, such as a mobile phone, a computer, a tablet computer, and the like. The server is a server, or a server cluster composed of several servers, or a cloud computing service center.

Fig. 4 is a flow chart showing a method for processing a voice signal according to an exemplary embodiment. Referring to Fig. 4, the method is executed by an electronic device and includes the following steps:

401. Determine multiple first voice features of the original voice signal, where each first voice feature corresponds to one voice signal frame in the original voice signal.

402. Call the non-local attention sub-model to fuse multiple first voice features to obtain multiple non-local voice features, each non-local voice feature corresponds to a voice signal frame, and each non-local voice feature is based on The first voice feature of the voice signal frame corresponding to the non-local voice feature is obtained by fusing the first voice feature of other voice signal frames except the voice signal frame.

403. Invoke the local attention sub-model to separately process the non-local speech features of each speech signal frame in the original speech signal to obtain the mixed speech feature of each speech signal frame.

404. Obtain a denoising parameter based on the mixed speech features of the multiple speech signal frames.

405. Perform denoising on the original speech signal based on the denoising parameter to obtain a target speech signal.

In the method provided by the embodiments of the present disclosure, the non-local attention sub-model is invoked to obtain the non-local speech features of each speech signal frame, the context information of the speech signal frame is considered, and then the local attention sub-model is invoked for each speech respectively. The non-local speech features of the signal frame are processed to obtain the speech features of the speech signal frame itself, so as to obtain the speech features of the mixed form. The proportion of signals other than the noise signal in each speech signal frame, so the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.

Fig. 5 is a flowchart of another voice signal processing method according to an exemplary embodiment. Referring to Fig. 5, the method is executed by an electronic device and includes the following steps:

501. The electronic device acquires the original amplitude and original phase of multiple speech signal frames in the original speech signal.

Since the speech signal includes amplitude and phase, and the noise signal in the speech signal is included in the amplitude, in this embodiment of the present disclosure, the original amplitude and original phase of each speech signal frame in the original speech signal are obtained, and the original amplitude is denoised, In order to realize the denoising of the original speech signal without processing the original phase, the processing amount is reduced. The original voice signal is collected by an electronic device, or is a voice signal containing noise signals sent by other electronic devices to the electronic device. For example, the noise signal is a noise signal of environmental noise, white noise, or the like.

The original voice signal includes multiple voice signal frames, and the electronic device performs Fourier transform on each voice signal frame to obtain the original amplitude and original phase of each voice signal frame, and subsequently calculates the original amplitude of each voice signal frame. Processed to achieve denoising of the original amplitude. The Fourier transform includes fast Fourier transform, short-time Fourier transform, and the like.

In some embodiments, due to the limited signal length of the speech signal processed by the speech processing model each time, for example, a speech signal of one minute, a speech signal of two minutes, etc. can be processed each time. Therefore, the signal length of the original speech signal cannot exceed the length of the reference signal, that is, the duration of the original speech signal cannot exceed the reference duration, wherein the reference signal length is any length, and the reference duration is any duration. For example, the reference signal length is 64 speech signal frames.

502. The electronic device invokes the feature extraction sub-model to perform feature extraction on the original amplitudes of the multiple voice signal frames respectively, to obtain the first voice features of the multiple voice signal frames.

That is, the electronic device invokes the feature extraction sub-model to perform feature extraction on the original amplitude of each voice signal frame respectively, to obtain the first voice feature of each voice signal frame, that is, to obtain multiple first voice features of the original voice signal. .

Wherein, the first voice feature of the voice signal frame is used to describe the corresponding voice signal frame, and the first voice feature is represented by a vector, a matrix or other forms. Optionally, the first voice features of multiple voice signal frames are respectively represented, or, the first voice features of multiple voice signal frames are combined to represent, for example, the first voice feature of each voice signal frame is a vector, Then, a plurality of vectors are combined to form a matrix, and each column in the matrix represents the first speech feature of a speech signal frame.

In some embodiments, the feature extraction sub-model includes a convolution layer, a batch normalization layer, and an activation function layer.

503. The electronic device invokes the non-local attention sub-model to fuse the first speech features of the multiple speech signal frames to obtain the non-local speech features of each speech signal frame.

That is, the electronic device invokes the non-local attention sub-model to process multiple first speech features to obtain multiple non-local speech features. Wherein, each non-local voice feature corresponds to a voice signal frame, and each non-local voice feature is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and other voice signals other than the voice signal frame The first speech feature of the frame is obtained by fusion. That is, the non-local voice feature of each voice signal frame is obtained by combining the first voice features of a plurality of voice signal frames, that is, the features of the voice signal frames before and after the voice signal frame are considered.

In the embodiment of the present disclosure, the non-local attention sub-model adopts the attention mechanism and residual learning to process the first speech feature. In the process of processing the first speech feature of each speech signal frame, this can be taken into account. The context information of the speech signal frame makes the processed non-local speech features more accurate, and because the first speech feature of the speech signal frame will lose some speech features in the processing process, the residual learning can be used in the first speech feature. After the processing, the non-local speech features are obtained by combining the input first speech features, so as to avoid losing important speech features in the process of processing the first speech features to obtain the non-local speech features.

In some embodiments, referring to FIG. 6 , the non-local attention sub-model includes a first processing unit, a second processing unit, a first fusion unit, and a second fusion unit. The first processing unit may be referred to as a first processing network, the second processing unit may be referred to as a second processing network, the first fusion unit may be referred to as a first fusion network, and the second fusion unit may be referred to as a second fusion network, The first processing network is a trunk branch (Trunk Branch), and the second processing network is a mask branch (Mask Branch). The first processing network and the second processing network respectively process the first voice signals of the input multiple voice signal frames, the first fusion network fuses the features obtained after processing by the first processing network and the second processing network, and the first fusion network The second fusion network fuses the features fused by the first fusion network with the features input in the non-local attention sub-model.

The process of invoking the non-local attention sub-model to process the first speech feature of each speech signal frame by the electronic device is shown in FIG. 7 , and the process includes the following steps:

701. The electronic device invokes the first processing network to perform feature extraction on the first voice features of the multiple voice signal frames, respectively, to obtain the second voice features of each voice signal frame.

That is, the first processing network is invoked to perform feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame. The second voice feature is obtained by further extracting the first voice feature, and the second voice feature contains less noise features than the first voice feature.

In some embodiments, referring to FIG. 8 , the first processing network includes a plurality of hole residual sub-units (Res.Unit), and the hole residual sub-unit may be called a hole residual sub-network. Take a hole residual sub-network as an example, each hole residual sub-network includes a hole convolution layer, a batch normalization layer and an activation function layer, and the multiple hole residual sub-networks are connected by the network structure of the residual learning network . Among them, the atrous convolutional layer can expand the receptive field and obtain more contextual information.

In some embodiments, the non-local attention sub-model further includes at least one hole residual unit. The hole residual unit may be called a hole residual network. Each hole residual network includes two hole residual sub-networks, wherein the hole residual subunit may be called a hole residual sub-network, and the two hole residual sub-networks are connected by the network structure of the residual learning network. Before calling the first processing network and the second processing network to process the first voice feature of each voice signal frame, the electronic device firstly calls at least one hole residual network to perform feature extraction on the first voice feature of each voice signal frame , to obtain the further extracted first voice feature of each voice signal frame, and the subsequent first processing network and the second processing network process the further extracted first voice feature of each voice signal frame. The above-mentioned invocation of the first processing network including a plurality of hole residual sub-networks can further extract the first speech feature to obtain deeper speech features.

702. The electronic device invokes the second processing network to fuse the first voice features of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame.

For each voice signal frame, the electronic device invokes the second processing network, and fuses the first voice feature of the voice signal frame with the first voice features of other voice signal frames except the voice signal frame to obtain the voice The third speech feature of the signal frame. The third voice feature of each voice signal frame is obtained by combining the first voice features of other voice signal frames.

In some embodiments, referring to Figure 9, the second processing network includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit. Among them, the residual non-local sub-unit may be referred to as a residual non-local sub-network, the convolution sub-unit may be referred to as a convolution sub-network, and the deconvolution sub-unit may be referred to as a de-convolution sub-network. The electronic device calls the residual non-local sub-network, and based on the weights of multiple voice signal frames, weighted fusion of the first voice feature of each voice signal frame and the first voice features of other voice signal frames respectively, to obtain each voice signal The first speech feature after frame weighted fusion; the convolution sub-network is called to encode the first speech feature after weighted fusion of each speech signal frame, and the encoded feature of each speech signal frame is obtained; the deconvolution sub-network is called, Decoding the encoded features of each voice signal frame to obtain a third voice feature of each voice signal frame. That is, for each speech signal frame, the residual non-local sub-network is called, and based on the weights of multiple speech signal frames, the first speech feature of the speech signal frame is respectively compared with other speech features except the speech signal frame. The first speech features of the signal frame are weighted and fused to obtain the weighted and fused first speech features of the speech signal frame.

In some embodiments, referring to FIG. 10 , the second processing network further includes a plurality of feature reduction subunits, a plurality of first hole residual subunits, a plurality of second hole residual subunits, and an activation function subunit, wherein, The feature reduction sub-unit may be called the feature reduction sub-network, the first hole residual sub-unit may be called the first hole residual sub-network, the second hole residual sub-unit may be called the second hole residual sub-network, and the activation function The subunits may be referred to as activation function sub-networks. The residual non-local sub-network is connected to the first first hole residual sub-network, multiple first hole residual sub-networks are connected in turn, and the last hole residual sub-network is connected to the convolution sub-network, and the convolution sub-network is connected to the convolution sub-network. The first feature reduction sub-network is connected, multiple feature reduction sub-networks are connected in turn, the last feature reduction sub-network is connected with the deconvolution sub-network, and the deconvolution sub-network is connected with the first and second hole residual sub-network, A plurality of second hole residual sub-networks are connected in sequence, and the last hole residual sub-network is connected with the activation function sub-network. In addition, Figure 10 only takes two first hole residual sub-networks, two second hole residual sub-networks and two feature reduction sub-networks as examples. Other numbers of networks and feature reduction sub-networks are also possible.

Wherein, the activation function in the activation function sub-network is a sigmoid function or other activation function. The atrous residual sub-network is different, each atrous residual sub-network includes atrous convolutional layer, batch normalization layer and activation function layer. Optionally, the feature reduction sub-network is also a hole residual sub-network.

In some embodiments, the electronic device invokes a plurality of first hole residual sub-networks to process the weighted and fused first voice features of each voice signal frame to obtain the further processed first voice features of each voice signal frame; The convolution sub-network is called to encode the first speech feature after further processing of each speech signal frame, and the encoded feature of each speech signal frame is obtained; Feature reduction to obtain multiple reduced coding features; call the deconvolution layer to decode multiple reduced coding features to obtain the decoded speech features of each speech signal frame; call multiple second hole residuals The network processes the decoded speech features of each speech signal frame to obtain the third speech feature of each speech signal frame. Wherein, performing reduction processing on the coding features can reduce the coding features, thereby reducing the amount of calculation and improving the processing speed of the coding features.

In some embodiments, the residual non-local sub-network includes a first fusion layer and a second fusion layer, the electronic device calls the first fusion layer, and based on the weights of the multiple speech signal frames, the first speech of each speech signal frame is The features are respectively weighted and fused with the first voice features of other voice signal frames except the voice signal frame to obtain the fusion features of each voice signal frame; the second fusion layer is called, respectively. The speech feature and the fusion feature are fused to obtain the first speech feature after weighted fusion of each speech signal frame. That is, for each voice signal frame, the first fusion layer is called, and based on the weights of multiple voice signal frames, the first voice feature of the voice signal frame is respectively combined with other voice signal frames except the voice signal frame. Perform weighted fusion on the first speech feature of the speech signal frame to obtain the fusion feature of the speech signal frame;

In the embodiment of the present disclosure, the first fusion layer is called to fuse the first speech features of different speech signal frames based on corresponding weights, so as to obtain more accurate fusion features, and in the case of including the first fusion layer and the second fusion layer Next, the residual non-local sub-network is a residual learning network, which fuses the fusion feature with the input first speech feature, so that the final weighted fusion first speech feature is more accurate, avoiding the loss of some important fusion features. feature, which improves the accuracy of the first speech feature after weight fusion. Moreover, the residual learning network is easier to optimize, which can improve the training efficiency of the model during the training process.

In some embodiments, referring to FIG. 11 , FIG. 11 takes the processing of three speech signal frames as an example to illustrate, the residual non-local sub-network further includes a plurality of convolution layers, a third fusion layer and a normalization layer, The third fusion layer is connected with the two convolutional layers, and the third fusion layer is used to fuse the first speech features processed by the connected two convolutional layers. The third fusion layer is connected with the normalization layer and normalized The normalization layer is used to normalize the fused speech features output by the third fusion layer, the normalization layer is connected to the first fusion layer, and the first fusion layer is used to process the first speech processed by another convolutional layer. The feature is fused with the normalized speech feature output by the normalization layer, and the fusion feature of each speech signal frame is obtained. The fusion feature is processed by a convolution layer and fused with the first speech feature to obtain the weight The first speech feature after fusion.

In some embodiments, the first fusion layer and the third fusion layer use matrix multiplication to fuse speech features, and the second fusion layer uses matrix addition to fuse speech features. Optionally, for each voice signal frame, the first voice feature of the voice signal frame is T*K*C, and the first voice feature represents the voice feature C corresponding to time T and frequency K, in order to be able to compare different voice signals. The speech features of the frames are multiplied or added, and the speech features need to be formally transformed.

For example, the residual non-local sub-network uses the following formula to process the first speech feature of the speech signal frame x _i :

o _i =W _z y _i +x _i =W _z softmax((W _u x _i ) ^T W _v x _j )(W _g x _j )+x;

Among them, o _i represents the first speech feature after weighted fusion of speech signal frame x _i , W _z , _Wu , W _v and W _g are known model parameters, softmax represents normalization processing, x _j represents the division of speech Other speech signal frames other than the signal frame _xi , _xi represents the fusion feature of the speech signal frame _xi .

703. The electronic device invokes the first fusion network, and fuses the second voice feature and the third voice feature of each voice signal frame respectively to obtain the non-local voice feature of each voice signal frame.

In some embodiments, the first fusion network is a multiplication unit, that is, the second speech feature and the third speech feature of each speech signal frame are respectively multiplied to obtain a fused non-local speech feature.

704. The electronic device invokes the second fusion network to fuse the non-local speech feature of each speech signal frame with the first speech feature, to obtain a non-local speech feature after fusion of each speech signal frame.

In some embodiments, the second fusion network is an addition unit, that is, the electronic device adds the non-local speech feature of each speech signal frame and the first speech feature to obtain the fused non-local speech feature of each speech signal frame.

In the embodiment shown in FIG. 7 , different networks in the non-local attention sub-model are called to process the first speech feature in different aspects, wherein the first processing network including a plurality of hole residual sub-networks can further The first voice feature is extracted to obtain deeper voice features, and the second processing network adopts a non-local attention mechanism. When processing the first voice feature of each voice signal frame, it is considered that the voice signal is divided For speech signal frames outside the frame, more accurate speech features are obtained by combining context information, and the first fusion network is called to fuse the speech features obtained by the two processing networks together to obtain non-local speech features. In addition, the hole residual sub-network can expand the receptive field and further obtain more contextual information.

Moreover, when the non-local attention sub-model includes the second fusion network, the non-local attention sub-model is a residual learning network. After obtaining the non-local speech features, the non-local speech features are combined with the input The first speech features are fused to make the final non-local speech features more accurate, avoid the loss of some important features from the non-local speech features, and improve the accuracy of the non-local speech features. Moreover, the residual learning network is easier to optimize, which can improve the training efficiency of the model during the training process.

In addition, in some embodiments, referring to FIG. 12 , the non-local attention sub-model includes a plurality of hole residual units, which may be called a hole residual network, and the electronic device first calls the plurality of hole residual networks to pair Input the first voice feature of each voice signal frame for processing, and then input the processed first voice feature to the first processing network and the second processing network. Similarly, after obtaining the non-local voice feature through the second fusion network , multiple hole residual networks are called to process the non-local speech features, and the processed non-local speech features are input into the subsequent local attention sub-model. Among them, Fig. 12 only takes four hole residual networks as an example for description.

504. The electronic device invokes the local attention sub-model to separately process the non-local speech features of each speech signal frame to obtain the mixed speech features of each speech signal frame.

Among them, the noise feature is no longer included in the mixed speech feature, and the mixed speech feature of each speech signal frame is obtained after considering the speech features of other speech signal frames, which is more accurate.

In the embodiment of the present disclosure, the network structure of the local attention sub-model is similar to that of the non-local attention sub-model, the difference is that the residual non-local sub-network is not included in the local attention sub-model, and the network structure of the local attention sub-model is in This will not be repeated here.

It should be noted that the embodiments of the present disclosure only take a non-local attention sub-model and a local attention sub-model as examples for description. In another embodiment, multiple non-local attention sub-models and multiple local attention sub-models are included, that is, after the mixed speech feature is obtained, the mixed speech feature can be input into the subsequent non-local attention sub-model or Continue processing in the local attention sub-model to obtain a more accurate mixed speech signal.

In the embodiment of the present disclosure, for each speech signal frame, the electronic device invokes the local attention sub-model, and then processes the non-local speech features of the speech signal frame respectively. The non-local voice features of other voice signal frames, that is, the context information of the voice signal frame is no longer considered, so that the voice features of the voice signal frame itself can be obtained in the processing process, and when the non-local voice features are obtained before, The context information of the speech signal frame has been considered, so the obtained mixed speech feature can not only reflect the speech characteristic of the speech signal frame in the whole speech signal, but also reflect the speech characteristic of the speech signal frame itself.

505. The electronic device invokes the feature recognition sub-model to perform feature recognition on the mixed speech features of multiple speech signal frames to obtain denoising parameters.

The feature recognition sub-model is used to perform feature recognition on the mixed speech features of multiple speech signal frames. For the mixed speech features of each speech signal frame, the feature recognition sub-model can identify the corresponding mixed speech features from the mixed speech features. The ratio between the noise signal in the voice signal frame and the voice signal other than the noise signal, and the multiple mixed voice features are respectively identified, so as to obtain the denoising parameters corresponding to the multiple voice signal frames, that is, the original voice signal is obtained. Corresponding denoising parameter, the denoising parameter represents the proportion of the speech signal other than the noise signal in the speech signal frame, and the denoising parameter can be used to denoise the original speech signal subsequently. Optionally, the denoising parameter is represented in the form of a matrix, each element in the matrix represents the denoising parameter of a speech signal frame, or one column element or one row element in the matrix represents the denoising parameter of a speech signal frame. Among them, the feature recognition sub-model is a convolutional network or other types of networks.

506. The electronic device invokes the speech denoising sub-model, and denoises the original amplitudes of the multiple speech signal frames according to the denoising parameters to obtain the target amplitudes of the multiple speech signal frames.

In some embodiments, the speech denoising sub-model is a multiplication network, and the denoising parameters are multiplied by multiple original amplitudes to obtain target amplitudes of multiple speech signal frames, and the target amplitudes do not contain noise signals. Optionally, in the case where the denoising parameter is a matrix, each element in the matrix is respectively multiplied by the original amplitude of the corresponding speech signal frame, or one column element or one row element in the matrix is respectively the same as the corresponding speech signal frame. The original magnitudes are multiplied.

507. The electronic device combines the original phases and target amplitudes of multiple voice signal frames to obtain a target voice signal.

In some embodiments, the electronic device performs inverse Fourier transform on the original phases and target amplitudes of the plurality of speech signal frames to obtain the target speech signal, where the target speech signal is the speech signal after removing the noise signal.

This method of denoising the original amplitude in the speech signal frame only needs to process the amplitude in the speech signal without processing the phase, which reduces the features to be processed and improves the processing speed.

Moreover, since the noise signal in the speech signal frame exists in the original amplitude of the speech signal frame, the feature extraction is performed on the original amplitude of the speech signal frame, and the original amplitude is denoised according to the acquired denoising parameters to obtain a signal that does not contain noise. The target amplitude of the original speech signal can be de-noised, and then the target speech signal without the noise signal can be recovered according to the target amplitude and the original phase, so as to realize the de-noising of the original speech signal. This denoising method only needs to process the amplitude in the speech signal without processing the phase, which reduces the features that need to be processed.

In addition, before calling the speech processing model and processing the original speech signal, the speech processing model needs to be trained. The training process is as follows: obtain the sample speech signal and the sample noise signal; mix the sample speech signal and the sample noise signal to obtain the sample mixture Signal; call the speech processing model to process multiple sample speech signal frames in the sample mixed signal to obtain the predicted denoising parameters corresponding to the sample mixed signal; denoise the original speech signal based on the predicted denoising parameters, and obtain the denoised signal The predicted speech signal of ; based on the difference between the predicted speech signal and the sample speech signal, the speech processing model is trained. The sample speech signal is a clean speech signal that does not contain noise signals. Moreover, since the network structure of the residual learning network is adopted in the speech processing model, the training speed of the model is improved during the training process.

For example, sample voice signals of multiple users are obtained from the voice database, and then multiple sample noise signals are obtained from the noise database, and multiple sample noise signals and sample voice signals are mixed according to different signal-to-noise ratios respectively to obtain multiple samples. Mixed signal, using multiple sample mixed signals to train the speech processing model.

In some embodiments, sample amplitudes of multiple sample speech signal frames in the sample mixed signal are obtained, and a speech processing model is invoked to process the multiple sample amplitudes to obtain predicted denoising parameters corresponding to the sample mixed signal; based on the predicted denoising The parameters denoise the sample amplitude to obtain the predicted amplitude of each speech signal frame, and train the speech processing model based on the difference between the predicted amplitude of each speech signal frame and the amplitudes of multiple speech signal frames in the sample speech signal.

For example, when training a speech processing model, set the convolution kernel, filter and convolution parameters of the convolutional layer in the speech processing model, as shown in Table 1 below:

Table 1

Among them, Conv. represents the feature extraction sub-model or feature recognition sub-model, RNAM represents the non-local attention sub-model, RAM represents the local attention sub-model, Res.Unit represents the hole residual network or hole residual sub-network, and Conv represents the volume Product sub-network, Deconv represents the deconvolution sub-network, and NL Unit represents the residual non-local sub-network.

In addition, in some embodiments, Wiener Filtering (Wiener Filtering) method, SEGAN (Speech Enhancement Generative Adversarial Network, speech enhancement generative adversarial network) method, Wavelnet (microwave) method, MMSE-GAN (a speech enhancement generative adversarial network) method is adopted. Network) method, DFL (Deep Feature Loss, deep feature loss) method, MDPhD (a hybrid model), RSGAN-GP (Speech Enhancement using Relativistic Generative Adversarial Networks with Gradient Penalty, using a relative speech enhancement generative adversarial network) method As a reference method, these methods are compared with the method (RNANet) provided in the embodiments of the present disclosure.

The comparison results of the above-mentioned reference method and the method provided by the embodiments of the present disclosure refer to the following table 2:

Table 2

方法method	SSNRSSNR	PESQPESQ	CSIGCSIG	CBAKCBAK	COVLCOVL
NoisyNoisy	1.681.68	1.971.97	3.353.35	2.442.44	2.632.63
WienerWiener	5.075.07	2.222.22	3.233.23	2.682.68	2.672.67
SEGANSEGAN	7.737.73	2.162.16	3.483.48	2.942.94	2.802.80
WavelnetWavelnet			3.623.62	3.233.23	2.982.98
DFLDFL			3.863.86	3.333.33	3.223.22
MMSE-GANMMSE-GAN		2.532.53	3.803.80	3.123.12	3.143.14
MDPhDMDPhD	10.2210.22	2.702.70	3.853.85	3.393.39	3.273.27
RNANetRNANet	10.1610.16	2.712.71	3.983.98	3.423.42	3.353.35

Among them, the larger the SSNR (Segmental Signal Noise Ratio, the segmental signal-to-noise ratio), the better the denoising effect; the larger the PESQ (Perceptual Evaluation of Speech Quality, the subjective speech quality evaluation), the better the denoising effect; CSIG (a kind of Evaluation index) is the average opinion score of signal distortion, the larger the CSIG, the better the denoising effect; CBAK (an evaluation index) is the background noise prediction score, the larger the CBAK, the better the denoising effect; COVL (an evaluation index) ) is the overall signal quality score of the speech signal.

In some embodiments, in order to display the improvement of the clarity of the speech signal, STOI (Short Time Objective Intelligibility, short-term objective intelligibility) is used to compare the method provided by the embodiment of the present disclosure and the reference method, and the comparison results are shown in Table 3:

table 3

评价方法Evaluation method	NoisyNoisy	MMSE-GANMMSE-GAN	RSGAN-GPRSGAN-GP	RNANetRNANet
STOISTOI	0.9210.921	0.9300.930	0.9420.942	0.9460.946

Among them, the larger the STOI, the better the denoising effect.

According to the comparison results of Table 2 and Table 3 above, it can be seen that the denoising effect of the method provided by the embodiment of the present disclosure is obviously better than that of other methods.

The above embodiment shown in FIG. 5 is only described by using the model to denoise the original voice signal. In another embodiment, the electronic device may not call the voice processing model to denoise the original voice signal.

FIG. 13 is a flowchart of another voice signal processing method provided by an embodiment of the present application. The method is performed by an electronic device, see FIG. 13 , and the method includes:

1301. Determine multiple first voice features of the original voice signal, where each first voice feature corresponds to one voice signal frame in the original voice signal.

In some embodiments, the electronic device performs feature extraction on the original amplitude of each voice signal frame, respectively, to obtain the first voice feature of each voice signal frame.

1302. Process multiple first voice features to obtain multiple non-local voice features, where each non-local voice feature corresponds to a voice signal frame.

Wherein, each non-local voice feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice features of other voice signal frames except the voice signal frame.

In some embodiments, the electronic device performs feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame; and separates the first voice feature of each voice signal frame with other The first voice feature of the voice signal frame is fused to obtain the third voice feature of each voice signal frame; the second voice feature and the third voice feature of each voice signal frame are respectively fused to obtain the Non-local speech features.

In some embodiments, based on the weights of multiple speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, respectively, to obtain a weighted fusion of each speech signal frame. encode the first speech feature after weighted fusion of each speech signal frame to obtain the encoded feature of each speech signal frame; decode the encoded feature of each speech signal frame to obtain each speech signal The third speech feature of the frame.

In some embodiments, the electronic device performs weighted fusion of the first voice features of each voice signal frame and the first voice features of other voice signal frames based on the weights of multiple voice signal frames, to obtain the weight of each voice signal frame. Fusion feature: the first voice feature of each voice signal frame is fused with the fused feature to obtain the weighted and fused first voice feature of each voice signal frame.

In some embodiments, the electronic device performs feature reduction on the coding feature of each voice signal frame to obtain multiple reduced coding features; and decodes each reduced coding feature to obtain the third coding feature of each voice signal frame. voice characteristics.

In some embodiments, the electronic device fuses the non-local speech feature of each speech signal frame with the first speech feature to obtain the fused non-local speech feature of each speech signal frame.

1303. Process the non-local voice features of each voice signal frame in the original voice signal respectively, to obtain the mixed voice feature of each voice signal frame.

1304. Obtain denoising parameters based on the mixed speech features of the multiple speech signal frames.

In some embodiments, the electronic device performs feature recognition on the mixed speech features of the multiple speech signal frames to obtain the denoising parameter.

1305. Perform denoising on the original speech signal based on the denoising parameter to obtain a target speech signal.

In some embodiments, the electronic device de-noises the original amplitudes of the multiple speech signal frames based on the denoising parameters, respectively, to obtain the target amplitudes of the multiple speech signal frames; the original phase of the multiple speech signal frames Combine with the target amplitude to obtain the target speech signal.

Fig. 14 is a block diagram of an apparatus for processing a speech signal according to an exemplary embodiment. Referring to Figure 14, the device includes:

The feature determining unit 1401 is configured to determine a plurality of first voice features of the original voice signal, each first voice feature corresponds to a voice signal frame in the original voice signal;

The non-local feature acquisition unit 1402 is configured to process a plurality of first speech features to obtain a plurality of non-local speech features, each non-local speech feature corresponds to a speech signal frame, and each non-local speech feature is based on Obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame;

The mixed feature acquisition unit 1403 is configured to process the non-local voice features of each voice signal frame in the original voice signal respectively to obtain the mixed voice feature of each voice signal frame;

The denoising parameter obtaining unit 1404 is configured to obtain denoising parameters based on the mixed speech features of a plurality of speech signal frames;

The target signal obtaining unit 1405 is configured to perform denoising on the original speech signal based on the denoising parameter to obtain the target speech signal.

In the device provided by the embodiment of the present disclosure, the context information of each voice signal frame is considered when acquiring the non-local voice feature of each voice signal frame, and then the non-local voice features of each voice signal frame are processed respectively to obtain the voice signal The speech features of the frame itself are used to obtain the speech features of the mixed form. The denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.

In some embodiments, the feature determining unit 1401 is configured to perform feature extraction on the original amplitude of each speech signal frame, respectively, to obtain the first speech feature of each speech signal frame.

In some embodiments, referring to FIG. 15 , the target signal acquisition unit 1405 includes:

The amplitude obtaining subunit 1415 is configured to de-noise the original amplitudes of the multiple speech signal frames based on the denoising parameters to obtain the target amplitudes of the multiple speech signal frames;

The signal acquisition subunit 1425 is configured to combine the original phase and target amplitude of multiple speech signal frames to obtain the target speech signal.

In some embodiments, the denoising parameter obtaining unit 1404 is configured to perform feature recognition on mixed speech features of multiple speech signal frames to obtain denoising parameters.

In some embodiments, referring to FIG. 15 , the non-local feature acquisition unit 1402 includes:

The feature extraction subunit 1412 is configured to perform feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame;

The first fusion subunit 1422 is configured to fuse the first speech features of each speech signal frame with the first speech features of other speech signal frames to obtain the third speech feature of each speech signal frame;

The second fusion subunit 1432 is configured to respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain the non-local speech feature of each speech signal frame.

In some embodiments, referring to FIG. 15 , the non-local feature acquisition unit 1402 further includes:

The third fusion subunit 1442 is configured to fuse the non-local speech feature of each speech signal frame with the first speech feature to obtain the fused non-local speech feature of each speech signal frame.

In some embodiments, referring to FIG. 15, the first fusion subunit 1422 is configured to:

Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, respectively, to obtain the weighted and fused first speech feature of each speech signal frame;

Encoding the first voice feature after weighted fusion of each voice signal frame to obtain the coding feature of each voice signal frame;

Decoding the encoded features of each voice signal frame to obtain a third voice feature of each voice signal frame.

Perform feature reduction on the coding feature of each speech signal frame to obtain a plurality of reduced coding features;

Each of the reduced encoded features is decoded to obtain a third voice feature of each voice signal frame.

In some embodiments, referring to Figure 15, the first fusion subunit 1422 is configured to perform:

Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are weighted and fused with the first speech features of other speech signal frames, respectively, to obtain the fusion features of each speech signal frame;

The first voice feature of each voice signal frame is fused with the fusion feature to obtain the weighted and fused first voice feature of each voice signal frame.

In some embodiments, the non-local feature acquisition unit 1402 is configured to invoke the non-local attention sub-model to process multiple first speech features to obtain multiple non-local speech features;

The mixed feature acquisition unit 1403 is configured to invoke the local attention sub-model to process the non-local speech features of each speech signal frame respectively to obtain the mixed speech features of each speech signal frame.

In some embodiments, the non-local attention sub-model includes a first processing network, a second processing network and a first fusion network. Referring to FIG. 15 , the non-local feature acquisition unit 1402 includes:

The feature extraction subunit 1412 is configured to call the first processing network to perform feature extraction on the first voice features of each voice signal frame respectively, to obtain the second voice features of each voice signal frame, and the first processing network includes a plurality of Atrous residual sub-network;

The first fusion subunit 1422 is configured to call the second processing network to fuse the first speech features of each speech signal frame with the first speech features of other speech signal frames to obtain the third speech feature of each speech signal frame. voice characteristics;

The second fusion subunit 1432 is configured to invoke the first fusion network to respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain the non-local speech feature of each speech signal frame.

In some embodiments, the second processing network includes a residual non-local sub-network, a convolution sub-network and a deconvolution sub-network; the first fusion sub-unit 1422 is configured to:

The residual non-local sub-network is called, and based on the weights of multiple voice signal frames, the first voice feature of each voice signal frame is weighted and fused with the first voice features of other voice signal frames, and the weight of each voice signal frame is obtained. The first speech feature after fusion;

Call the convolution sub-network to encode the first speech feature after weighted fusion of each speech signal frame, and obtain the encoded characteristic of each speech signal frame;

The deconvolution sub-network is called to decode the encoded features of each speech signal frame to obtain the third speech feature of each speech signal frame.

In some embodiments, the residual non-local sub-network includes a first fusion layer and a second fusion layer, and the first fusion subunit 1422 is configured to:

The first fusion layer is called, and based on the weights of multiple speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, and the fusion feature of each speech signal frame is obtained. ;

The second fusion layer is called to fuse the first speech feature of each speech signal frame with the fusion feature to obtain the weighted fusion first speech feature of each speech signal frame.

In an exemplary embodiment, there is provided an electronic device comprising one or more processors, and volatile or non-volatile memory for storing instructions executable by the one or more processors; Wherein, the one or more processors are configured to execute the voice signal processing method in the above embodiments.

In some embodiments, the electronic device is provided as a terminal. FIG. 16 is a structural block diagram of a terminal 1600 according to an exemplary embodiment. The terminal 1600 may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3), an MP4 (Moving Picture Experts Group Audio Layer IV, a dynamic picture expert Video Expert Compresses Standard Audio Layer 4) Player, Laptop or Desktop. Terminal 1600 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.

The terminal 1600 includes: a processor 1601 and a memory 1602 .

The processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1601 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. The processor 1601 may also include a main processor and a coprocessor. The main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 1601 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.

Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. Memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 1602 is used to store at least one piece of program code, and the at least one piece of program code is used to be executed by the processor 1601 to implement the methods provided by the method embodiments of the present disclosure. Methods of processing speech signals.

In some embodiments, the terminal 1600 may also optionally include: a peripheral device interface 1603 and at least one peripheral device. The processor 1601, the memory 1602 and the peripheral device interface 1603 can be connected through a bus or a signal line. Each peripheral device can be connected to the peripheral device interface 1603 through a bus, a signal line or a circuit board. Specifically, the peripheral devices include: at least one of a radio frequency circuit 1604 , a display screen 1605 , a camera assembly 1606 , an audio circuit 1607 , a positioning assembly 1608 and a power supply 1609 .

The peripheral device interface 1603 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1601 and the memory 1602 . In some embodiments, processor 1601, memory 1602, and peripherals interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 1601, memory 1602, and peripherals interface 1603 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The radio frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 1604 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, radio frequency circuitry 1604 includes an antenna system, an RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and the like. The radio frequency circuit 1604 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network. In some embodiments, the radio frequency circuit 1604 may further include a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in the present disclosure.

The display screen 1605 is used for displaying UI (User Interface, user interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 1605 is a touch display screen, the display screen 1605 also has the ability to acquire touch signals on or above the surface of the display screen 1605 . The touch signal can be input to the processor 1601 as a control signal for processing. At this time, the display screen 1605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 1605, which is arranged on the front panel of the terminal 1600; in other embodiments, there may be at least two display screens 1605, which are respectively arranged on different surfaces of the terminal 1600 or in a folded design; In other embodiments, the display screen 1605 may be a flexible display screen, disposed on a curved surface or a folding surface of the terminal 1600 . Even, the display screen 1605 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 1605 can be made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode).

The camera assembly 1606 is used to capture images or video. Optionally, the camera assembly 1606 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions. In some embodiments, the camera assembly 1606 may also include a flash. The flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.

Audio circuitry 1607 may include a microphone and speakers. The microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals and input them to the processor 1601 for processing, or to the radio frequency circuit 1604 to realize voice communication. For the purpose of stereo acquisition or noise reduction, there may be multiple microphones, which are respectively disposed in different parts of the terminal 1600 . The microphone may also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 1601 or the radio frequency circuit 1604 into sound waves. The loudspeaker can be a traditional thin-film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes. In some embodiments, audio circuitry 1607 may also include a headphone jack.

The positioning component 1608 is used to locate the current geographic location of the terminal 1600 to implement navigation or LBS (Location Based Service). The positioning component 1608 may be a positioning component based on the GPS (Global Positioning System, global positioning system) of the United States, the Beidou system of China, the Grenas positioning system of Russia, or the Galileo positioning system of the European Union.

Power supply 1609 is used to power various components in terminal 1600 . The power source 1609 may be alternating current, direct current, primary batteries, or rechargeable batteries. When the power source 1609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils. The rechargeable battery can also be used to support fast charging technology.

In some embodiments, terminal 1600 also includes one or more sensors 1610 . The one or more sensors 1610 include, but are not limited to, an acceleration sensor 1611 , a gyro sensor 1612 , a pressure sensor 1613 , a fingerprint sensor 1614 , an optical sensor 1615 , and a proximity sensor 1616 .

The acceleration sensor 1611 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 1600 . For example, the acceleration sensor 1611 can be used to detect the components of the gravitational acceleration on the three coordinate axes. The processor 1601 can control the display screen 1605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1611 . The acceleration sensor 1611 can also be used for game or user movement data collection.

The gyroscope sensor 1612 can detect the body direction and rotation angle of the terminal 1600 , and the gyroscope sensor 1612 can cooperate with the acceleration sensor 1611 to collect 3D actions of the user on the terminal 1600 . The processor 1601 can implement the following functions according to the data collected by the gyro sensor 1612: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.

The pressure sensor 1613 may be disposed on the side frame of the terminal 1600 and/or the lower layer of the display screen 1605 . When the pressure sensor 1613 is disposed on the side frame of the terminal 1600, the user's holding signal of the terminal 1600 can be detected, and the processor 1601 can perform left and right hand identification or shortcut operations according to the holding signal collected by the pressure sensor 1613. When the pressure sensor 1613 is disposed on the lower layer of the display screen 1605, the processor 1601 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 1605. The operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

The fingerprint sensor 1614 is used to collect the user's fingerprint, and the processor 1601 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1614, or the fingerprint sensor 1614 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 1601 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings. The fingerprint sensor 1614 may be disposed on the front, back, or side of the terminal 1600 . When the terminal 1600 is provided with physical buttons or a manufacturer's logo, the fingerprint sensor 1614 may be integrated with the physical buttons or the manufacturer's logo.

Optical sensor 1615 is used to collect ambient light intensity. In one embodiment, the processor 1601 may control the display brightness of the display screen 1605 according to the ambient light intensity collected by the optical sensor 1615 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 1605 is increased; when the ambient light intensity is low, the display brightness of the display screen 1605 is decreased. In another embodiment, the processor 1601 can also dynamically adjust the shooting parameters of the camera assembly 1606 according to the ambient light intensity collected by the optical sensor 1615 .

A proximity sensor 1616 , also called a distance sensor, is provided on the front panel of the terminal 1600 . The proximity sensor 1616 is used to collect the distance between the user and the front of the terminal 1600. In one embodiment, when the proximity sensor 1616 detects that the distance between the user and the front of the terminal 1600 is gradually decreasing, the processor 1601 controls the display screen 1605 to switch from the bright screen state to the off screen state; when the proximity sensor 1616 detects When the distance between the user and the front of the terminal 1600 gradually increases, the processor 1601 controls the display screen 1605 to switch from the screen-off state to the screen-on state.

Those skilled in the art can understand that the structure shown in FIG. 16 does not constitute a limitation on the terminal 1600, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.

In some embodiments, the electronic device is provided as a server. FIG. 17 is a structural block diagram of a server according to an exemplary embodiment. The server 1700 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1701 and one or more memories 1702, where at least one piece of program code is stored in the memory 1702, and the at least one piece of program code is loaded and executed by the processor 1701 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be described here.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the terminal or the method in the above-mentioned voice signal processing method. The steps performed by the server. For example, the non-transitory computer-readable storage medium may be ROM (Read Only Memory, Read Only Memory), RAM (Random Access Memory, Random Access Memory), CD-ROM (Compact Disc Read-Only) Memory), magnetic tapes, floppy disks, and optical data storage devices, etc.

In an exemplary embodiment, a computer program product is also provided, when the instructions in the computer program product are executed by the processor of the electronic device, the electronic device can execute the above-mentioned voice signal processing method executed by the terminal or the server. step.

In an exemplary embodiment, a method for processing a speech signal is provided, the method comprising:

determining a first speech feature of a plurality of speech signal frames in the original speech signal;

Call the non-local attention network to fuse the first speech features of multiple speech signal frames, and obtain the non-local speech features of each speech signal frame;

The local attention network is called to process the non-local speech features of each speech signal frame separately, and the mixed speech features of each speech signal frame are obtained;

Obtain denoising parameters based on the mixed speech features of multiple speech signal frames;

The original speech signal is denoised according to the denoising parameters to obtain the target speech signal.

In some embodiments, determining the first speech feature of multiple speech signal frames in the original speech signal includes:

The feature extraction network is invoked to perform feature extraction on the original amplitudes of the multiple speech signal frames, respectively, to obtain the first speech features of the multiple speech signal frames.

In some embodiments, the original speech signal is denoised according to the denoising parameters to obtain the target speech signal, including:

Call the speech denoising network to denoise the original amplitudes of multiple speech signal frames according to the denoising parameters to obtain the target amplitudes of multiple speech signal frames;

The original phase and target amplitude of multiple speech signal frames are combined to obtain the target speech signal.

In some embodiments, the denoising parameters are obtained based on the mixed speech features of multiple speech signal frames, including:

The feature reconstruction network is called to perform feature reconstruction on the mixed speech features of multiple speech signal frames, and the denoising parameters are obtained.

In some embodiments, the non-local attention network further includes a second fusion unit that invokes the first fusion unit to fuse the second speech feature and the third speech feature of each speech signal frame respectively to obtain each speech signal frame After the non-local speech features of , the processing method also includes:

The second fusion unit is called to fuse the non-local speech feature of each speech signal frame with the first speech feature to obtain the non-local speech feature after fusion of each speech signal frame.

In some embodiments, the second processing unit further includes a feature reduction subunit, which calls the convolution subunit to encode the first speech feature after weighted fusion of each speech signal frame, and obtains the encoded feature of each speech signal frame after the encoding feature is obtained. , the processing method also includes:

Calling the feature reduction subunit to perform feature reduction on the coding features of each speech signal frame to obtain a plurality of reduced coding features;

The deconvolution subunit is called to decode the coded features of each voice signal frame to obtain the third voice feature of each voice signal frame, including:

The deconvolution subunit is called to decode the plurality of reduced encoded features to obtain the third speech feature of each speech signal frame.

In some embodiments, the speech processing model includes at least a non-local attention network and a local attention network, and the training process of the speech processing model is as follows:

Obtain sample speech signal and sample noise signal;

Mixing the sample speech signal and the sample noise signal to obtain a sample mixed signal;

Call the speech processing model to process multiple sample speech signal frames in the sample mixed signal, and obtain the predicted denoising parameters corresponding to the sample mixed signal;

Denoising the original speech signal according to the predicted denoising parameters to obtain the denoised predicted speech signal;

A speech processing model is trained based on the difference between the predicted speech signal and the sample speech signal.

All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the protection scope required by the present disclosure.

Claims

A method for processing a voice signal, executed by an electronic device, the method comprising:

determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;

Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;

The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;

Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;

The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.
The method according to claim 1, wherein the determining a plurality of first speech features of the original speech signal comprises:

Feature extraction is performed on the original amplitude of each of the speech signal frames, respectively, to obtain the first speech feature of each of the speech signal frames.
The method according to claim 2, wherein the performing denoising on the original speech signal based on the denoising parameter to obtain a target speech signal comprises:

Based on the denoising parameters, denoising is performed on the original amplitudes of the plurality of speech signal frames, respectively, to obtain a plurality of target amplitudes of the speech signal frames;

The target speech signal is obtained by combining the original phase and target amplitude of a plurality of the speech signal frames.
The method according to claim 1, wherein the obtaining denoising parameters based on the mixed speech features of the plurality of speech signal frames comprises:

Feature recognition is performed on the mixed speech features of a plurality of the speech signal frames to obtain the denoising parameters.
The method according to claim 1, wherein said processing a plurality of said first speech features to obtain a plurality of non-local speech features, comprising:

Perform feature extraction on the first voice feature of each of the voice signal frames, respectively, to obtain the second voice feature of each of the voice signal frames;

Fusing the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;

The second voice feature and the third voice feature of each of the voice signal frames are respectively fused to obtain the non-local voice feature of each of the voice signal frames.
The method of claim 5, wherein the method further comprises:

The non-local speech feature and the first speech feature of each of the speech signal frames are fused to obtain a non-local speech feature after fusion of each of the speech signal frames.
The method according to claim 5, wherein the first voice feature of each of the voice signal frames is respectively fused with the first voice features of other voice signal frames to obtain the first voice feature of each of the voice signal frames. Three phonetic features, including:

Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are respectively weighted and fused with the first speech features of other speech signal frames to obtain the weighted fusion of each speech signal frame. The first voice feature of ;

Encoding the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;

Decoding the encoded features of each of the voice signal frames to obtain third voice features of each of the voice signal frames.
The method of claim 7, wherein the method further comprises:

Feature reduction is performed on the coding feature of each of the speech signal frames to obtain a plurality of reduced coding features;

The encoding feature of each of the voice signal frames is decoded to obtain the third voice feature of each of the voice signal frames, including:

Decoding each of the reduced encoded features to obtain a third voice feature of each of the voice signal frames.
The method according to claim 7, wherein the first voice feature of each voice signal frame is weighted with the first voice features of other voice signal frames based on the weights of the plurality of voice signal frames. Fusion to obtain the first voice feature after weighted fusion of each of the voice signal frames, including:

Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, to obtain the fusion feature of each speech signal frame. ;

The first voice feature of each of the voice signal frames is fused with the fusion feature to obtain the weighted and fused first voice feature of each of the voice signal frames.
The method according to claim 1, wherein said processing a plurality of said first speech features to obtain a plurality of non-local speech features, comprising:

calling the non-local attention sub-model to process a plurality of the first speech features to obtain a plurality of the non-local speech features;

The non-local voice features of each voice signal frame in the original voice signal are respectively processed to obtain the mixed voice features of each of the voice signal frames, including:

The local attention sub-model is invoked to process the non-local speech features of each of the speech signal frames respectively to obtain the mixed speech features of each of the speech signal frames.
11. The method of claim 10, wherein the non-local attention sub-model comprises a first processing network, a second processing network and a first fusion network, and the invoking the non-local attention sub-model, for a plurality of the said The first voice feature is processed to obtain a plurality of the non-local voice features, including:

Invoke the first processing network to perform feature extraction on the first voice features of each of the voice signal frames, respectively, to obtain the second voice features of each of the voice signal frames, and the first processing network includes a plurality of holes Residual sub-network;

Invoke the second processing network to fuse the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;

The first fusion network is invoked to respectively fuse the second speech feature and the third speech feature of each of the speech signal frames to obtain non-local speech features of each of the speech signal frames.
The method of claim 11, wherein the second processing network comprises a residual non-local sub-network, a convolutional sub-network and a deconvolutional sub-network; the invoking the second processing network converts each of the The first voice features of the voice signal frames are respectively fused with the first voice features of other voice signal frames to obtain the third voice features of each of the voice signal frames, including:

The residual non-local sub-network is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain the first speech feature after weighted fusion of each described speech signal frame;

Calling the convolution sub-network to encode the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;

The deconvolution sub-network is invoked to decode the encoded features of each of the speech signal frames to obtain a third speech feature of each of the speech signal frames.
13. The method of claim 12, wherein the residual non-local sub-network includes a first fusion layer and a second fusion layer, and the invoking the residual non-local sub-network is based on a plurality of the speech signal frames weight, the first voice features of each of the voice signal frames are weighted and fused with the first voice features of other voice signal frames, and the first voice features after weighted fusion of each of the voice signal frames are obtained, including:

The first fusion layer is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain each fusion features of the speech signal frame;

The second fusion layer is called to fuse the first speech feature of each of the speech signal frames with the fusion feature to obtain the weighted and fused first speech feature of each of the speech signal frames.
A voice signal processing device, the processing device includes:

a feature determining unit configured to determine a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;

The non-local feature acquisition unit is configured to process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features corresponds to a frame of a speech signal. The local voice feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame;

a mixed feature acquisition unit, configured to separately process the non-local voice features of each voice signal frame in the original voice signal to obtain the mixed voice feature of each of the voice signal frames;

a denoising parameter obtaining unit, configured to obtain denoising parameters based on the mixed speech features of the plurality of speech signal frames;

The target signal obtaining unit is configured to perform denoising on the original speech signal based on the denoising parameter to obtain a target speech signal.
The apparatus according to claim 14, wherein the feature determining unit is configured to perform feature extraction on the original amplitude of each of the voice signal frames, respectively, to obtain the first voice feature of each of the voice signal frames.
The apparatus according to claim 15, wherein the target signal acquisition unit comprises:

an amplitude acquisition subunit, configured to perform denoising on the original amplitudes of a plurality of the speech signal frames based on the denoising parameters, to obtain a plurality of target amplitudes of the speech signal frames;

The signal acquisition subunit is configured to combine the original phase and target amplitude of a plurality of the speech signal frames to obtain the target speech signal.
The apparatus according to claim 14, wherein the denoising parameter obtaining unit is configured to perform feature recognition on mixed speech features of a plurality of the speech signal frames to obtain the denoising parameter.
The apparatus according to claim 14, wherein the non-local feature acquisition unit comprises:

A feature extraction subunit, configured to perform feature extraction on the first voice feature of each of the voice signal frames, respectively, to obtain the second voice feature of each of the voice signal frames;

The first fusion subunit, is configured to fuse the first speech feature of each described speech signal frame with the first speech feature of other speech signal frames respectively, obtains the 3rd speech feature of each described speech signal frame;

The second fusion subunit is configured to respectively fuse the second speech feature and the third speech feature of each of the speech signal frames to obtain the non-local speech features of each of the speech signal frames.
The apparatus according to claim 18, wherein the non-local feature acquisition unit further comprises:

The third fusion subunit is configured to fuse the non-local speech feature of each of the speech signal frames with the first speech feature to obtain a non-local speech feature after fusion of each of the speech signal frames.
The apparatus of claim 18, wherein the first fusion subunit is configured to:

Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are respectively weighted and fused with the first speech features of other speech signal frames to obtain the weighted fusion of each speech signal frame. The first voice feature of ;

Encoding the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;

Decoding the encoded features of each of the voice signal frames to obtain third voice features of each of the voice signal frames.
The apparatus of claim 20, wherein the first fusion subunit is configured to:

Feature reduction is performed on the coding feature of each of the speech signal frames to obtain a plurality of reduced coding features;

Decoding each of the reduced encoded features to obtain a third voice feature of each of the voice signal frames.
The apparatus of claim 20, wherein the first fusion subunit is configured to:

Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, to obtain the fusion feature of each speech signal frame. ;

The first voice feature of each of the voice signal frames is fused with the fusion feature to obtain the weighted and fused first voice feature of each of the voice signal frames.
The apparatus according to claim 14, wherein the non-local feature acquisition unit is configured to call a non-local attention sub-model to process a plurality of the first speech features to obtain a plurality of the non-local speech features feature;

The mixed feature acquisition unit is configured to invoke the local attention sub-model to process the non-local speech features of each of the speech signal frames respectively to obtain the mixed speech features of each of the speech signal frames.
The apparatus according to claim 23, wherein the non-local attention sub-model comprises a first processing network, a second processing network and a first fusion network, and the non-local feature acquisition unit comprises:

The feature extraction subunit is configured to call the first processing network to perform feature extraction on the first voice feature of each of the voice signal frames, and obtain the second voice feature of each of the voice signal frames, and the The first processing network includes a plurality of hole residual sub-networks;

The first fusion subunit is configured to call the second processing network to fuse the first speech features of each of the speech signal frames with the first speech features of other speech signal frames to obtain each of the speech the third speech feature of the signal frame;

The second fusion subunit is configured to call the first fusion network to fuse the second speech feature and the third speech feature of each of the speech signal frames respectively to obtain the non-local part of each of the speech signal frames voice characteristics.
The apparatus of claim 24, wherein the second processing network comprises a residual non-local sub-network, a convolution sub-network and a deconvolution sub-network; the first fusion sub-unit is configured to:

The residual non-local sub-network is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain the first speech feature after weighted fusion of each described speech signal frame;

Calling the convolution sub-network to encode the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;

The deconvolution sub-network is invoked to decode the encoded features of each of the speech signal frames to obtain a third speech feature of each of the speech signal frames.
The apparatus of claim 25, wherein the residual non-local sub-network comprises a first fusion layer and a second fusion layer, and the first fusion subunit is configured to:

The first fusion layer is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain each fusion features of the speech signal frame;

The second fusion layer is called to fuse the first speech feature of each of the speech signal frames with the fusion feature to obtain the weighted and fused first speech feature of each of the speech signal frames.
An electronic device comprising:

one or more processors;

memory for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform the following steps:

determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;

Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;

The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;

Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;

The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.
18. The electronic device of claim 17, wherein the one or more processors are configured to perform the steps of:

Feature extraction is performed on the original amplitude of each of the speech signal frames, respectively, to obtain the first speech feature of each of the speech signal frames.
The electronic device of claim 28, wherein the one or more processors are configured to perform the steps of:

Based on the denoising parameters, denoising is performed on the original amplitudes of the plurality of speech signal frames, respectively, to obtain a plurality of target amplitudes of the speech signal frames;

The target speech signal is obtained by combining the original phase and target amplitude of a plurality of the speech signal frames.
The electronic device of claim 27, wherein the one or more processors are configured to perform the steps of:

Feature recognition is performed on the mixed speech features of a plurality of the speech signal frames to obtain the denoising parameters.
The electronic device of claim 27, wherein the one or more processors are configured to perform the steps of:

Perform feature extraction on the first voice feature of each of the voice signal frames, respectively, to obtain the second voice feature of each of the voice signal frames;

Fusing the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;

The second voice feature and the third voice feature of each of the voice signal frames are respectively fused to obtain the non-local voice feature of each of the voice signal frames.
The electronic device of claim 31, wherein the one or more processors are configured to perform the steps of:

The non-local speech feature and the first speech feature of each of the speech signal frames are fused to obtain a non-local speech feature after fusion of each of the speech signal frames.
The electronic device of claim 31, wherein the one or more processors are configured to perform the steps of:

Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are respectively weighted and fused with the first speech features of other speech signal frames to obtain the weighted fusion of each speech signal frame. The first voice feature of ;

Encoding the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;

Decoding the encoded features of each of the voice signal frames to obtain third voice features of each of the voice signal frames.
The electronic device of claim 33, wherein the one or more processors are configured to perform the steps of:

Feature reduction is performed on the coding feature of each of the speech signal frames to obtain a plurality of reduced coding features;

Decoding each of the reduced encoded features to obtain a third voice feature of each of the voice signal frames.
The electronic device of claim 33, wherein the one or more processors are configured to perform the steps of:

Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, to obtain the fusion feature of each speech signal frame. ;

The first voice feature of each of the voice signal frames is fused with the fusion feature to obtain the weighted and fused first voice feature of each of the voice signal frames.
The electronic device of claim 27, wherein the one or more processors are configured to perform the steps of:

calling the non-local attention sub-model to process a plurality of the first speech features to obtain a plurality of the non-local speech features;

The non-local voice features of each voice signal frame in the original voice signal are respectively processed to obtain the mixed voice features of each of the voice signal frames, including:

The local attention sub-model is invoked to process the non-local speech features of each of the speech signal frames respectively to obtain the mixed speech features of each of the speech signal frames.
37. The electronic device of claim 36, wherein the non-local attention sub-model includes a first processing network, a second processing network, and a first fusion network, the one or more processors configured to perform the steps of :

Invoke the first processing network to perform feature extraction on the first voice features of each of the voice signal frames, respectively, to obtain the second voice features of each of the voice signal frames, and the first processing network includes a plurality of holes Residual sub-network;

Invoke the second processing network to fuse the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;

The first fusion network is invoked to respectively fuse the second speech feature and the third speech feature of each of the speech signal frames to obtain non-local speech features of each of the speech signal frames.
38. The electronic device of claim 37, wherein the second processing network comprises a residual non-local sub-network, a convolution sub-network, and a deconvolution sub-network; the one or more processors configured to perform the following step:

The residual non-local sub-network is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain the first speech feature after weighted fusion of each described speech signal frame;

Calling the convolution sub-network to encode the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;

The deconvolution sub-network is invoked to decode the encoded features of each of the speech signal frames to obtain a third speech feature of each of the speech signal frames.
38. The electronic device of claim 38, wherein the residual non-local sub-network includes a first fusion layer and a second fusion layer, the one or more processors configured to perform the steps of:

The first fusion layer is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain each fusion features of the speech signal frame;

The second fusion layer is called to fuse the first speech feature of each of the speech signal frames with the fusion feature to obtain the weighted and fused first speech feature of each of the speech signal frames.
A non-transitory computer-readable storage medium, when an instruction in the non-transitory computer-readable storage medium is executed by a processor of an electronic device, enables the electronic device to perform the following steps:

determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;

Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;

The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;

Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;

The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.
A computer program product, comprising a computer program that implements the following steps when the computer program is executed by a processor:

determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;

Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;

The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;

Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;

The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.