CN112967730A

CN112967730A - Voice signal processing method and device, electronic equipment and storage medium

Info

Publication number: CN112967730A
Application number: CN202110125640.5A
Authority: CN
Inventors: 邓峰; 王晓瑞; 王仲远
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-06-15
Also published as: WO2022160715A1

Abstract

The disclosure relates to a method and a device for processing a voice signal, electronic equipment and a storage medium, and belongs to the technical field of voice processing. The method comprises the following steps: determining a first speech feature of a plurality of speech signal frames in an original speech signal; calling a non-local attention network to fuse the first voice features of the voice signal frames to obtain the non-local voice feature of each voice signal frame; calling a local attention network to respectively process the non-local voice features of each voice signal frame to obtain the mixed voice features of each voice signal frame; acquiring denoising parameters based on mixed voice characteristics of the voice signal frames; and denoising the original voice signal according to the denoising parameter to obtain a target voice signal. The method considers the context information of the voice signal frame in the processing process, so that the obtained denoising parameters are more accurate, and the denoising effect of the original voice signal is improved.

Description

Voice signal processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a method and an apparatus for processing a speech signal, an electronic device, and a storage medium.

Background

Noise is usually contained in the collected voice signal, and the processing of the voice signal is adversely affected by the existence of the noise, so that the removal of the noise is of great importance for the processing of the voice signal.

In the related art, spectral subtraction is adopted to denoise a voice signal, that is, a silence segment in the voice signal is obtained, a noise signal is extracted from the silence segment, and the voice signal and the noise signal are subtracted to remove noise in the voice signal.

Disclosure of Invention

The disclosure provides a processing method and device of a voice signal, electronic equipment and a storage medium, which improve the denoising effect of the voice signal.

According to an aspect of the embodiments of the present disclosure, there is provided a method for processing a speech signal, the method including:

determining a first speech feature of a plurality of speech signal frames in an original speech signal;

calling a non-local attention network to fuse the first voice features of the voice signal frames to obtain the non-local voice feature of each voice signal frame;

calling a local attention network to respectively process the non-local voice features of each voice signal frame to obtain the mixed voice features of each voice signal frame;

acquiring denoising parameters based on mixed voice characteristics of the voice signal frames;

and denoising the original voice signal according to the denoising parameter to obtain a target voice signal.

The method provided by the embodiment of the disclosure calls the non-local attention network and the local attention network to process the first voice features of a plurality of voice signal frames in the original voice signal to obtain the denoising parameter, wherein the denoising parameter can represent the proportion of signals except the denoising signal in each voice signal frame, so that the denoising parameter is adopted to denoise the original voice signal to remove the noise in the original voice signal, and the context information of each voice signal frame can be considered when the non-local attention network is called to process the first voice features of each voice signal frame, so that the obtained denoising parameter is more accurate, and the denoising effect of the original voice signal is improved.

In one possible implementation, the determining a first speech feature of a plurality of speech signal frames in an original speech signal includes:

and calling a feature extraction network to respectively perform feature extraction on the original amplitudes of the voice signal frames to obtain first voice features of the voice signal frames.

In the embodiment of the disclosure, because the noise signal in the speech signal frame exists in the original amplitude of the speech signal frame, the original amplitude of the speech signal frame is subjected to feature extraction without processing the original phase in the original speech signal, and the processing amount is reduced.

In another possible implementation manner, the denoising the original speech signal according to the denoising parameter to obtain a target speech signal includes:

calling a voice denoising network, and denoising the original amplitudes of the voice signal frames according to the denoising parameters to obtain target amplitudes of the voice signal frames;

and combining the original phases and the target amplitudes of the voice signal frames to obtain the target voice signal.

In the embodiment of the disclosure, the original amplitude is denoised according to the obtained denoising parameters to obtain the target amplitude without the noise signal, so as to denoise the original amplitude of the original voice signal, and then the target voice signal without the noise signal can be recovered according to the target amplitude and the original phase, so as to denoise the original voice signal. The denoising method only needs to process the amplitude in the voice signal without processing the phase, thereby reducing the characteristics needing to be processed and improving the processing speed.

In another possible implementation manner, the obtaining denoising parameters based on the mixed speech features of the plurality of speech signal frames includes:

and calling a feature reconstruction network to perform feature reconstruction on the mixed voice features of the voice signal frames to obtain the denoising parameters.

In the embodiment of the disclosure, the denoising parameter obtained by calling the feature reconstruction network can represent the proportion of signals except the denoising signal in each voice signal frame, and the denoising parameter is subsequently adopted to denoise the original voice signal.

In another possible implementation manner, the invoking the non-local attention network to fuse the first speech features of the plurality of speech signal frames to obtain the non-local speech feature of each speech signal frame includes:

calling the first processing unit to respectively perform feature extraction on the first voice features of the voice signal frames to obtain a second voice feature of each voice signal frame, wherein the first processing unit comprises a plurality of cavity residual error subunits;

calling the second processing unit, and fusing the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame;

and calling the first fusion unit, and respectively fusing the second voice feature and the third voice feature of each voice signal frame to obtain the non-local voice feature of each voice signal frame.

In the embodiment of the disclosure, different processing units in a non-local attention network are called to respectively perform different aspects of processing on a first voice feature, where the first processing unit including a plurality of void residual sub-units can further extract the first voice feature to obtain a deeper voice feature, and the second processing unit uses a non-local attention mechanism, and when the first voice feature of each voice signal frame is processed, a voice signal frame except the voice signal frame in the voice signal is considered, that is, a more accurate voice feature is obtained by combining context information, and the first fusion unit is called to fuse the voice features obtained by the two processing units together to obtain a non-local voice feature. In addition, the hole residual sub-unit can enlarge the receptive field and can further acquire more context information.

In another possible implementation manner, the non-local attention network further includes a second fusion unit, and the processing method further includes, after invoking the first fusion unit to respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain the non-local speech feature of each speech signal frame:

and calling the second fusion unit to fuse the non-local voice feature of each voice signal frame with the first voice feature to obtain the fused non-local voice feature of each voice signal frame.

In the embodiment of the disclosure, a residual error learning network is adopted in the non-local attention network, and after the non-local voice feature is obtained, the non-local voice feature is fused with the input first voice feature, so that the finally obtained non-local voice feature is more accurate, some important features are prevented from being lost by the non-local voice feature, and the accuracy of the non-local voice feature is improved. In addition, the residual error learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; the calling the second processing unit to fuse the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame, including:

calling the residual non-local subunit, and performing weighted fusion on the first voice feature of each voice signal frame and the first voice features of other voice signal frames according to the weights corresponding to the voice signal frames to obtain the first voice feature after weighted fusion of each voice signal frame;

calling the convolution subunit, and coding the first voice feature after weighted fusion of each voice signal frame to obtain the coding feature of each voice signal frame;

and calling the deconvolution subunit, and decoding the coding features of each voice signal frame to obtain a third voice feature of each voice signal frame.

In the embodiment of the present disclosure, when processing the first speech feature of each speech signal frame, the residual non-local subunit considers the speech signal frames in the speech signal except the speech signal frame, that is, combines the context information, to obtain a more accurate speech feature.

In another possible implementation manner, the second processing unit further includes a feature reduction subunit, where after the invoking of the convolutional layer is performed to encode the first speech feature after weighted fusion of each speech signal frame, and obtain the encoding feature of each speech signal frame, the processing method further includes:

calling the feature reduction subunit to perform feature reduction on the coding features of each voice signal frame to obtain a plurality of reduced coding features;

the calling the deconvolution layer, decoding the coding feature of each speech signal frame to obtain a third speech feature of each speech signal frame, including:

and calling the deconvolution layer, and decoding the reduced coding features to obtain a third speech feature of each speech signal frame.

In the embodiment of the disclosure, the reduction processing is performed on the coding features, so that the coding features can be reduced, the calculation amount is reduced, and the processing speed of the coding features is improved.

In another possible implementation manner, the invoking the residual non-local subunit, performing weighted fusion on the first speech feature of each speech signal frame and the first speech features of the other speech signal frames according to weights corresponding to the plurality of speech signal frames, to obtain the first speech feature after weighted fusion of each speech signal frame, includes:

calling the first fusion layer, and performing weighted fusion on the first voice feature of each voice signal frame and the first voice features of other voice signal frames according to the weights corresponding to the voice signal frames to obtain the fusion feature of each voice signal frame;

and calling the second fusion layer, and respectively fusing the first voice feature and the fusion feature of each voice signal frame to obtain the first voice feature after weighted fusion of each voice signal frame.

In the embodiment of the disclosure, the first fusion layer is called to fuse the first voice features of different voice signal frames together according to corresponding weights, so as to obtain more accurate fusion features, and under the condition that the first fusion layer and the second fusion layer are included, one residual non-local subunit is used, so that the fusion features are fused with the input first voice features, so that the finally obtained weighted fusion first voice features are more accurate, the fusion features are prevented from losing some important features, and the accuracy of the weighted fusion first voice features is improved. In addition, the residual error learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In another possible implementation, the speech processing model includes at least the non-local attention network and the local attention network, and the training process of the speech processing model is as follows:

acquiring a sample voice signal and a sample noise signal;

mixing the sample voice signal and the sample noise signal to obtain a sample mixed signal;

calling the voice processing model, and processing a plurality of sample voice signal frames in the sample mixed signal to obtain a prediction denoising parameter corresponding to the sample mixed signal;

denoising the original voice signal according to the prediction denoising parameter to obtain a denoised prediction voice signal;

training the speech processing model based on a difference between the predicted speech signal and the sample speech signal.

In the embodiment of the disclosure, the sample speech signal and the sample noise signal are mixed to obtain the sample mixed signal, and the sample mixed signal is used for training the speech processing model.

According to still another aspect of the embodiments of the present disclosure, there is provided an apparatus for processing a speech signal, the apparatus including:

a feature determination unit configured to perform determining a first speech feature of a plurality of speech signal frames in an original speech signal;

a non-local feature obtaining unit configured to perform a non-local attention network invoking to fuse first voice features of the plurality of voice signal frames, so as to obtain a non-local voice feature of each voice signal frame;

the mixed feature acquisition unit is configured to execute calling of a local attention network to respectively process the non-local voice features of each voice signal frame to obtain mixed voice features of each voice signal frame;

a denoising parameter acquiring unit configured to perform acquiring a denoising parameter based on a mixed speech feature of the plurality of speech signal frames;

and the target signal acquisition unit is configured to perform denoising on the original voice signal according to the denoising parameter to obtain a target voice signal.

In a possible implementation manner, the feature determination unit is configured to perform feature extraction on the original amplitudes of the plurality of speech signal frames respectively by invoking a feature extraction network, so as to obtain a first speech feature of the plurality of speech signal frames.

In another possible implementation manner, the target signal acquiring unit includes:

the amplitude obtaining subunit is configured to execute calling of a voice denoising network, and denoise the original amplitudes of the voice signal frames according to the denoising parameters to obtain target amplitudes of the voice signal frames;

a signal obtaining subunit configured to perform combining original phases and target amplitudes of the plurality of speech signal frames to obtain the target speech signal.

In another possible implementation manner, the denoising parameter obtaining unit is configured to perform feature reconstruction on the mixed speech features of the speech signal frames by invoking a feature reconstruction network, so as to obtain the denoising parameter.

In another possible implementation manner, the non-local attention network includes a first processing unit, a second processing unit, and a first fusion unit, and the non-local feature obtaining unit includes:

a feature extraction subunit, configured to invoke the first processing unit to perform feature extraction on first speech features of the plurality of speech signal frames, respectively, to obtain a second speech feature of each speech signal frame, where the first processing unit includes a plurality of hole residual error subunits;

the first fusion subunit is configured to execute and call the second processing unit, and fuse the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame;

and the second fusion subunit is configured to execute calling of the first fusion unit, and respectively fuse the second voice feature and the third voice feature of each voice signal frame to obtain the non-local voice feature of each voice signal frame.

In another possible implementation manner, the non-local attention network further includes a second fusion unit, and the non-local feature obtaining unit further includes:

and the third fusion subunit is configured to execute calling the second fusion unit to fuse the non-local voice feature of each voice signal frame with the first voice feature, so as to obtain the fused non-local voice feature of each voice signal frame.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; the first fusion subunit configured to perform:

In another possible implementation manner, the second processing unit further includes a feature reduction subunit, and the first fusion subunit is configured to perform:

In another possible implementation, the residual non-local sub-unit includes a first fusion layer and a second fusion layer, and the first fusion sub-unit is configured to perform:

acquiring a sample voice signal and a sample noise signal;

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

one or more processors;

a memory for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform the method of processing a speech signal of the above aspect.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method for processing a voice signal according to the above aspect.

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program product comprising a computer program, which when executed by a processor, implements the processing method of the speech signal according to the above aspect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram illustrating a speech processing model according to an exemplary embodiment.

FIG. 2 is a schematic diagram illustrating another speech processing model according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating another speech processing model according to an exemplary embodiment.

Fig. 4 is a flow chart illustrating a method of processing a speech signal according to an example embodiment.

Fig. 5 is a flow chart illustrating another method of processing a speech signal according to an example embodiment.

Fig. 6 is a schematic diagram illustrating a non-local attention network in accordance with an exemplary embodiment.

FIG. 7 is a flow diagram illustrating a method for non-local speech feature acquisition according to an example embodiment.

FIG. 8 is a schematic diagram illustrating a first processing unit in accordance with an exemplary embodiment.

FIG. 9 is a schematic diagram illustrating a second processing unit in accordance with an exemplary embodiment.

Fig. 10 is a schematic diagram illustrating another second processing unit according to an example embodiment.

Fig. 11 is a schematic diagram illustrating a residual non-local sub-unit in accordance with an exemplary embodiment.

Fig. 12 is a schematic diagram illustrating another non-local attention network in accordance with an example embodiment.

Fig. 13 is a block diagram illustrating a speech signal processing apparatus according to an exemplary embodiment.

Fig. 14 is a block diagram illustrating another speech signal processing apparatus according to an exemplary embodiment.

Fig. 15 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment.

Fig. 16 is a block diagram illustrating a configuration of a server according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the description of the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) referred to in the present disclosure is information authorized by the user or sufficiently authorized by each party.

The processing method of the voice signal provided by the embodiment of the disclosure can be applied to various scenes.

For example, in a live scene.

In the live broadcast process, noise signals may exist in the voice signals of the anchor collected by the anchor terminal, and if the audience terminal directly plays the voice signals, the voice signals may be unclear due to the existence of noise, so that the watching experience of audiences is influenced.

Also for example, in an automatic speech recognition scenario.

In the voice recognition process, if a noise signal exists in the voice signal, the noise signal can affect the voice signal recognition, so that the voice recognition accuracy is low, and the content of the voice signal is difficult to be recognized accurately.

The method provided by the embodiment of the disclosure can also be applied to scenes such as video playing, language identification, voice synthesis, identity identification and the like.

FIG. 1 is a schematic illustration of a speech processing model provided according to an exemplary embodiment, the speech processing model including: a non-local attention network 101 and a local attention network 102, the non-local attention network 101 and the local attention network 102 being connected. The non-local attention network 101 is configured to process a first speech feature of an input original speech signal to obtain a non-local speech feature of the original speech signal, and the local attention network 102 is configured to further process the non-local speech feature of the original speech signal to obtain a mixed speech feature of the original speech signal.

In one possible implementation, referring to fig. 2, the speech processing model further includes: the system comprises a feature extraction network 103, a feature reconstruction network 104 and a voice denoising network 105, wherein the feature extraction network 103 is connected with a non-local attention network 101, the feature reconstruction network 104 is connected with a local attention network 102, and the voice denoising network 105 is connected with the feature reconstruction network 104. The feature extraction network 103 is configured to extract a first voice feature of an original voice signal, the feature reconstruction network 104 is configured to perform feature reconstruction on a mixed voice feature of the processed original voice signal to obtain a denoising parameter of the original voice signal, and the voice denoising network 105 is configured to denoise the original voice signal.

In one possible implementation, the speech processing model includes a plurality of non-local attention networks 101 and a plurality of local attention networks 102, and the plurality of non-local attention networks 101 and the plurality of local attention networks 102 can be connected in sequence in any order. For example, referring to fig. 3, the speech processing model includes two non-local attention networks 101 and two local attention networks 102, the feature extraction network 103 is connected to the first non-local attention network 101, the first non-local attention network 101 is connected to the first local attention network 102, the first local attention network 102 is connected to the second local attention network 103, the second local attention network 103 is connected to the second non-local attention network 101, and the second non-local attention network 101 is connected to the feature reconstruction network 104.

The processing method of the voice signal provided by the embodiment of the disclosure can be applied to electronic equipment, and the electronic equipment is a terminal or a server. The terminal is various types of terminals such as a portable terminal, a pocket terminal, a handheld terminal and the like, such as a mobile phone, a computer, a tablet computer and the like. The server is a server, or a server cluster consisting of a plurality of servers, or a cloud computing service center.

Fig. 4 is a flowchart illustrating a method for processing a speech signal according to an exemplary embodiment, and referring to fig. 4, the method is applied to an electronic device, and includes the following steps:

401. a first speech feature of a plurality of speech signal frames in an original speech signal is determined.

402. And calling a non-local attention network to fuse the first voice features of the voice signal frames to obtain the non-local voice feature of each voice signal frame.

403. And calling a local attention network to respectively process the non-local voice features of each voice signal frame to obtain the mixed voice features of each voice signal frame.

404. And acquiring denoising parameters based on the mixed voice characteristics of the voice signal frames.

405. And denoising the original voice signal according to the denoising parameters to obtain a target voice signal.

In one possible implementation, determining a first speech feature of a plurality of speech signal frames in an original speech signal includes:

In another possible implementation manner, denoising an original speech signal according to a denoising parameter to obtain a target speech signal, includes:

calling a voice denoising network, and denoising the original amplitudes of the voice signal frames according to denoising parameters to obtain target amplitudes of the voice signal frames;

and combining the original phases and the target amplitudes of the voice signal frames to obtain a target voice signal.

In another possible implementation manner, acquiring denoising parameters based on mixed speech features of a plurality of speech signal frames includes:

and calling a feature reconstruction network to perform feature reconstruction on the mixed voice features of the voice signal frames to obtain denoising parameters.

In another possible implementation manner, the non-local attention network includes a first processing unit, a second processing unit, and a first fusion unit, and the non-local attention network is invoked to fuse the first speech features of the plurality of speech signal frames to obtain the non-local speech feature of each speech signal frame, including:

calling a first processing unit to respectively perform feature extraction on first voice features of a plurality of voice signal frames to obtain a second voice feature of each voice signal frame, wherein the first processing unit comprises a plurality of cavity residual error subunits;

calling a second processing unit, and fusing the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame;

and calling a first fusion unit, and respectively fusing the second voice feature and the third voice feature of each voice signal frame to obtain the non-local voice feature of each voice signal frame.

In another possible implementation manner, the non-local attention network further includes a second fusion unit, the first fusion unit is invoked to respectively fuse the second speech feature and the third speech feature of each speech signal frame, and after the non-local speech feature of each speech signal frame is obtained, the processing method further includes:

and calling a second fusion unit to fuse the non-local voice feature of each voice signal frame with the first voice feature to obtain the fused non-local voice feature of each voice signal frame.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; calling a second processing unit, and fusing the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame, wherein the third voice feature comprises:

calling a residual non-local subunit, and performing weighted fusion on the first voice feature of each voice signal frame and the first voice features of other voice signal frames according to weights corresponding to a plurality of voice signal frames to obtain the first voice feature after weighted fusion of each voice signal frame;

calling a convolution subunit, and coding the first voice feature after weighted fusion of each voice signal frame to obtain the coding feature of each voice signal frame;

and calling a deconvolution subunit, and decoding the coding characteristics of each voice signal frame to obtain third voice characteristics of each voice signal frame.

In another possible implementation manner, the second processing unit further includes a feature reduction subunit, which invokes the convolutional layer, and codes the first speech feature after weighted fusion of each speech signal frame, so as to obtain a coding feature of each speech signal frame, and the processing method further includes:

calling a characteristic reducing subunit, and carrying out characteristic reduction on the coding characteristics of each voice signal frame to obtain a plurality of reduced coding characteristics;

calling the deconvolution layer, decoding the coding characteristics of each voice signal frame to obtain third voice characteristics of each voice signal frame, and the method comprises the following steps:

and calling the deconvolution layer, and decoding the plurality of reduced coding features to obtain a third speech feature of each speech signal frame.

In another possible implementation manner, the calling the residual non-local subunit, and performing weighted fusion on the first speech feature of each speech signal frame and the first speech features of other speech signal frames according to weights corresponding to the multiple speech signal frames to obtain the weighted-fused first speech feature of each speech signal frame, where the method includes:

calling a first fusion layer, and performing weighted fusion on the first voice feature of each voice signal frame and the first voice features of other voice signal frames according to the weights corresponding to the voice signal frames to obtain the fusion feature of each voice signal frame;

and calling a second fusion layer, and respectively fusing the first voice feature and the fusion feature of each voice signal frame to obtain the first voice feature after weighted fusion of each voice signal frame.

In another possible implementation, the speech processing model includes at least a non-local attention network and a local attention network, and the training process of the speech processing model is as follows:

acquiring a sample voice signal and a sample noise signal;

calling a voice processing model, and processing a plurality of sample voice signal frames in the sample mixed signal to obtain a prediction denoising parameter corresponding to the sample mixed signal;

a speech processing model is trained based on a difference between the predicted speech signal and the sample speech signal.

Fig. 5 is a flowchart illustrating another speech signal processing method according to an exemplary embodiment, and referring to fig. 5, the method is applied to an electronic device, and includes the following steps:

501. the electronic device obtains original amplitudes and original phases of a plurality of speech signal frames in an original speech signal.

Because the voice signal comprises the amplitude and the phase, and the noise signal in the voice signal is contained in the amplitude, in the embodiment of the disclosure, the original amplitude and the original phase of each voice signal frame in the original voice signal are obtained, and the original amplitude is denoised, so as to denoise the original voice signal without processing the original phase, thereby reducing the processing amount. The original voice signal is collected by the electronic device, or is a voice signal containing a noise signal sent to the electronic device by other electronic devices, for example, the noise signal is a noise signal of a type of ambient noise, white noise, or the like.

The original voice signal comprises a plurality of voice signal frames, the electronic equipment respectively carries out Fourier transform on each voice signal to obtain the original amplitude and the original phase of each voice signal frame, and the original amplitude of each voice signal frame is subsequently processed to realize denoising of the original amplitude. The fourier transform includes a fast fourier transform, a short-time fourier transform, and the like.

In one possible implementation, since the speech processing model has a limited signal length of the speech signal processed each time, for example, a speech signal of one minute, a speech signal of two minutes, etc. can be processed each time. Therefore, the signal length of the original speech signal cannot exceed the reference signal length, i.e., the time duration of the original speech signal cannot exceed the reference time duration. For example, 64 speech signal frames are processed at a time.

502. The electronic equipment calls a feature extraction network to respectively extract the original amplitudes of the voice signal frames to obtain first voice features of the voice signal frames.

The first voice feature of the voice signal frame is used for describing the corresponding voice signal frame, and the first voice feature is represented by a vector, a matrix or other forms. The first speech features of the speech signal frames may be represented separately, or the first speech features of the speech signal frames may be represented by combining the first speech features of the speech signal frames together, for example, if the first speech feature of each speech signal frame is a vector, the vectors are combined together to form a matrix, and each column in the matrix represents the first speech feature of one speech signal frame.

In one possible implementation, the feature extraction network includes a convolution layer, a batch normalization layer, and an activation function layer.

503. The electronic equipment calls a non-local attention network to fuse the first voice features of the voice signal frames to obtain the non-local voice feature of each voice signal frame.

Wherein the non-local speech feature of each speech signal frame is obtained by combining the first speech features of a plurality of speech signal frames, i.e. taking into account the features of the speech signal frames before and after the speech signal frame.

In the embodiment of the disclosure, the non-local attention network processes the first voice feature by using an attention mechanism and residual learning, and in the process of processing the first voice feature of each voice signal frame, the context information of the voice signal frame can be considered, so that the processed non-local voice feature is more accurate, and because some voice features can be lost in the processing process of the first voice feature of the voice signal frame, the non-local voice feature can be obtained by using the residual learning after the first voice feature is processed and combining the input first voice feature, thereby avoiding losing important voice features in the process of processing the first voice feature to obtain the non-local voice feature.

In one possible implementation, referring to fig. 6, the non-local attention network includes a first processing unit, a second processing unit, a first merging unit and a second merging unit, where the first processing unit is a Trunk Branch (Trunk Branch), and the second processing unit is a Mask Branch (Mask Branch). The first processing unit and the second processing unit respectively process first voice signals of a plurality of input voice signal frames, the first fusion unit fuses the features obtained after the processing of the first processing unit and the second processing unit, and the second fusion unit fuses the features obtained by the fusion of the first fusion unit and the features input in the non-local attention network.

Referring to fig. 7, a process for an electronic device to invoke a non-local attention network to process a first speech feature of each speech signal frame includes the steps of:

701. the electronic equipment calls the first processing unit to respectively carry out feature extraction on the first voice features of the voice signal frames to obtain the second voice feature of each voice signal frame.

And the second voice feature is obtained by further extracting the first voice feature, and the second voice feature contains less noise features compared with the first voice feature.

In one possible implementation, referring to fig. 8, the first processing unit includes a plurality of hole residual subunits (res.unit), fig. 8 only illustrates two hole residual subunits, each hole residual subunit includes a hole convolution layer, a batch normalization layer, and an activation function layer, and the hole residual subunits are connected by a network structure of a residual learning network. The cavity convolution layer can enlarge the receptive field and acquire more context information.

In one possible implementation, the non-local attention network further includes at least one hole residual error unit, each hole residual error unit includes two hole convolution sub-units, and the two hole residual error sub-units are connected by using a network structure of a residual error learning network. Before the electronic device calls the first processing unit and the second processing unit to process the first voice feature of each voice signal frame, at least one cavity residual error unit is called to perform feature extraction on the first voice feature of each voice signal frame to obtain the first voice feature of each voice signal frame after further extraction, and the subsequent first processing unit and the second processing unit process the first voice feature of each voice signal frame after further extraction. The calling of the first processing unit comprising the plurality of hole residual error subunits can further extract the first voice feature to obtain a deeper voice feature.

702. And the electronic equipment calls the second processing unit to fuse the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain the third voice feature of each voice signal frame.

Wherein the third speech feature of each speech signal frame is obtained by combining the first speech features of other speech signal frames.

In one possible implementation, referring to fig. 9, the second processing unit includes a residual non-local sub-unit, a convolution sub-unit, and a deconvolution sub-unit. The electronic equipment calls a residual non-local subunit, and carries out weighted fusion on the first voice feature of each voice signal frame and the first voice features of other voice signal frames respectively according to the weights corresponding to the voice signal frames to obtain the first voice feature after weighted fusion of each voice signal frame; calling a convolution subunit, and coding the first voice feature after weighted fusion of each voice signal frame to obtain the coding feature of each voice signal frame; and calling a deconvolution subunit, and decoding the coding characteristics of each voice signal frame to obtain third voice characteristics of each voice signal frame.

In one possible implementation manner, referring to fig. 10, the second processing unit further includes a plurality of feature reduction sub-units, a plurality of first hole residual sub-units, a plurality of second hole residual sub-units, and an activation function sub-unit, the residual non-local sub-unit is connected to the first hole residual sub-unit, the plurality of first hole residual sub-units are connected in sequence, the last hole residual sub-unit is connected to the convolution sub-unit, the convolution sub-unit is connected to the first feature reduction sub-unit, the plurality of feature reduction sub-units are connected in sequence, the last feature reduction sub-unit is connected to the deconvolution sub-unit, the deconvolution sub-unit is connected to the first second hole residual sub-unit, the plurality of second hole residual sub-units are connected in sequence, and the last hole residual sub-unit is connected to the activation function sub-unit. In addition, fig. 10 is only an example of two first hole residual submonols, two second hole residual submonols, and two feature reduction submonols, and the first hole residual submonols, the second hole residual submonols, and the feature reduction submonols may be in other numbers.

The activation function in the activation function subunit may be a Sigmoid function or other activation functions, the first hole residual subunit and the third hole residual subunit may be the same or different, and each hole residual subunit includes a hole convolution layer, a batch normalization layer, and an activation function layer. Optionally, the feature reduction sub-unit is also a hole residual sub-unit.

In a possible implementation manner, the electronic device calls a plurality of first hole residual error subunits to process the first speech feature after weighted fusion of each speech signal frame, so as to obtain the first speech feature after further processing of each speech signal frame; calling a convolution subunit to encode the first voice feature after further processing each voice signal frame to obtain the encoding feature of each voice signal frame; calling a plurality of feature reduction subunits to perform feature reduction on the coding features of each voice signal frame to obtain a plurality of reduced coding features; calling the deconvolution layer, and decoding the reduced coding features to obtain the decoded voice features of each voice signal frame; and calling a plurality of second hole residual error subunits to process the decoded voice characteristics of each voice signal frame to obtain third voice characteristics of each voice signal frame. The reduction processing is performed on the coding features, so that the coding features can be reduced, the calculation amount is reduced, and the processing speed of the coding features is improved.

In one possible implementation manner, the residual non-local subunit includes a first fusion layer and a second fusion layer, the electronic device invokes the first fusion layer, and performs weighted fusion on the first speech feature of each speech signal frame and the first speech feature of other speech signal frames according to weights corresponding to the plurality of speech signal frames to obtain a fusion feature of each speech signal frame; and calling a second fusion layer, and respectively fusing the first voice feature and the fusion feature of each voice signal frame to obtain the first voice feature after weighted fusion of each voice signal frame.

In the embodiment of the disclosure, the first fusion layer is called to fuse the first voice features of different voice signal frames together according to corresponding weights, so as to obtain more accurate fusion features, and under the condition that the first fusion layer and the second fusion layer are included, the residual non-local subunit is a residual learning network, so that the fusion features are fused with the input first voice features, so that the finally obtained weighted fusion first voice features are more accurate, the fusion features are prevented from losing some important features, and the accuracy of the weighted fusion first voice features is improved. In addition, the residual error learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In a possible implementation manner, referring to fig. 11, fig. 11 illustrates processing three speech signal frames as an example, the residual non-local subunit further includes a plurality of convolution layers, a third fusion layer and a normalization layer, the third fusion layer is connected to two convolution layers, the third fusion layer is used for fusing the first speech feature processed by the two connected convolution layers, the third fusion layer is connected to the normalization layer, the normalization layer is used for normalizing the fused speech feature output by the third fusion layer, the normalization layer is connected to the first fusion layer, the first fusion layer is used for fusing the first speech feature processed by the other convolution layer and the normalized speech feature output by the normalization layer to obtain a fusion feature of each speech signal frame, the fusion feature is fused with the first speech feature after being processed by one convolution layer, and obtaining the weighted and fused first voice feature.

In a possible implementation manner, the first fusion layer and the third fusion layer adopt a matrix multiplication manner to fuse the voice features, and the second fusion layer adopts a matrix addition manner to fuse the voice features. Optionally, for each speech signal frame, the first speech feature of the speech signal frame is T × K × C, the first speech feature represents a speech feature C corresponding to time T and frequency K, and in order to be able to multiply or add speech features of different speech signal frames, it is necessary to perform formal transformation on the speech feature.

For example, the residual non-local subunit applies the following formula to the speech signal frame x_iThe first speech feature of (2) is processed:

o_i＝W_zy_i+x_i＝W_zsoftmax((W_ux_i)^TW_vx_j)(W_gx_i)+x_i；

wherein o is_iRepresenting frames x of a speech signal_iWeighting the fused first speech feature, W_z、W_u、W_vAnd W_gFor the known model parameters, softmax represents the normalization process, xj represents the division of the speech signal frame x_iSpeech signal frames other than x_iRepresenting frames x of a speech signal_iThe fusion characteristics of (1).

703. And the electronic equipment calls the first fusion unit to respectively fuse the second voice feature and the third voice feature of each voice signal frame to obtain the non-local voice feature of each voice signal frame.

In a possible implementation manner, the first fusion unit is a multiplication unit, that is, the second speech feature and the third speech feature of each speech signal frame are multiplied respectively, so as to obtain a fused non-local speech feature.

704. And the electronic equipment calls the second fusion unit to fuse the non-local voice feature of each voice signal frame with the first voice feature to obtain the fused non-local voice feature of each voice signal frame.

In a possible implementation manner, the second fusion unit is an addition unit, that is, the electronic device adds the non-local speech feature of each speech signal frame and the first speech feature to obtain the fused non-local speech feature of each speech signal frame.

In the embodiment shown in fig. 7, different processing units in the non-local attention network are called to respectively perform different aspects of processing on the first speech feature, where the first processing unit including the plurality of hole residual sub-units can further extract the first speech feature to obtain a deeper speech feature, and the second processing unit uses the non-local attention mechanism, and when the first speech feature of each speech signal frame is processed, speech signal frames except the speech signal frame in the speech signal are considered, that is, more accurate speech features are obtained by combining context information, and the first fusion unit is called to fuse the speech features obtained by the two processing units together to obtain a non-local speech feature. In addition, the hole residual sub-unit can enlarge the receptive field and can further acquire more context information.

And when the non-local attention network comprises the second fusion unit, the non-local attention network is a residual learning network, after the non-local voice feature is obtained, the non-local voice feature is fused with the input first voice feature, so that the finally obtained non-local voice feature is more accurate, the non-local voice feature is prevented from losing some important features, and the accuracy of the non-local voice feature is improved. In addition, the residual error learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In addition, in a possible implementation manner, referring to fig. 12, the non-local attention unit further includes a plurality of hole convolution units, the electronic device first invokes the plurality of hole residual error units to process the first speech feature input to each speech signal frame, and then inputs the processed first speech feature to the first processing unit and the second processing unit, and similarly, after the non-local speech feature is obtained by the second fusion unit, invokes the plurality of hole residual error units to process the non-local speech feature, and inputs the processed non-local speech feature to the subsequent local attention network. Fig. 12 is a diagram illustrating only four hole residual error units.

504. The electronic equipment calls a local attention network to respectively process the non-local voice features of each voice signal frame to obtain the mixed voice features of each voice signal frame.

The mixed speech features do not contain noise features, and the mixed speech features of each speech signal frame are obtained after the speech features of other speech signal frames are considered, so that the method is more accurate.

In the embodiment of the present disclosure, the network structure of the local attention network is similar to that of the non-local attention network, except that the local attention network does not include the residual non-local subunit, and the description of the network structure of the local attention network is omitted here.

It should be noted that, the embodiments of the present disclosure are only described by taking one non-local attention network and one local attention network as examples. In another embodiment, the mixed speech feature is obtained by inputting the mixed speech feature into a non-local attention network or a local attention network, and then processing the mixed speech feature to obtain a more accurate mixed speech signal.

505. And the electronic equipment calls a characteristic reconstruction network to carry out characteristic reconstruction on the mixed voice characteristics of the voice signal frames to obtain denoising parameters.

The denoising parameter is a denoising parameter corresponding to the original voice signal, and the denoising parameter is used for representing the proportion of the voice signals except the denoising signal in the voice signal frame, and the denoising parameter can be subsequently used for denoising the original voice signal. Optionally, the denoising parameters are represented in a matrix form, each element in the matrix represents the denoising parameters of one speech signal frame, or a column of elements or a row of elements in the matrix represents the denoising parameters of one speech signal frame. Wherein the feature reconstruction network is a convolutional network or other type of network.

506. And the electronic equipment calls a voice denoising network, and denoises the original amplitudes of the voice signal frames according to the denoising parameters to obtain the target amplitudes of the voice signal frames.

In one possible implementation, the voice denoising network is a multiplication network, and the denoising parameter is multiplied by the original amplitudes to obtain target amplitudes of the voice signal frames, where the target amplitudes do not include the noise signal. Alternatively, if the denoising parameter is a matrix, each element in the matrix is multiplied by the original amplitude of the corresponding speech signal frame, or a column of elements or a row of elements in the matrix is multiplied by the original amplitude of the corresponding speech signal frame.

507. The electronic equipment combines the original phases and the target amplitudes of the multiple voice signal frames to obtain a target voice signal.

In one possible implementation manner, the electronic device performs inverse fourier transform on the original phases and the target amplitudes of a plurality of voice signal frames to obtain a target voice signal, wherein the target voice signal is a voice signal from which the noise signal is removed.

The method for denoising the original amplitude in the voice signal frame only needs to process the amplitude in the voice signal without processing the phase, thereby reducing the characteristics needing to be processed and improving the processing speed.

And because the noise signal in the voice signal frame exists in the original amplitude of the voice signal frame, the original amplitude of the voice signal frame is subjected to feature extraction, the original amplitude is subjected to denoising according to the obtained denoising parameter to obtain a target amplitude without the noise signal, the original amplitude of the original voice signal is denoised, and then the target voice signal without the noise signal can be recovered according to the target amplitude and the original phase, so that the denoising of the original voice signal is realized. The denoising method only needs to process the amplitude in the voice signal without processing the phase, thereby reducing the characteristics needing to be processed.

In addition, before calling the speech processing model and processing the original speech signal, the speech processing model needs to be trained, and the training process is as follows: acquiring a sample voice signal and a sample noise signal; mixing the sample voice signal and the sample noise signal to obtain a sample mixed signal; calling a voice processing model, and processing a plurality of sample voice signal frames in the sample mixed signal to obtain a prediction denoising parameter corresponding to the sample mixed signal; denoising the original voice signal according to the prediction denoising parameter to obtain a denoised prediction voice signal; a speech processing model is trained based on a difference between the predicted speech signal and the sample speech signal. Wherein the sample speech signal is a clean speech signal containing no noise signal. In addition, the voice processing model adopts a network structure of a residual error learning network, so that the training speed of the model is improved in the training process.

For example, sample voice signals of a plurality of users are obtained from a voice database, a plurality of sample noise signals are obtained from a noise database, the sample noise signals and the sample voice signals are mixed according to different signal-to-noise ratios to obtain a plurality of sample mixed signals, and a voice processing model is trained by using the sample mixed signals.

In a possible implementation manner, sample amplitudes of a plurality of sample voice signal frames in a sample mixed signal are obtained, a voice processing model is called to process the sample amplitudes, and a prediction denoising parameter corresponding to the sample mixed signal is obtained; and denoising the sample amplitude according to the prediction denoising parameter to obtain the prediction amplitude of each voice signal frame, and training a voice processing model according to the difference between the prediction amplitude of each voice signal frame and the amplitudes of a plurality of voice signal frames in the sample voice signal.

For example, when training a speech processing model, the convolution kernels, filters, and convolution parameters of the convolutional layers in the speech processing model are set as shown in table 1 below:

TABLE 1

Wherein, conv represents a feature extraction network or a feature reconstruction network, RNAM represents a non-local attention network, RAM represents a local attention network, res.unit represents a hole residual error Unit or a hole residual error subunit, conv represents a convolution subunit, Deconv represents a deconvolution subunit, and NL Unit represents a residual error non-local subunit.

In addition, in a possible implementation manner, a Wiener Filtering (Wiener Filtering) method, a session enhanced adaptive Network (Speech enhanced generated countermeasure Network) method, a wavelet (microwave) method, an MMSE-GAN (Speech enhanced generated countermeasure Network) method, a Deep Feature Loss (DFL) method, a hybrid model (MDPhD), and a specific enhanced generalized adaptive Network (RSGAN-GP) method are adopted as reference methods, and these methods are compared with the method (RNANet) provided by the embodiment of the present disclosure.

The results of the comparison of the above referenced methods with the methods provided by the examples of the present disclosure are seen in table 2 below:

TABLE 2

Method	SSNR	PESQ	CSIG	CBAK	COVL
						Noisy	1.68	1.97	3.35	2.44	2.63
Wiener	5.07	2.22	3.23	2.68	2.67
						SEGAN	7.73	2.16	3.48	2.94	2.80
Wavelnet			3.62	3.23	2.98
						DFL			3.86	3.33	3.22
MMSE-GAN		2.53	3.80	3.12	3.14
						MDPhD	10.22	2.70	3.85	3.39	3.27
RNANet	10.16	2.71	3.98	3.42	3.35

Wherein, the larger the SSNR (segmentalSignal Noise Ratio), the better the denoising effect; the larger the PESQ (subjective Evaluation of Speech Quality Evaluation) is, the better the denoising effect is; CSIG (an evaluation index) is the mean opinion score of signal distortion, and the larger CSIG is, the better the denoising effect is; CBAK (an evaluation index) is a background noise prediction score, and the larger the CBAK is, the better the denoising effect is; COVL (an evaluation index) is a score of the overall signal quality of a speech signal.

In another possible implementation, in order to show the improvement of the Intelligibility of the speech signal, STOI (Short Time Objective Intelligibility) is used to compare the method provided by the present disclosure with the reference method, and the comparison results are shown in table 3:

TABLE 3

Evaluation method	Noisy	MMSE-GAN	RSGAN-GP	RNANet
					STOI	0.921	0.930	0.942	0.946

Wherein, the larger STOI is, the better the denoising effect is.

As can be seen from the comparison results in tables 2 and 3, the denoising effect of the method provided by the embodiment of the disclosure is significantly higher than that of other methods.

Fig. 13 is a block diagram illustrating a speech signal processing apparatus according to an exemplary embodiment. Referring to fig. 13, the apparatus includes:

a feature determination unit 1301 configured to perform determining a first speech feature of a plurality of speech signal frames in an original speech signal;

a non-local feature obtaining unit 1302, configured to perform a non-local attention network invoking to fuse first voice features of a plurality of voice signal frames, so as to obtain a non-local voice feature of each voice signal frame;

a mixed feature obtaining unit 1303, configured to perform processing on the non-local speech features of each speech signal frame by invoking a local attention network, so as to obtain a mixed speech feature of each speech signal frame;

a denoising parameter acquiring unit 1304 configured to perform acquiring a denoising parameter based on a mixed speech feature of a plurality of speech signal frames;

the target signal obtaining unit 1305 is configured to perform denoising on the original voice signal according to the denoising parameter, so as to obtain a target voice signal.

The device provided by the embodiment of the disclosure calls the non-local attention network and the local attention network to process the first voice features of a plurality of voice signal frames in the original voice signal to obtain the denoising parameter, wherein the denoising parameter can represent the proportion of signals except the denoising signal in each voice signal frame, so that the denoising parameter is adopted to denoise the original voice signal to remove noise in the original voice signal, and the context information of each voice signal frame can be considered when the non-local attention network is called to process the first voice features of each voice signal frame, so that the obtained denoising parameter is more accurate, and the denoising effect of the original voice signal is improved.

In one possible implementation manner, the feature determining unit 1301 is configured to perform feature extraction on the original amplitudes of the plurality of speech signal frames by invoking a feature extraction network, so as to obtain first speech features of the plurality of speech signal frames.

In another possible implementation, referring to fig. 14, the target signal acquiring unit 1305 includes:

an amplitude obtaining subunit 1315, configured to execute calling a voice denoising network, and denoise the original amplitudes of the multiple voice signal frames according to the denoising parameters, to obtain target amplitudes of the multiple voice signal frames;

a signal obtaining subunit 1325 configured to perform combining the original phases and the target amplitudes of the plurality of speech signal frames to obtain a target speech signal.

In another possible implementation manner, the denoising parameter obtaining unit 1304 is configured to perform feature reconstruction on a mixed speech feature of a plurality of speech signal frames by invoking a feature reconstruction network, so as to obtain a denoising parameter.

In another possible implementation, the non-local attention network includes a first processing unit, a second processing unit and a first fusion unit, referring to fig. 14, the non-local feature obtaining unit 1302 includes:

a feature extraction subunit 1312 configured to invoke a first processing unit to perform feature extraction on first speech features of the plurality of speech signal frames respectively to obtain a second speech feature of each speech signal frame, where the first processing unit includes a plurality of hole residual error subunits;

a first fusion subunit 1322, configured to execute calling the second processing unit, and fuse the first speech feature of each speech signal frame with the first speech features of other speech signal frames, respectively, to obtain a third speech feature of each speech signal frame;

and a second fusion subunit 1332 configured to perform calling the first fusion unit, and respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain a non-local speech feature of each speech signal frame.

In another possible implementation manner, the non-local attention network further includes a second fusion unit, see fig. 14, and the non-local feature obtaining unit 1302 further includes:

and a third fusion subunit 1342, configured to perform invoking the second fusion unit to fuse the non-local speech feature of each speech signal frame with the first speech feature, so as to obtain a fused non-local speech feature of each speech signal frame.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; referring to fig. 14, a first fusion subunit 1322 is configured to perform:

In another possible implementation, the second processing unit further includes a feature reduction subunit, see fig. 14, a first fusion subunit 1322, configured to perform:

In another possible implementation, the residual non-local sub-unit includes a first fusion layer and a second fusion layer, see fig. 14, the first fusion sub-unit 1322 is configured to perform:

acquiring a sample voice signal and a sample noise signal;

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

In an exemplary embodiment, an electronic device is provided that includes one or more processors, and volatile or non-volatile memory for storing instructions executable by the one or more processors; wherein the one or more processors are configured to perform the processing method of the voice signal in the above-described embodiment.

In one possible implementation, the electronic device is provided as a terminal. Fig. 15 is a block diagram illustrating a structure of a terminal 1500 according to an example embodiment. The terminal 1500 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1500 may also be referred to as user equipment, a portable terminal, a laptop terminal, a desktop terminal, or other names.

The terminal 1500 includes: a processor 1501 and memory 1502.

Processor 1501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 1501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, processor 1501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 1502 may include one or more computer-readable storage media, which may be non-transitory. The memory 1502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1502 is used to store at least one program code for execution by the processor 1501 to implement the method of processing a speech signal provided by the method embodiments of the present disclosure.

In some embodiments, the terminal 1500 may further include: a peripheral interface 1503 and at least one peripheral. The processor 1501, memory 1502, and peripheral interface 1503 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1503 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1504, a display 1505, a camera assembly 1506, an audio circuit 1507, a positioning assembly 1508, and a power supply 1509.

The peripheral interface 1503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1501 and the memory 1502. In some embodiments, the processor 1501, memory 1502, and peripheral interface 1503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1501, the memory 1502, and the peripheral interface 1503 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1504 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 1504 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1504 can communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuitry 1504 may also include NFC (Near Field Communication) related circuitry, which is not limited by the present disclosure.

The display screen 1505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1505 is a touch display screen, the display screen 1505 also has the ability to capture touch signals on or over the surface of the display screen 1505. The touch signal may be input to the processor 1501 as a control signal for processing. In this case, the display screen 1505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1505 may be one, provided on the front panel of terminal 1500; in other embodiments, display 1505 may be at least two, each disposed on a different surface of terminal 1500 or in a folded design; in other embodiments, display 1505 may be a flexible display disposed on a curved surface or a folded surface of terminal 1500. Even further, the display 1505 may be configured in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 1505 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 1506 is used to capture images or video. Optionally, the camera assembly 1506 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1507 may include a microphone and speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1501 for processing or inputting the electric signals to the radio frequency circuit 1504 to realize voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of the terminal 1500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1501 or the radio frequency circuit 1504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1507 may also include a headphone jack.

The positioning component 1508 is used to locate the current geographic position of the terminal 1500 for navigation or LBS (Location Based Service). The Positioning component 1508 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian glonass Positioning System, or the european union galileo Positioning System.

Power supply 1509 is used to power the various components in terminal 1500. The power supply 1509 may be alternating current, direct current, disposable or rechargeable. When the power supply 1509 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 1500 also includes one or more sensors 1510. The one or more sensors 1510 include, but are not limited to: acceleration sensor 1511, gyro sensor 1512, pressure sensor 1513, fingerprint sensor 1514, optical sensor 1515, and proximity sensor 1516.

The acceleration sensor 1511 may detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 1500. For example, the acceleration sensor 1511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1501 may control the display screen 1505 to display the user interface in a landscape view or a portrait view based on the gravitational acceleration signal collected by the acceleration sensor 1511. The acceleration sensor 1511 may also be used for acquisition of motion data of a game or a user.

The gyroscope sensor 1512 can detect the body direction and the rotation angle of the terminal 1500, and the gyroscope sensor 1512 and the acceleration sensor 1511 cooperate to collect the 3D motion of the user on the terminal 1500. The processor 1501 may implement the following functions according to the data collected by the gyro sensor 1512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1513 may be disposed on a side frame of terminal 1500 and/or underneath display 1505. When the pressure sensor 1513 is disposed on the side frame of the terminal 1500, the holding signal of the user to the terminal 1500 may be detected, and the processor 1501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1513. When the pressure sensor 1513 is disposed at a lower layer of the display screen 1505, the processor 1501 controls the operability control on the UI interface in accordance with the pressure operation of the user on the display screen 1505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1514 is configured to capture a fingerprint of the user, and the processor 1501 identifies the user based on the fingerprint captured by the fingerprint sensor 1514, or the fingerprint sensor 1514 identifies the user based on the captured fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 1514 may be disposed on the front, back, or side of the terminal 1500. When a physical key or vendor Logo is provided on the terminal 1500, the fingerprint sensor 1514 may be integrated with the physical key or vendor Logo.

The optical sensor 1515 is used to collect ambient light intensity. In one embodiment, processor 1501 may control the brightness of display screen 1505 based on the intensity of ambient light collected by optical sensor 1515. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1505 is increased; when the ambient light intensity is low, the display brightness of the display screen 1505 is adjusted down. In another embodiment, the processor 1501 may also dynamically adjust the shooting parameters of the camera assembly 1506 based on the ambient light intensity collected by the optical sensor 1515.

A proximity sensor 1516, also called a distance sensor, is provided on the front panel of the terminal 1500. The proximity sensor 1516 is used to collect the distance between the user and the front surface of the terminal 1500. In one embodiment, when the proximity sensor 1516 detects that the distance between the user and the front surface of the terminal 1500 gradually decreases, the processor 1501 controls the display 1505 to switch from the bright screen state to the dark screen state; when the proximity sensor 1516 detects that the distance between the user and the front surface of the terminal 1500 gradually becomes larger, the processor 1501 controls the display 1505 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 15 does not constitute a limitation of terminal 1500, and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components may be employed.

In another possible implementation, the electronic device is provided as a server. Fig. 16 is a block diagram illustrating a server 1600, which may have a relatively large difference due to different configurations or performances according to an exemplary embodiment, and may include one or more processors (CPUs) 1601 and one or more memories 1602, where the memory 1602 stores at least one program code, and the at least one program code is loaded and executed by the processors 1601 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the steps performed by a terminal or a server in the above-mentioned voice signal processing method. For example, the non-transitory computer readable storage medium may be a ROM (Read Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product, wherein instructions of the computer program product, when executed by a processor of an electronic device, enable the electronic device to perform the steps performed by the terminal or the server in the above-mentioned method for processing a voice signal.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing a speech signal, the method comprising:

2. The processing method of claim 1, wherein determining the first speech feature of the plurality of speech signal frames in the original speech signal comprises:

3. The processing method according to claim 2, wherein denoising the original speech signal according to the denoising parameter to obtain a target speech signal comprises:

4. The processing method according to claim 1, wherein said obtaining denoising parameters based on the mixed speech features of the plurality of speech signal frames comprises:

5. The processing method according to claim 1, wherein the non-local attention network includes a first processing unit, a second processing unit and a first fusion unit, and the invoking the non-local attention network to fuse the first speech features of the plurality of speech signal frames to obtain the non-local speech feature of each speech signal frame includes:

6. The processing method according to claim 5, wherein the non-local attention network further includes a second fusion unit, and the invoking the first fusion unit separately fuses the second speech feature and the third speech feature of each speech signal frame, so that after the non-local speech feature of each speech signal frame is obtained, the processing method further includes:

7. A processing apparatus of a speech signal, characterized in that the processing apparatus comprises:

8. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform the method of processing a speech signal of any one of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of processing a speech signal of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the method of processing a speech signal of any one of claims 1 to 6 when executed by a processor.