CN112967730B

CN112967730B - Voice signal processing method and device, electronic equipment and storage medium

Info

Publication number: CN112967730B
Application number: CN202110125640.5A
Authority: CN
Inventors: 邓峰; 王晓瑞; 王仲远
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2024-07-02
Anticipated expiration: 2041-01-29
Also published as: CN112967730A; WO2022160715A1

Abstract

The disclosure relates to a processing method and device of a voice signal, electronic equipment and a storage medium, and belongs to the technical field of voice processing. The method comprises the following steps: determining a first speech feature of a plurality of speech signal frames in the original speech signal; invoking a non-local attention network to fuse the first voice characteristics of the voice signal frames to obtain the non-local voice characteristics of each voice signal frame; invoking a local attention network to respectively process the non-local voice characteristics of each voice signal frame to obtain the mixed voice characteristics of each voice signal frame; acquiring denoising parameters based on the mixed speech features of the plurality of speech signal frames; and denoising the original voice signal according to the denoising parameters to obtain a target voice signal. The method considers the context information of the voice signal frame in the processing process, so that the obtained denoising parameters are more accurate, and the denoising effect of the original voice signal is improved.

Description

Voice signal processing method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of voice processing, and in particular relates to a method and a device for processing a voice signal, electronic equipment and a storage medium.

Background

Noise is typically included in the collected speech signal, and the presence of noise adversely affects the processing of the speech signal, so that noise removal plays a critical role in the processing of the speech signal.

In the related art, a spectral subtraction method is adopted to denoise a voice signal, namely, a mute segment in the voice signal is obtained, a noise signal is extracted from the mute segment, and the voice signal and the noise signal are subtracted to remove noise in the voice signal, but the spectral subtraction method is difficult to remove noise under the condition that the noise in the voice signal changes, and the denoising effect is poor.

Disclosure of Invention

The disclosure provides a processing method, a processing device, electronic equipment and a storage medium for voice signals, and the denoising effect of the voice signals is improved.

According to an aspect of the embodiments of the present disclosure, there is provided a method for processing a voice signal, the method including:

determining a first speech feature of a plurality of speech signal frames in the original speech signal;

invoking a non-local attention network to fuse the first voice characteristics of the voice signal frames to obtain the non-local voice characteristics of each voice signal frame;

Invoking a local attention network to respectively process the non-local voice characteristics of each voice signal frame to obtain the mixed voice characteristics of each voice signal frame;

acquiring denoising parameters based on the mixed speech features of the plurality of speech signal frames;

and denoising the original voice signal according to the denoising parameters to obtain a target voice signal.

According to the method provided by the embodiment of the disclosure, the non-local attention network and the local attention network are called, the first voice characteristics of a plurality of voice signal frames in an original voice signal are processed to obtain the denoising parameters, and the denoising parameters can represent the proportion of the signals except for noise signals in each voice signal frame, so that the original voice signal is denoised by adopting the denoising parameters, the removal of noise in the original voice signal is realized, and when the first voice characteristics of each voice signal frame are processed by calling the non-local attention network, the context information of the voice signal frame can be considered, so that the denoising parameters are more accurate, and the denoising effect of the original voice signal is improved.

In one possible implementation, the determining the first speech feature of the plurality of speech signal frames in the original speech signal includes:

And calling a feature extraction network to respectively perform feature extraction on the original amplitudes of the voice signal frames to obtain first voice features of the voice signal frames.

In the embodiment of the disclosure, since the noise signal in the voice signal frame exists in the original amplitude of the voice signal frame, the original amplitude of the voice signal frame is subjected to feature extraction, and the original phase in the original voice signal is not required to be processed, so that the processing amount is reduced.

In another possible implementation manner, denoising the original speech signal according to the denoising parameter to obtain a target speech signal, including:

calling a voice denoising network, and denoising the original amplitudes of the voice signal frames according to the denoising parameters to obtain target amplitudes of the voice signal frames;

and combining the original phases of the voice signal frames and the target amplitude to obtain the target voice signal.

In the embodiment of the disclosure, the original amplitude is denoised according to the obtained denoising parameters to obtain the target amplitude which does not contain the noise signal, denoising the original amplitude of the original voice signal is realized, and then the target voice signal which does not contain the noise signal can be recovered according to the target amplitude and the original phase, so that denoising of the original voice signal is realized. The denoising mode only needs to process the amplitude in the voice signal, but does not need to process the phase, so that the characteristics required to be processed are reduced, and the processing speed is improved.

In another possible implementation manner, the obtaining the denoising parameter based on the mixed speech feature of the plurality of speech signal frames includes:

and calling a feature reconstruction network to reconstruct the characteristics of the mixed voice features of the voice signal frames to obtain the denoising parameters.

In the embodiment of the disclosure, the denoising parameters acquired by calling the feature reconstruction network can represent the proportion of signals except noise signals in each voice signal frame, and then the denoising parameters are adopted to denoise the original voice signals.

In another possible implementation manner, the non-local attention network includes a first processing unit, a second processing unit, and a first fusion unit, and the invoking the non-local attention network to fuse the first speech features of the plurality of speech signal frames to obtain non-local speech features of each speech signal frame includes:

Invoking the first processing unit to respectively perform feature extraction on first voice features of the plurality of voice signal frames to obtain second voice features of each voice signal frame, wherein the first processing unit comprises a plurality of cavity residual sub-units;

Invoking the second processing unit to fuse the first voice characteristic of each voice signal frame with the first voice characteristics of other voice signal frames respectively to obtain a third voice characteristic of each voice signal frame;

And invoking the first fusion unit to respectively fuse the second voice characteristic and the third voice characteristic of each voice signal frame to obtain the non-local voice characteristic of each voice signal frame.

In the embodiment of the disclosure, different processing units in a non-local attention network are called to respectively process the first voice feature in different aspects, wherein the first processing unit comprising a plurality of cavity residual sub-units can further extract the first voice feature to obtain a deeper voice feature, the second processing unit adopts a non-local attention mechanism, when the first voice feature of each voice signal frame is processed, the voice signal frames except the voice signal frame in the voice signal are considered, namely, context information is combined, so that a more accurate voice feature is obtained, and the voice features obtained by the two processing units are fused together by calling the first fusion unit to obtain the non-local voice feature. In addition, the cavity residual subunit can expand the receptive field and can further acquire more context information.

In another possible implementation manner, the non-local attention network further includes a second fusing unit, the invoking the first fusing unit fuses the second voice feature and the third voice feature of each voice signal frame, respectively, so as to obtain the non-local voice feature of each voice signal frame, and after that, the processing method further includes:

and invoking the second fusion unit to fuse the non-local voice characteristics of each voice signal frame with the first voice characteristics to obtain the fused non-local voice characteristics of each voice signal frame.

In the embodiment of the disclosure, a residual learning network is adopted in the non-local attention network, and after the non-local voice feature is obtained, the non-local voice feature is fused with the input first voice feature, so that the finally obtained non-local voice feature is more accurate, some important features of the non-local voice feature are prevented from being lost, and the accuracy of the non-local voice feature is improved. In addition, the residual learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; the invoking the second processing unit to fuse the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame, including:

invoking the residual non-local subunit, and respectively carrying out weighted fusion on the first voice characteristic of each voice signal frame and the first voice characteristic of other voice signal frames according to the weights corresponding to the voice signal frames to obtain the first voice characteristic after weighted fusion of each voice signal frame;

Invoking the convolution subunit to encode the first voice characteristic after weighting and fusing each voice signal frame to obtain the encoding characteristic of each voice signal frame;

And calling the deconvolution subunit to decode the coding feature of each voice signal frame to obtain a third voice feature of each voice signal frame.

In the embodiment of the disclosure, when the residual non-local subunit processes the first speech feature of each speech signal frame, the speech signal frames except the speech signal frame in the speech signal are considered, that is, the context information is combined, so that more accurate speech features are obtained.

In another possible implementation manner, the second processing unit further includes a feature reduction subunit, the invoking the convolution layer encodes the first speech feature after the weighted fusion of each speech signal frame, and after obtaining the encoded feature of each speech signal frame, the processing method further includes:

Invoking the feature reduction subunit to perform feature reduction on the coding features of each voice signal frame to obtain a plurality of reduced coding features;

And invoking the deconvolution layer to decode the coding feature of each voice signal frame to obtain a third voice feature of each voice signal frame, wherein the method comprises the following steps:

and calling the deconvolution layer, and decoding the plurality of reduced coding features to obtain a third voice feature of each voice signal frame.

In the embodiment of the disclosure, the encoding features are reduced, so that the encoding features can be reduced, the calculated amount is reduced, and the processing speed of the encoding features is improved.

In another possible implementation manner, the residual non-local subunit includes a first fusion layer and a second fusion layer, and the invoking the residual non-local subunit performs weighted fusion on the first speech feature of each speech signal frame and the first speech feature of the other speech signal frames according to weights corresponding to the plurality of speech signal frames to obtain the weighted and fused first speech feature of each speech signal frame, where the method includes:

Invoking the first fusion layer, and respectively carrying out weighted fusion on the first voice characteristic of each voice signal frame and the first voice characteristic of other voice signal frames according to the weights corresponding to the voice signal frames to obtain fusion characteristics of each voice signal frame;

And invoking the second fusion layer to fuse the first voice characteristics and the fusion characteristics of each voice signal frame respectively to obtain the first voice characteristics after weighting and fusing of each voice signal frame.

In the embodiment of the disclosure, the first fusion layer is invoked to fuse the first voice features of different voice signal frames together according to the corresponding weights, so that more accurate fusion features are obtained, and under the condition of comprising the first fusion layer and the second fusion layer, the residual non-local subunit is one, and the fusion features are fused with the input first voice features, so that the finally obtained weighted and fused first voice features are more accurate, the fusion features are prevented from losing some important features, and the accuracy of the weighted and fused first voice features is improved. In addition, the residual learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In another possible implementation, the speech processing model includes at least the non-local attention network and the local attention network, and the training process of the speech processing model is as follows:

Acquiring a sample voice signal and a sample noise signal;

mixing the sample voice signal with the sample noise signal to obtain a sample mixed signal;

Invoking the voice processing model to process a plurality of sample voice signal frames in the sample mixed signal to obtain a predicted denoising parameter corresponding to the sample mixed signal;

denoising the original voice signal according to the predicted denoising parameters to obtain a denoised predicted voice signal;

The speech processing model is trained based on differences between the predicted speech signal and the sample speech signal.

In the embodiment of the disclosure, the sample voice signal and the sample noise signal are mixed to obtain the sample mixed signal, and the sample mixed signal is used for training the voice processing model, and because the network structure of the residual learning network is adopted in the voice processing model, the training speed of the model is improved in the training process.

According to still another aspect of the embodiments of the present disclosure, there is provided a processing apparatus for a voice signal, the apparatus including:

A feature determination unit configured to perform determination of a first speech feature of a plurality of speech signal frames in an original speech signal;

a non-local feature acquisition unit configured to perform fusion of first voice features of the plurality of voice signal frames by invoking a non-local attention network to obtain non-local voice features of each voice signal frame;

The mixed characteristic acquisition unit is configured to execute the call of the local attention network to process the non-local voice characteristics of each voice signal frame respectively to obtain the mixed voice characteristics of each voice signal frame;

a denoising parameter acquisition unit configured to perform acquisition of denoising parameters based on mixed speech features of the plurality of speech signal frames;

And the target signal acquisition unit is configured to perform denoising on the original voice signal according to the denoising parameters to obtain a target voice signal.

In a possible implementation manner, the feature determining unit is configured to perform feature extraction on original amplitudes of the plurality of voice signal frames by calling a feature extraction network, so as to obtain first voice features of the plurality of voice signal frames.

In another possible implementation manner, the target signal acquiring unit includes:

The amplitude acquisition subunit is configured to execute calling of a voice denoising network, and denoise the original amplitudes of the voice signal frames according to the denoising parameters to obtain target amplitudes of the voice signal frames;

And a signal acquisition subunit configured to perform combination of the original phases of the plurality of speech signal frames and a target amplitude to obtain the target speech signal.

In another possible implementation manner, the denoising parameter obtaining unit is configured to perform feature reconstruction on the mixed speech features of the plurality of speech signal frames by calling a feature reconstruction network, so as to obtain the denoising parameter.

In another possible implementation manner, the non-local attention network includes a first processing unit, a second processing unit, and a first fusion unit, and the non-local feature acquisition unit includes:

The feature extraction subunit is configured to execute feature extraction for the first voice features of the voice signal frames respectively by calling the first processing unit to obtain second voice features of each voice signal frame, wherein the first processing unit comprises a plurality of cavity residual subunits;

The first fusion subunit is configured to execute the calling of the second processing unit, and respectively fuse the first voice characteristic of each voice signal frame with the first voice characteristics of other voice signal frames to obtain the third voice characteristic of each voice signal frame;

and the second fusion subunit is configured to execute the calling of the first fusion unit, and respectively fuse the second voice characteristic and the third voice characteristic of each voice signal frame to obtain the non-local voice characteristic of each voice signal frame.

In another possible implementation manner, the non-local attention network further includes a second fusion unit, and the non-local feature acquisition unit further includes:

And the third fusion subunit is configured to execute the second fusion unit to fuse the non-local voice characteristic and the first voice characteristic of each voice signal frame so as to obtain the fused non-local voice characteristic of each voice signal frame.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; the first fusion subunit is configured to perform:

In another possible implementation manner, the second processing unit further includes a feature shrinking subunit, and the first fusion subunit is configured to perform:

In another possible implementation, the residual non-local subunit includes a first fusion layer and a second fusion layer, the first fusion subunit configured to perform:

Acquiring a sample voice signal and a sample noise signal;

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic device including:

one or more processors;

A memory for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform the method of processing speech signals of the above aspects.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method for processing a speech signal according to the above aspect.

According to a further aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of processing a speech signal according to the above aspect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a speech processing model, according to an example embodiment.

FIG. 2 is a schematic diagram of another speech processing model, shown according to an exemplary embodiment.

FIG. 3 is a schematic diagram of another speech processing model, shown according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating a method of processing a speech signal according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating another method of processing a speech signal according to an exemplary embodiment.

Fig. 6 is a schematic diagram of a non-local attention network, according to an example embodiment.

Fig. 7 is a flowchart illustrating a non-local speech feature acquisition method according to an exemplary embodiment.

Fig. 8 is a schematic diagram of a first processing unit, according to an example embodiment.

Fig. 9 is a schematic diagram of a second processing unit, according to an example embodiment.

Fig. 10 is a schematic diagram of another second processing unit shown according to an exemplary embodiment.

FIG. 11 is a schematic diagram illustrating a residual non-local subunit, according to an example embodiment.

Fig. 12 is a schematic diagram of another non-local attention network, shown according to an exemplary embodiment.

Fig. 13 is a block diagram illustrating a processing apparatus for a voice signal according to an exemplary embodiment.

Fig. 14 is a block diagram of another speech signal processing apparatus according to an exemplary embodiment.

Fig. 15 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment.

Fig. 16 is a block diagram illustrating a structure of a server according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description of the present disclosure and the claims and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The user information (including but not limited to user equipment information, user personal information, etc.) related to the present disclosure is information authorized by the user or sufficiently authorized by each party.

The processing method of the voice signal provided by the embodiment of the disclosure can be applied to various scenes.

For example, in live scenes.

In the live broadcast process, noise signals may exist in the voice signals of the anchor, which are collected by the anchor terminal, if the audience terminal directly plays the voice signals, the voice signals may be unclear due to the existence of noise, and the watching experience of the audience is affected.

As another example, in an automatic speech recognition scenario.

In the voice recognition process, if noise signals exist in voice signals, the noise signals can affect voice signal recognition, so that the voice recognition accuracy is low, the content of the voice signals is difficult to accurately recognize, at the moment, the method provided by the embodiment of the disclosure can be adopted to firstly denoise the voice signals, and the denoised voice signals are recognized, so that the voice recognition accuracy is improved.

The method provided by the embodiment of the disclosure can also be applied to scenes such as video playing, language identification, voice synthesis, identity identification and the like.

FIG. 1 is a schematic illustration of a speech processing model provided in accordance with an exemplary embodiment, the speech processing model comprising: a non-local attention network 101 and a local attention network 102, the non-local attention network 101 and the local attention network 102 being connected. The non-local attention network 101 is configured to process a first voice feature of an input original voice signal to obtain a non-local voice feature of the original voice signal, and the local attention network 102 is configured to further process the non-local voice feature of the original voice signal to obtain a mixed voice feature of the original voice signal.

In one possible implementation, referring to fig. 2, the speech processing model further includes: the feature extraction network 103, the feature reconstruction network 104 and the voice denoising network 105, wherein the feature extraction network 103 is connected with the non-local attention network 101, the feature reconstruction network 104 is connected with the local attention network 102, and the voice denoising network 105 is connected with the feature reconstruction network 104. The feature extraction network 103 is configured to extract a first voice feature of the original voice signal, the feature reconstruction network 104 is configured to perform feature reconstruction on the processed mixed voice feature of the original voice signal, so as to obtain a denoising parameter of the original voice signal, and the voice denoising network 105 is configured to denoise the original voice signal.

In one possible implementation, the speech processing model includes a plurality of non-local attention networks 101 and a plurality of local attention networks 102, where the plurality of non-local attention networks 101 and the plurality of local attention networks 102 can be connected in sequence in any order. For example, referring to fig. 3, the speech processing model includes two non-local attention networks 101 and two local attention networks 102, the feature extraction network 103 is connected to a first non-local attention network 101, the first non-local attention network 101 is connected to the first local attention network 102, the first local attention network 102 is connected to a second local attention network 102, the second local attention network 102 is connected to the second non-local attention network 101, and the second non-local attention network 101 is connected to the feature reconstruction network 104.

The processing method of the voice signal provided by the embodiment of the disclosure can be applied to electronic equipment, and the electronic equipment is a terminal or a server. The terminal is a portable, pocket-sized, hand-held terminal of various types, such as a mobile phone, a computer, a tablet personal computer and the like. The server is a server, or a server cluster formed by a plurality of servers, or a cloud computing service center.

Fig. 4 is a flowchart illustrating a method for processing a voice signal according to an exemplary embodiment, and referring to fig. 4, the method is applied to an electronic device, and includes the following steps:

401. a first speech feature of a plurality of speech signal frames in the original speech signal is determined.

402. And calling a non-local attention network to fuse the first voice characteristics of the voice signal frames to obtain the non-local voice characteristics of each voice signal frame.

403. And calling the local attention network to respectively process the non-local voice characteristics of each voice signal frame to obtain the mixed voice characteristics of each voice signal frame.

404. Denoising parameters are obtained based on mixed speech features of a plurality of speech signal frames.

405. And denoising the original voice signal according to the denoising parameters to obtain a target voice signal.

In one possible implementation, determining a first speech feature of a plurality of speech signal frames in an original speech signal includes:

And calling a feature extraction network to respectively perform feature extraction on the original amplitudes of the plurality of voice signal frames to obtain first voice features of the plurality of voice signal frames.

In another possible implementation manner, denoising the original speech signal according to a denoising parameter to obtain a target speech signal, including:

calling a voice denoising network, and denoising original amplitudes of a plurality of voice signal frames according to denoising parameters to obtain target amplitudes of the plurality of voice signal frames;

and combining the original phases of the voice signal frames and the target amplitude to obtain a target voice signal.

In another possible implementation, obtaining the denoising parameter based on a mixed speech feature of a plurality of speech signal frames includes:

And calling a feature reconstruction network to reconstruct the characteristics of the mixed voice features of the plurality of voice signal frames to obtain denoising parameters.

In another possible implementation manner, the non-local attention network includes a first processing unit, a second processing unit and a first fusion unit, and the invoking the non-local attention network to fuse the first voice features of the plurality of voice signal frames to obtain non-local voice features of each voice signal frame includes:

Invoking a first processing unit to respectively perform feature extraction on first voice features of a plurality of voice signal frames to obtain second voice features of each voice signal frame, wherein the first processing unit comprises a plurality of cavity residual sub-units;

Invoking a second processing unit to fuse the first voice characteristic of each voice signal frame with the first voice characteristics of other voice signal frames respectively to obtain a third voice characteristic of each voice signal frame;

And calling a first fusion unit to fuse the second voice characteristic and the third voice characteristic of each voice signal frame respectively to obtain the non-local voice characteristic of each voice signal frame.

In another possible implementation manner, the non-local attention network further includes a second fusion unit, invokes the first fusion unit to fuse the second voice feature and the third voice feature of each voice signal frame, so as to obtain the non-local voice feature of each voice signal frame, and after that, the processing method further includes:

And invoking a second fusion unit to fuse the non-local voice characteristics of each voice signal frame with the first voice characteristics to obtain the fused non-local voice characteristics of each voice signal frame.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; invoking a second processing unit to fuse the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame, including:

Invoking a residual non-local subunit, and respectively carrying out weighted fusion on the first voice characteristic of each voice signal frame and the first voice characteristics of other voice signal frames according to the weights corresponding to a plurality of voice signal frames to obtain the first voice characteristic after weighted fusion of each voice signal frame;

Invoking a convolution subunit to encode the first voice characteristic after weighting and fusing each voice signal frame to obtain the encoding characteristic of each voice signal frame;

And calling a deconvolution subunit to decode the coding characteristic of each voice signal frame to obtain a third voice characteristic of each voice signal frame.

In another possible implementation manner, the second processing unit further includes a feature shrinking subunit, invokes a convolution layer, encodes the first speech feature after weighting and fusing each speech signal frame, and after obtaining the encoded feature of each speech signal frame, the processing method further includes:

Invoking a feature reduction subunit to perform feature reduction on the coding features of each voice signal frame to obtain a plurality of reduced coding features;

Invoking a deconvolution layer to decode the encoded features of each speech signal frame to obtain a third speech feature of each speech signal frame, comprising:

And calling a deconvolution layer, and decoding the reduced coding features to obtain a third voice feature of each voice signal frame.

In another possible implementation manner, the residual non-local subunit includes a first fusion layer and a second fusion layer, calls the residual non-local subunit, and respectively performs weighted fusion on the first voice feature of each voice signal frame and the first voice features of other voice signal frames according to weights corresponding to a plurality of voice signal frames to obtain the weighted fused first voice feature of each voice signal frame, where the steps include:

Invoking a first fusion layer, and respectively carrying out weighted fusion on the first voice characteristic of each voice signal frame and the first voice characteristic of other voice signal frames according to the weights corresponding to the voice signal frames to obtain fusion characteristics of each voice signal frame;

and calling a second fusion layer to fuse the first voice characteristics and the fusion characteristics of each voice signal frame respectively, so as to obtain the first voice characteristics of each voice signal frame after weighted fusion.

In another possible implementation, the speech processing model includes at least a non-local attention network and a local attention network, and the training process of the speech processing model is as follows:

Acquiring a sample voice signal and a sample noise signal;

Mixing a sample voice signal with a sample noise signal to obtain a sample mixed signal;

Invoking a voice processing model, and processing a plurality of sample voice signal frames in the sample mixed signal to obtain a predicted denoising parameter corresponding to the sample mixed signal;

a speech processing model is trained based on differences between the predicted speech signal and the sample speech signal.

Fig. 5 is a flowchart illustrating another method for processing a voice signal according to an exemplary embodiment, and referring to fig. 5, the method is applied to an electronic device, and includes the following steps:

501. the electronic device obtains an original amplitude and an original phase of a plurality of speech signal frames in an original speech signal.

Because the voice signal comprises the amplitude and the phase, and the noise signal in the voice signal is contained in the amplitude, in the embodiment of the disclosure, the original amplitude and the original phase of each voice signal frame in the original voice signal are obtained, and the original amplitude is denoised, so that the denoising of the original voice signal is realized, the original phase is not required to be processed, and the processing capacity is reduced. The original voice signal is a voice signal which is acquired by the electronic device or is sent to the electronic device by other electronic devices and contains a noise signal, for example, the noise signal is a noise signal of the type of environmental noise, white noise and the like.

The electronic equipment respectively carries out Fourier transform on each voice signal to obtain the original amplitude and the original phase of each voice signal frame, and then processes the original amplitude of each voice signal frame to realize denoising of the original amplitude. The fourier transform includes fast fourier transform, short-time fourier transform, and the like.

In one possible implementation, because the speech processing model is limited in signal length for each processed speech signal, for example, one minute of speech signal, two minutes of speech signal, etc. can be processed at a time. Therefore, the signal length of the original speech signal cannot exceed the reference signal length, i.e., the duration of the original speech signal cannot exceed the reference duration. For example, 64 frames of speech signals are processed at a time.

502. The electronic equipment invokes a feature extraction network to respectively perform feature extraction on the original amplitudes of the plurality of voice signal frames to obtain first voice features of the plurality of voice signal frames.

The first voice characteristic of the voice signal frame is used for describing the corresponding voice signal frame, and the first voice characteristic is expressed in a vector, a matrix or other forms. The first speech features of the plurality of speech signal frames may be represented separately or may be combined together, e.g. the first speech features of each speech signal frame are vectors, and the plurality of vectors are combined together to form a matrix, each column in the matrix representing the first speech features of one speech signal frame.

In one possible implementation, the feature extraction network includes a convolution layer, a batch normalization layer, and an activation function layer.

503. The electronic equipment invokes the non-local attention network to fuse the first voice characteristics of the voice signal frames to obtain the non-local voice characteristics of each voice signal frame.

Wherein the non-local speech feature of each speech signal frame is derived in combination with the first speech feature of the plurality of speech signal frames, i.e. taking into account the features of the speech signal frames preceding and following the speech signal frame.

In the embodiment of the disclosure, the non-local attention network processes the first voice feature by adopting an attention mechanism and residual learning, and in the process of processing the first voice feature of each voice signal frame, the context information of the voice signal frame can be considered, so that the processed non-local voice feature is more accurate, and because the first voice feature of the voice signal frame loses some voice features in the processing process, the residual learning can be adopted to acquire the non-local voice feature by combining the input first voice feature after the first voice feature is processed, so that important voice features are prevented from being lost in the process of processing the first voice feature to obtain the non-local voice feature.

In one possible implementation, referring to fig. 6, the non-local attention network includes a first processing unit, a second processing unit, a first fusion unit, and a second fusion unit, where the first processing unit is a Trunk Branch (Trunk Branch), and the second processing unit is a Mask Branch (Mask Branch). The first processing unit and the second processing unit respectively process first voice signals of a plurality of input voice signal frames, the first fusion unit fuses the characteristics obtained after the processing of the first processing unit and the second processing unit, and the second fusion unit fuses the characteristics obtained by the fusion of the first fusion unit and the characteristics input in a non-local attention network.

The process by which the electronic device invokes the non-local attention network to process the first speech feature of each speech signal frame is seen in fig. 7, the process comprising the steps of:

701. The electronic equipment calls the first processing unit to respectively conduct feature extraction on first voice features of a plurality of voice signal frames to obtain second voice features of each voice signal frame.

The second voice feature is obtained by further extracting the first voice feature, and the second voice feature contains fewer noise features compared with the first voice feature.

In one possible implementation, referring to fig. 8, the first processing unit includes a plurality of hole residual subunits (res. Unit), fig. 8 only uses two hole residual subunits as an example, each hole residual subunit includes a hole convolution layer, a batch normalization layer, and an activation function layer, and the plurality of hole residual subunits are connected by adopting a network structure of a residual learning network. The cavity convolution layer can enlarge the receptive field and acquire more context information.

In one possible implementation, the non-local attention network further comprises at least one hole residual unit, each hole residual unit comprising two hole convolution subunits, the two hole residual subunits being connected by a network structure of a residual learning network. Before the electronic equipment calls the first processing unit and the second processing unit to process the first voice feature of each voice signal frame, at least one cavity residual error unit is called to conduct feature extraction on the first voice feature of each voice signal frame, the first voice feature after further extraction of each voice signal frame is obtained, and the first processing unit and the second processing unit are used for processing the first voice feature after further extraction of each voice signal frame. The invoking of the first processing unit comprising the plurality of hole residual subunits can further extract the first voice features to obtain deeper voice features.

702. The electronic equipment calls the second processing unit to fuse the first voice characteristic of each voice signal frame with the first voice characteristic of other voice signal frames respectively to obtain a third voice characteristic of each voice signal frame.

Wherein the third speech feature of each speech signal frame is derived in combination with the first speech features of the other speech signal frames.

In one possible implementation, referring to fig. 9, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit. The electronic equipment calls a residual non-local subunit, and according to weights corresponding to a plurality of voice signal frames, the first voice characteristics of each voice signal frame are respectively subjected to weighted fusion with the first voice characteristics of other voice signal frames to obtain the first voice characteristics after weighted fusion of each voice signal frame; invoking a convolution subunit to encode the first voice characteristic after weighting and fusing each voice signal frame to obtain the encoding characteristic of each voice signal frame; and calling a deconvolution subunit to decode the coding characteristic of each voice signal frame to obtain a third voice characteristic of each voice signal frame.

In one possible implementation, referring to fig. 10, the second processing unit further includes a plurality of feature reduction subunits, a plurality of first hole residual subunits, a plurality of second hole residual subunits, and an activation function subunit, the residual non-local subunit is connected with the first hole residual subunit, the plurality of first hole residual subunits are sequentially connected, the last hole residual subunit is connected with the convolution subunit, the convolution subunit is connected with the first feature reduction subunit, the plurality of feature reduction subunits are sequentially connected, the last feature reduction subunit is connected with the deconvolution subunit, the deconvolution subunit is connected with the first second hole residual subunit, the plurality of second hole residual subunits are sequentially connected, and the last hole residual subunit is connected with the activation function subunit. In addition, fig. 10 only exemplifies two first hole residual subunits, two second hole residual subunits, and two feature reduction subunits, and the first hole residual subunits, the second hole residual subunits, and the feature reduction subunits may be other numbers.

The activation function in the activation function subunit may be a Sigmoid function or other activation functions, and the first hole residual subunit and the third hole residual subunit may be the same or different, and each hole residual subunit includes a hole convolution layer, a batch normalization layer and an activation function layer. Optionally, the feature reduction subunit is also a hole residual subunit.

In one possible implementation manner, the electronic device invokes a plurality of first hole residual error subunits to process the first voice characteristics after weighting and fusing each voice signal frame, so as to obtain the first voice characteristics after further processing each voice signal frame; invoking a convolution subunit to encode the first voice characteristic of each voice signal frame after further processing to obtain the encoding characteristic of each voice signal frame; invoking a plurality of feature reduction subunits to perform feature reduction on the coding features of each voice signal frame to obtain a plurality of reduced coding features; invoking a deconvolution layer to decode the reduced coding features to obtain decoded voice features of each voice signal frame; and calling a plurality of second hole residual error subunits to process the voice characteristics decoded by each voice signal frame to obtain a third voice characteristic of each voice signal frame. The encoding features are reduced, so that the amount of calculation is reduced, and the processing speed of the encoding features is improved.

In one possible implementation manner, the residual non-local subunit includes a first fusion layer and a second fusion layer, the electronic device invokes the first fusion layer, and according to weights corresponding to a plurality of voice signal frames, the first voice feature of each voice signal frame is respectively subjected to weighted fusion with the first voice features of other voice signal frames to obtain a fusion feature of each voice signal frame; and calling a second fusion layer to fuse the first voice characteristics and the fusion characteristics of each voice signal frame respectively, so as to obtain the first voice characteristics of each voice signal frame after weighted fusion.

In the embodiment of the disclosure, the first fusion layer is invoked to fuse the first voice features of different voice signal frames together according to the corresponding weights, so that more accurate fusion features are obtained, and under the condition of comprising the first fusion layer and the second fusion layer, the residual non-local subunit is a residual learning network, and the fusion features are fused with the input first voice features, so that the finally obtained weighted and fused first voice features are more accurate, the fusion features are prevented from losing some important features, and the accuracy of the weighted and fused first voice features is improved. In addition, the residual learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In one possible implementation manner, referring to fig. 11, the processing of three speech signal frames is taken as an example in fig. 11, the residual non-local subunit further includes a plurality of convolution layers, a third fusion layer and a normalization layer, the third fusion layer is connected with two convolution layers, the third fusion layer is used for fusing the first speech feature processed by the two connected convolution layers, the third fusion layer is connected with the normalization layer, the normalization layer is used for normalizing the fused speech feature output by the third fusion layer, the normalization layer is connected with the first fusion layer, the first fusion layer is used for fusing the first speech feature processed by another convolution layer with the normalized speech feature output by the normalization layer, so as to obtain a fused feature of each speech signal frame, and the fused feature is processed by one convolution layer and then fused with the first speech feature, so as to obtain the weighted fused first speech feature.

In one possible implementation manner, the first fusion layer and the third fusion layer fuse the voice features in a matrix multiplication manner, and the second fusion layer fuses the voice features in a matrix addition manner. Optionally, for each speech signal frame, the first speech feature of the speech signal frame is t×k×c, where the first speech feature represents the speech feature C corresponding to the time T and the frequency K, and in order to multiply or add the speech features of different speech signal frames, the speech features need to be transformed in form.

For example, the residual non-local subunit processes the first speech feature of speech signal frame x _i using the following formula:

o_i＝W_zy_i+x_i＝W_zsoftmax((W_ux_i)^TW_υx_j)(W_gx_j)+x_i;

Where o _i represents the first speech feature of the speech signal frame x _i after weighted fusion, W _z、W_u、W_υ and W _g are known model parameters, softmax represents performing normalization processing, x _j represents other speech signal frames than speech signal frame x _i, and x _i represents the fusion feature of speech signal frame x _i.

703. The electronic equipment invokes a first fusion unit to respectively fuse the second voice characteristic and the third voice characteristic of each voice signal frame to obtain the non-local voice characteristic of each voice signal frame.

In one possible implementation, the first fusion unit is a multiplication unit, that is, multiplies the second speech feature and the third speech feature of each speech signal frame respectively to obtain the fused non-local speech feature.

704. The electronic equipment invokes a second fusion unit to fuse the non-local voice characteristics of each voice signal frame with the first voice characteristics to obtain the fused non-local voice characteristics of each voice signal frame.

In one possible implementation manner, the second fusion unit is an addition unit, that is, the electronic device adds the non-local voice feature of each voice signal frame to the first voice feature to obtain the non-local voice feature after fusion of each voice signal frame.

In the embodiment shown in fig. 7, different processing units in the non-local attention network are called to process the first voice feature in different aspects respectively, wherein the first processing unit including a plurality of hole residual sub-units can further extract the first voice feature to obtain a deeper voice feature, the second processing unit adopts a non-local attention mechanism, when processing the first voice feature of each voice signal frame, consider the voice signal frames except for the voice signal frame in the voice signal, that is, combine the context information to obtain a more accurate voice feature, and call the first combining unit to combine the voice features obtained by the two processing units to obtain the non-local voice feature. In addition, the cavity residual subunit can expand the receptive field and can further acquire more context information.

When the non-local attention network comprises a second fusion unit, the non-local attention network is a residual learning network, and after the non-local voice characteristics are obtained, the non-local voice characteristics are fused with the input first voice characteristics, so that the finally obtained non-local voice characteristics are more accurate, important characteristics of the non-local voice characteristics are prevented from being lost, and the accuracy of the non-local voice characteristics is improved. In addition, the residual learning network is easier to optimize, and the training efficiency of the model can be improved in the training process.

In addition, in one possible implementation manner, referring to fig. 12, the non-local attention unit further includes a plurality of hole convolution units, the electronic device invokes the plurality of hole residual units to process the first voice feature input into each voice signal frame, inputs the processed first voice feature into the first processing unit and the second processing unit, and similarly, invokes the plurality of hole residual units to process the non-local voice feature after obtaining the non-local voice feature through the second fusion unit, and inputs the processed non-local voice feature into the subsequent local attention network. Fig. 12 illustrates only four hole residual units as an example.

504. The electronic equipment invokes the local attention network to process the non-local voice characteristics of each voice signal frame respectively to obtain the mixed voice characteristics of each voice signal frame.

The mixed voice characteristics of each voice signal frame are obtained by considering the voice characteristics of other voice signal frames, and are more accurate.

In the embodiment of the present disclosure, the local attention network is similar to the network structure of the non-local attention network, except that the local attention network does not include a residual non-local subunit, and the network structure of the local attention network is not described herein.

It should be noted that the embodiments of the present disclosure are described by taking only one non-local attention network and one local attention network as examples. In another embodiment, a plurality of non-local attention networks and a plurality of local attention networks are included, that is, after the mixed voice feature is obtained, the mixed voice feature can be input into the non-local attention network or the local attention network to be processed further, so as to obtain a more accurate mixed voice signal.

505. And the electronic equipment invokes a feature reconstruction network to reconstruct the characteristics of the mixed voice features of the plurality of voice signal frames, so as to obtain denoising parameters.

The denoising parameters are denoising parameters corresponding to the original voice signals, the denoising parameters are used for representing the proportion of voice signals except noise signals in the voice signal frames, and the original voice signals can be denoised by the denoising parameters. Optionally, the denoising parameter is represented in a matrix form, and each element in the matrix represents the denoising parameter of one speech signal frame, or one column element or one row element in the matrix represents the denoising parameter of one speech signal frame. Wherein the feature reconstruction network is a convolutional network or other type of network.

506. The electronic equipment calls a voice denoising network, and denoises the original amplitudes of the voice signal frames according to denoising parameters to obtain target amplitudes of the voice signal frames.

In one possible implementation, the voice denoising network is a multiplication network, and multiplies the denoising parameters by a plurality of original amplitudes to obtain target amplitudes of a plurality of voice signal frames, where the target amplitudes do not include noise signals. Alternatively, if the denoising parameter is a matrix, each element in the matrix is multiplied by the original amplitude of the corresponding speech signal frame, or a column or row of elements in the matrix is multiplied by the original amplitude of the corresponding speech signal frame, respectively.

507. The electronic device combines the original phases of the plurality of speech signal frames with the target amplitude to obtain a target speech signal.

In one possible implementation, the electronic device performs inverse fourier transform on the original phases and the target amplitudes of the plurality of speech signal frames to obtain a target speech signal, where the target speech signal is a speech signal from which the noise signal is removed.

The mode for denoising the original amplitude in the voice signal frame only needs to process the amplitude in the voice signal instead of the phase, reduces the characteristics required to be processed, and improves the processing speed.

And because the noise signal in the voice signal frame exists in the original amplitude of the voice signal frame, the original amplitude of the voice signal frame is subjected to feature extraction, the original amplitude is denoised according to the obtained denoising parameters, the target amplitude which does not contain the noise signal is obtained, the denoising of the original amplitude of the original voice signal is realized, and then the target voice signal which does not contain the noise signal can be recovered according to the target amplitude and the original phase, so that the denoising of the original voice signal is realized. The denoising method only needs to process the amplitude in the voice signal, and does not need to process the phase, so that the characteristics needing to be processed are reduced.

In addition, before invoking the speech processing model to process the original speech signal, the speech processing model needs to be trained, and the training process is as follows: acquiring a sample voice signal and a sample noise signal; mixing a sample voice signal with a sample noise signal to obtain a sample mixed signal; invoking a voice processing model, and processing a plurality of sample voice signal frames in the sample mixed signal to obtain a predicted denoising parameter corresponding to the sample mixed signal; denoising the original voice signal according to the predicted denoising parameters to obtain a denoised predicted voice signal; a speech processing model is trained based on differences between the predicted speech signal and the sample speech signal. Wherein the sample speech signal is a clean speech signal that does not contain a noise signal. In addition, because the voice processing model adopts a network structure of a residual learning network, the training speed of the model is improved in the training process.

For example, sample voice signals of a plurality of users are obtained from a voice database, then a plurality of sample noise signals are obtained from a noise database, the plurality of sample noise signals are respectively mixed with the sample voice signals according to different signal to noise ratios, a plurality of sample mixed signals are obtained, and the plurality of sample mixed signals are adopted to train a voice processing model.

In one possible implementation manner, sample amplitudes of a plurality of sample voice signal frames in a sample mixed signal are obtained, and a voice processing model is called to process the plurality of sample amplitudes so as to obtain predicted denoising parameters corresponding to the sample mixed signal; denoising the sample amplitude according to the predicted denoising parameters to obtain the predicted amplitude of each voice signal frame, and training a voice processing model according to the difference between the predicted amplitude of each voice signal frame and the amplitudes of a plurality of voice signal frames in the sample voice signal.

For example, in training a speech processing model, convolution kernels, filters, and convolution parameters of a convolution layer in the speech processing model are set as shown in table 1 below:

TABLE 1

Wherein conv. denotes a feature extraction network or a feature reconstruction network, RNAM denotes a non-local attention network, RAM denotes a local attention network, res. Unit denotes a hole residual Unit or a hole residual subunit, conv. denotes a convolution subunit, deconv denotes a deconvolution subunit, NL Unit denotes a residual non-local subunit.

In addition, in one possible implementation, wiener filtering (WIENER FILTERING) method, SEGAN (SPEECH ENHANCEMENT GENERATIVE ADVERSARIAL Network, voice enhancement generation countermeasure Network) method, wavelnet (microwave) method, MMSE-GAN (a voice enhancement generation countermeasure Network) method, DFL (Deep Feature Loss, depth feature loss) method, MDPhD (a hybrid model )、RSGAN-GP(Speech Enhancement using Relativistic Generative Adversarial Networks with Gradient Penalty, uses a relative voice enhancement generation countermeasure Network) method are adopted as reference methods, and these methods are compared with the methods (RNANet) provided by the embodiments of the present disclosure.

The comparison of the above-described reference method with the method provided by the examples of the present disclosure is shown in table 2 below:

TABLE 2

Method of	SSNR	PESQ	CSIG	CBAK	COVL
						Noisy	1.68	1.97	3.35	2.44	2.63
Wiener	5.07	2.22	3.23	2.68	2.67
						SEGAN	7.73	2.16	3.48	2.94	2.80
Wavelnet			3.62	3.23	2.98
						DFL			3.86	3.33	3.22
MMSE-GAN		2.53	3.80	3.12	3.14
						MDPhD	10.22	2.70	3.85	3.39	3.27
RNANet	10.16	2.71	3.98	3.42	3.35

Wherein, the larger the SSNR (SEGMENTAL SIGNAL Noise Ratio, segmented signal to Noise Ratio) is, the better the denoising effect is; the larger PESQ (Perceptual Evaluation of Speech Quality, subjective speech quality assessment) indicates the better denoising effect; CSIG (an evaluation index) is the mean opinion score of signal distortion, and the larger CSIG is, the better the denoising effect is; CBAK (an evaluation index) is a background noise prediction score, and the larger CBAK is the better the denoising effect; COVL (an evaluation index) is a score of the overall signal quality of the speech signal.

In another possible implementation, to show an improvement in the clarity of a speech signal, STOI (Short Time Objective Intelligibility, short-term objective intelligibility) is used to compare the method provided by the present disclosure with a reference method, the comparison results are shown in table 3:

TABLE 3 Table 3

Evaluation method	Noisy	MMSE-GAN	RSGAN-GP	RNANet
					STOI	0.921	0.930	0.942	0.946

Wherein, the larger STOI represents the better denoising effect.

As can be seen from the comparison results of the above table 2 and table 3, the denoising effect of the method provided by the embodiment of the disclosure is significantly higher than that of other methods.

Fig. 13 is a block diagram illustrating a processing apparatus for a voice signal according to an exemplary embodiment. Referring to fig. 13, the apparatus includes:

A feature determination unit 1301 configured to perform determining a first speech feature of a plurality of speech signal frames in the original speech signal;

a non-local feature obtaining unit 1302 configured to perform invoking a non-local attention network to fuse the first speech features of the plurality of speech signal frames, so as to obtain a non-local speech feature of each speech signal frame;

a mixed feature obtaining unit 1303 configured to perform calling of the local attention network to process the non-local voice feature of each voice signal frame, respectively, to obtain a mixed voice feature of each voice signal frame;

a denoising parameter acquisition unit 1304 configured to perform acquisition of denoising parameters based on mixed speech features of a plurality of speech signal frames;

the target signal obtaining unit 1305 is configured to perform denoising on the original speech signal according to the denoising parameter, so as to obtain a target speech signal.

According to the device provided by the embodiment of the disclosure, the non-local attention network and the local attention network are invoked to process the first voice characteristics of a plurality of voice signal frames in an original voice signal to obtain the denoising parameters, and the denoising parameters can represent the proportion of the signals except for noise signals in each voice signal frame, so that the original voice signal is denoised by adopting the denoising parameters, the removal of noise in the original voice signal is realized, and the context information of each voice signal frame can be considered when the non-local attention network is invoked to process the first voice characteristics of each voice signal frame, so that the obtained denoising parameters are more accurate, and the denoising effect of the original voice signal is improved.

In one possible implementation, the feature determining unit 1301 is configured to perform feature extraction on the original amplitudes of the plurality of speech signal frames by invoking the feature extraction network, so as to obtain a first speech feature of the plurality of speech signal frames.

In another possible implementation manner, referring to fig. 14, the target signal acquiring unit 1305 includes:

an amplitude obtaining subunit 1315, configured to perform invoking of a voice denoising network, and denoise the original amplitudes of the plurality of voice signal frames according to the denoising parameters, so as to obtain target amplitudes of the plurality of voice signal frames;

The signal acquisition subunit 1325 is configured to perform combining the original phases of the plurality of speech signal frames and the target amplitude to obtain the target speech signal.

In another possible implementation manner, the denoising parameter obtaining unit 1304 is configured to perform feature reconstruction on a mixed speech feature of a plurality of speech signal frames by calling a feature reconstruction network, so as to obtain denoising parameters.

In another possible implementation, the non-local attention network includes a first processing unit, a second processing unit, and a first fusion unit, see fig. 14, and the non-local feature acquisition unit 1302 includes:

The feature extraction subunit 1312 is configured to perform feature extraction on the first speech features of the plurality of speech signal frames respectively by invoking a first processing unit, so as to obtain a second speech feature of each speech signal frame, where the first processing unit includes a plurality of hole residual subunits;

A first fusion subunit 1322 configured to execute invoking the second processing unit to fuse the first speech feature of each speech signal frame with the first speech features of other speech signal frames, respectively, to obtain a third speech feature of each speech signal frame;

The second merging subunit 1332 is configured to perform invoking the first merging unit to merge the second voice feature and the third voice feature of each voice signal frame respectively, so as to obtain the non-local voice feature of each voice signal frame.

In another possible implementation manner, the non-local attention network further includes a second merging unit, see fig. 14, and the non-local feature acquiring unit 1302 further includes:

The third fusion subunit 1342 is configured to perform fusion on the non-local speech feature and the first speech feature of each speech signal frame by invoking the second fusion unit, so as to obtain the fused non-local speech feature of each speech signal frame.

In another possible implementation, the second processing unit includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit; referring to fig. 14, a first fusion subunit 1322 is configured to perform:

In another possible implementation, the second processing unit further includes a feature reduction subunit, see fig. 14, the first fusion subunit 1322 configured to perform:

In another possible implementation, the residual non-local subunit includes a first fusion layer and a second fusion layer, see fig. 14, the first fusion subunit 1322 configured to perform:

Acquiring a sample voice signal and a sample noise signal;

The specific manner in which the individual units perform the operations in relation to the apparatus of the above embodiments has been described in detail in relation to the embodiments of the method and will not be described in detail here.

In an exemplary embodiment, an electronic device is provided that includes one or more processors and volatile or non-volatile memory for storing instructions executable by the one or more processors; wherein the one or more processors are configured to perform the method of processing speech signals in the above-described embodiments.

In one possible implementation, the electronic device is provided as a terminal. Fig. 15 is a block diagram illustrating a structure of a terminal 1500 according to an exemplary embodiment. The terminal 1500 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1500 can also be referred to as a user device, portable terminal, laptop terminal, desktop terminal, and the like.

The terminal 1500 includes: a processor 1501 and a memory 1502.

The processor 1501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1501 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 1501 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1501 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content to be displayed by the display screen. In some embodiments, the processor 1501 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 1502 may include one or more computer-readable storage media, which may be non-transitory. Memory 1502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1502 is used to store at least one program code for execution by the processor 1501 to implement the method of processing a speech signal provided by the method embodiments in the present disclosure.

In some embodiments, the terminal 1500 may further optionally include: a peripheral interface 1503 and at least one peripheral device. The processor 1501, memory 1502 and peripheral interface 1503 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1503 via a bus, signal lines, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1504, a display screen 1505, a camera assembly 1506, audio circuitry 1507, a positioning assembly 1508, and a power supply 1509.

A peripheral interface 1503 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1501 and the memory 1502. In some embodiments, processor 1501, memory 1502, and peripheral interface 1503 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1501, the memory 1502, and the peripheral interface 1503 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1504 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1504 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1504 converts electrical signals to electromagnetic signals for transmission, or converts received electromagnetic signals to electrical signals. Optionally, the radio frequency circuit 1504 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 1504 may also include NFC (NEAR FIELD Communication) related circuits, which are not limited by the present disclosure.

Display 1505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When display screen 1505 is a touch display screen, display screen 1505 also has the ability to collect touch signals at or above the surface of display screen 1505. The touch signal may be input to the processor 1501 as a control signal for processing. At this point, display 1505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1505 may be one, disposed on the front panel of the terminal 1500; in other embodiments, the display 1505 may be at least two, respectively disposed on different surfaces of the terminal 1500 or in a folded design; in other embodiments, display 1505 may be a flexible display disposed on a curved surface or a folded surface of terminal 1500. Even more, the display 1505 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display screen 1505 may be made of materials such as an LCD (Liquid CRYSTAL DISPLAY) and an OLED (Organic Light-Emitting Diode).

The camera assembly 1506 is used to capture images or video. Optionally, the camera assembly 1506 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1506 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 1507 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, inputting the electric signals to the processor 1501 for processing, or inputting the electric signals to the radio frequency circuit 1504 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 1500. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1501 or the radio frequency circuit 1504 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1507 may also include a headphone jack.

The positioning component 1508 is utilized to locate a current geographic location of the terminal 1500 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 1508 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati positioning system of Russia, or the Galileo positioning system of the European Union.

The power supply 1509 is used to power the various components in the terminal 1500. The power supply 1509 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1509 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 1500 also includes one or more sensors 1510. The one or more sensors 1510 include, but are not limited to: acceleration sensor 1511, gyroscope sensor 1512, pressure sensor 1513, fingerprint sensor 1514, optical sensor 1515, and proximity sensor 1516.

The acceleration sensor 1511 may detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1500. For example, the acceleration sensor 1511 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1501 may control the display screen 1505 to display the user interface in a landscape view or a portrait view based on the gravitational acceleration signal acquired by the acceleration sensor 1511. The acceleration sensor 1511 may also be used for the acquisition of motion data of a game or user.

The gyro sensor 1512 may detect a body direction and a rotation angle of the terminal 1500, and the gyro sensor 1512 may collect 3D motion of the terminal 1500 by a user in cooperation with the acceleration sensor 1511. The processor 1501, based on the data collected by the gyro sensor 1512, may implement the following functions: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 1513 may be disposed on a side frame of the terminal 1500 and/or under the display 1505. When the pressure sensor 1513 is disposed on the side frame of the terminal 1500, a grip signal of the user on the terminal 1500 may be detected, and the processor 1501 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 1513. When the pressure sensor 1513 is disposed at the lower layer of the display screen 1505, the processor 1501 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1505. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 1514 is used to collect a fingerprint of a user, and the processor 1501 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 1514, or the fingerprint sensor 1514 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1514 may be disposed on the front, back, or side of the terminal 1500. When a physical key or vendor Logo is provided on the terminal 1500, the fingerprint sensor 1514 may be integrated with the physical key or vendor Logo.

The optical sensor 1515 is used to collect the ambient light intensity. In one embodiment, processor 1501 may control the display brightness of display screen 1505 based on the intensity of ambient light collected by optical sensor 1515. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1505 is turned up; when the ambient light intensity is low, the display luminance of the display screen 1505 is turned down. In another embodiment, the processor 1501 may also dynamically adjust the shooting parameters of the camera assembly 1506 based on the ambient light intensity collected by the optical sensor 1515.

A proximity sensor 1516, also referred to as a distance sensor, is provided on the front panel of the terminal 1500. The proximity sensor 1516 is used to collect the distance between the user and the front of the terminal 1500. In one embodiment, when the proximity sensor 1516 detects a gradual decrease in the distance between the user and the front of the terminal 1500, the processor 1501 controls the display 1505 to switch from the on-screen state to the off-screen state; when the proximity sensor 1516 detects that the distance between the user and the front surface of the terminal 1500 gradually increases, the processor 1501 controls the display screen 1505 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 15 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

In another possible implementation, the electronic device is provided as a server. Fig. 16 is a block diagram illustrating a server 1600 according to an exemplary embodiment, which may vary considerably in configuration or performance, and may include one or more processors (Central Processing Units, CPU) 1601 and one or more memories 1602, wherein the memories 1602 store at least one program code that is loaded and executed by the processors 1601 to implement the methods provided by the various method embodiments described above. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, which when executed by a processor of an electronic device, causes the electronic device to perform the steps performed by the terminal or the server in the above-described method of processing a speech signal. For example, the non-transitory computer readable storage medium may be a ROM (Read Only Memory), a RAM (random access Memory ), a CD-ROM (compact disc Read Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which, when executed by a processor of an electronic device, enables the electronic device to perform the steps performed by the terminal or server in the above-described method of processing a speech signal.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing a speech signal, the method comprising:

Invoking a first processing unit to respectively perform feature extraction on first voice features of the plurality of voice signal frames to obtain second voice features of each voice signal frame;

Invoking a first fusion unit to respectively fuse the second voice characteristic and the third voice characteristic of each voice signal frame to obtain a non-local voice characteristic of each voice signal frame, wherein a non-local attention network comprises the first processing unit, the second processing unit and the first fusion unit;

invoking a third processing unit to respectively extract the characteristics of the non-local voice characteristics of each voice signal frame to obtain a fourth voice characteristic of each voice signal frame;

Invoking a first convolution subunit to encode the fourth voice feature of each voice signal frame to obtain a first encoding feature of each voice signal frame; invoking a first deconvolution subunit to decode the first coding feature of each voice signal frame to obtain a fifth voice feature of each voice signal frame;

Invoking a third fusion unit to respectively fuse a fourth voice characteristic and a fifth voice characteristic of each voice signal frame to obtain a mixed voice characteristic of each voice signal frame, wherein a local attention network comprises the third processing unit, a fourth processing unit and the third fusion unit, and the fourth processing unit comprises the first convolution subunit and the first deconvolution subunit;

Invoking a feature reconstruction network to reconstruct the characteristics of the mixed voice features of the voice signal frames to obtain denoising parameters;

2. The processing method of claim 1, wherein said determining a first speech feature of a plurality of speech signal frames in the original speech signal comprises:

3. The processing method according to claim 2, wherein denoising the original speech signal according to the denoising parameter to obtain a target speech signal, comprises:

4. The processing method of claim 1, wherein the first processing unit includes a plurality of hole residual subunits therein.

5. The processing method according to claim 1, wherein the non-local attention network further includes a second fusion unit, the invoking the first fusion unit fuses the second speech feature and the third speech feature of each speech signal frame, respectively, to obtain the non-local speech feature of each speech signal frame, and the processing method further includes:

6. The processing method of claim 1, wherein the second processing unit comprises a residual non-local subunit, a second convolution subunit, and a second deconvolution subunit; the invoking the second processing unit to fuse the first voice feature of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame, including:

invoking the second convolution subunit to encode the first voice characteristic after weighting and fusing each voice signal frame to obtain the encoding characteristic of each voice signal frame;

And calling the second deconvolution subunit to decode the coding feature of each voice signal frame to obtain a third voice feature of each voice signal frame.

7. The processing method according to claim 6, wherein the second processing unit further includes a feature reduction subunit, and the invoking the second convolution subunit encodes the weighted and fused first speech feature of each speech signal frame to obtain the encoded feature of each speech signal frame, and further includes:

the invoking the second deconvolution subunit, decoding the coding feature of each voice signal frame to obtain a third voice feature of each voice signal frame, including:

and calling the second deconvolution subunit to decode the plurality of reduced coding features to obtain a third voice feature of each voice signal frame.

8. The processing method according to claim 6, wherein the residual non-local subunit includes a first fusion layer and a second fusion layer, the invoking the residual non-local subunit performs weighted fusion on the first speech feature of each speech signal frame and the first speech feature of the other speech signal frames according to weights corresponding to the plurality of speech signal frames, to obtain the weighted fused first speech feature of each speech signal frame, and includes:

9. The processing method according to any one of claims 1 to 8, wherein a speech processing model comprises at least the non-local attention network, the local attention network and the feature reconstruction network, and the training process of the speech processing model is as follows:

Acquiring a sample voice signal and a sample noise signal;

10. A processing apparatus for a speech signal, the processing apparatus comprising:

a non-local feature acquisition unit, the non-local feature acquisition unit comprising:

The feature extraction subunit is configured to execute calling of the first processing unit, and respectively perform feature extraction on the first voice features of the plurality of voice signal frames to obtain second voice features of each voice signal frame;

The first fusion subunit is configured to execute the call of the second processing unit, and respectively fuse the first voice characteristic of each voice signal frame with the first voice characteristics of other voice signal frames to obtain the third voice characteristic of each voice signal frame;

The second fusion subunit is configured to execute the calling of the first fusion unit, and respectively fuse the second voice feature and the third voice feature of each voice signal frame to obtain the non-local voice feature of each voice signal frame, wherein the non-local attention network comprises the first processing unit, the second processing unit and the first fusion unit;

The mixed feature acquisition unit is configured to execute calling of the third processing unit, and feature extraction is respectively carried out on the non-local voice features of each voice signal frame to obtain fourth voice features of each voice signal frame; invoking a first convolution subunit to encode the fourth voice feature of each voice signal frame to obtain a first encoding feature of each voice signal frame; invoking a first deconvolution subunit to decode the first coding feature of each voice signal frame to obtain a fifth voice feature of each voice signal frame; invoking a third fusion unit to respectively fuse a fourth voice characteristic and a fifth voice characteristic of each voice signal frame to obtain a mixed voice characteristic of each voice signal frame, wherein a local attention network comprises the third processing unit, a fourth processing unit and the third fusion unit, and the fourth processing unit comprises the first convolution subunit and the first deconvolution subunit;

the denoising parameter acquisition unit is configured to execute feature reconstruction on the mixed voice features of the plurality of voice signal frames by calling a feature reconstruction network to obtain denoising parameters;

11. The processing device according to claim 10, wherein the feature determining unit is configured to perform feature extraction of the original amplitudes of the plurality of speech signal frames, respectively, by invoking a feature extraction network, resulting in a first speech feature of the plurality of speech signal frames.

12. The processing apparatus according to claim 11, wherein the target signal acquisition unit includes:

13. The processing apparatus of claim 10, wherein the first processing unit comprises a plurality of hole residual subunits therein.

14. The processing apparatus according to claim 10, wherein the non-local attention network further comprises a second fusion unit, the non-local feature acquisition unit further comprising:

15. The processing apparatus of claim 10, wherein the second processing unit comprises a residual non-local subunit, a second convolution subunit, and a second deconvolution subunit; the first fusion subunit is configured to perform:

16. The processing apparatus of claim 15, wherein the second processing unit further comprises a feature reduction subunit, the first fusion subunit configured to perform:

17. The processing apparatus of claim 15, wherein the residual non-local subunit comprises a first fusion layer and a second fusion layer, the first fusion subunit configured to perform:

18. The processing apparatus according to any one of claims 10-17, wherein a speech processing model comprises at least the non-local attention network, the local attention network and the feature reconstruction network, the training process of the speech processing model being as follows:

Acquiring a sample voice signal and a sample noise signal;

19. An electronic device, the electronic device comprising:

one or more processors;

A memory for storing the one or more processor-executable instructions;

Wherein the one or more processors are configured to perform the method of processing a speech signal according to any one of claims 1 to 9.

20. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of processing a speech signal according to any one of claims 1 to 9.

21. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the method of processing a speech signal according to any one of claims 1 to 9.