CN114360562A

CN114360562A - Voice processing method, device, electronic equipment and storage medium

Info

Publication number: CN114360562A
Application number: CN202111555187.8A
Authority: CN
Inventors: 宁峻; 于利标; 魏建强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-15

Abstract

The present disclosure provides a speech processing method, an apparatus, an electronic device and a storage medium, which relate to the technical field of artificial intelligence, and in particular, to the technical field of speech enhancement and deep learning. The specific implementation scheme is as follows: the method comprises the steps of obtaining amplitude component characteristics and phase component characteristics corresponding to a plurality of frequency band components corresponding to a voice signal, increasing the signal quantity which can be obtained by a model, further capturing the time and frequency correlation of each frequency band component through an attention model according to the amplitude component characteristics and the phase component characteristics corresponding to each frequency band component, improving the processing effect of the model, obtaining an amplitude correction factor and a phase correction factor corresponding to each frequency band component, correcting the amplitude component and the phase component corresponding to the frequency band component, further obtaining a target voice signal through sub-band synthesis, and improving the effects of noise suppression and reverberation elimination.

Description

Voice processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of speech enhancement technologies and deep learning technologies, and in particular, to a speech processing method and apparatus, an electronic device, and a storage medium.

Background

In everyday acoustic environments, speech is often disturbed by reverberation and background noise, which can reduce the intelligibility and quality of speech, affect the user experience in audio calls, and affect the performance of speech synthesis systems. Therefore, performing noise suppression and reverberation cancellation becomes a hot spot problem in speech signal enhancement.

Disclosure of Invention

The disclosure provides a voice processing method, a voice processing device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a speech processing method including:

acquiring a plurality of frequency band components corresponding to the voice signal;

determining amplitude components and phase components corresponding to the plurality of frequency band components;

extracting the characteristics of the amplitude components and the phase components corresponding to the plurality of frequency band components to obtain the characteristics of the amplitude components and the phase components corresponding to the plurality of frequency band components;

inputting the amplitude component characteristics and the phase component characteristics corresponding to the plurality of frequency band components into an attention model, and outputting amplitude correction factors and phase correction factors corresponding to the plurality of frequency band components;

correcting the amplitude components and the phase components corresponding to the plurality of frequency band components according to the amplitude correction factors and the phase correction factors corresponding to the plurality of frequency band components;

and performing sub-band synthesis according to the corrected amplitude components and phase components corresponding to the plurality of frequency band components to obtain a target voice signal.

According to another aspect of the present disclosure, there is provided a voice processing apparatus including:

the acquisition module is used for acquiring a plurality of frequency band components corresponding to the voice signals;

a determining module, configured to determine amplitude components and phase components corresponding to the plurality of frequency band components;

the characteristic extraction module is used for extracting the characteristics of the amplitude components and the phase components corresponding to the plurality of frequency band components to obtain the amplitude component characteristics and the phase component characteristics corresponding to the plurality of frequency band components;

the processing module is used for inputting the amplitude component characteristics and the phase component characteristics corresponding to the plurality of frequency band components into an attention model and outputting amplitude correction factors and phase correction factors corresponding to the plurality of frequency band components;

the correction module is used for correcting the amplitude components and the phase components corresponding to the plurality of frequency band components according to the amplitude correction factors and the phase correction factors corresponding to the plurality of frequency band components;

and the synthesis module is used for carrying out sub-band synthesis according to the corrected amplitude components and phase components corresponding to the plurality of frequency band components to obtain the target voice signal.

According to another aspect of the present disclosure, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the previous aspect when executing the program.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as set forth in the preceding aspect.

According to another aspect of the present disclosure, a computer program product is provided, having a computer program stored thereon, which when executed by a processor, implements a method as described in the preceding aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating another speech processing method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a speech processing effect provided by an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram illustrating another speech processing method according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an attention model provided in an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an attention network according to an embodiment of the present disclosure;

fig. 7 is a schematic flowchart of another image processing method provided in the embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic block diagram of an electronic device provided by an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A voice processing method, apparatus, electronic device, and storage medium of the embodiments of the present disclosure are described below with reference to the accompanying drawings.

In everyday acoustic environments, speech is often disturbed by room reverberation and background noise, which can degrade the intelligibility and quality of speech and affect the user experience in audio telephony. However, the single-channel algorithm in the related art has difficulty in achieving ideal effects in terms of noise suppression and reverberation elimination due to limited available information. Therefore, the present disclosure provides a speech processing method, which obtains amplitude component features and phase component features corresponding to a plurality of frequency band components corresponding to a speech signal, increases a semaphore that a model can obtain, and further captures a correlation of each frequency band component in time and frequency through an attention model, thereby improving a processing effect of the model, obtaining an amplitude correction factor and a phase correction factor corresponding to each frequency band component, so as to correct the amplitude component and the phase component corresponding to the frequency band component, and further obtaining a target speech signal through subband synthesis, thereby improving noise suppression and reverberation elimination effects.

Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present disclosure.

The main execution body of the speech processing method in the embodiment of the present disclosure is a speech processing apparatus, which may be disposed in an electronic device, and the electronic device may be a terminal or a server, which is not limited in this embodiment.

As shown in fig. 1, the method may include the steps of:

step 101, obtaining a plurality of frequency band components corresponding to a voice signal.

In the embodiment of the present disclosure, by obtaining a speech signal collected by a speech collection device, that is, an unprocessed original speech signal, for example, a speech signal collected by a microphone, frame-wise windowing is performed on the collected speech signal to obtain each frame of speech signal, and then, a frequency band component corresponding to each frame of speech signal is determined, which can be specifically implemented by the following implementation manners:

in an implementation manner of the embodiment of the present disclosure, subband decomposition is performed on each frame of voice signal, and a plurality of frequency bands are obtained by dividing, so as to obtain a plurality of frequency band components corresponding to each frame of voice signal, and subband decomposition can better prevent frequency band information from leaking, so that information between different frequency bands is more independent, and the efficiency of subsequent voice processing is improved.

In another implementation manner of the embodiment of the present disclosure, a short-time fourier transform is performed on each frame of voice signal to obtain a frequency spectrum of each frame of voice signal, and the frequency spectrum is subjected to frequency band division to obtain a plurality of frequency band components corresponding to the voice signal, so as to convert a time domain signal into a frequency domain signal.

It should be noted that the granularity of the frequency band division may be set according to the requirement, and is not limited in this embodiment.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

Step 102, determining amplitude components and phase components corresponding to a plurality of frequency band components.

In the embodiment of the present disclosure, for each frequency band component corresponding to a voice signal, an amplitude component and a phase component corresponding to each frequency band component are determined, so that the amount of information acquired from the voice signal is increased.

And 103, performing feature extraction on the amplitude components and the phase components corresponding to the multiple frequency band components to obtain amplitude component features and phase component features corresponding to the multiple frequency band components.

In the embodiment of the disclosure, for each frequency band component, the amplitude component and the phase component corresponding to the frequency band component are respectively subjected to feature extraction to obtain an amplitude component feature and a phase component feature, so that the amount of information that can be obtained by a subsequent attention model is increased, the subsequent attention model can obtain more sufficient information, and the performance of the attention model is improved.

And 104, inputting the amplitude component characteristics and the phase component characteristics corresponding to the plurality of frequency band components into the attention model, and outputting the amplitude correction factors and the phase correction factors corresponding to the plurality of frequency band components.

In the embodiment of the present disclosure, the attention model analyzes, for the input amplitude component feature and phase component feature corresponding to each frequency band component, the correlation between the amplitude component feature and the phase component feature to capture the correlation in time and frequency corresponding to each frequency band component, so as to improve the processing capability of the attention model, and further obtain an amplitude component correction factor for correcting the amplitude component corresponding to each frequency band component and a phase component correction factor for correcting the corresponding phase component.

And 105, correcting the amplitude components and the phase components corresponding to the plurality of frequency band components according to the amplitude correction factors and the phase correction factors corresponding to the plurality of frequency band components.

In the embodiment of the disclosure, for each frequency band component, an amplitude component corresponding to each frequency band component after correction is obtained according to the corresponding amplitude component and the amplitude correction factor, and a phase component corresponding to each frequency band component after correction is obtained according to a phase component corresponding to each frequency band component and the phase correction factor. As an implementation manner, for each frequency band component, performing a dot product operation on the amplitude component and the corresponding amplitude correction factor to obtain an amplitude component corresponding to the frequency band component after correction; and obtaining the phase component corresponding to the frequency band component after correction by using the phase component and the corresponding amplitude correction factor.

And 106, performing sub-band synthesis according to the amplitude component and the phase component corresponding to the plurality of corrected frequency band components to obtain a target voice signal.

In the embodiment of the disclosure, for each frequency band component, according to the corresponding amplitude correction factor and phase correction factor, the corresponding amplitude component and phase component are corrected to obtain amplitude components and phase components corresponding to a plurality of frequency band components after correction, and sub-band synthesis is performed to obtain a target speech signal, where the target speech signal is a time domain signal obtained by enhancing an acquired speech signal.

In the speech processing method of the embodiment of the disclosure, amplitude component features and phase component features corresponding to a plurality of frequency band components corresponding to a speech signal are obtained, a semaphore that can be obtained by a model is increased, and then the amplitude component features and the phase component features corresponding to each frequency band component are captured through an attention model to capture feature correlation on each frequency band component so as to reinforce features of important regions, thereby improving the processing effect of the model, obtaining amplitude correction factors and phase correction factors corresponding to each frequency band component so as to correct the amplitude components and the phase components corresponding to the frequency band components, and then obtaining a target speech signal through sub-band synthesis, thereby improving the speech enhancement effect.

Based on the foregoing embodiments, fig. 2 is a schematic flow chart of another speech processing method provided in the embodiments of the present disclosure, as shown in fig. 2, the method includes the following steps:

step 201, obtaining a plurality of frequency band components corresponding to the voice signal.

In step 202, amplitude components and phase components corresponding to the plurality of frequency band components are determined.

Step 203, extracting the features of the amplitude component and the phase component corresponding to the plurality of frequency band components to obtain the amplitude component features and the phase component features corresponding to the plurality of frequency band components.

Specifically, the explanation of steps 201 to 203 may refer to the explanation in the foregoing embodiments, and the principle is the same, which is not repeated in this embodiment.

Step 204, inputting the amplitude component characteristics and the phase component characteristics corresponding to the plurality of frequency band components into an encoding network of the attention model to obtain the encoding characteristics corresponding to the plurality of frequency band components.

In the embodiment of the disclosure, for each frequency band component, the corresponding amplitude component characteristic and phase component characteristic are input into the coding network of the attention model, so that the information amount of the voice signal input by the coding network is improved, and the coding network performs coding fusion on the amplitude characteristic and the phase component characteristic to obtain finer granularity and carry more coding characteristics of the voice signal information.

Step 205, inputting the coding features corresponding to the plurality of frequency band components into the attention network of the attention model to obtain the fusion features corresponding to the plurality of frequency band components.

Wherein the fused features comprise feature correlations of the corresponding band components in the time and frequency dimensions.

Step 206, inputting the fusion features corresponding to the plurality of frequency band components into a decoding network of the attention model to obtain the amplitude correction factors and the phase correction factors corresponding to the plurality of frequency band components.

In the embodiment of the disclosure, the coding features are processed through the attention network, the feature correlation of each frequency band component in the time dimension and the frequency dimension is extracted to obtain a fusion feature containing the global correlation, the fusion feature and the decoding feature are input into the decoding network, the decoding network is formed by three layers of two-dimensional deconvolution and used for reconstructing the frequency spectrum features compressed by the coding network and the attention network, and the amplitude correction factor and the phase correction factor corresponding to each frequency band component are obtained to improve the effect of the model.

And step 207, correcting the amplitude components and the phase components corresponding to the plurality of frequency band components according to the amplitude correction factors and the phase correction factors corresponding to the plurality of frequency band components.

And step 208, performing subband synthesis according to the amplitude component and the phase component corresponding to the plurality of corrected frequency band components to obtain a target voice signal.

Step 207 and step 208 may refer to the explanations in the foregoing embodiments, and the principle is the same, which is not described again in this embodiment.

In a daily acoustic environment, a voice is usually interfered by reverberation and background noise, as an example, fig. 3 is a schematic diagram of a voice processing effect provided by an embodiment of the present disclosure, and an example of a voice signal to be processed is a voice signal with reverberation and background noise is described, as shown in fig. 3, the upper diagram is a time domain waveform and a frequency spectrum diagram of the voice signal to be processed, and it can be seen in the frequency spectrum diagram of the upper diagram that the voice signal to be processed has a long tail, the reverberation of the voice is large, and a certain background noise exists at the same time. The lower graph is the target voice signal obtained after processing, and as can be seen from the spectrogram and the time domain waveform of the lower graph, the background noise in the obtained target voice signal is suppressed, the reverberation is eliminated, meanwhile, the voice signal is almost not damaged, and the voice enhancement effect is improved.

In the speech processing method of the embodiment of the disclosure, amplitude component features and phase component features corresponding to a plurality of frequency band components corresponding to a speech signal are obtained, a semaphore that can be obtained by a model is increased, and then the amplitude component features and the phase component features corresponding to each frequency band component are captured through an attention model, so that the feature correlation of each frequency band component in time and frequency is captured, so as to reinforce a corresponding important region in a spectrogram, thereby improving the processing effect of the model, obtaining an amplitude correction factor and a phase correction factor corresponding to each frequency band component, so as to correct the amplitude component and the phase component corresponding to the frequency band component, and further obtaining a target speech signal through sub-band synthesis, thereby improving the speech enhancement effect.

Based on the foregoing embodiments, fig. 4 is a schematic flowchart of another speech processing method provided in the embodiments of the present disclosure, and as shown in fig. 4, the method includes the following steps:

step 401, obtaining a plurality of frequency band components corresponding to the voice signal.

Step 402, determining amplitude components and phase components corresponding to a plurality of frequency band components.

Step 403, performing feature extraction on the amplitude component and the phase component corresponding to the multiple frequency band components to obtain amplitude component features and phase component features corresponding to the multiple frequency band components.

Step 404, inputting the amplitude component characteristics and the phase component characteristics corresponding to the plurality of frequency band components into an encoding network of the attention model to obtain encoding characteristics corresponding to the plurality of frequency band components.

The coding network is composed of three layers of two-dimensional convolution, and the number of channels is 16, 32 and 64 respectively.

As an implementation, before the amplitude component feature corresponding to each frequency band component is input into the attention model, the amplitude component may be subjected to a compression process, for example, an evolution operation, to compress the dynamic range of the amplitude component feature, so that the model obtains better performance.

Specifically, the explanation of steps 401 to 404 in the foregoing embodiment can be referred to, the principle is the same, and details are not repeated in this embodiment.

As an example, fig. 5 is a schematic structural diagram of an attention model provided in an embodiment of the present disclosure, and as shown in fig. 5, the attention model includes an encoding network, an attention network, and a decoding network, where the attention network may include a plurality of networks, an output of a previous attention network is used as an input of a subsequent attention network after restoring to original 2-channel feature data, and accuracy of a fusion feature is improved through a multi-layer attention network.

Step 405, for the coding feature corresponding to each frequency band component, inputting the coding feature into a residual error module of the attention network to obtain an intermediate coding feature.

In one implementation of the disclosed embodiment, the attention network includes a residual module, a frequency attention module, and a frequency transform module. Fig. 6 is a schematic structural diagram of an attention network according to an embodiment of the disclosure, and in fig. 6, a residual block is illustrated by taking a residual block as an example.

In the embodiment of the disclosure, the coding characteristics are input into a residual module of the attention network for the coding characteristics corresponding to each of the plurality of frequency band components, and the input coding characteristics are added in the process of processing the coding characteristics by the residual model to complement some information in the original characteristics which may be lost in the processing process, that is, the missing of the intermediate coding characteristic information is avoided, and the accuracy of the intermediate coding characteristics is improved.

It should be noted that, the residual module in the embodiment of the present disclosure may also be a plurality of residual modules, so as to improve the accuracy of the intermediate coding feature.

Step 406, inputting the intermediate coding features into a frequency attention module of the attention network to obtain a first weighting coefficient, and weighting the intermediate coding features according to the first weighting coefficient to obtain first coding features weighted in a time dimension.

Wherein the first weighting factor is indicative of a frequency dependence of the corresponding band component in the time dimension.

In the embodiment of the disclosure, for each frequency band component, the frequency attention module calculates, according to the intermediate coding feature, a frequency correlation of the corresponding frequency band component in the time dimension to obtain a first weighting coefficient weighted in the time dimension by the intermediate coding feature, and multiplies the first weighting coefficient by the intermediate coding feature to output a coding feature with an enhanced important region, and the frequency attention module captures a correlation between the time dimension and the frequency dimension and emphasizes utilization of feature information in the time dimension, thereby achieving capture of a global correlation of each frequency band component and improving modeling capability of a model.

Step 407, inputting the intermediate coding features into a frequency transformation module of the attention network to obtain a second weighting coefficient, and weighting the intermediate coding features according to the first weighting coefficient to obtain a second coding feature weighted in a frequency dimension.

Wherein the second weighting factor is indicative of a time-dependence of the corresponding band component in the frequency dimension.

In the embodiment of the disclosure, for each frequency band component, the frequency transformation module calculates the time correlation of the corresponding frequency band component in the frequency dimension according to the intermediate coding feature to obtain a second weighting coefficient weighted in the frequency dimension by the intermediate coding feature, multiplies the second weighting coefficient by the intermediate coding feature, and outputs the coding feature with the enhanced important region, and the frequency transformation module captures the correlation between the time dimension and the frequency dimension and emphasizes the utilization of the feature information in the frequency dimension, thereby realizing the capture of the global correlation of each frequency band component and improving the modeling capability of the model.

And step 408, obtaining a fusion feature according to the first coding feature, the second coding feature and the intermediate coding feature.

In the embodiment of the present disclosure, the first coding feature, the second coding feature, and the intermediate coding feature are fused by the input and output module to obtain a fusion feature, which is specifically implemented in the following manner:

in an implementation manner, the first coding feature, the second coding feature and the intermediate coding feature are spliced to obtain a fusion feature with increased feature dimensions, so that the information content included in the fusion feature is increased.

In another implementation manner, the first coding feature, the second coding feature and the intermediate coding feature are added to obtain a fusion feature without increasing feature dimension, so that the computation amount is reduced and the processing efficiency is improved under the condition of ensuring the feature information amount.

Step 409, inputting the fusion features corresponding to the plurality of frequency band components into a decoding network of the attention model to obtain amplitude correction factors and phase correction factors corresponding to the plurality of frequency band components.

Step 410, according to the amplitude correction factor and the phase correction factor corresponding to the plurality of frequency band components, correcting the amplitude component and the phase component corresponding to the plurality of frequency band components.

Step 411, performing subband synthesis according to the amplitude component and the phase component corresponding to the plurality of corrected band components to obtain the target speech signal.

The explanation of steps 409 to 411 may refer to the explanations in the foregoing embodiments, and the principle is the same, which is not described again in this embodiment.

In the embodiment of the disclosure, the amplitude component and the phase component of the noisy signal are simultaneously input into the model, and compared with the previous single-amplitude characteristic, the model can obtain more sufficient information, so that the performance of the model is improved, the amplitude component is subjected to the squaring operation, the dynamic range of the amplitude is compressed, and the model obtains better performance. A frequency domain attention module and a frequency transformation module are introduced into the set attention network, the global correlation of the speech spectrum is captured, the modeling capability of the model is greatly improved, and the speech enhancement effect is improved.

Based on the foregoing embodiment, fig. 7 is a schematic flowchart of another image processing method provided in the embodiment of the present disclosure, illustrating how to train an attention model, as shown in fig. 7, before the foregoing step 104, the method includes the following steps:

step 701, obtaining a training sample.

Wherein, the training sample comprises a speech signal to be processed and a standard speech signal.

The standard speech signal refers to a clean speech signal without noise, reverberation and other interference signals.

Step 702, obtaining a plurality of frequency band components corresponding to the voice signal to be processed.

Step 703, inputting a plurality of frequency band components corresponding to the voice signal to be processed into the attention model, and obtaining an amplitude correction factor and a phase correction factor corresponding to the plurality of frequency band components.

Step 704, modifying the amplitude component and the phase component corresponding to the plurality of frequency band components according to the amplitude modification factor and the phase modification factor corresponding to the plurality of frequency band components.

Step 705, performing subband synthesis according to the amplitude component and the phase component corresponding to the plurality of corrected frequency band components to obtain a target speech signal.

In step 701-702, the explanation in the foregoing embodiment can be referred to, and the principle is the same, which is not described again in this embodiment.

And step 706, determining a loss function according to the target voice signal and the standard voice signal, and training the attention model according to the loss function.

Wherein the loss function comprises a complex loss function and an amplitude loss function.

The loss function of the embodiment of the disclosure comprises an amplitude loss function and a complex loss function, wherein the complex loss function is determined according to the difference between the real part and the imaginary part of the frequency spectrum of the target voice signal and the real part and the imaginary part of the frequency spectrum of the standard voice signal, so as to more accurately indicate the difference between the frequency spectrum of the target voice signal and the frequency spectrum of the standard voice signal, and realize stricter constraint on the voice frequency spectrum, so that the attention model achieves better performance according to the training of the loss function.

Wherein the loss function can be expressed by the following formula:

wherein L represents a loss function, wherein the complex loss function is

And

wherein Sr is the real part of the frequency spectrum of the target speech signal, and

is the real part of the frequency spectrum of the standard speech signal, Si is the imaginary part of the frequency spectrum of the target speech signal,

is the imaginary part of the spectrum of the standard speech signal.

As a function of the amplitude loss. Wherein the function MSE () is used to calculate the mean square error.

In the speech processing method of the embodiment of the disclosure, the attention model is trained based on an amplitude loss function and a complex loss function, wherein the complex loss function is determined according to a difference between a real part and an imaginary part of a frequency spectrum of a target speech signal and a difference between a real part and an imaginary part of a frequency spectrum of a standard speech signal, the loss function is determined according to the amplitude loss function and the complex loss function, a more accurate indication of a difference between the frequency spectrum of the target speech signal and the frequency spectrum of the standard speech signal is achieved, and a stricter constraint is performed on the frequency spectrum of speech, so that the attention model achieves better performance through training.

In order to implement the foregoing embodiments, the embodiments of the present disclosure further provide a speech processing apparatus.

Fig. 8 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, the speech processing device 80 may include:

an obtaining module 81, configured to obtain a plurality of frequency band components corresponding to a voice signal;

a determining module 82, configured to determine an amplitude component and a phase component corresponding to the plurality of frequency band components.

The feature extraction module 83 is configured to perform feature extraction on the amplitude component and the phase component corresponding to the multiple frequency band components, so as to obtain an amplitude component feature and a phase component feature corresponding to the multiple frequency band components.

And the processing module 84 is configured to input the amplitude component characteristics and the phase component characteristics corresponding to the multiple frequency band components into the attention model, and output the amplitude correction factors and the phase correction factors corresponding to the multiple frequency band components.

And a modifying module 85, configured to modify the amplitude component and the phase component corresponding to the multiple frequency band components according to the amplitude modification factor and the phase modification factor corresponding to the multiple frequency band components.

And a synthesis module 86, configured to perform subband synthesis according to the amplitude component and the phase component corresponding to the plurality of modified frequency band components, so as to obtain a target speech signal.

Further, in an implementation manner of the embodiment of the present disclosure, the processing module 84 is specifically configured to:

inputting the amplitude component characteristics and the phase component characteristics corresponding to the plurality of frequency band components into the coding network of the attention model to obtain the coding characteristics corresponding to the plurality of frequency band components;

inputting the coding features corresponding to the plurality of frequency band components into an attention network of the attention model to obtain fusion features corresponding to the plurality of frequency band components; wherein the fusion features comprise feature correlations of the corresponding band components in a time dimension and a frequency dimension;

and inputting the fusion characteristics corresponding to the plurality of frequency band components into a decoding network of the attention model to obtain amplitude correction factors and phase correction factors corresponding to the plurality of frequency band components.

In an implementation manner of the embodiment of the present disclosure, the processing module 84 is further specifically configured to:

inputting the coding features corresponding to the plurality of frequency band components into a residual error module of the attention network to obtain intermediate coding features corresponding to the plurality of frequency band components;

inputting the intermediate coding features corresponding to the plurality of frequency band components into a frequency attention module of the attention network to obtain first weighting coefficients corresponding to the plurality of frequency band components, and weighting the intermediate coding features corresponding to the plurality of frequency band components according to the first weighting coefficients corresponding to the plurality of frequency band components to obtain first coding features corresponding to the plurality of frequency band components weighted in a time dimension; wherein the first weighting factor is indicative of a frequency dependence of the corresponding band component in a time dimension;

inputting the intermediate coding features corresponding to the plurality of frequency band components into a frequency transformation module of the attention network to obtain second weighting coefficients corresponding to the plurality of frequency band components, and weighting the intermediate coding features corresponding to the plurality of frequency band components according to the first weighting coefficients corresponding to the plurality of frequency band components to obtain second coding features corresponding to the plurality of frequency band components weighted in a frequency dimension; wherein the second weighting factor indicates a time correlation of the corresponding band component in the frequency dimension;

and obtaining fusion characteristics corresponding to the plurality of frequency band components according to the first coding characteristics, the second coding characteristics and the intermediate coding characteristics corresponding to the plurality of frequency band components.

In an implementation manner of the embodiment of the present disclosure, the obtaining module 81 is configured to:

acquiring a voice signal; and carrying out sub-band decomposition on the voice signal to obtain a plurality of frequency band components corresponding to the voice signal.

In an implementation manner of the embodiment of the present disclosure, the apparatus further includes a training module, where the training module is specifically configured to:

obtaining a training sample; the training sample comprises a voice signal to be processed and a standard voice signal;

acquiring a plurality of frequency band components corresponding to the voice signal to be processed;

inputting a plurality of frequency band components corresponding to the voice signal to be processed into an attention model to obtain amplitude correction factors and phase correction factors corresponding to the frequency band components;

performing sub-band synthesis according to the amplitude components and the phase components corresponding to the plurality of corrected frequency band components to obtain a target voice signal;

determining a loss function according to the target voice signal and the standard voice signal, and training the attention model according to the loss function; wherein the loss function comprises a complex loss function and an amplitude loss function.

It should be noted that the foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and is not repeated herein.

In the image processing apparatus according to the embodiment of the disclosure, amplitude component features and phase component features corresponding to a plurality of frequency band components corresponding to a speech signal are obtained, a signal amount that can be obtained by a model is increased, and then the amplitude component features and the phase component features corresponding to each frequency band component are captured through an attention model, so that feature correlations of each frequency band component in time and frequency are captured, so as to reinforce a corresponding important region in a spectrogram, thereby improving a processing effect of the model, obtaining an amplitude correction factor and a phase correction factor corresponding to each frequency band component, so as to correct the amplitude component and the phase component corresponding to the frequency band component, and further obtaining a target speech signal through subband synthesis, thereby improving a speech enhancement effect.

In order to implement the foregoing embodiments, the present disclosure also proposes an electronic device, which includes a memory, a processor and a computer program stored on the memory and executable on the processor, and when the processor executes the program, the electronic device implements the method according to the foregoing method embodiments.

In order to implement the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as described in the aforementioned method embodiments.

In order to implement the above embodiments, the present disclosure also proposes a computer program product having a computer program stored thereon, which, when being executed by a processor, implements the method as described in the aforementioned method embodiments.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 9 is a schematic block diagram of an electronic device provided by an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 902 or a computer program loaded from a storage unit 908 into a RAM (Random Access Memory) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An I/O (Input/Output) interface 905 is also connected to the bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 901 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 901 performs the respective methods and processes described above, such as a voice processing method. For example, in some embodiments, the speech processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the speech processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the speech processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of speech processing comprising:

and performing sub-band synthesis according to the corrected amplitude components and phase components corresponding to the plurality of frequency band components to obtain the target voice signal.

2. The method of claim 1, wherein the inputting amplitude component characteristics and phase component characteristics corresponding to the plurality of frequency band components into an attention model and outputting amplitude correction factors and phase correction factors corresponding to the plurality of frequency band components comprises:

3. The method of claim 2, wherein the inputting the coding features corresponding to the plurality of frequency band components into the attention network of the attention model to obtain the fused features corresponding to the plurality of frequency band components comprises:

4. The method of claim 1, wherein the obtaining the plurality of frequency band components corresponding to the speech signal comprises:

acquiring a voice signal;

and carrying out sub-band decomposition on the voice signal to obtain a plurality of frequency band components corresponding to the voice signal.

5. The method according to any one of claims 1-4, wherein before inputting the amplitude component characteristic and the phase component characteristic corresponding to the plurality of frequency band components into the attention model and outputting the amplitude correction factor and the phase correction factor corresponding to the plurality of frequency band components, the method further comprises:

6. A speech processing apparatus comprising:

7. The apparatus of claim 6, wherein the processing module is specifically configured to:

8. The apparatus of claim 7, wherein the processing module is further specifically configured to:

9. The apparatus of claim 6, wherein the obtaining module is specifically configured to:

acquiring a voice signal;

10. The apparatus of any of claims 6-9, wherein the apparatus further comprises: a training module;

the training module is used for obtaining a training sample; the training sample comprises a voice signal to be processed and a standard voice signal; acquiring a plurality of frequency band components corresponding to the voice signal to be processed; inputting a plurality of frequency band components corresponding to the voice signal to be processed into an attention model to obtain amplitude correction factors and phase correction factors corresponding to the frequency band components; correcting the amplitude components and the phase components corresponding to the plurality of frequency band components according to the amplitude correction factors and the phase correction factors corresponding to the plurality of frequency band components; performing sub-band synthesis according to the amplitude components and the phase components corresponding to the plurality of corrected frequency band components to obtain a target voice signal; determining a loss function according to the target voice signal and the standard voice signal, and training the attention model according to the loss function; wherein the loss function comprises a complex loss function and an amplitude loss function.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.