CN112309426A

CN112309426A - Voice processing model training method and device and voice processing method and device

Info

Publication number: CN112309426A
Application number: CN202011330109.3A
Authority: CN
Inventors: 郑羲光; 李楠; 任新蕾; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-02-02

Abstract

The disclosure provides a training method and a device of a voice processing model, and a voice processing method and a device thereof. The training method comprises the following steps: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Description

Voice processing model training method and device and voice processing method and device

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to a method and an apparatus for training a speech processing model, and a method and an apparatus for speech processing.

Background

With the rapid development of electronic technology and network technology, electronic devices can process audio signals in a time-frequency domain based on a neural network speech processing algorithm.

Although neural network-based speech enhancement and noise reduction have achieved performance exceeding that of conventional signal processing methods and have been able to operate efficiently in electronic devices, it is common to train two neural networks to achieve speech enhancement and noise reduction, respectively, for speech enhancement (non-speech component invariant speech component increase) and speech noise reduction (speech component invariant non-speech component decrease) problems. In addition, for speech processing using both neural networks, one type of signal is always amplified or reduced while the other type of signal is kept unchanged.

Disclosure of Invention

The present disclosure provides a training method and device for a speech processing model, and a speech processing method and device, so as to solve at least the problem of using one neural network to simultaneously accomplish speech enhancement and denoising.

According to a first aspect of embodiments of the present disclosure, there is provided a method for training a speech processing model, which may include: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Alternatively, the generating of the mixed signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; the mixed signal is generated by mixing the first signal, the second signal and the speech signal.

Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on a second signal-to-noise ratio and the first gain.

Alternatively, the step of generating the target signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the second signal.

Alternatively, the estimation data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask is related to the signal energy.

Optionally, in case the estimation data is an estimated ideal amplitude mask, the step of determining a loss function based on the target signal and the estimation data may comprise: calculating a target ideal amplitude mask based on the target signal and the mixed signal; determining a loss function based on the target ideal amplitude mask and the estimation data.

Alternatively, the target ideal amplitude mask may be an amplitude ratio of the target signal to the mixed signal in a time-frequency domain.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing method, which may include: acquiring an audio signal, wherein the audio signal comprises at least one of a speech signal, a noise signal and a specific signal, and the specific signal belongs to an audio type which does not need to be enhanced and suppressed; obtaining an ideal amplitude mask using a speech processing model based on the audio signal; and according to the size of the ideal amplitude mask, performing different processing on the audio signal to obtain a desired signal.

Alternatively, the speech processing model may be obtained by training according to the above-mentioned training method.

Optionally, the step of performing different processing on the audio signal to obtain the desired signal according to the size of the ideal amplitude mask may include: determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold.

Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask may comprise: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is taken as the desired signal.

Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask may comprise: if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.

Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask may comprise: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.

Optionally, the output of the speech processing model is the ideal amplitude mask or the estimated target signal, wherein in case the output of the speech processing model is the estimated target signal, the step of obtaining the ideal amplitude mask may comprise: obtaining an estimated target signal by applying the audio signal to a speech processing model; obtaining the ideal amplitude mask based on the estimated target signal and the audio signal.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for training a speech processing model, the apparatus may include: a data generation module configured to: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; and a data training module configured to: inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Optionally, the data generation module may be configured to: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; and generating the mixed signal by mixing the first signal, the second signal, and the voice signal.

Optionally, the data generation module may be configured to: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the second signal.

Alternatively, the estimation data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask may be related to the signal energy.

Optionally, in case the estimated data is an estimated ideal amplitude mask, the data training module may be configured to: calculating a target ideal amplitude mask based on the target signal and the mixed signal; determining a loss function based on the target ideal amplitude mask and the estimation data.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus, which may include: a data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal, the specific signal being of an audio type that does not need to be enhanced and suppressed; and a data processing module configured to: obtaining an ideal amplitude mask by using a voice processing model based on the audio signal; and according to the size of the ideal amplitude mask, performing different processing on the audio signal to obtain a desired signal.

Optionally, the data processing module may be configured to: determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold.

Optionally, the data processing module may be configured to: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is taken as the desired signal.

Optionally, the data processing module may be configured to: if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.

Optionally, the data processing module may be configured to: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.

Optionally, the output of the speech processing model may be the ideal amplitude mask or the estimated target signal, wherein in case the output of the speech processing model is the estimated target signal, the data processing module may be configured to: obtaining an estimated target signal by applying the audio signal to a speech processing model; obtaining the ideal amplitude mask based on the estimated target signal and the audio signal.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method and the model training method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method and the model training method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions of which are executed by at least one processor in an electronic device to perform the speech processing method and the model training method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method integrates voice enhancement and denoising into a deep neural network for training, and can respectively carry out voice enhancement and denoising or simultaneously carry out voice enhancement and denoising through post-processing based on an ideal target mask IRM. Furthermore, the training targets are classified into three categories at the time of model design, namely speech (requiring enhancement), noise (requiring suppression), and other audio such as music (neither enhancement nor suppression), and a speech processing model trained using such training data is different from a single model of speech enhancement and speech noise reduction, so that the model is more suitable for practical use, and speech processing can be performed more efficiently.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow chart of a method of speech processing according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of training a speech processing model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of training a speech processing model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of training a speech processing model according to another embodiment of the present disclosure;

FIG. 5 is a flow chart diagram of a method of speech processing according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device according to an embodiment of the disclosure.

Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.

Detailed Description

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

In related speech enhancement and denoising applications, the complexity is doubled due to the fact that two neural networks are independently used as networks for speech enhancement and noise cancellation, and the application and the use of electronic equipment are not facilitated. Therefore, the present disclosure proposes a method for performing speech enhancement and denoising simultaneously using one neural network, i.e., ensuring both noise suppression and speech enhancement.

In addition, a new class, namely audio types such as music and the like which are not expected to be amplified or weakened, is introduced in the model design, so that the method is more suitable for practical application requirements. Therefore, noise suppression, speech enhancement and other types of sound magnitude invariance can be simultaneously ensured through the speech processing model of the present disclosure.

Hereinafter, according to various embodiments of the present disclosure, a method, an apparatus, and a system of the present disclosure will be described in detail with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method of speech processing according to an embodiment of the present disclosure. The speech processing method illustrated in fig. 1 may be executed on a network side connected to the electronic device or locally on the electronic device.

The electronic device may be any electronic device having functions of voice/text reception, voice processing, and command execution. In an example embodiment of the present disclosure, the electronic device may include, for example, but not limited to, a portable communication device (e.g., a smartphone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a server, among others. According to the embodiments of the present disclosure, the electronic device is not limited to the above.

Referring to fig. 1, in step S101, an audio signal is acquired. The speech processing model of the present disclosure is not used purely for speech enhancement and speech noise reduction, since a new audio class, i.e., a type that is neither desired to be amplified nor desired to be attenuated (such as a music signal, etc.), is introduced during the model training phase of the present disclosure. Thus, the present disclosure may perform speech processing on multiple types of signals. For example, the audio signal may include at least one of a voice signal, a noise signal, and a specific signal. Here, the particular signal is of an audio type that does not need to be enhanced and suppressed. For example, the specific signal is a music signal. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

In step S102, an Ideal amplitude Mask (IRM) is obtained using a speech processing model based on the acquired audio signal. How to derive the speech processing model will be described in detail below with reference to fig. 2. FIG. 2 is a flow diagram of a method of training a speech processing model according to an embodiment of the present disclosure. The execution subject of the model training method provided in the embodiment of the present disclosure may be the model training device provided in the embodiment of the present disclosure, or may be an electronic device including the model training device. The method can be determined according to actual use requirements, and the embodiment of the disclosure is not limited.

Referring to fig. 2, in step S201, a mixed signal and a target signal are generated based on at least one of a voice signal, a noise signal, and a specific signal, where the specific signal is of an audio type that does not need to be enhanced and suppressed. For example, the specific signal may be a music signal. According to an embodiment of the present disclosure, in the process of generating the mix signal and the target signal, other types of signals may be included in addition to the above-listed signals, that is, the above-mentioned training data is not limited to the above-mentioned three categories, and may include more types of audio signals.

As an example, the mixed signal may include three data sources, such as a speech signal s (t), a specific signal m (t), and a noise signal n (t). Where t represents time. The speech signal s (t) may refer to a signal that requires enhancement, the specific signal m (t) may refer to a type of audio that requires neither enhancement nor suppression, and the noise signal n (t) may refer to a signal that requires suppression.

In generating the mixed signal, the specific signal m (t) may be multiplied by a first gain to obtain a first signal and the noise signal n (t) may be multiplied by a second gain to obtain a second signal, and then the mixed signal may be generated by mixing the first signal, the second signal, and the voice signal s (t). For example, the mixed signal may be represented by the following equation (1):

Mix(t)＝S(t)+M(t)*g_SNR1+N(t)*g_sNR2 (1)

where mix (t) is the mixing signal, g_SNR1Is a first gain, g_SMR2Is the second gain.

In generating the target signal, the speech signal s (t) may be multiplied by a third gain to obtain a third signal, and then the target signal may be generated by mixing the third signal and the second signal. For example, the target signal may be represented by the following equation (2):

wherein, Tar (t) is the target signal,

is the third gain. Here, the third gain may be a target voice amplification gain.

According to the embodiment of the disclosure, the first gain, the second gain and the third gain can be determined according to the preset signal-to-noise ratio, so that the generated mixed signal and the target signal are more consistent with the actual situation, and the trained voice processing model is more accurate. The third gain may be adjusted by a user according to actual needs, or may be a predetermined value, to which the present disclosure is not limited.

As an example, a first gain may be determined based on a first predetermined signal-to-noise ratio, and a second gain may be determined based on a second signal-to-noise ratio and the first gain. For example, the first gain and the second gain may be determined using the following equations (3) and (4):

where target SNR1 is a first predetermined signal-to-noise ratio and target SNR2 is a second signal-to-noise ratio. target SNR1 is represented as the energy ratio between the speech signal and the particular signal, and target SNR2 is represented as the energy ratio of the speech signal plus the particular signal to the noise signal. The above examples are merely illustrative, and the present disclosure is not limited thereto. Alternatively, different signal-to-noise ratios can be set according to actual requirements.

Further, in generating the mixed signal and the target signal, if the training data includes other types of audio signals in addition to the above-described voice signal, noise signal, and specific signal, it is possible to distinguish by applying different target gains to each type of signal and satisfy the signal-to-noise ratio of the actual demand.

In step S202, the mixed signal is input to a speech processing model to obtain estimation data. Here, the speech processing model may be obtained by training a deep neural network.

According to embodiments of the present disclosure, different speech processing models may be obtained from different training data. Here, the estimation data may be an estimated target signal or an estimated ideal amplitude mask.

In step S203, a loss function is determined based on the target signal and the estimation data. As an example, in case the estimation data is an estimated ideal amplitude mask, the target ideal amplitude mask may first be calculated based on the target signal and the mixed signal, and then the loss function may be determined based on the target ideal amplitude mask and the estimation data.

In step S204, the speech processing model is trained based on the loss function to adjust parameters of the speech processing model. The training process of the speech processing model in the case where the output of the speech processing model is the estimated target signal will be described in detail below with reference to fig. 3, and the training process of the speech processing model in the case where the output of the speech processing model is the estimated ideal amplitude mask will be described in detail with reference to fig. 4.

In the case where the output of the speech processing model is an estimated target signal, the speech processing model may be trained with reference to FIG. 3. FIG. 3 is a schematic diagram of training a speech processing model according to an embodiment of the present disclosure.

Referring to fig. 3, the mixed signal m (t) and the target signal Tar (t) are respectively transformed to the time-frequency domain by the short-time fourier transform STFT to obtain a mixed signal Mix (n, k) and a target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with length T are Tar (T) and Mix (T) respectively in the time domain, where T represents time and 0 < T ≦ T, after the short time fourier transform STFT, Tar (T) and Mix (T) can be expressed as:

Tar(n，k)＝STFT(Tar(t)) (5)

Mix(n，k)＝STFT(Mix(t)) (6)

wherein N is a frame sequence, N is more than 0 and less than or equal to N, and N is the total frame number; k is a central frequency sequence, K is more than 0 and less than or equal to K, and K is the total frequency point number.

Next, the mixed signal Mix (n, k) in the time-frequency domain is input to a deep neural network DNN, and an estimated target signal Tar is output from the DNN_est(n, k). Then based on the target signal Tar (n, k) and the estimated target signal Tar_est(n, k) constructing a loss function, carrying out optimization iteration on the deep neural network DNN based on the loss function, finally converging, and completing a training stage, thereby obtaining a voice processing model. However, the above examples of the constructive loss function are merely exemplary, and the present disclosure is not limited thereto.

After inputting the audio signal into the speech processing model trained as shown in fig. 3, an estimated target signal can be obtained.

In the case where the output of the speech processing model is an estimated ideal amplitude mask, the speech processing model may be trained with reference to FIG. 4. FIG. 4 is a schematic diagram of training a speech processing model according to another embodiment of the present disclosure.

Referring to fig. 4, the mixed signal m (t) and the target signal Tar (t) are respectively transformed to a time-frequency domain by a short-time fourier transform STFT to obtain a mixed signal Mix (n, k) and a target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with length T are Tar (T) and Mix (T), respectively, in the time domain, where T represents time and 0 < T ≦ T, after the short time fourier transform STFT, Tar (T) and Mix (T) can be expressed as equation (5) and equation (6) in the time-frequency domain:

Tar(n，k)＝STFT(Tar(t)) (5)

Mix(n，k)＝STFT(Mix(t)) (6)

The target ideal amplitude mask is calculated based on the mixed signal Mix (n, k) and the target signal Tar (n, k). For example, the target ideal amplitude mask may be calculated using equation (7) below:

as can be seen from the above equation (7),

next, the mixed signal Mix (n, k) in the time-frequency domain is input into a deep neural network DNN, and an estimated ideal amplitude mask IRM is output from the DNN_est(n, k). Then, IRM is masked based on the target ideal amplitude_obj(n, k) and estimated ideal amplitude mask IRM_est(n, k) to construct a loss function, and optimally training the deep neural network DNN based on the loss function to adjust network parameters so as to obtain a speech processing model. However, the above examples of the constructive loss function are merely exemplary, and the present disclosure is not limited thereto.

After the audio signal is input to the speech processing model trained as shown in fig. 4, an estimated ideal amplitude mask can be obtained.

Referring back to fig. 1, in step S102, in the case where the output of the speech processing model is an estimated target signal, the estimated target signal may be obtained by applying the obtained audio signal to the speech processing model, and then an ideal amplitude mask is obtained based on the estimated target signal and the audio signal. For example, when the estimated target signal is output from the speech processing model, the ideal amplitude mask can be calculated using equation (8) below:

wherein, Tar_estAnd (n, k) is an estimated target signal output from the speech processing model, and Aud (n, k) represents a signal on a time-frequency domain after the obtained audio signal is subjected to short-time fourier transform.

In step S103, the audio signal is differently processed according to the size of the obtained ideal amplitude mask to obtain a desired signal. Here, the desired signal may be a signal subjected to speech enhancement, may be a signal subjected to noise reduction processing, or may be a signal subjected to speech enhancement and noise reduction processing. It is determined whether the desired signal is obtained based on an estimated signal resulting from multiplying the obtained audio signal by the ideal amplitude mask by comparing the ideal amplitude mask with a predetermined threshold.

As an example, if the ideal amplitude mask is larger than the predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to equation (9) below:

where Est (n, k) represents the desired signal and Aud (n, k) represents the audio signal after short-time fourier transform.

Defined for adjustable userAnd (4) gain. Here, the preset threshold may be 1, or may be an arbitrary value set by the user.

After obtaining the desired signal Est (n, k) in the video domain, the desired signal Est (t) in the time domain is obtained by a short-time inverse fourier transform.

Through the above-described processing, the speech portion in the obtained audio signal can be further enhanced, and the gain of the speech portion desired to be enhanced can be arbitrarily adjusted according to the user's needs.

As another example, the estimated signal is taken as the desired signal if the ideal amplitude mask is smaller than a predetermined threshold, and the obtained audio signal is taken as the desired signal otherwise. For example, the desired signal may be obtained according to equation (10) below:

where Est (n, k) represents the desired signal and Aud (n, k) represents the audio signal. Here, the preset threshold may be 1, or may be an arbitrary value set by the user.

By the processing, the denoising effect of the obtained audio signal can be realized.

As another example, if the ideal amplitude mask is greater than a predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, if the ideal amplitude mask is less than the predetermined threshold, the estimated signal is treated as the desired signal, otherwise the obtained audio signal is treated as the desired signal. For example, the desired signal may be obtained according to equation (11) below:

where Est (n, k) represents the desired signal and Aud (n, k) represents the audio signal.

An additional gain defined for the adjustable user. Here, the preset threshold value mayIt may be 1 or an arbitrary value set by the user.

Through the processing, the voice part in the obtained audio signal can be further enhanced, the gain of the voice part which is expected to be enhanced can be adjusted randomly according to the requirement of a user, and meanwhile, the noise reduction processing can be carried out on the audio signal.

In the above-described embodiment, in the model training stage and the speech processing stage, the obtained signal in the time domain may be first converted into a signal in the time-frequency domain via short-time fourier transform, then model training and speech processing are performed, and finally the finally obtained signal in the time-frequency domain may be converted into a signal in the time domain via short-time inverse fourier transform.

FIG. 5 is a flow chart diagram of a speech processing method according to an embodiment of the present disclosure. In this embodiment, it is assumed that the output of the speech processing model is the estimated target signal.

Referring to fig. 5, the obtained audio signal Aud (t) is transformed into a signal Aud (n, k) on a time-frequency domain by a short-time fourier transform STFT, and then the signal Aud (n, k) is input to a trained speech processing model.

Outputting an estimated target signal Tar from a speech processing model_est(n, k), an ideal amplitude mask IRM (n, k) may be calculated using equation (8), and then it is determined how to post-process the acquired audio signal based on a comparison of the calculated ideal amplitude mask with a predetermined threshold.

The preset threshold may be set to 1, the desired signal Est (n, k) in the time-frequency domain is obtained using equation (9), equation (10), or equation (11), and then the signal in the time domain is obtained by performing a short-time inverse fourier transform ISTFT on the desired signal Est (n, k) in the time-frequency domain.

Furthermore, in the case where the output of the speech processing model is an estimated ideal amplitude mask, the time-frequency mask conversion operation in fig. 5 may be omitted, the estimated ideal amplitude mask is obtained directly by the speech processing model, and then a different post-processing operation is performed based on a comparison of the ideal amplitude mask with a preset threshold.

Fig. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure. Referring to fig. 6, the speech processing apparatus 600 may include a data acquisition module 601, a data processing module 602, and a model training module 603. Each module in the voice processing apparatus 600 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the speech processing apparatus 600 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The data acquisition module 601 may acquire an audio signal, wherein the audio signal may include at least one of a speech signal, a noise signal, and a specific signal belonging to an audio type that does not need to be enhanced and suppressed. Since a new audio class, i.e. a type (such as music signal) which is not desired to be amplified or attenuated, is introduced in the model training stage of the present disclosure, the speech processing model of the present disclosure is not used for speech enhancement and speech noise reduction, and such a design is more suitable for practical application. Thus, the present disclosure may perform speech processing on multiple types of signals.

The data processing module 602 may obtain an ideal amplitude mask using a speech processing model based on the obtained audio signal, and then perform different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.

As an example, the data processing module 602 may determine whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold.

For example, if the ideal amplitude mask is greater than the predetermined threshold, the data processing module 602 may multiply the estimated signal resulting from multiplying the audio signal and the ideal amplitude mask by a user-defined gain to obtain the desired signal; otherwise the audio signal may be taken as the desired signal. Here, the preset threshold may be set to 1, or an arbitrary value set by the user. The voice post-processing operation can be performed with reference to equation (9).

For another example, if the ideal amplitude mask is less than the predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the audio signal may be taken as the desired signal. The voice post-processing operation can be performed with reference to equation (10).

For another example, if the ideal amplitude mask is greater than the predetermined threshold, the data processing module 602 may multiply the estimated signal by a user-defined gain to obtain the desired signal. If the ideal amplitude mask is less than the predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the data processing module 602 may treat the audio signal as the desired signal. The voice post-processing operation can be performed with reference to equation (11).

Different speech processing models can be trained due to different training data. In the present disclosure, the output of the speech processing model may be an ideal amplitude mask or an estimated target signal.

In the case where the output of the speech processing model is an estimated target signal, the data processing module 602 may obtain the estimated target signal by applying the obtained audio signal to the speech processing model and then obtain an ideal amplitude mask based on the estimated target signal and the audio signal. And performing voice post-processing operation according to the obtained ideal amplitude mask.

Optionally, the speech processing apparatus may further comprise a model training module 603. The model training module 603 may train the speech processing model based on the following methods: generating a mixed signal and a target signal based on at least one of the speech signal, the noise signal and the specific signal, inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; the speech processing model is trained based on the loss function to adjust parameters of the speech processing model.

Alternatively, the model training module 603 may multiply a specific signal by a first gain to obtain a first signal and multiply a noise signal by a second gain to obtain a second signal, and generate a mixed signal by mixing the first signal, the second signal, and the speech signal. For example, the mixed signal may be generated using equation (1).

Alternatively, the model training module 603 may multiply the speech signal by a third gain to obtain a third signal, and generate the target signal by mixing the third signal and the second signal. For example, the target signal may be generated using equation (2).

Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on a second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal as training data and the target signal are more consistent with the requirements of practical application.

Since the speech processing models are different, the estimated data output by the speech processing models may be different. For example, the estimated data output by the speech processing model may be an estimated target signal or an estimated ideal amplitude mask.

In the case where the estimation data is an estimated ideal amplitude mask, model training module 603 may calculate a target ideal amplitude mask based on the target signal and the mixed signal, and then determine a loss function based on the target ideal amplitude mask and the estimation data. Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mix signal.

FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure. Referring to FIG. 7, model training apparatus 700 may include a data generation module 701 and a data training module 702. Each module in the model training apparatus 700 may be implemented by one or more modules, and the name of the corresponding module may vary according to the type of the module. In various embodiments, some modules in model training apparatus 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The training data is divided into three categories during the model design stage, distinguished from single speech enhancement and speech noise reduction, with three input sources in the mixed data, e.g., speech (requiring enhancement), music (audio type neither enhancement nor suppression), and noise (requiring suppression).

The data generation module 701 may generate a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and a specific signal. Specifically, the data generation module 701 may multiply a specific signal by a first gain to obtain a first signal and multiply a noise signal by a second gain to obtain a second signal, and generate a mixed signal by mixing the first signal, the second signal, and the voice signal. For example, a mixed signal as shown in equation (1) may be generated.

The data generation module 701 may multiply the voice signal by a third gain to obtain a third signal, and generate the target signal by mixing the third signal and the second signal. For example, a target signal as shown in equation (2) may be generated.

Here, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on a second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal as training data and the target signal are more consistent with the requirements of practical application. For example, equations (3) and (4) may be utilized to determine gain values for different signals.

The data training module 702 may input the mixed signal into a speech processing model (such as a deep neural network) to obtain estimated data, determine a loss function based on the target signal and the estimated data, train the speech processing model based on the loss function to adjust parameters of the speech processing model.

According to embodiments of the present disclosure, different training data may be utilized to obtain different speech processing models. Assuming that the training output is a speech processing model of the target signal, the data training module 702 inputs the mixed signal Mix (n, k) on the time-frequency domain into the deep neural network DNN, and outputs the estimated target signal Tar from the DNN_est(n, k). Then based on the target signal Tar (n, k) and the estimated target signal Tar_est(n, k) constructing a loss function, carrying out optimization iteration on the deep neural network DNN based on the loss function, finally converging, and completing a training stage, thereby obtaining a voice processing model.

Assuming the training output is a speech processing model with ideal amplitude masks, data training module 702 may base its operation on target signals Tar (n, k) andthe mixed signal Mix (n, k) calculates a target ideal amplitude mask, and the data training module 702 inputs the mixed signal Mix (n, k) in the time-frequency domain into the deep neural network DNN and outputs an estimated ideal amplitude mask IRM from the DNN_est(n, k). Then IRM based on the target ideal amplitude mask_obj(n, k and estimated ideal amplitude mask IRM)_est(n, k) to determine a loss function. Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mix signal.

According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 8 is a block diagram of an electronic device 800 that may include at least one memory 802 and at least one processor 801, the at least one memory 802 storing a set of computer-executable instructions that, when executed by the at least one processor 801, perform a method of speech processing or a method of training a speech processing model according to an embodiment of the disclosure, according to an embodiment of the disclosure.

The processor 801 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 801 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The memory 802, which is a kind of storage medium, may include an operating system, a data storage module, a network communication module, a user interface module, a video playback parameter determination program, and a database.

The memory 802 may be integrated with the processor 801, for example, a RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. Further, memory 802 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.

Further, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.

By way of example, the electronic device 800 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. Here, the electronic device 800 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a speech processing method or a training method of a speech processing model according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product, in which instructions are executable by a processor of a computer device to perform the above-mentioned speech processing method or training method of a speech processing model.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a speech processing model, the method comprising:

generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal;

inputting the mixed signal into a speech processing model to obtain estimation data;

determining a loss function based on the target signal and the estimation data;

training the speech processing model based on the loss function to adjust parameters of the speech processing model.

2. The method of claim 1, wherein the step of generating the mixed signal based on at least one of the speech signal, the noise signal, and the specific signal comprises:

multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal;

generating the mixed signal by mixing the first signal, the second signal and the voice signal,

wherein the first gain is determined based on a first predetermined signal-to-noise ratio, the second gain is determined based on a second signal-to-noise ratio and the first gain,

wherein the generating of the target signal based on at least one of the voice signal, the noise signal, and the specific signal includes:

multiplying the speech signal by a third gain to obtain a third signal;

the target signal is generated by mixing the third signal and the second signal.

3. The method of claim 1, wherein the estimation data is an estimated target signal or an estimated ideal amplitude mask,

wherein the ideal amplitude mask is related to the signal energy,

wherein, in the case that the estimation data is an estimated ideal amplitude mask, the step of determining a loss function based on the target signal and the estimation data comprises:

calculating a target ideal amplitude mask based on the target signal and the mixed signal;

determining a loss function based on the target ideal amplitude mask and the estimation data,

wherein the target ideal amplitude mask is an amplitude ratio of the target signal and the mixed signal in a time-frequency domain.

4. A method of speech processing, the method comprising:

acquiring an audio signal, wherein the audio signal comprises at least one of a speech signal, a noise signal and a specific signal, and the specific signal belongs to an audio type which does not need to be enhanced and suppressed;

obtaining an ideal amplitude mask based on the audio signal using a speech processing model trained using the training method of any one of claims 1-3; and

and according to the size of the ideal amplitude mask, performing different processing on the audio signal to obtain a desired signal.

5. The method of claim 4, wherein the step of performing different processing on the audio signal to obtain the desired signal according to the magnitude of the ideal amplitude mask comprises:

determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold,

wherein if the ideal amplitude mask is greater than the predetermined threshold, multiplying the estimated signal by a user-defined gain to obtain the desired signal; otherwise the audio signal is taken as the desired signal, or

Wherein the estimated signal is taken as the desired signal if the ideal amplitude mask is less than the predetermined threshold; otherwise the audio signal is taken as the desired signal, or

Wherein if the ideal amplitude mask is greater than the predetermined threshold, multiplying the estimated signal by a user-defined gain to obtain the desired signal; if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.

6. The method of claim 4, wherein the output of the speech processing model is the ideal amplitude mask or an estimated target signal,

wherein, in the case that the output of the speech processing model is an estimated target signal, the step of obtaining an ideal amplitude mask comprises:

obtaining an estimated target signal by applying the audio signal to a speech processing model;

obtaining the ideal amplitude mask based on the estimated target signal and the audio signal.

7. An apparatus for training a speech processing model, the apparatus comprising:

a data generation module configured to: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; and

a data training module configured to: inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

8. A speech processing apparatus, characterized in that the apparatus comprises:

a data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal, the specific signal being of an audio type that does not need to be enhanced and suppressed;

a data processing module configured to:

9. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method of any of claims 1 to 3 and 4 to 6.

10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method of any of claims 1 to 3 and claims 4 to 6.