CN112309426A - Voice processing model training method and device and voice processing method and device - Google Patents

Voice processing model training method and device and voice processing method and device Download PDF

Info

Publication number
CN112309426A
CN112309426A CN202011330109.3A CN202011330109A CN112309426A CN 112309426 A CN112309426 A CN 112309426A CN 202011330109 A CN202011330109 A CN 202011330109A CN 112309426 A CN112309426 A CN 112309426A
Authority
CN
China
Prior art keywords
signal
amplitude mask
speech processing
ideal amplitude
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011330109.3A
Other languages
Chinese (zh)
Inventor
郑羲光
李楠
任新蕾
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202011330109.3A priority Critical patent/CN112309426A/en
Publication of CN112309426A publication Critical patent/CN112309426A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure provides a training method and a device of a voice processing model, and a voice processing method and a device thereof. The training method comprises the following steps: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Description

Voice processing model training method and device and voice processing method and device
Technical Field
The present disclosure relates to the field of audio technologies, and in particular, to a method and an apparatus for training a speech processing model, and a method and an apparatus for speech processing.
Background
With the rapid development of electronic technology and network technology, electronic devices can process audio signals in a time-frequency domain based on a neural network speech processing algorithm.
Although neural network-based speech enhancement and noise reduction have achieved performance exceeding that of conventional signal processing methods and have been able to operate efficiently in electronic devices, it is common to train two neural networks to achieve speech enhancement and noise reduction, respectively, for speech enhancement (non-speech component invariant speech component increase) and speech noise reduction (speech component invariant non-speech component decrease) problems. In addition, for speech processing using both neural networks, one type of signal is always amplified or reduced while the other type of signal is kept unchanged.
Disclosure of Invention
The present disclosure provides a training method and device for a speech processing model, and a speech processing method and device, so as to solve at least the problem of using one neural network to simultaneously accomplish speech enhancement and denoising.
According to a first aspect of embodiments of the present disclosure, there is provided a method for training a speech processing model, which may include: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.
Alternatively, the generating of the mixed signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; the mixed signal is generated by mixing the first signal, the second signal and the speech signal.
Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on a second signal-to-noise ratio and the first gain.
Alternatively, the step of generating the target signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the second signal.
Alternatively, the estimation data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask is related to the signal energy.
Optionally, in case the estimation data is an estimated ideal amplitude mask, the step of determining a loss function based on the target signal and the estimation data may comprise: calculating a target ideal amplitude mask based on the target signal and the mixed signal; determining a loss function based on the target ideal amplitude mask and the estimation data.
Alternatively, the target ideal amplitude mask may be an amplitude ratio of the target signal to the mixed signal in a time-frequency domain.
According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing method, which may include: acquiring an audio signal, wherein the audio signal comprises at least one of a speech signal, a noise signal and a specific signal, and the specific signal belongs to an audio type which does not need to be enhanced and suppressed; obtaining an ideal amplitude mask using a speech processing model based on the audio signal; and according to the size of the ideal amplitude mask, performing different processing on the audio signal to obtain a desired signal.
Alternatively, the speech processing model may be obtained by training according to the above-mentioned training method.
Optionally, the step of performing different processing on the audio signal to obtain the desired signal according to the size of the ideal amplitude mask may include: determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold.
Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask may comprise: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is taken as the desired signal.
Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask may comprise: if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.
Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask may comprise: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.
Optionally, the output of the speech processing model is the ideal amplitude mask or the estimated target signal, wherein in case the output of the speech processing model is the estimated target signal, the step of obtaining the ideal amplitude mask may comprise: obtaining an estimated target signal by applying the audio signal to a speech processing model; obtaining the ideal amplitude mask based on the estimated target signal and the audio signal.
According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for training a speech processing model, the apparatus may include: a data generation module configured to: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; and a data training module configured to: inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.
Optionally, the data generation module may be configured to: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; and generating the mixed signal by mixing the first signal, the second signal, and the voice signal.
Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on a second signal-to-noise ratio and the first gain.
Optionally, the data generation module may be configured to: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the second signal.
Alternatively, the estimation data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask may be related to the signal energy.
Optionally, in case the estimated data is an estimated ideal amplitude mask, the data training module may be configured to: calculating a target ideal amplitude mask based on the target signal and the mixed signal; determining a loss function based on the target ideal amplitude mask and the estimation data.
Alternatively, the target ideal amplitude mask may be an amplitude ratio of the target signal to the mixed signal in a time-frequency domain.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus, which may include: a data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal, the specific signal being of an audio type that does not need to be enhanced and suppressed; and a data processing module configured to: obtaining an ideal amplitude mask by using a voice processing model based on the audio signal; and according to the size of the ideal amplitude mask, performing different processing on the audio signal to obtain a desired signal.
Optionally, the data processing module may be configured to: determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold.
Optionally, the data processing module may be configured to: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is taken as the desired signal.
Optionally, the data processing module may be configured to: if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.
Optionally, the data processing module may be configured to: multiplying the estimated signal by a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.
Optionally, the output of the speech processing model may be the ideal amplitude mask or the estimated target signal, wherein in case the output of the speech processing model is the estimated target signal, the data processing module may be configured to: obtaining an estimated target signal by applying the audio signal to a speech processing model; obtaining the ideal amplitude mask based on the estimated target signal and the audio signal.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method and the model training method as described above.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method and the model training method as described above.
According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions of which are executed by at least one processor in an electronic device to perform the speech processing method and the model training method as described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the method integrates voice enhancement and denoising into a deep neural network for training, and can respectively carry out voice enhancement and denoising or simultaneously carry out voice enhancement and denoising through post-processing based on an ideal target mask IRM. Furthermore, the training targets are classified into three categories at the time of model design, namely speech (requiring enhancement), noise (requiring suppression), and other audio such as music (neither enhancement nor suppression), and a speech processing model trained using such training data is different from a single model of speech enhancement and speech noise reduction, so that the model is more suitable for practical use, and speech processing can be performed more efficiently.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow chart of a method of speech processing according to an embodiment of the present disclosure;
FIG. 2 is a flow diagram of a method of training a speech processing model according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of training a speech processing model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of training a speech processing model according to another embodiment of the present disclosure;
FIG. 5 is a flow chart diagram of a method of speech processing according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;
fig. 8 is a block diagram of an electronic device according to an embodiment of the disclosure.
Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.
Detailed Description
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.
In related speech enhancement and denoising applications, the complexity is doubled due to the fact that two neural networks are independently used as networks for speech enhancement and noise cancellation, and the application and the use of electronic equipment are not facilitated. Therefore, the present disclosure proposes a method for performing speech enhancement and denoising simultaneously using one neural network, i.e., ensuring both noise suppression and speech enhancement.
In addition, a new class, namely audio types such as music and the like which are not expected to be amplified or weakened, is introduced in the model design, so that the method is more suitable for practical application requirements. Therefore, noise suppression, speech enhancement and other types of sound magnitude invariance can be simultaneously ensured through the speech processing model of the present disclosure.
Hereinafter, according to various embodiments of the present disclosure, a method, an apparatus, and a system of the present disclosure will be described in detail with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method of speech processing according to an embodiment of the present disclosure. The speech processing method illustrated in fig. 1 may be executed on a network side connected to the electronic device or locally on the electronic device.
The electronic device may be any electronic device having functions of voice/text reception, voice processing, and command execution. In an example embodiment of the present disclosure, the electronic device may include, for example, but not limited to, a portable communication device (e.g., a smartphone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a server, among others. According to the embodiments of the present disclosure, the electronic device is not limited to the above.
Referring to fig. 1, in step S101, an audio signal is acquired. The speech processing model of the present disclosure is not used purely for speech enhancement and speech noise reduction, since a new audio class, i.e., a type that is neither desired to be amplified nor desired to be attenuated (such as a music signal, etc.), is introduced during the model training phase of the present disclosure. Thus, the present disclosure may perform speech processing on multiple types of signals. For example, the audio signal may include at least one of a voice signal, a noise signal, and a specific signal. Here, the particular signal is of an audio type that does not need to be enhanced and suppressed. For example, the specific signal is a music signal. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.
In step S102, an Ideal amplitude Mask (IRM) is obtained using a speech processing model based on the acquired audio signal. How to derive the speech processing model will be described in detail below with reference to fig. 2. FIG. 2 is a flow diagram of a method of training a speech processing model according to an embodiment of the present disclosure. The execution subject of the model training method provided in the embodiment of the present disclosure may be the model training device provided in the embodiment of the present disclosure, or may be an electronic device including the model training device. The method can be determined according to actual use requirements, and the embodiment of the disclosure is not limited.
Referring to fig. 2, in step S201, a mixed signal and a target signal are generated based on at least one of a voice signal, a noise signal, and a specific signal, where the specific signal is of an audio type that does not need to be enhanced and suppressed. For example, the specific signal may be a music signal. According to an embodiment of the present disclosure, in the process of generating the mix signal and the target signal, other types of signals may be included in addition to the above-listed signals, that is, the above-mentioned training data is not limited to the above-mentioned three categories, and may include more types of audio signals.
As an example, the mixed signal may include three data sources, such as a speech signal s (t), a specific signal m (t), and a noise signal n (t). Where t represents time. The speech signal s (t) may refer to a signal that requires enhancement, the specific signal m (t) may refer to a type of audio that requires neither enhancement nor suppression, and the noise signal n (t) may refer to a signal that requires suppression.
In generating the mixed signal, the specific signal m (t) may be multiplied by a first gain to obtain a first signal and the noise signal n (t) may be multiplied by a second gain to obtain a second signal, and then the mixed signal may be generated by mixing the first signal, the second signal, and the voice signal s (t). For example, the mixed signal may be represented by the following equation (1):
Mix(t)=S(t)+M(t)*gSNR1+N(t)*gsNR2 (1)
where mix (t) is the mixing signal, gSNR1Is a first gain, gSMR2Is the second gain.
In generating the target signal, the speech signal s (t) may be multiplied by a third gain to obtain a third signal, and then the target signal may be generated by mixing the third signal and the second signal. For example, the target signal may be represented by the following equation (2):
Figure BDA0002795518660000071
wherein, Tar (t) is the target signal,
Figure BDA0002795518660000083
is the third gain. Here, the third gain may be a target voice amplification gain.
According to the embodiment of the disclosure, the first gain, the second gain and the third gain can be determined according to the preset signal-to-noise ratio, so that the generated mixed signal and the target signal are more consistent with the actual situation, and the trained voice processing model is more accurate. The third gain may be adjusted by a user according to actual needs, or may be a predetermined value, to which the present disclosure is not limited.
As an example, a first gain may be determined based on a first predetermined signal-to-noise ratio, and a second gain may be determined based on a second signal-to-noise ratio and the first gain. For example, the first gain and the second gain may be determined using the following equations (3) and (4):
Figure BDA0002795518660000081
Figure BDA0002795518660000082
where target SNR1 is a first predetermined signal-to-noise ratio and target SNR2 is a second signal-to-noise ratio. target SNR1 is represented as the energy ratio between the speech signal and the particular signal, and target SNR2 is represented as the energy ratio of the speech signal plus the particular signal to the noise signal. The above examples are merely illustrative, and the present disclosure is not limited thereto. Alternatively, different signal-to-noise ratios can be set according to actual requirements.
Further, in generating the mixed signal and the target signal, if the training data includes other types of audio signals in addition to the above-described voice signal, noise signal, and specific signal, it is possible to distinguish by applying different target gains to each type of signal and satisfy the signal-to-noise ratio of the actual demand.
In step S202, the mixed signal is input to a speech processing model to obtain estimation data. Here, the speech processing model may be obtained by training a deep neural network.
According to embodiments of the present disclosure, different speech processing models may be obtained from different training data. Here, the estimation data may be an estimated target signal or an estimated ideal amplitude mask.
In step S203, a loss function is determined based on the target signal and the estimation data. As an example, in case the estimation data is an estimated ideal amplitude mask, the target ideal amplitude mask may first be calculated based on the target signal and the mixed signal, and then the loss function may be determined based on the target ideal amplitude mask and the estimation data.
In step S204, the speech processing model is trained based on the loss function to adjust parameters of the speech processing model. The training process of the speech processing model in the case where the output of the speech processing model is the estimated target signal will be described in detail below with reference to fig. 3, and the training process of the speech processing model in the case where the output of the speech processing model is the estimated ideal amplitude mask will be described in detail with reference to fig. 4.
In the case where the output of the speech processing model is an estimated target signal, the speech processing model may be trained with reference to FIG. 3. FIG. 3 is a schematic diagram of training a speech processing model according to an embodiment of the present disclosure.
Referring to fig. 3, the mixed signal m (t) and the target signal Tar (t) are respectively transformed to the time-frequency domain by the short-time fourier transform STFT to obtain a mixed signal Mix (n, k) and a target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with length T are Tar (T) and Mix (T) respectively in the time domain, where T represents time and 0 < T ≦ T, after the short time fourier transform STFT, Tar (T) and Mix (T) can be expressed as:
Tar(n,k)=STFT(Tar(t)) (5)
Mix(n,k)=STFT(Mix(t)) (6)
wherein N is a frame sequence, N is more than 0 and less than or equal to N, and N is the total frame number; k is a central frequency sequence, K is more than 0 and less than or equal to K, and K is the total frequency point number.
Next, the mixed signal Mix (n, k) in the time-frequency domain is input to a deep neural network DNN, and an estimated target signal Tar is output from the DNNest(n, k). Then based on the target signal Tar (n, k) and the estimated target signal Tarest(n, k) constructing a loss function, carrying out optimization iteration on the deep neural network DNN based on the loss function, finally converging, and completing a training stage, thereby obtaining a voice processing model. However, the above examples of the constructive loss function are merely exemplary, and the present disclosure is not limited thereto.
After inputting the audio signal into the speech processing model trained as shown in fig. 3, an estimated target signal can be obtained.
In the case where the output of the speech processing model is an estimated ideal amplitude mask, the speech processing model may be trained with reference to FIG. 4. FIG. 4 is a schematic diagram of training a speech processing model according to another embodiment of the present disclosure.
Referring to fig. 4, the mixed signal m (t) and the target signal Tar (t) are respectively transformed to a time-frequency domain by a short-time fourier transform STFT to obtain a mixed signal Mix (n, k) and a target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with length T are Tar (T) and Mix (T), respectively, in the time domain, where T represents time and 0 < T ≦ T, after the short time fourier transform STFT, Tar (T) and Mix (T) can be expressed as equation (5) and equation (6) in the time-frequency domain:
Tar(n,k)=STFT(Tar(t)) (5)
Mix(n,k)=STFT(Mix(t)) (6)
wherein N is a frame sequence, N is more than 0 and less than or equal to N, and N is the total frame number; k is a central frequency sequence, K is more than 0 and less than or equal to K, and K is the total frequency point number.
The target ideal amplitude mask is calculated based on the mixed signal Mix (n, k) and the target signal Tar (n, k). For example, the target ideal amplitude mask may be calculated using equation (7) below:
Figure BDA0002795518660000101
as can be seen from the above equation (7),
Figure BDA0002795518660000104
next, the mixed signal Mix (n, k) in the time-frequency domain is input into a deep neural network DNN, and an estimated ideal amplitude mask IRM is output from the DNNest(n, k). Then, IRM is masked based on the target ideal amplitudeobj(n, k) and estimated ideal amplitude mask IRMest(n, k) to construct a loss function, and optimally training the deep neural network DNN based on the loss function to adjust network parameters so as to obtain a speech processing model. However, the above examples of the constructive loss function are merely exemplary, and the present disclosure is not limited thereto.
After the audio signal is input to the speech processing model trained as shown in fig. 4, an estimated ideal amplitude mask can be obtained.
Referring back to fig. 1, in step S102, in the case where the output of the speech processing model is an estimated target signal, the estimated target signal may be obtained by applying the obtained audio signal to the speech processing model, and then an ideal amplitude mask is obtained based on the estimated target signal and the audio signal. For example, when the estimated target signal is output from the speech processing model, the ideal amplitude mask can be calculated using equation (8) below:
Figure BDA0002795518660000102
wherein, TarestAnd (n, k) is an estimated target signal output from the speech processing model, and Aud (n, k) represents a signal on a time-frequency domain after the obtained audio signal is subjected to short-time fourier transform.
In step S103, the audio signal is differently processed according to the size of the obtained ideal amplitude mask to obtain a desired signal. Here, the desired signal may be a signal subjected to speech enhancement, may be a signal subjected to noise reduction processing, or may be a signal subjected to speech enhancement and noise reduction processing. It is determined whether the desired signal is obtained based on an estimated signal resulting from multiplying the obtained audio signal by the ideal amplitude mask by comparing the ideal amplitude mask with a predetermined threshold.
As an example, if the ideal amplitude mask is larger than the predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to equation (9) below:
Figure BDA0002795518660000103
where Est (n, k) represents the desired signal and Aud (n, k) represents the audio signal after short-time fourier transform.
Figure BDA0002795518660000105
Defined for adjustable userAnd (4) gain. Here, the preset threshold may be 1, or may be an arbitrary value set by the user.
After obtaining the desired signal Est (n, k) in the video domain, the desired signal Est (t) in the time domain is obtained by a short-time inverse fourier transform.
Through the above-described processing, the speech portion in the obtained audio signal can be further enhanced, and the gain of the speech portion desired to be enhanced can be arbitrarily adjusted according to the user's needs.
As another example, the estimated signal is taken as the desired signal if the ideal amplitude mask is smaller than a predetermined threshold, and the obtained audio signal is taken as the desired signal otherwise. For example, the desired signal may be obtained according to equation (10) below:
Figure BDA0002795518660000111
where Est (n, k) represents the desired signal and Aud (n, k) represents the audio signal. Here, the preset threshold may be 1, or may be an arbitrary value set by the user.
By the processing, the denoising effect of the obtained audio signal can be realized.
As another example, if the ideal amplitude mask is greater than a predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, if the ideal amplitude mask is less than the predetermined threshold, the estimated signal is treated as the desired signal, otherwise the obtained audio signal is treated as the desired signal. For example, the desired signal may be obtained according to equation (11) below:
Figure BDA0002795518660000112
where Est (n, k) represents the desired signal and Aud (n, k) represents the audio signal.
Figure BDA0002795518660000113
An additional gain defined for the adjustable user. Here, the preset threshold value mayIt may be 1 or an arbitrary value set by the user.
Through the processing, the voice part in the obtained audio signal can be further enhanced, the gain of the voice part which is expected to be enhanced can be adjusted randomly according to the requirement of a user, and meanwhile, the noise reduction processing can be carried out on the audio signal.
In the above-described embodiment, in the model training stage and the speech processing stage, the obtained signal in the time domain may be first converted into a signal in the time-frequency domain via short-time fourier transform, then model training and speech processing are performed, and finally the finally obtained signal in the time-frequency domain may be converted into a signal in the time domain via short-time inverse fourier transform.
FIG. 5 is a flow chart diagram of a speech processing method according to an embodiment of the present disclosure. In this embodiment, it is assumed that the output of the speech processing model is the estimated target signal.
Referring to fig. 5, the obtained audio signal Aud (t) is transformed into a signal Aud (n, k) on a time-frequency domain by a short-time fourier transform STFT, and then the signal Aud (n, k) is input to a trained speech processing model.
Outputting an estimated target signal Tar from a speech processing modelest(n, k), an ideal amplitude mask IRM (n, k) may be calculated using equation (8), and then it is determined how to post-process the acquired audio signal based on a comparison of the calculated ideal amplitude mask with a predetermined threshold.
The preset threshold may be set to 1, the desired signal Est (n, k) in the time-frequency domain is obtained using equation (9), equation (10), or equation (11), and then the signal in the time domain is obtained by performing a short-time inverse fourier transform ISTFT on the desired signal Est (n, k) in the time-frequency domain.
Furthermore, in the case where the output of the speech processing model is an estimated ideal amplitude mask, the time-frequency mask conversion operation in fig. 5 may be omitted, the estimated ideal amplitude mask is obtained directly by the speech processing model, and then a different post-processing operation is performed based on a comparison of the ideal amplitude mask with a preset threshold.
Fig. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure. Referring to fig. 6, the speech processing apparatus 600 may include a data acquisition module 601, a data processing module 602, and a model training module 603. Each module in the voice processing apparatus 600 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the speech processing apparatus 600 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.
The data acquisition module 601 may acquire an audio signal, wherein the audio signal may include at least one of a speech signal, a noise signal, and a specific signal belonging to an audio type that does not need to be enhanced and suppressed. Since a new audio class, i.e. a type (such as music signal) which is not desired to be amplified or attenuated, is introduced in the model training stage of the present disclosure, the speech processing model of the present disclosure is not used for speech enhancement and speech noise reduction, and such a design is more suitable for practical application. Thus, the present disclosure may perform speech processing on multiple types of signals.
The data processing module 602 may obtain an ideal amplitude mask using a speech processing model based on the obtained audio signal, and then perform different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.
As an example, the data processing module 602 may determine whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold.
For example, if the ideal amplitude mask is greater than the predetermined threshold, the data processing module 602 may multiply the estimated signal resulting from multiplying the audio signal and the ideal amplitude mask by a user-defined gain to obtain the desired signal; otherwise the audio signal may be taken as the desired signal. Here, the preset threshold may be set to 1, or an arbitrary value set by the user. The voice post-processing operation can be performed with reference to equation (9).
For another example, if the ideal amplitude mask is less than the predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the audio signal may be taken as the desired signal. The voice post-processing operation can be performed with reference to equation (10).
For another example, if the ideal amplitude mask is greater than the predetermined threshold, the data processing module 602 may multiply the estimated signal by a user-defined gain to obtain the desired signal. If the ideal amplitude mask is less than the predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the data processing module 602 may treat the audio signal as the desired signal. The voice post-processing operation can be performed with reference to equation (11).
Different speech processing models can be trained due to different training data. In the present disclosure, the output of the speech processing model may be an ideal amplitude mask or an estimated target signal.
In the case where the output of the speech processing model is an estimated target signal, the data processing module 602 may obtain the estimated target signal by applying the obtained audio signal to the speech processing model and then obtain an ideal amplitude mask based on the estimated target signal and the audio signal. And performing voice post-processing operation according to the obtained ideal amplitude mask.
Optionally, the speech processing apparatus may further comprise a model training module 603. The model training module 603 may train the speech processing model based on the following methods: generating a mixed signal and a target signal based on at least one of the speech signal, the noise signal and the specific signal, inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; the speech processing model is trained based on the loss function to adjust parameters of the speech processing model.
Alternatively, the model training module 603 may multiply a specific signal by a first gain to obtain a first signal and multiply a noise signal by a second gain to obtain a second signal, and generate a mixed signal by mixing the first signal, the second signal, and the speech signal. For example, the mixed signal may be generated using equation (1).
Alternatively, the model training module 603 may multiply the speech signal by a third gain to obtain a third signal, and generate the target signal by mixing the third signal and the second signal. For example, the target signal may be generated using equation (2).
Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on a second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal as training data and the target signal are more consistent with the requirements of practical application.
Since the speech processing models are different, the estimated data output by the speech processing models may be different. For example, the estimated data output by the speech processing model may be an estimated target signal or an estimated ideal amplitude mask.
In the case where the estimation data is an estimated ideal amplitude mask, model training module 603 may calculate a target ideal amplitude mask based on the target signal and the mixed signal, and then determine a loss function based on the target ideal amplitude mask and the estimation data. Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mix signal.
FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure. Referring to FIG. 7, model training apparatus 700 may include a data generation module 701 and a data training module 702. Each module in the model training apparatus 700 may be implemented by one or more modules, and the name of the corresponding module may vary according to the type of the module. In various embodiments, some modules in model training apparatus 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.
The training data is divided into three categories during the model design stage, distinguished from single speech enhancement and speech noise reduction, with three input sources in the mixed data, e.g., speech (requiring enhancement), music (audio type neither enhancement nor suppression), and noise (requiring suppression).
The data generation module 701 may generate a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and a specific signal. Specifically, the data generation module 701 may multiply a specific signal by a first gain to obtain a first signal and multiply a noise signal by a second gain to obtain a second signal, and generate a mixed signal by mixing the first signal, the second signal, and the voice signal. For example, a mixed signal as shown in equation (1) may be generated.
The data generation module 701 may multiply the voice signal by a third gain to obtain a third signal, and generate the target signal by mixing the third signal and the second signal. For example, a target signal as shown in equation (2) may be generated.
Here, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on a second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal as training data and the target signal are more consistent with the requirements of practical application. For example, equations (3) and (4) may be utilized to determine gain values for different signals.
The data training module 702 may input the mixed signal into a speech processing model (such as a deep neural network) to obtain estimated data, determine a loss function based on the target signal and the estimated data, train the speech processing model based on the loss function to adjust parameters of the speech processing model.
According to embodiments of the present disclosure, different training data may be utilized to obtain different speech processing models. Assuming that the training output is a speech processing model of the target signal, the data training module 702 inputs the mixed signal Mix (n, k) on the time-frequency domain into the deep neural network DNN, and outputs the estimated target signal Tar from the DNNest(n, k). Then based on the target signal Tar (n, k) and the estimated target signal Tarest(n, k) constructing a loss function, carrying out optimization iteration on the deep neural network DNN based on the loss function, finally converging, and completing a training stage, thereby obtaining a voice processing model.
Assuming the training output is a speech processing model with ideal amplitude masks, data training module 702 may base its operation on target signals Tar (n, k) andthe mixed signal Mix (n, k) calculates a target ideal amplitude mask, and the data training module 702 inputs the mixed signal Mix (n, k) in the time-frequency domain into the deep neural network DNN and outputs an estimated ideal amplitude mask IRM from the DNNest(n, k). Then IRM based on the target ideal amplitude maskobj(n, k and estimated ideal amplitude mask IRM)est(n, k) to determine a loss function. Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mix signal.
According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 8 is a block diagram of an electronic device 800 that may include at least one memory 802 and at least one processor 801, the at least one memory 802 storing a set of computer-executable instructions that, when executed by the at least one processor 801, perform a method of speech processing or a method of training a speech processing model according to an embodiment of the disclosure, according to an embodiment of the disclosure.
The processor 801 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 801 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
The memory 802, which is a kind of storage medium, may include an operating system, a data storage module, a network communication module, a user interface module, a video playback parameter determination program, and a database.
The memory 802 may be integrated with the processor 801, for example, a RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. Further, memory 802 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.
Further, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.
By way of example, the electronic device 800 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. Here, the electronic device 800 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a speech processing method or a training method of a speech processing model according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to an embodiment of the present disclosure, there may also be provided a computer program product, in which instructions are executable by a processor of a computer device to perform the above-mentioned speech processing method or training method of a speech processing model.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method for training a speech processing model, the method comprising:
generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal;
inputting the mixed signal into a speech processing model to obtain estimation data;
determining a loss function based on the target signal and the estimation data;
training the speech processing model based on the loss function to adjust parameters of the speech processing model.
2. The method of claim 1, wherein the step of generating the mixed signal based on at least one of the speech signal, the noise signal, and the specific signal comprises:
multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal;
generating the mixed signal by mixing the first signal, the second signal and the voice signal,
wherein the first gain is determined based on a first predetermined signal-to-noise ratio, the second gain is determined based on a second signal-to-noise ratio and the first gain,
wherein the generating of the target signal based on at least one of the voice signal, the noise signal, and the specific signal includes:
multiplying the speech signal by a third gain to obtain a third signal;
the target signal is generated by mixing the third signal and the second signal.
3. The method of claim 1, wherein the estimation data is an estimated target signal or an estimated ideal amplitude mask,
wherein the ideal amplitude mask is related to the signal energy,
wherein, in the case that the estimation data is an estimated ideal amplitude mask, the step of determining a loss function based on the target signal and the estimation data comprises:
calculating a target ideal amplitude mask based on the target signal and the mixed signal;
determining a loss function based on the target ideal amplitude mask and the estimation data,
wherein the target ideal amplitude mask is an amplitude ratio of the target signal and the mixed signal in a time-frequency domain.
4. A method of speech processing, the method comprising:
acquiring an audio signal, wherein the audio signal comprises at least one of a speech signal, a noise signal and a specific signal, and the specific signal belongs to an audio type which does not need to be enhanced and suppressed;
obtaining an ideal amplitude mask based on the audio signal using a speech processing model trained using the training method of any one of claims 1-3; and
and according to the size of the ideal amplitude mask, performing different processing on the audio signal to obtain a desired signal.
5. The method of claim 4, wherein the step of performing different processing on the audio signal to obtain the desired signal according to the magnitude of the ideal amplitude mask comprises:
determining whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal by the ideal amplitude mask by comparing the ideal amplitude mask to a predetermined threshold,
wherein if the ideal amplitude mask is greater than the predetermined threshold, multiplying the estimated signal by a user-defined gain to obtain the desired signal; otherwise the audio signal is taken as the desired signal, or
Wherein the estimated signal is taken as the desired signal if the ideal amplitude mask is less than the predetermined threshold; otherwise the audio signal is taken as the desired signal, or
Wherein if the ideal amplitude mask is greater than the predetermined threshold, multiplying the estimated signal by a user-defined gain to obtain the desired signal; if the ideal amplitude mask is less than the predetermined threshold, treating the estimated signal as the desired signal; otherwise, the audio signal is taken as the desired signal.
6. The method of claim 4, wherein the output of the speech processing model is the ideal amplitude mask or an estimated target signal,
wherein, in the case that the output of the speech processing model is an estimated target signal, the step of obtaining an ideal amplitude mask comprises:
obtaining an estimated target signal by applying the audio signal to a speech processing model;
obtaining the ideal amplitude mask based on the estimated target signal and the audio signal.
7. An apparatus for training a speech processing model, the apparatus comprising:
a data generation module configured to: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; and
a data training module configured to: inputting the mixed signal into a speech processing model to obtain estimation data; determining a loss function based on the target signal and the estimation data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.
8. A speech processing apparatus, characterized in that the apparatus comprises:
a data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal, the specific signal being of an audio type that does not need to be enhanced and suppressed;
a data processing module configured to:
obtaining an ideal amplitude mask based on the audio signal using a speech processing model trained using the training method of any one of claims 1-3; and
and according to the size of the ideal amplitude mask, performing different processing on the audio signal to obtain a desired signal.
9. An electronic device, comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method of any of claims 1 to 3 and 4 to 6.
10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method of any of claims 1 to 3 and claims 4 to 6.
CN202011330109.3A 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device Pending CN112309426A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011330109.3A CN112309426A (en) 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011330109.3A CN112309426A (en) 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device

Publications (1)

Publication Number Publication Date
CN112309426A true CN112309426A (en) 2021-02-02

Family

ID=74335596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011330109.3A Pending CN112309426A (en) 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device

Country Status (1)

Country Link
CN (1) CN112309426A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113035221A (en) * 2021-02-26 2021-06-25 北京达佳互联信息技术有限公司 Training method and device of voice processing model and voice processing method and device
CN113192536A (en) * 2021-04-28 2021-07-30 北京达佳互联信息技术有限公司 Training method of voice quality detection model, voice quality detection method and device
CN113470124A (en) * 2021-06-30 2021-10-01 北京达佳互联信息技术有限公司 Training method and device of special effect model and special effect generation method and device
WO2022178970A1 (en) * 2021-02-26 2022-09-01 平安科技(深圳)有限公司 Speech noise reducer training method and apparatus, and computer device and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000231399A (en) * 1999-02-10 2000-08-22 Oki Electric Ind Co Ltd Noise reducing device
CN101154383A (en) * 2006-09-29 2008-04-02 株式会社东芝 Method and device for noise suppression, phonetic feature extraction, speech recognition and training voice model
WO2011152993A1 (en) * 2010-06-04 2011-12-08 Apple Inc. User-specific noise suppression for voice quality improvements
US20150112672A1 (en) * 2013-10-18 2015-04-23 Apple Inc. Voice quality enhancement techniques, speech recognition techniques, and related systems
CN108346433A (en) * 2017-12-28 2018-07-31 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN108615535A (en) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 Sound enhancement method, device, intelligent sound equipment and computer equipment
WO2019227590A1 (en) * 2018-05-29 2019-12-05 平安科技(深圳)有限公司 Voice enhancement method, apparatus, computer device, and storage medium
KR102085739B1 (en) * 2018-10-29 2020-03-06 광주과학기술원 Speech enhancement method
CN110956957A (en) * 2019-12-23 2020-04-03 苏州思必驰信息科技有限公司 Training method and system of speech enhancement model
CN111554321A (en) * 2020-04-20 2020-08-18 北京达佳互联信息技术有限公司 Noise reduction model training method and device, electronic equipment and storage medium
CN111627458A (en) * 2020-05-27 2020-09-04 北京声智科技有限公司 Sound source separation method and equipment
CN111933164A (en) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 Training method and device of voice processing model, electronic equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000231399A (en) * 1999-02-10 2000-08-22 Oki Electric Ind Co Ltd Noise reducing device
CN101154383A (en) * 2006-09-29 2008-04-02 株式会社东芝 Method and device for noise suppression, phonetic feature extraction, speech recognition and training voice model
WO2011152993A1 (en) * 2010-06-04 2011-12-08 Apple Inc. User-specific noise suppression for voice quality improvements
US20150112672A1 (en) * 2013-10-18 2015-04-23 Apple Inc. Voice quality enhancement techniques, speech recognition techniques, and related systems
CN108346433A (en) * 2017-12-28 2018-07-31 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN108615535A (en) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 Sound enhancement method, device, intelligent sound equipment and computer equipment
WO2019227590A1 (en) * 2018-05-29 2019-12-05 平安科技(深圳)有限公司 Voice enhancement method, apparatus, computer device, and storage medium
KR102085739B1 (en) * 2018-10-29 2020-03-06 광주과학기술원 Speech enhancement method
CN110956957A (en) * 2019-12-23 2020-04-03 苏州思必驰信息科技有限公司 Training method and system of speech enhancement model
CN111554321A (en) * 2020-04-20 2020-08-18 北京达佳互联信息技术有限公司 Noise reduction model training method and device, electronic equipment and storage medium
CN111627458A (en) * 2020-05-27 2020-09-04 北京声智科技有限公司 Sound source separation method and equipment
CN111933164A (en) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 Training method and device of voice processing model, electronic equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113035221A (en) * 2021-02-26 2021-06-25 北京达佳互联信息技术有限公司 Training method and device of voice processing model and voice processing method and device
WO2022178970A1 (en) * 2021-02-26 2022-09-01 平安科技(深圳)有限公司 Speech noise reducer training method and apparatus, and computer device and storage medium
CN113035221B (en) * 2021-02-26 2023-12-19 北京达佳互联信息技术有限公司 Training method and device for voice processing model and voice processing method and device
CN113192536A (en) * 2021-04-28 2021-07-30 北京达佳互联信息技术有限公司 Training method of voice quality detection model, voice quality detection method and device
CN113470124A (en) * 2021-06-30 2021-10-01 北京达佳互联信息技术有限公司 Training method and device of special effect model and special effect generation method and device
CN113470124B (en) * 2021-06-30 2023-09-22 北京达佳互联信息技术有限公司 Training method and device for special effect model, and special effect generation method and device

Similar Documents

Publication Publication Date Title
CN112309426A (en) Voice processing model training method and device and voice processing method and device
CN110265064B (en) Audio frequency crackle detection method, device and storage medium
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
CN113241088B (en) Training method and device of voice enhancement model and voice enhancement method and device
US10262680B2 (en) Variable sound decomposition masks
US20220130407A1 (en) Method for isolating sound, electronic equipment, and storage medium
CN112712816B (en) Training method and device for voice processing model and voice processing method and device
CN113284507B (en) Training method and device for voice enhancement model and voice enhancement method and device
US20230267947A1 (en) Noise reduction using machine learning
CN114121029A (en) Training method and device of speech enhancement model and speech enhancement method and device
US11393443B2 (en) Apparatuses and methods for creating noise environment noisy data and eliminating noise
JP2019078864A (en) Musical sound emphasis device, convolution auto encoder learning device, musical sound emphasis method, and program
US9601124B2 (en) Acoustic matching and splicing of sound tracks
US9318106B2 (en) Joint sound model generation techniques
CN112652290B (en) Method for generating reverberation audio signal and training method of audio processing model
CN113035221B (en) Training method and device for voice processing model and voice processing method and device
Li et al. Single‐channel dereverberation and denoising based on lower band trained SA‐LSTMs
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
US20190385590A1 (en) Generating device, generating method, and non-transitory computer readable storage medium
US9398387B2 (en) Sound processing device, sound processing method, and program
US20210256970A1 (en) Speech feature extraction apparatus, speech feature extraction method, and computer-readable storage medium
US11611839B2 (en) Optimization of convolution reverberation
Oh et al. Spectrogram-channels u-net: a source separation model viewing each channel as the spectrogram of each source
CN113257267B (en) Method for training interference signal elimination model and method and equipment for eliminating interference signal
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination