WO2023093477A1 - Speech enhancement model training method and apparatus, storage medium, and device - Google Patents

Speech enhancement model training method and apparatus, storage medium, and device Download PDF

Info

Publication number
WO2023093477A1
WO2023093477A1 PCT/CN2022/129232 CN2022129232W WO2023093477A1 WO 2023093477 A1 WO2023093477 A1 WO 2023093477A1 CN 2022129232 W CN2022129232 W CN 2022129232W WO 2023093477 A1 WO2023093477 A1 WO 2023093477A1
Authority
WO
WIPO (PCT)
Prior art keywords
impulse response
training
room impulse
data
training samples
Prior art date
Application number
PCT/CN2022/129232
Other languages
French (fr)
Chinese (zh)
Inventor
刘荣
Original Assignee
广州视源电子科技股份有限公司
广州视源人工智能创新研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司, 广州视源人工智能创新研究院有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2023093477A1 publication Critical patent/WO2023093477A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present disclosure relates to the technical field of speech enhancement, in particular to a speech enhancement model training method and device, a readable storage medium and electronic equipment.
  • speech enhancement is to improve speech quality through various calculation methods, and to extract as pure a speech signal as possible from a speech signal containing interfering sounds.
  • speech enhancement algorithms are as follows: Speech enhancement algorithm based on spectral subtraction, speech enhancement algorithm based on wavelet analysis, speech enhancement algorithm based on Kalman filter, enhancement method based on signal subspace, speech enhancement algorithm based on auditory masking effect
  • Speech enhancement algorithm based on spectral subtraction
  • speech enhancement algorithm based on wavelet analysis
  • speech enhancement algorithm based on Kalman filter enhancement method based on signal subspace
  • speech enhancement algorithm based on auditory masking effect
  • a training method for an enhanced model a training method for a speech enhancement model based on independent component analysis
  • a training method for a speech enhancement model based on a neural network a training method for a speech enhancement model based on a neural network.
  • the neural network when used to train the speech enhancement model, it will inevitably cause more damage to the speech signal, thereby degrading the speech quality. Furthermore, when using clean speech signals and room impulse responses to generate training data, it can cause the target signal to lead the input signal, which may eventually cause the model to become unrealizable and thus untrainable.
  • the purpose of the present disclosure is to provide a speech enhancement model training method and device, a readable storage medium and electronic equipment, at least to a certain extent to overcome the problem that the early reverberation is not preserved when the reverberation is removed in the related art, and the complexity of the model is relatively high. High, the disadvantage of greater damage to speech.
  • a method for training a speech enhancement model comprising: obtaining N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the speech enhancement model is trained through the above-mentioned N groups of training samples; wherein, obtaining the above-mentioned i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data , process the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the i-th training data in the above-mentioned i-th group of training samples; determine the i-th control curve according to the above-mentioned i-th room impulse response, and convert the above-mentioned i-th The i control curve is multiplied
  • the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is related to the sampling points of the i-th room impulse response The number of points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, the tail truncation process can be selected.
  • the determination of the i-th control curve based on the i-th room impulse response includes: determining the absolute value of each sampling point in the i-th room impulse response, wherein the absolute value of Contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve and The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.
  • the i-th control curve is determined according to the i-th room impulse response, and the method further includes: adjusting the control value of the i-th control curve through parameters to determine the i-th control curve; wherein , other control values except the above-mentioned main control value are not greater than the above-mentioned main control value.
  • the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.
  • the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples.
  • the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The noise data are summed and convolved with the i-th room impulse response to obtain the i-th training data in the i-th set of training samples.
  • a training device for a speech enhancement model includes: an acquisition module, configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th training data i target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the training module is used to: train the speech enhancement model through the above-mentioned N groups of training samples; wherein, the above-mentioned acquisition module is specifically used to: acquire the i-th The room impulse response and the i-th pure voice data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the above-mentioned i-th training data; determining the i-th control curve according to the above-mentioned i-th room impulse response, Multiply the above i-th control curve with the above-ment
  • a terminal including: a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, the first A method for training a speech enhancement model in one aspect.
  • a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for training a speech enhancement model in the first aspect is implemented.
  • the speech enhancement model training method and device, readable storage medium and electronic equipment provided by the embodiments of the present disclosure have the following technical effects:
  • N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data; the speech enhancement model is trained through the N groups of training samples; the i-th group of training samples is obtained
  • the i group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th training data in the i-th group of training samples;
  • the i-th room impulse response determines the i-th control curve, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N; the i-th room
  • the pure speech data is convolved with the i'th room impulse response to obtain the i'th target data in the i'th group of training
  • FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model provided by an embodiment of the present disclosure
  • Figure 2 shows a schematic diagram of a speech enhancement model
  • Fig. 3 shows the schematic diagram of control curve
  • Fig. 4 shows a schematic diagram of the i-th room impulse response
  • Fig. 5 shows the schematic diagram of the i'th room impulse response
  • FIG. 6 schematically shows a structural diagram of a training device for a speech enhancement model provided by an embodiment of the present disclosure
  • Fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model according to an exemplary embodiment of the present disclosure.
  • this method comprises the following steps:
  • S101 acquire the i-th group of training samples, and obtain N groups of training samples in total, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N .
  • obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data of .
  • S12 Determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N ;
  • the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
  • multiplying the i-th control curve by the i-th room impulse response refers to multiplying the control value of each control point in the i-th control curve by the corresponding sampling point in the i-th room impulse response.
  • the multiplication operation can be omitted.
  • the control value of the corresponding control point can be regarded as 1 for the sampling point in the impulse response of the i-th room that has no operation or does not need to be operated.
  • control point with a control value of 1 can be omitted, that is, the number of points of the above-mentioned i-th control curve and the number of samples of the above-mentioned i-th room impulse response are not necessarily strictly equal, but equal in a broad sense.
  • the tail truncation process can be selected, and the number of sampling points will be reduced correspondingly after the tail is truncated.
  • the impulse response of the i'th room is obtained by tail truncation processing on the impulse response of the i'th room, it can be regarded as setting the control value of the corresponding tail control point in the i'th control curve to 0.
  • N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is not A positive integer greater than N; training the speech enhancement model through N groups of training samples; obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, and the i-th room impulse response and the i-th pure voice
  • the data is processed to obtain the i-th training data in the i-th group of training samples; the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room Impulse response, wherein, i' is a positive integer not greater than N; the i-th pure speech data is convolved with the i'-th room impulse response to obtain
  • Fig. 2 shows a schematic diagram of a speech enhancement model.
  • the specific implementation manner of each step included in the embodiment shown in FIG. 1 will be described in detail below with reference to FIG. 2 .
  • the present embodiment provides a kind of training method of speech enhancement model, and specific implementation method is as follows:
  • the i-th group of training samples is obtained to obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive number not greater than N integer.
  • the speech enhancement model is trained through N groups of training samples.
  • obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data in the sample.
  • the i-th pure speech data and the i-th room impulse response are randomly obtained in the sample library, and convolution is performed on the i-th pure speech data and the i-th room impulse response Processing, the i-th training data can be obtained.
  • the i-th training data and the i-th target data obtained in the following embodiments will be used together as the i-th group of training samples.
  • the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is not greater than N Positive integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
  • the i-th group of training samples is first obtained, the i-th group of training samples includes the i-th training data and the i-th target data, and a plurality of i-th groups of training samples form N groups of training samples.
  • the method for obtaining the i-th target data is as follows: as shown in FIG. 2 , after the i-th pure voice data and the i-th room impulse response are obtained, S12 is executed.
  • the specific implementation method refers to the following examples:
  • the i-th room impulse response After obtaining the i-th pure speech data and the i-th room impulse response, determine the first sampling point with the largest absolute value in the i-th room impulse response, which is the peak position point, the i-th room impulse response There also needs to be a peak point corresponding to it in the control curve. Therefore, in the i-th control curve, the control value point corresponding to the peak position point of the i-th room impulse response is recorded as the main control point p, and the main control point p corresponds to The control value of is recorded as the main control value.
  • Reverb includes early and late reflections.
  • early reflections can have the effect of enhancing the speech signal, so they can be preserved. Therefore, parameters can be used to control the duration and intensity of the reverberation in the room impulse response, so as to selectively retain the reverberation in different stages.
  • the i-th control curve can be generated through preset parameters, and multiplied by the value of each sampling point in the i-th room impulse response to generate an i-th control curve with only direct sound and/or early reverberation.
  • 'Room impulse response Since the sound propagation takes a certain time, and the intensity of the direct sound and the early reflection sound is greater than the intensity of the late reflection sound, the amplitude of the sampling point of the impulse response of the i'th room is zero or very small at first, and then increases rapidly, and in The amplitude of the direct sound and/or early reflection reaches a maximum, and the amplitude gradually decreases during the late reflection. Therefore, the control value of the i-th control curve can also change according to this rule.
  • the above-mentioned main control point p is located near the direct sound or the early reflection sound, and the control values in the i-th control curve are not greater than the main control point p.
  • parameters can be used to adjust the control value in the i-th control curve, so as to control the amplitude, position and duration of the direct sound, early reverberation, and late reverberation in the i'th room impulse response .
  • the control value in the i-th control curve can be adjusted according to requirements, that is, the shape of the i-th control curve can be changed at will, and there is no limitation in this embodiment, and only two or three examples are given below as examples The description of , does not represent all feasible solutions, for example, a bell curve similar to normal distribution or Gaussian distribution can also be used.
  • Figure 3 shows a schematic diagram of the control curve.
  • the sampling point amplitude of the i'th room impulse response is zero or very small at first, it can be ignored, and the control value can be set to 0 for this part, as shown in Fig. 3 on the left Side dotted line part.
  • the magnitude of the impulse response of the i'th room before the main control point p shows a rapid upward trend.
  • the i'th control curve can be recorded as m, and a section of m can be recorded as m1, and the control value parameters can be adjusted , for example, m1 is preset as an exponentially rising curve, as shown in FIG. 3 .
  • the impulse response of the i'th room enters the earlier stage of direct sound and early reflections, and a section of the control curve m is recorded as m2, and the amplitude of this section of the i'th room's impulse response remains unchanged. That is, all the control values of the control points of the m2 section are set to 1, and at this time m2 is a straight line, as shown in Figure 3.
  • the i-th control curve is used to control early reverberation and late reverberation.
  • a section of the curve controlling the late reverberation is recorded as m4, and the control parameters can also be adjusted, for example, m4 is preset as a section of exponentially decaying curve, as shown in FIG. 3 .
  • the duration of m3 it can be set to the length of the early reverberation that needs to be preserved, and the duration of m4 can be set to the required reverberation time, such as T60.
  • the corresponding control value parameter in the control curve can be directly set to 0, as shown in the dotted line on the right in Figure 3. According to the above-described embodiment, a curve as shown in FIG. 3 is generated.
  • the parameters of the m1 segment can also be set to all 0s (or the duration of the m1 segment is set to 0), or m1 can be changed to a straight line that changes linearly.
  • the parameters of the m4 segment can also be set to all 0 (or set the duration of the m4 segment to 0), or change m4 to a linear attenuation line.
  • the m2 segment and the m3 segment can also be set as a straight line with a linear change, or as an exponential curve.
  • the shapes and lengths of m1, m2, m3, and m4 (that is, the control of the amplitude and duration of different stages of the i'th room impulse response) can be adjusted according to actual conditions , with no established rules.
  • the control value corresponding to the main control point p still needs to be kept as the maximum value in the i-th control curve, and the control values of other control points are not greater than the main control value.
  • the segmentation of the i-th control curve is not limited to the four segments m1, m2, m3, and m4, and the number of segments can be increased or decreased arbitrarily, which is not limited here.
  • the control value corresponding to the control point in the i-th control curve is multiplied by the value of the corresponding sampling point in the i-th room impulse response to obtain the controlled i-th control curve i'room impulse response.
  • some samples whose tail sample values have zero or very small amplitude can be deleted and truncated, and the processed samples of the i'th room impulse response The number decreases accordingly. According to the convolution theorem, such processing will not affect the result of convolution, and can save storage and computing resources.
  • Fig. 4 shows a schematic diagram of the impulse response of the i-th room
  • Fig. 5 shows a schematic diagram of the impulse response of the i'th room.
  • the impulse response of the i-th room has a long duration of reverberation, and there is a certain amount of noise at the head and tail.
  • the late reverberation in the impulse response of the i-th room can be removed, and the early reverberation can be retained.
  • the obtained i'th room impulse response is shown in Figure 5. Only the direct sound and early reverberation are reserved in the i'th room impulse response, and The head and tail noises in the i-th room impulse response are removed.
  • the above i'th pure speech data is convolved with the i'th room impulse response to obtain the i'th target data.
  • the i-th target data is the target label data during speech enhancement model training.
  • the i-th pure speech data and the i-th room impulse response are convolved to obtain the i-th training data in the i-th group of training samples, which together with the i-th target data serve as the i-th training data of the neural network Group i of training samples are input into the neural network, as shown in Figure 2.
  • the i-th group of training samples is used to continuously train the speech enhancement model until the output of the model can achieve excellent speech enhancement results.
  • the i-th training data in the i-th group of training samples also includes: convolving the i-th pure speech data with the i-th room impulse response, and adding them to the noise data to obtain the i-th training data ; Or, add the i-th pure speech data to the noise data, and convolve it with the i-th room impulse response to obtain the i-th training data.
  • Whether to add noise depends on whether the model needs noise reduction capability.
  • the amplitude of the i-th pure speech data and noise data can also be randomly scaled.
  • the signal enhancement method used is not limited, for example, it can be an ideal binary mask (Ideal Binary Mask, IBM), an ideal ratio mask Ideal Ratio Mask (IRM), Ideal Amplitude Mask (IAM), Phase-Shifting Mask (PSM), Complex Ideal Ratio Mask (Complex Ideal Ratio Mask, CIRM), etc. way.
  • an ideal binary mask Ideal Binary Mask, IBM
  • an ideal ratio mask Ideal Ratio Mask IRM
  • Ideal Amplitude Mask IAM
  • PSM Phase-Shifting Mask
  • CIRM Complex Ideal Ratio Mask
  • the above-mentioned neural network can be a deep neural network (Deep Neural Networks, DNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recurrent Neural Networks, RNN), a long short-term memory Neural networks (Long Short-Term Networks, LSTM), etc., are not limited here.
  • DNN Deep Neural Networks
  • CNN convolutional Neural Networks
  • RNN recurrent neural network
  • LSTM Long Short-Term Networks
  • FIG. 6 shows a structural diagram of an apparatus for training a speech enhancement model according to an exemplary embodiment of the present disclosure.
  • the speech enhancement model training device shown in this figure can be implemented as all or part of the terminal through software, hardware or a combination of the two, and can also be integrated on the server as an independent module.
  • the training device 600 of the speech enhancement model includes: an acquisition module 601 and a training module 602, wherein:
  • An acquisition module 601 configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;
  • the training module 602 is used for: training the speech enhancement model by N groups of training samples;
  • the acquiring module 601 is specifically used to: acquire the i-th room impulse response and the i-th pure voice data, process the i-th room impulse response and the i-th pure voice data, and obtain the i-th group of training samples in the i-th Training data; determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve with the i-th room impulse response to obtain the i'th room impulse response; where i' is a positive value not greater than N Integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
  • the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned functions can be assigned to different function modules according to needs Module completion means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the speech enhancement model training device and the speech enhancement model training method provided by the above embodiments belong to the same idea, so for the details not disclosed in the disclosed device embodiments, please refer to the speech enhancement model mentioned above in this disclosure. The embodiment of the training method will not be repeated here.
  • An embodiment of the present disclosure also provides a readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method in any one of the foregoing embodiments are implemented.
  • the readable storage medium may include but is not limited to any type of disk, including floppy disk, optical disk, DVD, CD-ROM, microdrive and magneto-optical disk, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory device, Magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of medium or device suitable for storing instructions and/or data.
  • An embodiment of the present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the program, the steps of the method in any of the foregoing embodiments are implemented.
  • an electronic device 700 includes: a processor 701 and a memory 702 .
  • the processor 701 is a control center of the computer system, and may be a processor of a physical machine or a processor of a virtual machine.
  • the processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 701 can adopt at least one hardware form in digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA) accomplish.
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PLA programmable logic array
  • the processor 701 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, also called a central processing unit (Central Processing Unit, CPU); the coprocessor is Low-power processor for processing data in standby state.
  • main processor is a processor for processing data in the wake-up state, also called a central processing unit (Central Processing Unit, CPU);
  • coprocessor is Low-power processor for processing data in standby state.
  • processor 701 is specifically used to:
  • the i-th group of training samples includes: the i-th training data and the i-th target data, wherein N is a positive integer, and i is a positive integer not greater than N; training speech enhancement through the above-mentioned N groups of training samples model; wherein, obtaining the above i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure speech data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure speech data, and obtaining the above-mentioned i-th
  • the i-th training data in the group of training samples; the i-th control curve is determined according to the above-mentioned i-th room impulse response, and the above-mentioned i-th control curve is multiplied by the above-mentioned i-th room impulse response to obtain the i'th room impulse response, Wherein, i' is a positive integer not
  • the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is the same as the number of the i-th room impulse response The number of sampling points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, you can choose to perform tail truncation processing.
  • the determination of the i-th control curve according to the above-mentioned i-th room impulse response includes: determining the absolute value of each of the above-mentioned sampling points in the above-mentioned i-th room impulse response, wherein the above-mentioned absolute The value contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.
  • the above-mentioned determination of the i-th control curve according to the above-mentioned i-th room impulse response further includes: adjusting the control value of the above-mentioned i-th control curve through parameters to determine the above-mentioned i-th control curve; wherein, in addition to the above-mentioned main control Except for the control point, the control values corresponding to the other above-mentioned control points are not greater than the above-mentioned main control value.
  • the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th
  • the room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.
  • the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th
  • the room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples above.
  • the aforementioned processing of the i-th room impulse response and the aforementioned i-th pure speech data to obtain the i-th training data in the aforementioned i-th group of training samples includes: combining the aforementioned i-th pure speech data with noise data and convolved with the impulse response of the above i-th room to obtain the i-th training data in the above-mentioned i-th group of training samples.
  • Memory 702 may include one or more readable storage media, which may be non-transitory.
  • the memory 702 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • non-transitory readable storage medium in the memory 702 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 701 to implement the methods in the embodiments of the present disclosure.
  • the electronic device 700 further includes: a peripheral device interface 703 and at least one peripheral device.
  • the processor 701, the memory 702, and the peripheral device interface 703 may be connected through buses or signal lines.
  • Each peripheral device can be connected to the peripheral device interface 703 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a display screen 704 , a camera 707 and an audio circuit 706 .
  • the peripheral device interface 703 may be used to connect at least one peripheral device related to input/output (Input/Output, I/O) to the processor 701 and the memory 702 .
  • the processor 701, the memory 702 and the peripheral device interface 703 are integrated on the same chip or circuit board; in some other embodiments of the present disclosure, the processor 701, the memory 702 and the peripheral device interface Either or both of 703 can be implemented on separate chips or circuit boards. Embodiments of the present disclosure do not specifically limit this.
  • the display screen 704 is used to display a user interface (User Interface, UI).
  • the UI can include graphics, text, icons, video, and any combination thereof.
  • the display screen 704 also has the ability to collect touch signals on or above the surface of the display screen 704 .
  • the touch signal can be input to the processor 701 as a control signal for processing.
  • the display screen 704 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 704 there may be one display screen 704, which is set on the front panel of the electronic device 700; in other embodiments of the present disclosure, there may be at least two display screens 704, which are respectively set on the Different surfaces may be in a folded design; in some other embodiments of the present disclosure, the display screen 704 may be a flexible display screen, which is arranged on a curved surface or a folded surface of the electronic device 700 . Even, the display screen 704 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 704 can be made of liquid crystal display (Liquid Crystal Display, LCD), organic light-emitting diode (Organic Light-Emitting Diode, OLED) and other materials.
  • the camera 707 is used to collect images or videos.
  • the camera 707 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the electronic device, and the rear camera is set on the back of the electronic device.
  • there are at least two rear cameras which are any one of the main camera, depth-of-field camera, wide-angle camera, and telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function.
  • the camera 707 may also include a flash.
  • the flash can be a single-color temperature flash or a dual-color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • Audio circuitry 706 may include a microphone and speakers.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 701 for processing.
  • the processor 701 for processing.
  • the microphone can also be an array microphone or an omnidirectional collection microphone.
  • the power supply 707 is used to supply power to various components in the electronic device 700 .
  • Power source 707 may be AC, DC, disposable or rechargeable batteries.
  • the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery.
  • a wired rechargeable battery is a battery charged through a wired line
  • a wireless rechargeable battery is a battery charged through a wireless coil.
  • the rechargeable battery can also be used to support fast charging technology.
  • the structural block diagram of the electronic device shown in the embodiment of the present disclosure does not constitute a limitation on the electronic device 700, and the electronic device 700 may include more or fewer components than shown in the figure, or combine some components, or adopt a different component arrangement .
  • connection can be fixed connection, detachable connection, or integral connection; “connection” can be directly or indirectly through an intermediary.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Filters That Use Time-Delay Elements (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

A speech enhancement model training method and apparatus, a storage medium, and a device. The method comprises: acquiring N groups of training samples, an i-th group of training samples comprising: i-th training data and i-th target data (S101); training a speech enhancement model by means of the N groups of training samples (S102); acquiring the i-th group of training samples comprising: acquiring an i-th room impulse response and i-th pure speech data, and processing the i-th room impulse response and the i-th pure speech data to obtain i-th training data in the i-th group of training samples (S11); and determining an i-th control curve according to the i-th room impulse response, and multiplying the i-th control curve by the i-th room impulse response to obtain an i'-th room impulse response, the i-th pure speech data being convolved with the i'-th room impulse response to obtain the i-th target data in the i-th group of training samples (S12).

Description

语音增强模型的训练方法及装置、存储介质及设备Speech enhancement model training method and device, storage medium and equipment
本申请要求于2021年11月25日提交中国专利局,申请号为2021114275387、发明名称为“语音增强模型的训练方法及装置、存储介质及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 2021114275387 and the title of the invention "Speech enhancement model training method and device, storage medium and equipment" submitted to the China Patent Office on November 25, 2021, the entire content of which is passed References are incorporated in this application.
技术领域technical field
本公开涉及语音增强技术领域,尤其涉及一种语音增强模型的训练方法及装置、可读存储介质及电子设备。The present disclosure relates to the technical field of speech enhancement, in particular to a speech enhancement model training method and device, a readable storage medium and electronic equipment.
背景技术Background technique
语音增强的目的是藉由各种演算方法来提高语音质量,从含有干扰声的语音信号中尽可能提取纯净的语音信号。常用的语音增强算法有如下几种:基于谱相减的语音增强算法、基于小波分析的语音增强算法、基于卡尔曼滤波的语音增强算法、基于信号子空间的增强方法、基于听觉掩蔽效应的语音增强模型的训练方法、基于独立分量分析的语音增强模型的训练方法、基于神经网络的语音增强模型的训练方法。The purpose of speech enhancement is to improve speech quality through various calculation methods, and to extract as pure a speech signal as possible from a speech signal containing interfering sounds. Commonly used speech enhancement algorithms are as follows: Speech enhancement algorithm based on spectral subtraction, speech enhancement algorithm based on wavelet analysis, speech enhancement algorithm based on Kalman filter, enhancement method based on signal subspace, speech enhancement algorithm based on auditory masking effect A training method for an enhanced model, a training method for a speech enhancement model based on independent component analysis, and a training method for a speech enhancement model based on a neural network.
而在利用神经网络进行语音增强模型的训练时,会不可避免地对语音信号造成较多的损伤,从而语音质量下降。此外,当使用纯净语音信号与房间冲击响应生成训练数据时,会导致目标信号超前输入信号,最终可能就会导致模型变得不可实现从而无法训练。However, when the neural network is used to train the speech enhancement model, it will inevitably cause more damage to the speech signal, thereby degrading the speech quality. Furthermore, when using clean speech signals and room impulse responses to generate training data, it can cause the target signal to lead the input signal, which may eventually cause the model to become unrealizable and thus untrainable.
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background section is only for enhancing the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.
发明内容Contents of the invention
本公开的目的在于提供一种语音增强模型的训练方法及装置、可读存储介质及电子设备,至少在一定程度上克服由于相关技术中去混响时没有保留早期混响,且模型复杂度较高,对语音损伤较大的缺点。The purpose of the present disclosure is to provide a speech enhancement model training method and device, a readable storage medium and electronic equipment, at least to a certain extent to overcome the problem that the early reverberation is not preserved when the reverberation is removed in the related art, and the complexity of the model is relatively high. High, the disadvantage of greater damage to speech.
本公开的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本公开的实践而习得。Other features and advantages of the present disclosure will become apparent from the following detailed description, or in part, be learned by practice of the present disclosure.
根据本公开的第一个方面,提供一种语音增强模型的训练方法,上述方法包括:获取N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数;通过上述N组训练样本训练语音增强模型;其中,获取上述第i组训练样本,包括:获取第i房间冲激响应以及第i纯净语音数据,对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据;根据上述第i房间冲激响应确定第i控制曲线,将上述第i控制曲线与上述第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;上述第i纯净语音数据与上述第i’房间冲激响应进行卷积,得到上述第i组训练样本中的第i目标数据。According to a first aspect of the present disclosure, a method for training a speech enhancement model is provided, the method comprising: obtaining N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the speech enhancement model is trained through the above-mentioned N groups of training samples; wherein, obtaining the above-mentioned i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data , process the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the i-th training data in the above-mentioned i-th group of training samples; determine the i-th control curve according to the above-mentioned i-th room impulse response, and convert the above-mentioned i-th The i control curve is multiplied by the impulse response of the i-th room above to obtain the impulse response of the i'th room, where i' is a positive integer not greater than N; the i-th pure speech data and the impulse response of the i'th room Perform convolution to obtain the i-th target data in the above-mentioned i-th group of training samples.
在本公开的一个实施例中,上述第i房间冲激响应中包括多个采样点;上述第i控制曲线包括多个控制值,上述控制值的个数与上述第i房间冲激响应的采样点的个数相同;上述第i’房间冲激响应,当尾部的采样点值为零或绝对值很小时,可选择进行尾部截断处理。In an embodiment of the present disclosure, the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is related to the sampling points of the i-th room impulse response The number of points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, the tail truncation process can be selected.
在本公开的一个实施例中,上述根据上述第i房间冲激响应确定第i控制曲线,包括:确定上述第i房间冲激响应中,上述每个采样点的绝对值,其中上述绝对值中包含多个数值相等的最大值;在上述第i房间冲激响应中,将上述绝对值中的第一个最大值所对应的采样点,确定为峰值位置点;将上述第i控制曲线中与上述第i房间冲激响应的峰值位置点对应的控制值,确定为上述第i控制曲线的主控制值。In an embodiment of the present disclosure, the determination of the i-th control curve based on the i-th room impulse response includes: determining the absolute value of each sampling point in the i-th room impulse response, wherein the absolute value of Contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve and The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.
在本公开的一个实施例中,上述根据上述第i房间冲激响应确定第i控制曲线,上述方法还包括:通过参数调整上述第i控制曲线的控制值,以确定上述第i控制曲线;其中,除上述主控制值外的其他控制值均不大于上述主控制值。In an embodiment of the present disclosure, the i-th control curve is determined according to the i-th room impulse response, and the method further includes: adjusting the control value of the i-th control curve through parameters to determine the i-th control curve; wherein , other control values except the above-mentioned main control value are not greater than the above-mentioned main control value.
在本公开一个实施例中,上述对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据,包括:将第i纯净语音数据与第i房间冲激响应进行卷积,得到第i组训练样本中的第i训练数据。In an embodiment of the present disclosure, the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.
在本公开一个实施例中,上述对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据,包括:将第i纯净语音数据与第i房间冲激响应进行卷积,并与噪声数据相加,得到第i组训练样本中的第i训练数据。In an embodiment of the present disclosure, the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples.
在本公开一个实施例中,上述对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据,包括:将第i纯净语音数据与噪声数据相加,并与第i房间冲激响应进行卷积,得到第i组训练样本中的第i训练数据。In an embodiment of the present disclosure, the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The noise data are summed and convolved with the i-th room impulse response to obtain the i-th training data in the i-th set of training samples.
根据本公开的第二个方面,提供一种语音增强模型的训练装置,上述装置包括:获取模块,用于:获取N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数;训练模块,用于:通过上述N组训练样本训练语音增强模型;其中,上述获取模块,具体用于:获取第i房间冲激响应以及第i纯净语音数据,对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i训练数据;根据上述第i房间冲激响应确定第i控制曲线,将上述第i控制曲线与上述第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;上述第i纯净语音数据与上述第i’房间冲激响应进行卷积,得到上述第i组训练样本中的第i目标数据。According to a second aspect of the present disclosure, a training device for a speech enhancement model is provided, the above-mentioned device includes: an acquisition module, configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th training data i target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the training module is used to: train the speech enhancement model through the above-mentioned N groups of training samples; wherein, the above-mentioned acquisition module is specifically used to: acquire the i-th The room impulse response and the i-th pure voice data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the above-mentioned i-th training data; determining the i-th control curve according to the above-mentioned i-th room impulse response, Multiply the above i-th control curve with the above-mentioned i-th room impulse response to obtain the i'th room impulse response, wherein, i' is a positive integer not greater than N; the above-mentioned i-th pure voice data and the above-mentioned i'th room The impulse response is convoluted to obtain the i-th target data in the i-th group of training samples above.
根据本公开的第三个方面,提供一种终端,包括:存储器、处理器以及存储在上述存储器中并可在上述处理器上运行的计算机程序,上述处理器执行上述计算机程序时实现所述第一个方面的语音增强模型的训练方法。According to a third aspect of the present disclosure, there is provided a terminal, including: a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the first A method for training a speech enhancement model in one aspect.
根据本公开的第四个方面,提供一种可读存储介质,其上存储有计算机程序,上述计算机程序被处理器执行时实现所述第一个方面的语音增强模型的训练方法。According to a fourth aspect of the present disclosure, there is provided a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for training a speech enhancement model in the first aspect is implemented.
本公开的实施例所提供的语音增强模型的训练方法及装置、可读存储介质及电子设备,具备以下技术效果:The speech enhancement model training method and device, readable storage medium and electronic equipment provided by the embodiments of the present disclosure have the following technical effects:
在本公开实施例提供的语音增强模型的训练过程中,获取N组训练样本,第i组训练样本包括:第i训练数据和第i目标数据;通过N组训练样本训练语音增强模型;获取第i组训练样本包括:获取第i房间冲激响应以及第i纯净语音数据,对第i房间冲激响应以及第i纯净语音数据进行处理,得到第i组训练样本中的第i训练数据;根据第i房间冲激响应确定第i控制曲线,将第i控制曲线与第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;第i纯净语音数据与第i’房间冲激响应进行卷积,得到第i组训练样本中的第i目标数据。本申请能够减小信号处理后的失真以及解决训练数据和目标数据的对齐问题,且能保留早期混响。In the training process of the speech enhancement model provided by the embodiments of the present disclosure, N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data; the speech enhancement model is trained through the N groups of training samples; the i-th group of training samples is obtained The i group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th training data in the i-th group of training samples; The i-th room impulse response determines the i-th control curve, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N; the i-th room The pure speech data is convolved with the i'th room impulse response to obtain the i'th target data in the i'th group of training samples. The application can reduce the distortion after signal processing and solve the alignment problem of training data and target data, and can preserve early reverberation.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Apparently, the drawings in the following description are only some embodiments of the present disclosure, and those skilled in the art can obtain other drawings according to these drawings without creative efforts.
图1示意性示出了本公开实施例提供一种语音增强模型的训练方法的流程图;FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model provided by an embodiment of the present disclosure;
图2示出了语音增强模型的示意图;Figure 2 shows a schematic diagram of a speech enhancement model;
图3示出了控制曲线的示意图;Fig. 3 shows the schematic diagram of control curve;
图4示出了第i房间冲激响应的示意图;Fig. 4 shows a schematic diagram of the i-th room impulse response;
图5示出了第i’房间冲激响应的示意图;Fig. 5 shows the schematic diagram of the i'th room impulse response;
图6示意性示出了本公开一实施例提供的语音增强模型的训练装置的结构图;FIG. 6 schematically shows a structural diagram of a training device for a speech enhancement model provided by an embodiment of the present disclosure;
图7示意性示出了本公开一实施例提供的一种电子设备的框图。Fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开的目的、技术方案和优点更加清楚,下面将结合附图对本公开实施例方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present disclosure clearer, the embodiments of the present disclosure will be further described in detail below in conjunction with the accompanying drawings.
下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.
在本公开的描述中,需要理解的是,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本公开中的具体含义。此外,在本公开的描述中,除非另有说明,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。In the description of the present disclosure, it should be understood that the terms "first", "second", etc. are used for descriptive purposes only, and should not be understood as indicating or implying relative importance. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present disclosure in specific situations. In addition, in the description of the present disclosure, unless otherwise specified, "plurality" means two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. The character "/" generally indicates that the contextual objects are an "or" relationship.
下面,将结合附图及实施例对本示例实施方式中的语音增强模型的训练方法的各个步骤进行更详细的说明。Next, each step of the method for training a speech enhancement model in this exemplary embodiment will be described in more detail with reference to the accompanying drawings and embodiments.
其中,图1示意性示出了根据本公开一示例性的实施例中语音增强模型的训练方法的流程图。参考图1,该方法包括以下步骤:Wherein, FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model according to an exemplary embodiment of the present disclosure. With reference to Fig. 1, this method comprises the following steps:
S101,获取第i组训练样本,共得到N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数。S101, acquire the i-th group of training samples, and obtain N groups of training samples in total, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N .
S102,通过N组训练样本训练语音增强模型。S102. Train the speech enhancement model by using N groups of training samples.
S11,其中,获取第i组训练样本,包括:获取第i房间冲激响应以及第i纯净语音数据,对第i房间冲激响应以及第i纯净语音数据进行处理,得到第i组训练样本中的第i训练数据。S11, wherein, obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data of .
S12,根据第i房间冲激响应确定第i控制曲线,将第i控制曲线与第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;第i纯净语音数据与第i’房间冲激响应进行卷积,得到第i组训练样本中的第i目标数据。S12. Determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N ; The i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
其中,上述第i控制曲线与上述第i房间冲激响应相乘,是指用第i控制曲线中的每个控制点的控制值,去乘以第i房间冲激响应中对应的采样点。显而易见的,对上述第i控制曲线中控制值为1的控制点,可以选择省掉相乘的操作。同理,对上述第i房间冲激响应中没有做操作或不需要做操作的采样点,对应的控制点的控制值可视为1。在实现时,控制值为1的控制点可以选择省略,即上述第i控制曲线的点数和上述第i房间冲激响应的样本数不一定严格相等,只是广义上的相等。Wherein, multiplying the i-th control curve by the i-th room impulse response refers to multiplying the control value of each control point in the i-th control curve by the corresponding sampling point in the i-th room impulse response. Obviously, for the control point with a control value of 1 in the above i-th control curve, the multiplication operation can be omitted. Similarly, the control value of the corresponding control point can be regarded as 1 for the sampling point in the impulse response of the i-th room that has no operation or does not need to be operated. During implementation, the control point with a control value of 1 can be omitted, that is, the number of points of the above-mentioned i-th control curve and the number of samples of the above-mentioned i-th room impulse response are not necessarily strictly equal, but equal in a broad sense.
根据卷积定理,上述第i’房间冲激响应,当尾部的采样点值为零或绝对值很小时,可选择进行尾部截断处理,尾部截断后其采样点数相应减少。显而易见的,当使用对第i房间冲激响应做尾部截断处理的方法得到第i’房间冲激响时,可视为将第i控制曲线中对应的尾部控制点的控制值置为0。According to the convolution theorem, when the value of the sampling points at the tail is zero or the absolute value is very small for the impulse response of the i’th room above, the tail truncation process can be selected, and the number of sampling points will be reduced correspondingly after the tail is truncated. Obviously, when the impulse response of the i'th room is obtained by tail truncation processing on the impulse response of the i'th room, it can be regarded as setting the control value of the corresponding tail control point in the i'th control curve to 0.
在图1所示实施例提供的语音增强模型的训练过程中,获取N组训练样本,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数;通过N组训练样本训练语音增强模型;获取第i组训练样本包括:获取第i房间冲激响应以及第i纯净语音数据,对第i房间冲激响应以及第i纯净语音数据进行处理,得到第i组训练样本中的第i训练数据;根据第i房间冲激响应确定第i控制曲线,将第i控制曲线与第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;第i纯净语音数据与第i’房间冲激响应进行卷积,得到第i组训练样本中的第i目标数据。本申请能够减小信号处理后的失真以及解决训练数据和目标数据的对齐问题,且能根据需要保留一定程度的早期混响。In the training process of the speech enhancement model provided by the embodiment shown in Figure 1, N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is not A positive integer greater than N; training the speech enhancement model through N groups of training samples; obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, and the i-th room impulse response and the i-th pure voice The data is processed to obtain the i-th training data in the i-th group of training samples; the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room Impulse response, wherein, i' is a positive integer not greater than N; the i-th pure speech data is convolved with the i'-th room impulse response to obtain the i-th target data in the i-th group of training samples. The present application can reduce the distortion after signal processing and solve the alignment problem of training data and target data, and can retain a certain degree of early reverberation as required.
图2示出了语音增强模型的示意图。以下结合图2对图1所示实施例所包含的各个步骤的具体实施方式进行详细介绍。本实施例提供了一种语音增强模型的训练方法,具体实施方法如下:Fig. 2 shows a schematic diagram of a speech enhancement model. The specific implementation manner of each step included in the embodiment shown in FIG. 1 will be described in detail below with reference to FIG. 2 . The present embodiment provides a kind of training method of speech enhancement model, and specific implementation method is as follows:
在S101中,获取第i组训练样本,得到N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数。In S101, the i-th group of training samples is obtained to obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive number not greater than N integer.
在S102中,通过N组训练样本训练语音增强模型。In S102, the speech enhancement model is trained through N groups of training samples.
在S11中,其中,获取第i组训练样本,包括:获取第i房间冲激响应以及第i纯净语音数据,对第i房间冲激响应以及第i纯净语音数据进行处理,得到第i组训练样本中的第i训练数据。In S11, obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data in the sample.
在示例性的实施例中,如图2所示,在样本库中随机获取第i纯净语音数据和第i房间冲激响应,并对第i纯净语音数据和第i房间冲激响应进行卷积处理,可得到第i训练数据。第i训练数据与下述实施例所得到的第i目标数据将一起作为第i组训练样本。In an exemplary embodiment, as shown in FIG. 2, the i-th pure speech data and the i-th room impulse response are randomly obtained in the sample library, and convolution is performed on the i-th pure speech data and the i-th room impulse response Processing, the i-th training data can be obtained. The i-th training data and the i-th target data obtained in the following embodiments will be used together as the i-th group of training samples.
在S12中,根据第i房间冲激响应确定第i控制曲线,将第i控制曲线与第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;第i纯净语音数据与第i’房间冲激响应进行卷积,得到第i组训练样本中的第i目标数据。In S12, the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is not greater than N Positive integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
在示例性的实施例中,首先获取第i组训练样本,第i组训练样本包括第i训练数据和第i目标数据,多个第i组训练样本形成N组训练样本。其中,获取第i目标数据的方法如下:如图2所示,获取到第i纯净语音数据和第i房间冲激响应后,执行S12。具体执行方法参照下述实施例:In an exemplary embodiment, the i-th group of training samples is first obtained, the i-th group of training samples includes the i-th training data and the i-th target data, and a plurality of i-th groups of training samples form N groups of training samples. Wherein, the method for obtaining the i-th target data is as follows: as shown in FIG. 2 , after the i-th pure voice data and the i-th room impulse response are obtained, S12 is executed. The specific implementation method refers to the following examples:
在示例性的实施例中,获取上述第i纯净语音数据以及第i房间冲激响应后,确定第i房间冲激响应中第一个绝对值最大的采样点,即为峰值位置点,第i控制曲线中也需要存在一个峰值点与之对应,因此,将第i控制曲线中,与第i房间冲激响应的峰值位置点对应的控制值点记为主控制点p,主控制点p对应的控制值记为主控制值。In an exemplary embodiment, after obtaining the i-th pure speech data and the i-th room impulse response, determine the first sampling point with the largest absolute value in the i-th room impulse response, which is the peak position point, the i-th room impulse response There also needs to be a peak point corresponding to it in the control curve. Therefore, in the i-th control curve, the control value point corresponding to the peak position point of the i-th room impulse response is recorded as the main control point p, and the main control point p corresponds to The control value of is recorded as the main control value.
混响包括早期反射声和晚期反射声。在去除混响的过程中,早期反射声可以起到增强语音信号的效果,因此可以对其进行保留。因此,可以通过参数来控制房间冲激响应中混响的时长以及强度,以对不同阶段的混响进行有选择地保留。Reverb includes early and late reflections. In the process of reverberation, early reflections can have the effect of enhancing the speech signal, so they can be preserved. Therefore, parameters can be used to control the duration and intensity of the reverberation in the room impulse response, so as to selectively retain the reverberation in different stages.
在示例性的实施例中,可以通过预设的参数生成第i控制曲线,乘以第i房间冲激响应中各采样点的值来生成一个只有直达声和/或只有早期混响的第i’房间冲激响应。由于声音传播需要一定时间,且直达声和早期反射声的强度比晚期反射声的强度大,因此,起初第i’房间冲激响应的采样点幅度为零或非常小,之后迅速增大,在直达声和/或早期反射声时期幅度达到最大值,在晚期反射声时期幅度又逐渐减小。因此,第i控制曲线的控制值也可以按此规律变化,上述主控制点p即位于直达声或早期反射声附近,第i控制曲线中的控制值均不大于主控制点p。In an exemplary embodiment, the i-th control curve can be generated through preset parameters, and multiplied by the value of each sampling point in the i-th room impulse response to generate an i-th control curve with only direct sound and/or early reverberation. 'Room impulse response. Since the sound propagation takes a certain time, and the intensity of the direct sound and the early reflection sound is greater than the intensity of the late reflection sound, the amplitude of the sampling point of the impulse response of the i'th room is zero or very small at first, and then increases rapidly, and in The amplitude of the direct sound and/or early reflection reaches a maximum, and the amplitude gradually decreases during the late reflection. Therefore, the control value of the i-th control curve can also change according to this rule. The above-mentioned main control point p is located near the direct sound or the early reflection sound, and the control values in the i-th control curve are not greater than the main control point p.
在示例性的实施例中,可以通过参数来调整上述第i控制曲线中的控制值,以达到控制第i’房间冲激响应中直达声、早期混响、晚期混响的幅度、位置和时长。在调整时,可以根据需求来对第i控制曲线中的控制值进行调整,即第i控制曲线的形状可以随意变化,本实施例中不做限制,下面仅举其中二三例来做示例性的说明,并不代表全部的可行方案,例如还可以使用类似正态分布或高斯分布的钟形曲线。In an exemplary embodiment, parameters can be used to adjust the control value in the i-th control curve, so as to control the amplitude, position and duration of the direct sound, early reverberation, and late reverberation in the i'th room impulse response . During adjustment, the control value in the i-th control curve can be adjusted according to requirements, that is, the shape of the i-th control curve can be changed at will, and there is no limitation in this embodiment, and only two or three examples are given below as examples The description of , does not represent all feasible solutions, for example, a bell curve similar to normal distribution or Gaussian distribution can also be used.
图3示出了控制曲线的示意图。在示例性的实施例中,由于起初第i’房间冲激响应的采样点幅度为零或非常小,因此可以忽略不计,对于此部分可以将控制值设置为0,如图3所示的左侧虚线部分。主控制点p之前的第i’房间冲激响应的幅度呈快速上升趋势,控制此段时,可以将第i控制曲线记为m,将m中的一段记为m1,对控制值参数进行调整,例如将m1预设为一段指数上升的曲线,如图3所示。之后,第i’房间冲激响应进入直达声和早期反射声的较早阶段,将控制曲线m中的一段曲线记为m2,可对第i’房间冲激响应的这一段幅度不做改变,即将m2段的控制点的控制值全部设置为1,此时m2为一条直线,如图3所示。Figure 3 shows a schematic diagram of the control curve. In an exemplary embodiment, since the sampling point amplitude of the i'th room impulse response is zero or very small at first, it can be ignored, and the control value can be set to 0 for this part, as shown in Fig. 3 on the left Side dotted line part. The magnitude of the impulse response of the i'th room before the main control point p shows a rapid upward trend. When controlling this section, the i'th control curve can be recorded as m, and a section of m can be recorded as m1, and the control value parameters can be adjusted , for example, m1 is preset as an exponentially rising curve, as shown in FIG. 3 . Afterwards, the impulse response of the i'th room enters the earlier stage of direct sound and early reflections, and a section of the control curve m is recorded as m2, and the amplitude of this section of the i'th room's impulse response remains unchanged. That is, all the control values of the control points of the m2 section are set to 1, and at this time m2 is a straight line, as shown in Figure 3.
在示例性的实施例中,在主控制点p之后,第i控制曲线用来控制早期混响及晚期混响。将第i控制曲线中控制早期混响的这段曲线记为m3,也对m3的幅度不做改变,即将m3段控制点的控制值全部设置为1,此时m3为一条直线,如图3所示。将第i控制曲线中控制晚期混响的一段曲线记为m4,同样可对控制参数进行调整,例如将m4预设为一段指数衰减的曲线,如图3所示。对于m3的持续时间长度,可设置为需要保留的早期混响的时间长度,m4的持续时间长度,可设置为需要的混响时间长度,例如T60。在m4后,对应于第i’房间冲激响应的幅度逐渐衰减为0的部分,可将控制曲线中对应的控制值参数直接设置为0,如图3所示的右侧虚线部分。根据上述实施例,生成如图3所示的曲线。In an exemplary embodiment, after the main control point p, the i-th control curve is used to control early reverberation and late reverberation. Record the curve that controls the early reverberation in the i-th control curve as m3, and do not change the amplitude of m3, that is, set all the control values of the control points of the m3 segment to 1, and m3 is a straight line at this time, as shown in Figure 3 shown. In the i-th control curve, a section of the curve controlling the late reverberation is recorded as m4, and the control parameters can also be adjusted, for example, m4 is preset as a section of exponentially decaying curve, as shown in FIG. 3 . For the duration of m3, it can be set to the length of the early reverberation that needs to be preserved, and the duration of m4 can be set to the required reverberation time, such as T60. After m4, corresponding to the part where the amplitude of the impulse response of the i'th room gradually decays to 0, the corresponding control value parameter in the control curve can be directly set to 0, as shown in the dotted line on the right in Figure 3. According to the above-described embodiment, a curve as shown in FIG. 3 is generated.
在示例性的实施例中,m1段的参数也可以设置为全0(或设置m1段的时长为0),或者将m1变 为呈线性变化的直线。m4段的参数也可以设置为全0(或设置m4段的时长为0),或者将m4变为线性衰减的直线。m2段和m3段也可以设置为呈线性变化的直线,或者设置为指数曲线。In an exemplary embodiment, the parameters of the m1 segment can also be set to all 0s (or the duration of the m1 segment is set to 0), or m1 can be changed to a straight line that changes linearly. The parameters of the m4 segment can also be set to all 0 (or set the duration of the m4 segment to 0), or change m4 to a linear attenuation line. The m2 segment and the m3 segment can also be set as a straight line with a linear change, or as an exponential curve.
在示例性的实施例中,除上述示例外,对于m1、m2、m3、m4的形状与长度(即对第i’房间冲激响应不同阶段的幅度与时长的控制)可以视实际情况进行调整,并无既定规则。但为保证语音增强的效果,主控制点p所对应的控制值仍需保持为第i控制曲线中的最大值,其余控制点的控制值均不大于主控制值。另外,对第i控制曲线的分段并非局限于m1、m2、m3、m4四段,可以任意增加或减少段数,在此不作限制。In an exemplary embodiment, in addition to the above examples, the shapes and lengths of m1, m2, m3, and m4 (that is, the control of the amplitude and duration of different stages of the i'th room impulse response) can be adjusted according to actual conditions , with no established rules. However, in order to ensure the effect of speech enhancement, the control value corresponding to the main control point p still needs to be kept as the maximum value in the i-th control curve, and the control values of other control points are not greater than the main control value. In addition, the segmentation of the i-th control curve is not limited to the four segments m1, m2, m3, and m4, and the number of segments can be increased or decreased arbitrarily, which is not limited here.
在示例性的实施例中,生成第i控制曲线后,将上述第i控制曲线中控制点对应的控制值,与第i房间冲激响应中对应采样点的值相乘,得到控制后的第i’房间冲激响应。In an exemplary embodiment, after the i-th control curve is generated, the control value corresponding to the control point in the i-th control curve is multiplied by the value of the corresponding sampling point in the i-th room impulse response to obtain the controlled i-th control curve i'room impulse response.
在示例性的实施例中,得到第i’房间冲激响应后,可以将尾部样本值幅度为零或幅度非常小的部分样本进行删去截断处理,处理后第i’房间冲激响应的样本数相应的变少。根据卷积定理,这样的处理不会影响卷积的结果,且可以节省存储和计算资源。In an exemplary embodiment, after the i'th room impulse response is obtained, some samples whose tail sample values have zero or very small amplitude can be deleted and truncated, and the processed samples of the i'th room impulse response The number decreases accordingly. According to the convolution theorem, such processing will not affect the result of convolution, and can save storage and computing resources.
图4示出了第i房间冲激响应的示意图,图5示出了第i’房间冲激响应的示意图。如图4所示的第i房间冲激响应,其混响持续时间较长,且头部和尾部存在一定的噪声。利用上述实施例所述的方法,可对第i房间冲激响应中的晚期混响进行去除,保留早期混响。利用第i控制曲线对第i房间冲激响应进行处理后,得到的第i’房间冲激响应如图5所示,第i’房间冲激响应中仅保留了直达声和早期混响,并去除了第i房间冲激响应中头部和尾部的噪声。Fig. 4 shows a schematic diagram of the impulse response of the i-th room, and Fig. 5 shows a schematic diagram of the impulse response of the i'th room. As shown in Figure 4, the impulse response of the i-th room has a long duration of reverberation, and there is a certain amount of noise at the head and tail. By using the method described in the above embodiment, the late reverberation in the impulse response of the i-th room can be removed, and the early reverberation can be retained. After processing the i-th room impulse response with the i-th control curve, the obtained i'th room impulse response is shown in Figure 5. Only the direct sound and early reverberation are reserved in the i'th room impulse response, and The head and tail noises in the i-th room impulse response are removed.
在示例性的实施例中,将上述第i纯净语音数据与第i’房间冲激响应进行卷积,即得到第i目标数据。第i目标数据为语音增强模型训练时的目标标签数据。In an exemplary embodiment, the above i'th pure speech data is convolved with the i'th room impulse response to obtain the i'th target data. The i-th target data is the target label data during speech enhancement model training.
在示例性的实施例中,将第i纯净语音数据和第i房间冲激响应进行卷积,得到上述第i组训练样本中的第i训练数据,与第i目标数据一同作为神经网络的第i组训练样本,输入至神经网络中,如图2所示。由此利用第i组训练样本不断训练语音增强模型,直到模型的输出能够达到优良的语音增强结果。In an exemplary embodiment, the i-th pure speech data and the i-th room impulse response are convolved to obtain the i-th training data in the i-th group of training samples, which together with the i-th target data serve as the i-th training data of the neural network Group i of training samples are input into the neural network, as shown in Figure 2. Thus, the i-th group of training samples is used to continuously train the speech enhancement model until the output of the model can achieve excellent speech enhancement results.
在示例性的实施例中,对于输入至神经网络的第i训练数据,除了带有加了房间冲激响应的混响语音外,还可加入噪声。例如在上述实施例中,第i组训练样本中的第i训练数据还包括:将第i纯净语音数据与第i房间冲激响应进行卷积,并与噪声数据相加,得到第i训练数据;或者,将第i纯净语音数据与噪声数据相加,并与第i房间冲激响应进行卷积,得到第i训练数据。是否需要加入噪声根据模型是否需要降噪能力而定。除此之外,为了使训练样本更加丰富,还可以对第i纯净语音数据与噪声数据进行幅度的随机缩放。In an exemplary embodiment, for the ith training data input to the neural network, in addition to the reverberant speech with room impulse response added, noise may also be added. For example, in the above-mentioned embodiment, the i-th training data in the i-th group of training samples also includes: convolving the i-th pure speech data with the i-th room impulse response, and adding them to the noise data to obtain the i-th training data ; Or, add the i-th pure speech data to the noise data, and convolve it with the i-th room impulse response to obtain the i-th training data. Whether to add noise depends on whether the model needs noise reduction capability. In addition, in order to make the training samples more abundant, the amplitude of the i-th pure speech data and noise data can also be randomly scaled.
在示例性的实施例中,在对混响和噪声进行去除处理时,对所使用的信号增强方式不做限定,例如,可以是理想二值掩模(Ideal Binary Mask,IBM)、理想比值掩模(Ideal Ratio Mask,IRM)、理想幅度掩模(Ideal Amplitude Mask,IAM)、相移掩模(Phase-Shifting Mask,PSM)、复数理想比例掩模(Complex Ideal Ratio Mask,CIRM)等任意一种方式。In an exemplary embodiment, when the reverberation and noise are removed, the signal enhancement method used is not limited, for example, it can be an ideal binary mask (Ideal Binary Mask, IBM), an ideal ratio mask Ideal Ratio Mask (IRM), Ideal Amplitude Mask (IAM), Phase-Shifting Mask (PSM), Complex Ideal Ratio Mask (Complex Ideal Ratio Mask, CIRM), etc. way.
在示例性的实施例中,上述神经网络可以是深度神经网络(Deep Neural Networks,DNN)、卷积神经网络(Convolutional Neural Networks,CNN)、循环神经网络(Recurrent Neural Networks,RNN)、长短期记忆神经网络(Long Short-Term Networks,LSTM)等等,在此不做限制。In an exemplary embodiment, the above-mentioned neural network can be a deep neural network (Deep Neural Networks, DNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recurrent Neural Networks, RNN), a long short-term memory Neural networks (Long Short-Term Networks, LSTM), etc., are not limited here.
下述为本公开装置实施例,可以用于执行本公开方法实施例。对于本公开装置实施例中未披露的细节,请参照本公开方法实施例。The following are device embodiments of the present disclosure, which can be used to implement the method embodiments of the present disclosure. For details not disclosed in the disclosed device embodiments, please refer to the disclosed method embodiments.
其中,图6示出了根据本公开一示例性的实施例中语音增强模型的训练装置的结构图。请参见图6,该图所示的语音增强模型的训练装置可以通过软件、硬件或者两者的结合实现成为终端的全部或一部分,还可以作为独立的模块集成于服务器上。Wherein, FIG. 6 shows a structural diagram of an apparatus for training a speech enhancement model according to an exemplary embodiment of the present disclosure. Please refer to FIG. 6 , the speech enhancement model training device shown in this figure can be implemented as all or part of the terminal through software, hardware or a combination of the two, and can also be integrated on the server as an independent module.
在示例性的实施例中,上述语音增强模型的训练装置600包括:获取模块601以及训练模块602,其中:In an exemplary embodiment, the training device 600 of the speech enhancement model includes: an acquisition module 601 and a training module 602, wherein:
获取模块601,用于:获取N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数;An acquisition module 601, configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;
训练模块602,用于:通过N组训练样本训练语音增强模型;The training module 602 is used for: training the speech enhancement model by N groups of training samples;
其中,获取模块601,具体用于:获取第i房间冲激响应以及第i纯净语音数据,对第i房间冲激响应以及第i纯净语音数据进行处理,得到第i组训练样本中的第i训练数据;根据第i房间冲激响应确定第i控制曲线,将第i控制曲线与第i房间冲激响应相乘,得到第i’房间冲激响应;其中,i’为不大于N的正整数;第i纯净语音数据与第i’房间冲激响应进行卷积,得到第i组训练样本中的第i目标数据。Among them, the acquiring module 601 is specifically used to: acquire the i-th room impulse response and the i-th pure voice data, process the i-th room impulse response and the i-th pure voice data, and obtain the i-th group of training samples in the i-th Training data; determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve with the i-th room impulse response to obtain the i'th room impulse response; where i' is a positive value not greater than N Integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
需要说明的是,上述实施例提供的数据同步装置在语音增强模型的训练方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音增强模型的训练装置与语音增强模型的训练方法的实施例属于同一构思,因此对于本公开装置实施例中未披露的细节,请参照本公开上述的语音增强模型的训练方法的实施例,这里不再赘述。It should be noted that, in the speech enhancement model training method of the data synchronization device provided by the above-mentioned embodiments, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be assigned to different function modules according to needs Module completion means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the speech enhancement model training device and the speech enhancement model training method provided by the above embodiments belong to the same idea, so for the details not disclosed in the disclosed device embodiments, please refer to the speech enhancement model mentioned above in this disclosure. The embodiment of the training method will not be repeated here.
上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments.
本公开实施例还提供了一种可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任一实施例方法的步骤。其中,可读存储介质可以包括但不限于任何类型的盘,包括软盘、光盘、DVD、CD-ROM、微型驱动器以及磁光盘、ROM、RAM、EPROM、EEPROM、DRAM、VRAM、闪速存储器设备、磁卡或光卡、纳米系统(包括分子存储器IC),或适合于存储指令和/或数据的任何类型的媒介或设备。An embodiment of the present disclosure also provides a readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method in any one of the foregoing embodiments are implemented. Among them, the readable storage medium may include but is not limited to any type of disk, including floppy disk, optical disk, DVD, CD-ROM, microdrive and magneto-optical disk, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory device, Magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of medium or device suitable for storing instructions and/or data.
本公开实施例还提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现上述任一实施例方法的步骤。An embodiment of the present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the steps of the method in any of the foregoing embodiments are implemented.
图7示意性示出了根据本公开一示例性的实施例中电子设备的结构图。请参见图7所示,电子设备700包括有:处理器701和存储器702。Fig. 7 schematically shows a structural diagram of an electronic device according to an exemplary embodiment of the present disclosure. Referring to FIG. 7 , an electronic device 700 includes: a processor 701 and a memory 702 .
本公开实施例中,处理器701为计算机系统的控制中心,可以是实体机的处理器,也可以是虚拟机的处理器。处理器701可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器701可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器701也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称中央处理器(Central Processing Unit,CPU);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。In the embodiment of the present disclosure, the processor 701 is a control center of the computer system, and may be a processor of a physical machine or a processor of a virtual machine. The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 can adopt at least one hardware form in digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA) accomplish. The processor 701 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, also called a central processing unit (Central Processing Unit, CPU); the coprocessor is Low-power processor for processing data in standby state.
在本公开实施例中,上述处理器701具体用于:In the embodiment of the present disclosure, the above-mentioned processor 701 is specifically used to:
获取N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数;通过上述N组训练样本训练语音增强模型;其中,获取上述第i组训练样本,包括:获取第i房间冲激响应以及第i纯净语音数据,对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据;根据上述第i房间冲激响应确定第i控制曲线,将上述第i控制曲线与上述第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;上述第i纯净语音数据与上述第i’房间冲激响应进行卷积,得到上述第i组训练样本中的第i目标数据。Obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, wherein N is a positive integer, and i is a positive integer not greater than N; training speech enhancement through the above-mentioned N groups of training samples model; wherein, obtaining the above i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure speech data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure speech data, and obtaining the above-mentioned i-th The i-th training data in the group of training samples; the i-th control curve is determined according to the above-mentioned i-th room impulse response, and the above-mentioned i-th control curve is multiplied by the above-mentioned i-th room impulse response to obtain the i'th room impulse response, Wherein, i' is a positive integer not greater than N; the above i'th pure speech data is convolved with the above i'th room impulse response to obtain the i'th target data in the above i'th group of training samples.
进一步地,在本公开一个实施例中,上述第i房间冲激响应中包括多个采样点;上述第i控制曲线包括多个控制值,上述控制值的个数与上述第i房间冲激响应的采样点的个数相同;上述第i’房间冲激响应,当尾部的采样点值为零或绝对值很小时,可选择进行尾部截断处理。Further, in an embodiment of the present disclosure, the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is the same as the number of the i-th room impulse response The number of sampling points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, you can choose to perform tail truncation processing.
可选的,上述第i控制曲线中,上述根据上述第i房间冲激响应确定第i控制曲线,包括:确定上述第i房间冲激响应中,上述每个采样点的绝对值,其中上述绝对值中包含多个数值相等的最大值;在上述第i房间冲激响应中,将上述绝对值中的第一个最大值所对应的采样点,确定为峰值位置点;将上述第i控制曲线中与上述第i房间冲激响应的峰值位置点对应的控制值,确定为上述第i控制曲线的主控制值。Optionally, in the above i-th control curve, the determination of the i-th control curve according to the above-mentioned i-th room impulse response includes: determining the absolute value of each of the above-mentioned sampling points in the above-mentioned i-th room impulse response, wherein the above-mentioned absolute The value contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.
可选的,上述根据上述第i房间冲激响应确定第i控制曲线,上述方法还包括:通过参数调整上述 第i控制曲线的控制值,以确定上述第i控制曲线;其中,除上述主控制点外,其他上述控制点对应的控制值均不大于上述主控制值。Optionally, the above-mentioned determination of the i-th control curve according to the above-mentioned i-th room impulse response, the above-mentioned method further includes: adjusting the control value of the above-mentioned i-th control curve through parameters to determine the above-mentioned i-th control curve; wherein, in addition to the above-mentioned main control Except for the control point, the control values corresponding to the other above-mentioned control points are not greater than the above-mentioned main control value.
可选的,上述对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据,包括:将上述第i纯净语音数据与上述第i房间冲激响应进行卷积,得到第i组训练样本中的第i训练数据。Optionally, the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th The room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.
可选的,上述对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据,包括:将上述第i纯净语音数据与上述第i房间冲激响应进行卷积,并与噪声数据相加,得到上述第i组训练样本中的第i训练数据。Optionally, the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th The room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples above.
可选的,上述对上述第i房间冲激响应以及上述第i纯净语音数据进行处理,得到上述第i组训练样本中的第i训练数据,包括:将上述第i纯净语音数据与噪声数据相加,并与上述第i房间冲激响应进行卷积,得到上述第i组训练样本中的第i训练数据。Optionally, the aforementioned processing of the i-th room impulse response and the aforementioned i-th pure speech data to obtain the i-th training data in the aforementioned i-th group of training samples includes: combining the aforementioned i-th pure speech data with noise data and convolved with the impulse response of the above i-th room to obtain the i-th training data in the above-mentioned i-th group of training samples.
存储器702可以包括一个或多个可读存储介质,该可读存储介质可以是非暂态的。存储器702还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在本公开的一些实施例中,存储器702中的非暂态的可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器701所执行以实现本公开实施例中的方法。 Memory 702 may include one or more readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments of the present disclosure, the non-transitory readable storage medium in the memory 702 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 701 to implement the methods in the embodiments of the present disclosure.
一些实施例中,电子设备700还包括有:外围设备接口703和至少一个外围设备。处理器701、存储器702和外围设备接口703之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口703相连。具体地,外围设备包括:显示屏704、摄像头707和音频电路706中的至少一种。In some embodiments, the electronic device 700 further includes: a peripheral device interface 703 and at least one peripheral device. The processor 701, the memory 702, and the peripheral device interface 703 may be connected through buses or signal lines. Each peripheral device can be connected to the peripheral device interface 703 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a display screen 704 , a camera 707 and an audio circuit 706 .
外围设备接口703可被用于将输入/输出(Input/Output,I/O)相关的至少一个外围设备连接到处理器701和存储器702。在本公开的一些实施例中,处理器701、存储器702和外围设备接口703被集成在同一芯片或电路板上;在本公开的一些其他实施例中,处理器701、存储器702和外围设备接口703中的任意一个或两个可以在单独的芯片或电路板上实现。本公开实施例对此不作具体限定。The peripheral device interface 703 may be used to connect at least one peripheral device related to input/output (Input/Output, I/O) to the processor 701 and the memory 702 . In some embodiments of the present disclosure, the processor 701, the memory 702 and the peripheral device interface 703 are integrated on the same chip or circuit board; in some other embodiments of the present disclosure, the processor 701, the memory 702 and the peripheral device interface Either or both of 703 can be implemented on separate chips or circuit boards. Embodiments of the present disclosure do not specifically limit this.
显示屏704用于显示用户界面(User Interface,UI)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏704是触摸显示屏时,显示屏704还具有采集在显示屏704的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器701进行处理。此时,显示屏704还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在本公开的一些实施例中,显示屏704可以为一个,设置电子设备700的前面板;在本公开的另一些实施例中,显示屏704可以为至少两个,分别设置在电子设备700的不同表面或呈折叠设计;在本公开的再一些实施例中,显示屏704可以是柔性显示屏,设置在电子设备700的弯曲表面上或折叠面上。甚至,显示屏704还可以设置成非矩形的不规则图形,也即异形屏。显示屏704可以采用液晶显示屏(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等材质制备。The display screen 704 is used to display a user interface (User Interface, UI). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 704 is a touch display screen, the display screen 704 also has the ability to collect touch signals on or above the surface of the display screen 704 . The touch signal can be input to the processor 701 as a control signal for processing. At this time, the display screen 704 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments of the present disclosure, there may be one display screen 704, which is set on the front panel of the electronic device 700; in other embodiments of the present disclosure, there may be at least two display screens 704, which are respectively set on the Different surfaces may be in a folded design; in some other embodiments of the present disclosure, the display screen 704 may be a flexible display screen, which is arranged on a curved surface or a folded surface of the electronic device 700 . Even, the display screen 704 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 704 can be made of liquid crystal display (Liquid Crystal Display, LCD), organic light-emitting diode (Organic Light-Emitting Diode, OLED) and other materials.
摄像头707用于采集图像或视频。可选地,摄像头707包括前置摄像头和后置摄像头。通常,前置摄像头设置在电子设备的前面板,后置摄像头设置在电子设备的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及虚拟现实(Virtual Reality,VR)拍摄功能或者其它融合拍摄功能。在本公开的一些实施例中,摄像头707还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。The camera 707 is used to collect images or videos. Optionally, the camera 707 includes a front camera and a rear camera. Usually, the front camera is set on the front panel of the electronic device, and the rear camera is set on the back of the electronic device. In some embodiments, there are at least two rear cameras, which are any one of the main camera, depth-of-field camera, wide-angle camera, and telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function. Combined with the wide-angle camera to achieve panoramic shooting and virtual reality (Virtual Reality, VR) shooting functions or other fusion shooting functions. In some embodiments of the present disclosure, the camera 707 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
音频电路706可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器701进行处理。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在电子设备700的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。 Audio circuitry 706 may include a microphone and speakers. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 701 for processing. For the purpose of stereo sound collection or noise reduction, there may be multiple microphones, which are respectively arranged in different parts of the electronic device 700 . The microphone can also be an array microphone or an omnidirectional collection microphone.
电源707用于为电子设备700中的各个组件进行供电。电源707可以是交流电、直流电、一次性电池或可充电电池。当电源707包括可充电电池时,该可充电电池可以是有线充电电池或无线充电电池。 有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。The power supply 707 is used to supply power to various components in the electronic device 700 . Power source 707 may be AC, DC, disposable or rechargeable batteries. When the power source 707 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery charged through a wired line, and a wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charging technology.
本公开实施例中示出的电子设备结构框图并不构成对电子设备700的限定,电子设备700可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。The structural block diagram of the electronic device shown in the embodiment of the present disclosure does not constitute a limitation on the electronic device 700, and the electronic device 700 may include more or fewer components than shown in the figure, or combine some components, or adopt a different component arrangement .
在本公开中,术语“第一”、“第二”等仅用于描述的目的,而不能理解为指示或暗示相对重要性或顺序;术语“多个”则指两个或两个以上,除非另有明确的限定。术语“安装”、“相连”、“连接”、“固定”等术语均应做广义理解,例如,“连接”可以是固定连接,也可以是可拆卸连接,或一体地连接;“相连”可以是直接相连,也可以通过中间媒介间接相连。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本公开中的具体含义。In the present disclosure, the terms "first", "second", etc. are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or order; the term "plurality" refers to two or more than two, Unless expressly defined otherwise. The terms "installation", "connection", "connection", "fixed" and other terms should be interpreted in a broad sense, for example, "connection" can be fixed connection, detachable connection, or integral connection; "connection" can be directly or indirectly through an intermediary. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present disclosure according to specific situations.
本公开的描述中,需要理解的是,术语“上”、“下”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本公开和简化描述,而不是指示或暗示所指的装置或单元必须具有特定的方向、以特定的方位构造和操作,因此,不能理解为对本公开的限制。In the description of the present disclosure, it should be understood that the orientation or positional relationship indicated by the terms "upper", "lower" and the like is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present disclosure and simplifying the description, and It is not to indicate or imply that a referenced device or unit must be in a particular orientation, be configured or operate in a particular orientation, and therefore, should not be construed as limiting the present disclosure.
以上上述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,依本公开权利要求所作的等同变化,仍属本公开所涵盖的范围。The above is only a specific embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope of the present disclosure, and should within the protection scope of the present disclosure. Therefore, equivalent changes made according to the claims of the present disclosure still fall within the scope of the present disclosure.

Claims (10)

  1. 一种语音增强模型的训练方法,其特征在于,包括:A training method for a speech enhancement model, characterized in that it comprises:
    获取N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数;Obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;
    通过所述N组训练样本训练语音增强模型;Training a speech enhancement model by the N groups of training samples;
    其中,获取所述第i组训练样本,包括:Wherein, obtaining the i-th group of training samples includes:
    获取第i房间冲激响应以及第i纯净语音数据,对所述第i房间冲激响应以及所述第i纯净语音数据进行处理,得到所述第i组训练样本中的第i训练数据;Obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, to obtain the i-th training data in the i-th group of training samples;
    根据所述第i房间冲激响应确定第i控制曲线,将所述第i控制曲线与所述第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;所述第i纯净语音数据与所述第i’房间冲激响应进行卷积,得到所述第i组训练样本中的第i目标数据。The i-th control curve is determined according to the i-th room impulse response, and the i-th room impulse response is obtained by multiplying the i-th control curve by the i-th room impulse response, wherein i' is not greater than N is a positive integer; the i-th pure speech data is convoluted with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
  2. 根据权利要求1所述的语音增强模型的训练方法,其特征在于,The training method of speech enhancement model according to claim 1, is characterized in that,
    所述第i房间冲激响应包括多个采样点;The i-th room impulse response includes a plurality of sampling points;
    所述第i控制曲线包括多个控制值,所述控制值的个数与所述第i房间冲激响应的采样点的个数相同;The i-th control curve includes a plurality of control values, and the number of the control values is the same as the number of sampling points of the i-th room impulse response;
    所述第i’房间冲激响应,当尾部的采样点值为零或绝对值很小时,可选择进行尾部截断处理。For the impulse response of the i'th room, when the sampling point value at the tail is zero or the absolute value is very small, tail truncation processing can be selected.
  3. 根据权利要求2所述的语音增强模型的训练方法,其特征在于,所述根据所述第i房间冲激响应确定第i控制曲线,包括:The training method of the speech enhancement model according to claim 2, wherein said determining the i-th control curve according to the i-th room impulse response comprises:
    确定所述第i房间冲激响应中,所述每个采样点的绝对值,其中所述绝对值中包含多个数值相等的最大值;determining the absolute value of each sampling point in the i-th room impulse response, wherein the absolute value includes multiple maximum values with equal values;
    在所述第i房间冲激响应中,将所述绝对值中的第一个最大值所对应的采样点,确定为峰值位置点;In the i-th room impulse response, determining the sampling point corresponding to the first maximum value in the absolute value as the peak position point;
    将所述第i控制曲线中与所述第i房间冲激响应的峰值位置点对应的控制值,确定为所述第i控制曲线的主控制值。The control value corresponding to the peak position point of the i-th room impulse response in the i-th control curve is determined as the main control value of the i-th control curve.
  4. 根据权利要求3所述的语音增强模型的训练方法,其特征在于,所述根据所述第i房间冲激响应确定第i控制曲线,所述方法还包括:The training method of the speech enhancement model according to claim 3, wherein the i-th control curve is determined according to the i-th room impulse response, and the method also includes:
    通过参数调整所述第i控制曲线的控制值,以确定所述第i控制曲线;其中,除所述主控制值外的其他控制值均不大于所述主控制值。Adjusting the control value of the i-th control curve by parameters to determine the i-th control curve; wherein, all other control values except the main control value are not greater than the main control value.
  5. 根据权利要求1所述的语音增强模型的训练方法,其特征在于,所述对所述第i房间冲激响应以及所述第i纯净语音数据进行处理,得到所述第i组训练样本中的第i训练数据,包括:The training method of the speech enhancement model according to claim 1, wherein the i-th room impulse response and the i-th pure speech data are processed to obtain the i-th group of training samples The i-th training data, including:
    将所述第i纯净语音数据与所述第i房间冲激响应进行卷积,得到第i组训练样本中的第i训练数据。Convolving the i-th pure speech data with the i-th room impulse response to obtain the i-th training data in the i-th group of training samples.
  6. 根据权利要求5所述的语音增强模型的训练方法,其特征在于,所述对所述第i房间冲激响应以及所述第i纯净语音数据进行处理,得到所述第i组训练样本中的第i训练数据,包括:The training method of the speech enhancement model according to claim 5, wherein the i-th room impulse response and the i-th pure voice data are processed to obtain the i-th group of training samples The i-th training data, including:
    将所述第i纯净语音数据与所述第i房间冲激响应进行卷积,并与噪声数据相加,得到所述第i组训练样本中的第i训练数据。Convolving the i-th pure speech data with the i-th room impulse response and adding them to the noise data to obtain the i-th training data in the i-th group of training samples.
  7. 根据权利要求5所述的语音增强模型的训练方法,其特征在于,所述对所述第i房间冲激响应以及所述第i纯净语音数据进行处理,得到所述第i组训练样本中的第i训练数据,包括:The training method of the speech enhancement model according to claim 5, wherein the i-th room impulse response and the i-th pure voice data are processed to obtain the i-th group of training samples The i-th training data, including:
    将所述第i纯净语音数据与噪声数据相加,并与所述第i房间冲激响应进行卷积,得到所述第i组训练样本中的第i训练数据。Adding the i-th pure speech data and noise data, and performing convolution with the i-th room impulse response, to obtain the i-th training data in the i-th group of training samples.
  8. 一种语音增强模型的训练装置,其特征在于,包括:A training device for a speech enhancement model, characterized in that it comprises:
    获取模块,用于:获取N组训练样本,其中,第i组训练样本包括:第i训练数据和第i目标数据,其中,N为正整数,i为不大于N的正整数;The obtaining module is used to: obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;
    训练模块,用于:通过所述N组训练样本训练语音增强模型;A training module, configured to: train a speech enhancement model through the N groups of training samples;
    其中,所述获取模块,具体用于:获取第i房间冲激响应以及第i纯净语音数据,对所述第i房间冲激响应以及所述第i纯净语音数据进行处理,得到所述第i组训练样本中的第i训练数据;根据所述第i房间冲激响应确定第i控制曲线,将所述第i控制曲线与所述第i房间冲激响应相乘,得到第i’房间冲激响应,其中,i’为不大于N的正整数;所述第i纯净语音数据与所述第i’房间冲激响应进行卷积,得到所述第i组训练样本中的第i目标数据。Wherein, the acquiring module is specifically configured to: acquire the i-th room impulse response and the i-th pure voice data, process the i-th room impulse response and the i-th pure voice data, and obtain the i-th room impulse response and the i-th pure voice data. The i-th training data in the group of training samples; the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse The impulse response, wherein, i' is a positive integer not greater than N; the ith pure speech data is convoluted with the i'th room impulse response to obtain the i'th target data in the i'th group of training samples .
  9. 一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的语音增强模型的训练方法。A terminal, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that, when the processor executes the computer program, the computer program according to claims 1 to 7 is implemented. The training method of the speech enhancement model described in any one.
  10. 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的语音增强模型的训练方法。A readable storage medium on which a computer program is stored, wherein the computer program implements the method for training a speech enhancement model according to any one of claims 1 to 7 when the computer program is executed by a processor.
PCT/CN2022/129232 2021-11-25 2022-11-02 Speech enhancement model training method and apparatus, storage medium, and device WO2023093477A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111427538.7A CN116189698A (en) 2021-11-25 2021-11-25 Training method and device for voice enhancement model, storage medium and equipment
CN202111427538.7 2021-11-25

Publications (1)

Publication Number Publication Date
WO2023093477A1 true WO2023093477A1 (en) 2023-06-01

Family

ID=86431208

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/129232 WO2023093477A1 (en) 2021-11-25 2022-11-02 Speech enhancement model training method and apparatus, storage medium, and device

Country Status (2)

Country Link
CN (1) CN116189698A (en)
WO (1) WO2023093477A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105325013A (en) * 2013-05-29 2016-02-10 高通股份有限公司 Filtering with binaural room impulse responses
CN105513592A (en) * 2014-10-13 2016-04-20 福特全球技术公司 Acoustic impulse response simulation
CN109523999A (en) * 2018-12-26 2019-03-26 中国科学院声学研究所 A kind of front end processing method and system promoting far field speech recognition
CN110930991A (en) * 2018-08-30 2020-03-27 阿里巴巴集团控股有限公司 Far-field speech recognition model training method and device
CN111341303A (en) * 2018-12-19 2020-06-26 北京猎户星空科技有限公司 Acoustic model training method and device and voice recognition method and device
CN111933164A (en) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 Training method and device of voice processing model, electronic equipment and storage medium
WO2021022079A1 (en) * 2019-08-01 2021-02-04 Dolby Laboratories Licensing Corporation System and method for enhancement of a degraded audio signal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105325013A (en) * 2013-05-29 2016-02-10 高通股份有限公司 Filtering with binaural room impulse responses
CN105513592A (en) * 2014-10-13 2016-04-20 福特全球技术公司 Acoustic impulse response simulation
CN110930991A (en) * 2018-08-30 2020-03-27 阿里巴巴集团控股有限公司 Far-field speech recognition model training method and device
CN111341303A (en) * 2018-12-19 2020-06-26 北京猎户星空科技有限公司 Acoustic model training method and device and voice recognition method and device
CN109523999A (en) * 2018-12-26 2019-03-26 中国科学院声学研究所 A kind of front end processing method and system promoting far field speech recognition
WO2021022079A1 (en) * 2019-08-01 2021-02-04 Dolby Laboratories Licensing Corporation System and method for enhancement of a degraded audio signal
CN111933164A (en) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 Training method and device of voice processing model, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116189698A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
US20210217433A1 (en) Voice processing method and apparatus, and device
US20220188619A1 (en) Microcontroller Interface for Audio Signal Processing
JP2019128939A (en) Gesture based voice wakeup method, apparatus, arrangement and computer readable medium
CN107666638B (en) A kind of method and terminal device for estimating tape-delayed
CN106454139A (en) Shooting method and mobile terminal
WO2022022536A1 (en) Audio playback method, audio playback apparatus, and electronic device
WO2023279739A1 (en) Image processing method and apparatus, and electronic device and storage medium
CN110289024B (en) Audio editing method and device, electronic equipment and storage medium
WO2015130270A1 (en) Photo and document integration
WO2020020375A1 (en) Voice processing method and apparatus, electronic device, and readable storage medium
CN115049783B (en) Model determining method, scene reconstruction model, medium, equipment and product
CN111863020A (en) Voice signal processing method, device, equipment and storage medium
EP4120242A1 (en) Method for in-chorus mixing, apparatus, electronic device and storage medium
US20170092321A1 (en) Perceptual computing input to determine post-production effects
CN104345880B (en) The method and electronic equipment of a kind of information processing
WO2021184956A1 (en) Image editing method and device, storage medium, and terminal
WO2023093477A1 (en) Speech enhancement model training method and apparatus, storage medium, and device
TWI605261B (en) Method,media and apparatus for measurement of disiance between devices using audio signals
WO2021179869A1 (en) Audio playing method and apparatus, and storage medium and terminal
WO2021120383A1 (en) Screen color temperature control method and apparatus, storage medium, and mobile terminal
CN108495160A (en) Intelligent control method, system, equipment and storage medium
CN109360582B (en) Audio processing method, device and storage medium
WO2023279740A1 (en) Image processing method and apparatus, and electronic device and storage medium
CN116343716A (en) Techniques for low power selective frame update on a display
WO2020113473A1 (en) Audio playing control method and apparatus, and terminal and computer-readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897560

Country of ref document: EP

Kind code of ref document: A1