WO2023093477A1 - Procédé et appareil d'apprentissage de modèle d'amélioration de la qualité de la parole, support de stockage et dispositif - Google Patents

Procédé et appareil d'apprentissage de modèle d'amélioration de la qualité de la parole, support de stockage et dispositif Download PDF

Info

Publication number
WO2023093477A1
WO2023093477A1 PCT/CN2022/129232 CN2022129232W WO2023093477A1 WO 2023093477 A1 WO2023093477 A1 WO 2023093477A1 CN 2022129232 W CN2022129232 W CN 2022129232W WO 2023093477 A1 WO2023093477 A1 WO 2023093477A1
Authority
WO
WIPO (PCT)
Prior art keywords
impulse response
training
room impulse
data
training samples
Prior art date
Application number
PCT/CN2022/129232
Other languages
English (en)
Chinese (zh)
Inventor
刘荣
Original Assignee
广州视源电子科技股份有限公司
广州视源人工智能创新研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司, 广州视源人工智能创新研究院有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2023093477A1 publication Critical patent/WO2023093477A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present disclosure relates to the technical field of speech enhancement, in particular to a speech enhancement model training method and device, a readable storage medium and electronic equipment.
  • speech enhancement is to improve speech quality through various calculation methods, and to extract as pure a speech signal as possible from a speech signal containing interfering sounds.
  • speech enhancement algorithms are as follows: Speech enhancement algorithm based on spectral subtraction, speech enhancement algorithm based on wavelet analysis, speech enhancement algorithm based on Kalman filter, enhancement method based on signal subspace, speech enhancement algorithm based on auditory masking effect
  • Speech enhancement algorithm based on spectral subtraction
  • speech enhancement algorithm based on wavelet analysis
  • speech enhancement algorithm based on Kalman filter enhancement method based on signal subspace
  • speech enhancement algorithm based on auditory masking effect
  • a training method for an enhanced model a training method for a speech enhancement model based on independent component analysis
  • a training method for a speech enhancement model based on a neural network a training method for a speech enhancement model based on a neural network.
  • the neural network when used to train the speech enhancement model, it will inevitably cause more damage to the speech signal, thereby degrading the speech quality. Furthermore, when using clean speech signals and room impulse responses to generate training data, it can cause the target signal to lead the input signal, which may eventually cause the model to become unrealizable and thus untrainable.
  • the purpose of the present disclosure is to provide a speech enhancement model training method and device, a readable storage medium and electronic equipment, at least to a certain extent to overcome the problem that the early reverberation is not preserved when the reverberation is removed in the related art, and the complexity of the model is relatively high. High, the disadvantage of greater damage to speech.
  • a method for training a speech enhancement model comprising: obtaining N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the speech enhancement model is trained through the above-mentioned N groups of training samples; wherein, obtaining the above-mentioned i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data , process the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the i-th training data in the above-mentioned i-th group of training samples; determine the i-th control curve according to the above-mentioned i-th room impulse response, and convert the above-mentioned i-th The i control curve is multiplied
  • the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is related to the sampling points of the i-th room impulse response The number of points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, the tail truncation process can be selected.
  • the determination of the i-th control curve based on the i-th room impulse response includes: determining the absolute value of each sampling point in the i-th room impulse response, wherein the absolute value of Contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve and The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.
  • the i-th control curve is determined according to the i-th room impulse response, and the method further includes: adjusting the control value of the i-th control curve through parameters to determine the i-th control curve; wherein , other control values except the above-mentioned main control value are not greater than the above-mentioned main control value.
  • the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.
  • the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples.
  • the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The noise data are summed and convolved with the i-th room impulse response to obtain the i-th training data in the i-th set of training samples.
  • a training device for a speech enhancement model includes: an acquisition module, configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th training data i target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the training module is used to: train the speech enhancement model through the above-mentioned N groups of training samples; wherein, the above-mentioned acquisition module is specifically used to: acquire the i-th The room impulse response and the i-th pure voice data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the above-mentioned i-th training data; determining the i-th control curve according to the above-mentioned i-th room impulse response, Multiply the above i-th control curve with the above-ment
  • a terminal including: a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, the first A method for training a speech enhancement model in one aspect.
  • a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for training a speech enhancement model in the first aspect is implemented.
  • the speech enhancement model training method and device, readable storage medium and electronic equipment provided by the embodiments of the present disclosure have the following technical effects:
  • N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data; the speech enhancement model is trained through the N groups of training samples; the i-th group of training samples is obtained
  • the i group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th training data in the i-th group of training samples;
  • the i-th room impulse response determines the i-th control curve, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N; the i-th room
  • the pure speech data is convolved with the i'th room impulse response to obtain the i'th target data in the i'th group of training
  • FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model provided by an embodiment of the present disclosure
  • Figure 2 shows a schematic diagram of a speech enhancement model
  • Fig. 3 shows the schematic diagram of control curve
  • Fig. 4 shows a schematic diagram of the i-th room impulse response
  • Fig. 5 shows the schematic diagram of the i'th room impulse response
  • FIG. 6 schematically shows a structural diagram of a training device for a speech enhancement model provided by an embodiment of the present disclosure
  • Fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model according to an exemplary embodiment of the present disclosure.
  • this method comprises the following steps:
  • S101 acquire the i-th group of training samples, and obtain N groups of training samples in total, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N .
  • obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data of .
  • S12 Determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N ;
  • the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
  • multiplying the i-th control curve by the i-th room impulse response refers to multiplying the control value of each control point in the i-th control curve by the corresponding sampling point in the i-th room impulse response.
  • the multiplication operation can be omitted.
  • the control value of the corresponding control point can be regarded as 1 for the sampling point in the impulse response of the i-th room that has no operation or does not need to be operated.
  • control point with a control value of 1 can be omitted, that is, the number of points of the above-mentioned i-th control curve and the number of samples of the above-mentioned i-th room impulse response are not necessarily strictly equal, but equal in a broad sense.
  • the tail truncation process can be selected, and the number of sampling points will be reduced correspondingly after the tail is truncated.
  • the impulse response of the i'th room is obtained by tail truncation processing on the impulse response of the i'th room, it can be regarded as setting the control value of the corresponding tail control point in the i'th control curve to 0.
  • N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is not A positive integer greater than N; training the speech enhancement model through N groups of training samples; obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, and the i-th room impulse response and the i-th pure voice
  • the data is processed to obtain the i-th training data in the i-th group of training samples; the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room Impulse response, wherein, i' is a positive integer not greater than N; the i-th pure speech data is convolved with the i'-th room impulse response to obtain
  • Fig. 2 shows a schematic diagram of a speech enhancement model.
  • the specific implementation manner of each step included in the embodiment shown in FIG. 1 will be described in detail below with reference to FIG. 2 .
  • the present embodiment provides a kind of training method of speech enhancement model, and specific implementation method is as follows:
  • the i-th group of training samples is obtained to obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive number not greater than N integer.
  • the speech enhancement model is trained through N groups of training samples.
  • obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data in the sample.
  • the i-th pure speech data and the i-th room impulse response are randomly obtained in the sample library, and convolution is performed on the i-th pure speech data and the i-th room impulse response Processing, the i-th training data can be obtained.
  • the i-th training data and the i-th target data obtained in the following embodiments will be used together as the i-th group of training samples.
  • the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is not greater than N Positive integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
  • the i-th group of training samples is first obtained, the i-th group of training samples includes the i-th training data and the i-th target data, and a plurality of i-th groups of training samples form N groups of training samples.
  • the method for obtaining the i-th target data is as follows: as shown in FIG. 2 , after the i-th pure voice data and the i-th room impulse response are obtained, S12 is executed.
  • the specific implementation method refers to the following examples:
  • the i-th room impulse response After obtaining the i-th pure speech data and the i-th room impulse response, determine the first sampling point with the largest absolute value in the i-th room impulse response, which is the peak position point, the i-th room impulse response There also needs to be a peak point corresponding to it in the control curve. Therefore, in the i-th control curve, the control value point corresponding to the peak position point of the i-th room impulse response is recorded as the main control point p, and the main control point p corresponds to The control value of is recorded as the main control value.
  • Reverb includes early and late reflections.
  • early reflections can have the effect of enhancing the speech signal, so they can be preserved. Therefore, parameters can be used to control the duration and intensity of the reverberation in the room impulse response, so as to selectively retain the reverberation in different stages.
  • the i-th control curve can be generated through preset parameters, and multiplied by the value of each sampling point in the i-th room impulse response to generate an i-th control curve with only direct sound and/or early reverberation.
  • 'Room impulse response Since the sound propagation takes a certain time, and the intensity of the direct sound and the early reflection sound is greater than the intensity of the late reflection sound, the amplitude of the sampling point of the impulse response of the i'th room is zero or very small at first, and then increases rapidly, and in The amplitude of the direct sound and/or early reflection reaches a maximum, and the amplitude gradually decreases during the late reflection. Therefore, the control value of the i-th control curve can also change according to this rule.
  • the above-mentioned main control point p is located near the direct sound or the early reflection sound, and the control values in the i-th control curve are not greater than the main control point p.
  • parameters can be used to adjust the control value in the i-th control curve, so as to control the amplitude, position and duration of the direct sound, early reverberation, and late reverberation in the i'th room impulse response .
  • the control value in the i-th control curve can be adjusted according to requirements, that is, the shape of the i-th control curve can be changed at will, and there is no limitation in this embodiment, and only two or three examples are given below as examples The description of , does not represent all feasible solutions, for example, a bell curve similar to normal distribution or Gaussian distribution can also be used.
  • Figure 3 shows a schematic diagram of the control curve.
  • the sampling point amplitude of the i'th room impulse response is zero or very small at first, it can be ignored, and the control value can be set to 0 for this part, as shown in Fig. 3 on the left Side dotted line part.
  • the magnitude of the impulse response of the i'th room before the main control point p shows a rapid upward trend.
  • the i'th control curve can be recorded as m, and a section of m can be recorded as m1, and the control value parameters can be adjusted , for example, m1 is preset as an exponentially rising curve, as shown in FIG. 3 .
  • the impulse response of the i'th room enters the earlier stage of direct sound and early reflections, and a section of the control curve m is recorded as m2, and the amplitude of this section of the i'th room's impulse response remains unchanged. That is, all the control values of the control points of the m2 section are set to 1, and at this time m2 is a straight line, as shown in Figure 3.
  • the i-th control curve is used to control early reverberation and late reverberation.
  • a section of the curve controlling the late reverberation is recorded as m4, and the control parameters can also be adjusted, for example, m4 is preset as a section of exponentially decaying curve, as shown in FIG. 3 .
  • the duration of m3 it can be set to the length of the early reverberation that needs to be preserved, and the duration of m4 can be set to the required reverberation time, such as T60.
  • the corresponding control value parameter in the control curve can be directly set to 0, as shown in the dotted line on the right in Figure 3. According to the above-described embodiment, a curve as shown in FIG. 3 is generated.
  • the parameters of the m1 segment can also be set to all 0s (or the duration of the m1 segment is set to 0), or m1 can be changed to a straight line that changes linearly.
  • the parameters of the m4 segment can also be set to all 0 (or set the duration of the m4 segment to 0), or change m4 to a linear attenuation line.
  • the m2 segment and the m3 segment can also be set as a straight line with a linear change, or as an exponential curve.
  • the shapes and lengths of m1, m2, m3, and m4 (that is, the control of the amplitude and duration of different stages of the i'th room impulse response) can be adjusted according to actual conditions , with no established rules.
  • the control value corresponding to the main control point p still needs to be kept as the maximum value in the i-th control curve, and the control values of other control points are not greater than the main control value.
  • the segmentation of the i-th control curve is not limited to the four segments m1, m2, m3, and m4, and the number of segments can be increased or decreased arbitrarily, which is not limited here.
  • the control value corresponding to the control point in the i-th control curve is multiplied by the value of the corresponding sampling point in the i-th room impulse response to obtain the controlled i-th control curve i'room impulse response.
  • some samples whose tail sample values have zero or very small amplitude can be deleted and truncated, and the processed samples of the i'th room impulse response The number decreases accordingly. According to the convolution theorem, such processing will not affect the result of convolution, and can save storage and computing resources.
  • Fig. 4 shows a schematic diagram of the impulse response of the i-th room
  • Fig. 5 shows a schematic diagram of the impulse response of the i'th room.
  • the impulse response of the i-th room has a long duration of reverberation, and there is a certain amount of noise at the head and tail.
  • the late reverberation in the impulse response of the i-th room can be removed, and the early reverberation can be retained.
  • the obtained i'th room impulse response is shown in Figure 5. Only the direct sound and early reverberation are reserved in the i'th room impulse response, and The head and tail noises in the i-th room impulse response are removed.
  • the above i'th pure speech data is convolved with the i'th room impulse response to obtain the i'th target data.
  • the i-th target data is the target label data during speech enhancement model training.
  • the i-th pure speech data and the i-th room impulse response are convolved to obtain the i-th training data in the i-th group of training samples, which together with the i-th target data serve as the i-th training data of the neural network Group i of training samples are input into the neural network, as shown in Figure 2.
  • the i-th group of training samples is used to continuously train the speech enhancement model until the output of the model can achieve excellent speech enhancement results.
  • the i-th training data in the i-th group of training samples also includes: convolving the i-th pure speech data with the i-th room impulse response, and adding them to the noise data to obtain the i-th training data ; Or, add the i-th pure speech data to the noise data, and convolve it with the i-th room impulse response to obtain the i-th training data.
  • Whether to add noise depends on whether the model needs noise reduction capability.
  • the amplitude of the i-th pure speech data and noise data can also be randomly scaled.
  • the signal enhancement method used is not limited, for example, it can be an ideal binary mask (Ideal Binary Mask, IBM), an ideal ratio mask Ideal Ratio Mask (IRM), Ideal Amplitude Mask (IAM), Phase-Shifting Mask (PSM), Complex Ideal Ratio Mask (Complex Ideal Ratio Mask, CIRM), etc. way.
  • an ideal binary mask Ideal Binary Mask, IBM
  • an ideal ratio mask Ideal Ratio Mask IRM
  • Ideal Amplitude Mask IAM
  • PSM Phase-Shifting Mask
  • CIRM Complex Ideal Ratio Mask
  • the above-mentioned neural network can be a deep neural network (Deep Neural Networks, DNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recurrent Neural Networks, RNN), a long short-term memory Neural networks (Long Short-Term Networks, LSTM), etc., are not limited here.
  • DNN Deep Neural Networks
  • CNN convolutional Neural Networks
  • RNN recurrent neural network
  • LSTM Long Short-Term Networks
  • FIG. 6 shows a structural diagram of an apparatus for training a speech enhancement model according to an exemplary embodiment of the present disclosure.
  • the speech enhancement model training device shown in this figure can be implemented as all or part of the terminal through software, hardware or a combination of the two, and can also be integrated on the server as an independent module.
  • the training device 600 of the speech enhancement model includes: an acquisition module 601 and a training module 602, wherein:
  • An acquisition module 601 configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;
  • the training module 602 is used for: training the speech enhancement model by N groups of training samples;
  • the acquiring module 601 is specifically used to: acquire the i-th room impulse response and the i-th pure voice data, process the i-th room impulse response and the i-th pure voice data, and obtain the i-th group of training samples in the i-th Training data; determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve with the i-th room impulse response to obtain the i'th room impulse response; where i' is a positive value not greater than N Integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
  • the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned functions can be assigned to different function modules according to needs Module completion means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the speech enhancement model training device and the speech enhancement model training method provided by the above embodiments belong to the same idea, so for the details not disclosed in the disclosed device embodiments, please refer to the speech enhancement model mentioned above in this disclosure. The embodiment of the training method will not be repeated here.
  • An embodiment of the present disclosure also provides a readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method in any one of the foregoing embodiments are implemented.
  • the readable storage medium may include but is not limited to any type of disk, including floppy disk, optical disk, DVD, CD-ROM, microdrive and magneto-optical disk, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory device, Magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of medium or device suitable for storing instructions and/or data.
  • An embodiment of the present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the program, the steps of the method in any of the foregoing embodiments are implemented.
  • an electronic device 700 includes: a processor 701 and a memory 702 .
  • the processor 701 is a control center of the computer system, and may be a processor of a physical machine or a processor of a virtual machine.
  • the processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 701 can adopt at least one hardware form in digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA) accomplish.
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PLA programmable logic array
  • the processor 701 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, also called a central processing unit (Central Processing Unit, CPU); the coprocessor is Low-power processor for processing data in standby state.
  • main processor is a processor for processing data in the wake-up state, also called a central processing unit (Central Processing Unit, CPU);
  • coprocessor is Low-power processor for processing data in standby state.
  • processor 701 is specifically used to:
  • the i-th group of training samples includes: the i-th training data and the i-th target data, wherein N is a positive integer, and i is a positive integer not greater than N; training speech enhancement through the above-mentioned N groups of training samples model; wherein, obtaining the above i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure speech data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure speech data, and obtaining the above-mentioned i-th
  • the i-th training data in the group of training samples; the i-th control curve is determined according to the above-mentioned i-th room impulse response, and the above-mentioned i-th control curve is multiplied by the above-mentioned i-th room impulse response to obtain the i'th room impulse response, Wherein, i' is a positive integer not
  • the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is the same as the number of the i-th room impulse response The number of sampling points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, you can choose to perform tail truncation processing.
  • the determination of the i-th control curve according to the above-mentioned i-th room impulse response includes: determining the absolute value of each of the above-mentioned sampling points in the above-mentioned i-th room impulse response, wherein the above-mentioned absolute The value contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.
  • the above-mentioned determination of the i-th control curve according to the above-mentioned i-th room impulse response further includes: adjusting the control value of the above-mentioned i-th control curve through parameters to determine the above-mentioned i-th control curve; wherein, in addition to the above-mentioned main control Except for the control point, the control values corresponding to the other above-mentioned control points are not greater than the above-mentioned main control value.
  • the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th
  • the room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.
  • the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th
  • the room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples above.
  • the aforementioned processing of the i-th room impulse response and the aforementioned i-th pure speech data to obtain the i-th training data in the aforementioned i-th group of training samples includes: combining the aforementioned i-th pure speech data with noise data and convolved with the impulse response of the above i-th room to obtain the i-th training data in the above-mentioned i-th group of training samples.
  • Memory 702 may include one or more readable storage media, which may be non-transitory.
  • the memory 702 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • non-transitory readable storage medium in the memory 702 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 701 to implement the methods in the embodiments of the present disclosure.
  • the electronic device 700 further includes: a peripheral device interface 703 and at least one peripheral device.
  • the processor 701, the memory 702, and the peripheral device interface 703 may be connected through buses or signal lines.
  • Each peripheral device can be connected to the peripheral device interface 703 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a display screen 704 , a camera 707 and an audio circuit 706 .
  • the peripheral device interface 703 may be used to connect at least one peripheral device related to input/output (Input/Output, I/O) to the processor 701 and the memory 702 .
  • the processor 701, the memory 702 and the peripheral device interface 703 are integrated on the same chip or circuit board; in some other embodiments of the present disclosure, the processor 701, the memory 702 and the peripheral device interface Either or both of 703 can be implemented on separate chips or circuit boards. Embodiments of the present disclosure do not specifically limit this.
  • the display screen 704 is used to display a user interface (User Interface, UI).
  • the UI can include graphics, text, icons, video, and any combination thereof.
  • the display screen 704 also has the ability to collect touch signals on or above the surface of the display screen 704 .
  • the touch signal can be input to the processor 701 as a control signal for processing.
  • the display screen 704 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 704 there may be one display screen 704, which is set on the front panel of the electronic device 700; in other embodiments of the present disclosure, there may be at least two display screens 704, which are respectively set on the Different surfaces may be in a folded design; in some other embodiments of the present disclosure, the display screen 704 may be a flexible display screen, which is arranged on a curved surface or a folded surface of the electronic device 700 . Even, the display screen 704 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 704 can be made of liquid crystal display (Liquid Crystal Display, LCD), organic light-emitting diode (Organic Light-Emitting Diode, OLED) and other materials.
  • the camera 707 is used to collect images or videos.
  • the camera 707 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the electronic device, and the rear camera is set on the back of the electronic device.
  • there are at least two rear cameras which are any one of the main camera, depth-of-field camera, wide-angle camera, and telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function.
  • the camera 707 may also include a flash.
  • the flash can be a single-color temperature flash or a dual-color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • Audio circuitry 706 may include a microphone and speakers.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 701 for processing.
  • the processor 701 for processing.
  • the microphone can also be an array microphone or an omnidirectional collection microphone.
  • the power supply 707 is used to supply power to various components in the electronic device 700 .
  • Power source 707 may be AC, DC, disposable or rechargeable batteries.
  • the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery.
  • a wired rechargeable battery is a battery charged through a wired line
  • a wireless rechargeable battery is a battery charged through a wireless coil.
  • the rechargeable battery can also be used to support fast charging technology.
  • the structural block diagram of the electronic device shown in the embodiment of the present disclosure does not constitute a limitation on the electronic device 700, and the electronic device 700 may include more or fewer components than shown in the figure, or combine some components, or adopt a different component arrangement .
  • connection can be fixed connection, detachable connection, or integral connection; “connection” can be directly or indirectly through an intermediary.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

La présente invention concerne un procédé et un appareil d'apprentissage de modèle d'amélioration de la qualité de la parole, un support de stockage, ainsi qu'un dispositif. Le procédé consiste à : acquérir N groupes d'échantillons d'apprentissage, un i-ième groupe d'échantillons d'apprentissage comprenant : des i-ièmes données d'apprentissage et des i-ièmes données cibles (S101) ; entraîner un modèle d'amélioration de la qualité de la parole au moyen des N groupes d'échantillons d'apprentissage (S102) ; acquérir le i-ième groupe d'échantillons d'apprentissage, ce qui consiste à : acquérir une i-ième réponse impulsionnelle de pièce et des i-ièmes données de parole pure, et traiter la i-ième réponse impulsionnelle de pièce et les i-ièmes données de parole pure pour obtenir des i-ièmes données d'apprentissage dans le i-ième groupe d'échantillons d'apprentissage (S11) ; et déterminer une i-ième courbe de commande selon la i-ième réponse impulsionnelle de pièce, et multiplier la i-ième courbe de commande par la i-ième réponse impulsionnelle de pièce pour obtenir une i'-ième réponse impulsionnelle de pièce, les i-ièmes données de parole pure subissant une convolution avec la i'-ième réponse impulsionnelle de pièce pour obtenir les i-ièmes données cibles dans le i-ième groupe d'échantillons d'apprentissage (S12).
PCT/CN2022/129232 2021-11-25 2022-11-02 Procédé et appareil d'apprentissage de modèle d'amélioration de la qualité de la parole, support de stockage et dispositif WO2023093477A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111427538.7 2021-11-25
CN202111427538.7A CN116189698A (zh) 2021-11-25 2021-11-25 语音增强模型的训练方法及装置、存储介质及设备

Publications (1)

Publication Number Publication Date
WO2023093477A1 true WO2023093477A1 (fr) 2023-06-01

Family

ID=86431208

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/129232 WO2023093477A1 (fr) 2021-11-25 2022-11-02 Procédé et appareil d'apprentissage de modèle d'amélioration de la qualité de la parole, support de stockage et dispositif

Country Status (2)

Country Link
CN (1) CN116189698A (fr)
WO (1) WO2023093477A1 (fr)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105325013A (zh) * 2013-05-29 2016-02-10 高通股份有限公司 具有立体声房间脉冲响应的滤波
CN105513592A (zh) * 2014-10-13 2016-04-20 福特全球技术公司 声学脉冲响应模拟
CN109523999A (zh) * 2018-12-26 2019-03-26 中国科学院声学研究所 一种提升远场语音识别的前端处理方法和系统
CN110930991A (zh) * 2018-08-30 2020-03-27 阿里巴巴集团控股有限公司 一种远场语音识别模型训练方法及装置
CN111341303A (zh) * 2018-12-19 2020-06-26 北京猎户星空科技有限公司 一种声学模型的训练方法及装置、语音识别方法及装置
CN111933164A (zh) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 语音处理模型的训练方法、装置、电子设备和存储介质
WO2021022079A1 (fr) * 2019-08-01 2021-02-04 Dolby Laboratories Licensing Corporation Système et procédé d'amélioration d'un signal audio dégradé

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105325013A (zh) * 2013-05-29 2016-02-10 高通股份有限公司 具有立体声房间脉冲响应的滤波
CN105513592A (zh) * 2014-10-13 2016-04-20 福特全球技术公司 声学脉冲响应模拟
CN110930991A (zh) * 2018-08-30 2020-03-27 阿里巴巴集团控股有限公司 一种远场语音识别模型训练方法及装置
CN111341303A (zh) * 2018-12-19 2020-06-26 北京猎户星空科技有限公司 一种声学模型的训练方法及装置、语音识别方法及装置
CN109523999A (zh) * 2018-12-26 2019-03-26 中国科学院声学研究所 一种提升远场语音识别的前端处理方法和系统
WO2021022079A1 (fr) * 2019-08-01 2021-02-04 Dolby Laboratories Licensing Corporation Système et procédé d'amélioration d'un signal audio dégradé
CN111933164A (zh) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 语音处理模型的训练方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN116189698A (zh) 2023-05-30

Similar Documents

Publication Publication Date Title
US20210217433A1 (en) Voice processing method and apparatus, and device
US20220188619A1 (en) Microcontroller Interface for Audio Signal Processing
JP2019128939A (ja) ジェスチャーによる音声ウェイクアップ方法、装置、設備及びコンピュータ可読媒体
CN106454139A (zh) 拍照方法及移动终端
TW202105260A (zh) 批量標準化資料的處理方法、圖像分類方法、圖像檢測方法、視訊處理方法
WO2020034779A1 (fr) Procédé de traitement audio, support d'informations et dispositif électronique
WO2022022536A1 (fr) Procédé de lecture audio, appareil de lecture audio et dispositif électronique
CN106249508B (zh) 自动对焦方法和系统、拍摄装置
CN110636365B (zh) 视频字符添加方法、装置、电子设备及存储介质
WO2020020375A1 (fr) Procédé et appareil de traitement vocal, dispositif électronique et support de stockage lisible
CN115049783B (zh) 模型的确定方法、场景重建模型、介质、设备及产品
CN111325220B (zh) 图像生成方法、装置、设备及存储介质
EP4120242A1 (fr) Procédé de mixage pour chant en choeur, appareil, dispositif électronique et support de stockage
WO2021120383A1 (fr) Procédé et appareil de commande de température de couleur d'écran, support de stockage, et terminal mobile
WO2017052861A1 (fr) Entrée de calcul perceptive pour déterminer des effets post-production
CN116343716A (zh) 用于显示器上的低功率选择性帧更新的技术
CN104345880B (zh) 一种信息处理的方法及电子设备
WO2023093477A1 (fr) Procédé et appareil d'apprentissage de modèle d'amélioration de la qualité de la parole, support de stockage et dispositif
TWI605261B (zh) 用於使用音訊信號於裝置間之距離量測的方法、媒體與設備
US20230403413A1 (en) Method and apparatus for displaying online interaction, electronic device and computer readable medium
CN116501227B (zh) 图片显示方法、装置、电子设备及存储介质
EP4294026A1 (fr) Procédé de rendu et dispositif associé
WO2024130737A1 (fr) Procédé et appareil de rendu d'image et dispositif électronique
CN108803925A (zh) 触屏效果的实现方法、装置、终端和介质
WO2021083097A1 (fr) Appareil et procédé de traitement de données, et dispositif informatique et support de stockage associés

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897560

Country of ref document: EP

Kind code of ref document: A1