WO2023093477A1

WO2023093477A1 - Speech enhancement model training method and apparatus, storage medium, and device

Info

Publication number: WO2023093477A1
Application number: PCT/CN2022/129232
Authority: WO
Inventors: 刘荣
Original assignee: 广州视源电子科技股份有限公司; 广州视源人工智能创新研究院有限公司
Priority date: 2021-11-25
Filing date: 2022-11-02
Publication date: 2023-06-01
Also published as: CN116189698A

Abstract

A speech enhancement model training method and apparatus, a storage medium, and a device. The method comprises: acquiring N groups of training samples, an i-th group of training samples comprising: i-th training data and i-th target data (S101); training a speech enhancement model by means of the N groups of training samples (S102); acquiring the i-th group of training samples comprising: acquiring an i-th room impulse response and i-th pure speech data, and processing the i-th room impulse response and the i-th pure speech data to obtain i-th training data in the i-th group of training samples (S11); and determining an i-th control curve according to the i-th room impulse response, and multiplying the i-th control curve by the i-th room impulse response to obtain an i'-th room impulse response, the i-th pure speech data being convolved with the i'-th room impulse response to obtain the i-th target data in the i-th group of training samples (S12).

Description

Speech enhancement model training method and device, storage medium and equipment

This application claims the priority of the Chinese patent application with the application number 2021114275387 and the title of the invention "Speech enhancement model training method and device, storage medium and equipment" submitted to the China Patent Office on November 25, 2021, the entire content of which is passed References are incorporated in this application.

technical field

The present disclosure relates to the technical field of speech enhancement, in particular to a speech enhancement model training method and device, a readable storage medium and electronic equipment.

Background technique

The purpose of speech enhancement is to improve speech quality through various calculation methods, and to extract as pure a speech signal as possible from a speech signal containing interfering sounds. Commonly used speech enhancement algorithms are as follows: Speech enhancement algorithm based on spectral subtraction, speech enhancement algorithm based on wavelet analysis, speech enhancement algorithm based on Kalman filter, enhancement method based on signal subspace, speech enhancement algorithm based on auditory masking effect A training method for an enhanced model, a training method for a speech enhancement model based on independent component analysis, and a training method for a speech enhancement model based on a neural network.

However, when the neural network is used to train the speech enhancement model, it will inevitably cause more damage to the speech signal, thereby degrading the speech quality. Furthermore, when using clean speech signals and room impulse responses to generate training data, it can cause the target signal to lead the input signal, which may eventually cause the model to become unrealizable and thus untrainable.

It should be noted that the information disclosed in the above background section is only for enhancing the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.

Contents of the invention

The purpose of the present disclosure is to provide a speech enhancement model training method and device, a readable storage medium and electronic equipment, at least to a certain extent to overcome the problem that the early reverberation is not preserved when the reverberation is removed in the related art, and the complexity of the model is relatively high. High, the disadvantage of greater damage to speech.

Other features and advantages of the present disclosure will become apparent from the following detailed description, or in part, be learned by practice of the present disclosure.

According to a first aspect of the present disclosure, a method for training a speech enhancement model is provided, the method comprising: obtaining N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the speech enhancement model is trained through the above-mentioned N groups of training samples; wherein, obtaining the above-mentioned i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data , process the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the i-th training data in the above-mentioned i-th group of training samples; determine the i-th control curve according to the above-mentioned i-th room impulse response, and convert the above-mentioned i-th The i control curve is multiplied by the impulse response of the i-th room above to obtain the impulse response of the i'th room, where i' is a positive integer not greater than N; the i-th pure speech data and the impulse response of the i'th room Perform convolution to obtain the i-th target data in the above-mentioned i-th group of training samples.

In an embodiment of the present disclosure, the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is related to the sampling points of the i-th room impulse response The number of points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, the tail truncation process can be selected.

In an embodiment of the present disclosure, the determination of the i-th control curve based on the i-th room impulse response includes: determining the absolute value of each sampling point in the i-th room impulse response, wherein the absolute value of Contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve and The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.

In an embodiment of the present disclosure, the i-th control curve is determined according to the i-th room impulse response, and the method further includes: adjusting the control value of the i-th control curve through parameters to determine the i-th control curve; wherein , other control values except the above-mentioned main control value are not greater than the above-mentioned main control value.

In an embodiment of the present disclosure, the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.

In an embodiment of the present disclosure, the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The i-th room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples.

In an embodiment of the present disclosure, the processing of the i-th room impulse response and the i-th pure voice data to obtain the i-th training data in the i-th group of training samples includes: combining the i-th pure voice data with The noise data are summed and convolved with the i-th room impulse response to obtain the i-th training data in the i-th set of training samples.

According to a second aspect of the present disclosure, a training device for a speech enhancement model is provided, the above-mentioned device includes: an acquisition module, configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th training data i target data, wherein, N is a positive integer, and i is a positive integer not greater than N; the training module is used to: train the speech enhancement model through the above-mentioned N groups of training samples; wherein, the above-mentioned acquisition module is specifically used to: acquire the i-th The room impulse response and the i-th pure voice data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure voice data to obtain the above-mentioned i-th training data; determining the i-th control curve according to the above-mentioned i-th room impulse response, Multiply the above i-th control curve with the above-mentioned i-th room impulse response to obtain the i'th room impulse response, wherein, i' is a positive integer not greater than N; the above-mentioned i-th pure voice data and the above-mentioned i'th room The impulse response is convoluted to obtain the i-th target data in the i-th group of training samples above.

According to a third aspect of the present disclosure, there is provided a terminal, including: a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the first A method for training a speech enhancement model in one aspect.

According to a fourth aspect of the present disclosure, there is provided a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for training a speech enhancement model in the first aspect is implemented.

The speech enhancement model training method and device, readable storage medium and electronic equipment provided by the embodiments of the present disclosure have the following technical effects:

In the training process of the speech enhancement model provided by the embodiments of the present disclosure, N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data; the speech enhancement model is trained through the N groups of training samples; the i-th group of training samples is obtained The i group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th training data in the i-th group of training samples; The i-th room impulse response determines the i-th control curve, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N; the i-th room The pure speech data is convolved with the i'th room impulse response to obtain the i'th target data in the i'th group of training samples. The application can reduce the distortion after signal processing and solve the alignment problem of training data and target data, and can preserve early reverberation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Apparently, the drawings in the following description are only some embodiments of the present disclosure, and those skilled in the art can obtain other drawings according to these drawings without creative efforts.

FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model provided by an embodiment of the present disclosure;

Figure 2 shows a schematic diagram of a speech enhancement model;

Fig. 3 shows the schematic diagram of control curve;

Fig. 4 shows a schematic diagram of the i-th room impulse response;

Fig. 5 shows the schematic diagram of the i'th room impulse response;

FIG. 6 schematically shows a structural diagram of a training device for a speech enhancement model provided by an embodiment of the present disclosure;

Fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solution and advantages of the present disclosure clearer, the embodiments of the present disclosure will be further described in detail below in conjunction with the accompanying drawings.

When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

In the description of the present disclosure, it should be understood that the terms "first", "second", etc. are used for descriptive purposes only, and should not be understood as indicating or implying relative importance. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present disclosure in specific situations. In addition, in the description of the present disclosure, unless otherwise specified, "plurality" means two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. The character "/" generally indicates that the contextual objects are an "or" relationship.

Next, each step of the method for training a speech enhancement model in this exemplary embodiment will be described in more detail with reference to the accompanying drawings and embodiments.

Wherein, FIG. 1 schematically shows a flowchart of a method for training a speech enhancement model according to an exemplary embodiment of the present disclosure. With reference to Fig. 1, this method comprises the following steps:

S101, acquire the i-th group of training samples, and obtain N groups of training samples in total, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N .

S102. Train the speech enhancement model by using N groups of training samples.

S11, wherein, obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data of .

S12. Determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve by the i-th room impulse response to obtain the i'th room impulse response, where i' is a positive integer not greater than N ; The i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.

Wherein, multiplying the i-th control curve by the i-th room impulse response refers to multiplying the control value of each control point in the i-th control curve by the corresponding sampling point in the i-th room impulse response. Obviously, for the control point with a control value of 1 in the above i-th control curve, the multiplication operation can be omitted. Similarly, the control value of the corresponding control point can be regarded as 1 for the sampling point in the impulse response of the i-th room that has no operation or does not need to be operated. During implementation, the control point with a control value of 1 can be omitted, that is, the number of points of the above-mentioned i-th control curve and the number of samples of the above-mentioned i-th room impulse response are not necessarily strictly equal, but equal in a broad sense.

According to the convolution theorem, when the value of the sampling points at the tail is zero or the absolute value is very small for the impulse response of the i’th room above, the tail truncation process can be selected, and the number of sampling points will be reduced correspondingly after the tail is truncated. Obviously, when the impulse response of the i'th room is obtained by tail truncation processing on the impulse response of the i'th room, it can be regarded as setting the control value of the corresponding tail control point in the i'th control curve to 0.

In the training process of the speech enhancement model provided by the embodiment shown in Figure 1, N groups of training samples are obtained, and the i-th group of training samples includes: the i-th training data and the i-th target data, wherein, N is a positive integer, and i is not A positive integer greater than N; training the speech enhancement model through N groups of training samples; obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, and the i-th room impulse response and the i-th pure voice The data is processed to obtain the i-th training data in the i-th group of training samples; the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room Impulse response, wherein, i' is a positive integer not greater than N; the i-th pure speech data is convolved with the i'-th room impulse response to obtain the i-th target data in the i-th group of training samples. The present application can reduce the distortion after signal processing and solve the alignment problem of training data and target data, and can retain a certain degree of early reverberation as required.

Fig. 2 shows a schematic diagram of a speech enhancement model. The specific implementation manner of each step included in the embodiment shown in FIG. 1 will be described in detail below with reference to FIG. 2 . The present embodiment provides a kind of training method of speech enhancement model, and specific implementation method is as follows:

In S101, the i-th group of training samples is obtained to obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive number not greater than N integer.

In S102, the speech enhancement model is trained through N groups of training samples.

In S11, obtaining the i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, and obtaining the i-th group of training samples The i-th training data in the sample.

In an exemplary embodiment, as shown in FIG. 2, the i-th pure speech data and the i-th room impulse response are randomly obtained in the sample library, and convolution is performed on the i-th pure speech data and the i-th room impulse response Processing, the i-th training data can be obtained. The i-th training data and the i-th target data obtained in the following embodiments will be used together as the i-th group of training samples.

In S12, the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse response, where i' is not greater than N Positive integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.

In an exemplary embodiment, the i-th group of training samples is first obtained, the i-th group of training samples includes the i-th training data and the i-th target data, and a plurality of i-th groups of training samples form N groups of training samples. Wherein, the method for obtaining the i-th target data is as follows: as shown in FIG. 2 , after the i-th pure voice data and the i-th room impulse response are obtained, S12 is executed. The specific implementation method refers to the following examples:

In an exemplary embodiment, after obtaining the i-th pure speech data and the i-th room impulse response, determine the first sampling point with the largest absolute value in the i-th room impulse response, which is the peak position point, the i-th room impulse response There also needs to be a peak point corresponding to it in the control curve. Therefore, in the i-th control curve, the control value point corresponding to the peak position point of the i-th room impulse response is recorded as the main control point p, and the main control point p corresponds to The control value of is recorded as the main control value.

Reverb includes early and late reflections. In the process of reverberation, early reflections can have the effect of enhancing the speech signal, so they can be preserved. Therefore, parameters can be used to control the duration and intensity of the reverberation in the room impulse response, so as to selectively retain the reverberation in different stages.

In an exemplary embodiment, the i-th control curve can be generated through preset parameters, and multiplied by the value of each sampling point in the i-th room impulse response to generate an i-th control curve with only direct sound and/or early reverberation. 'Room impulse response. Since the sound propagation takes a certain time, and the intensity of the direct sound and the early reflection sound is greater than the intensity of the late reflection sound, the amplitude of the sampling point of the impulse response of the i'th room is zero or very small at first, and then increases rapidly, and in The amplitude of the direct sound and/or early reflection reaches a maximum, and the amplitude gradually decreases during the late reflection. Therefore, the control value of the i-th control curve can also change according to this rule. The above-mentioned main control point p is located near the direct sound or the early reflection sound, and the control values in the i-th control curve are not greater than the main control point p.

In an exemplary embodiment, parameters can be used to adjust the control value in the i-th control curve, so as to control the amplitude, position and duration of the direct sound, early reverberation, and late reverberation in the i'th room impulse response . During adjustment, the control value in the i-th control curve can be adjusted according to requirements, that is, the shape of the i-th control curve can be changed at will, and there is no limitation in this embodiment, and only two or three examples are given below as examples The description of , does not represent all feasible solutions, for example, a bell curve similar to normal distribution or Gaussian distribution can also be used.

Figure 3 shows a schematic diagram of the control curve. In an exemplary embodiment, since the sampling point amplitude of the i'th room impulse response is zero or very small at first, it can be ignored, and the control value can be set to 0 for this part, as shown in Fig. 3 on the left Side dotted line part. The magnitude of the impulse response of the i'th room before the main control point p shows a rapid upward trend. When controlling this section, the i'th control curve can be recorded as m, and a section of m can be recorded as m1, and the control value parameters can be adjusted , for example, m1 is preset as an exponentially rising curve, as shown in FIG. 3 . Afterwards, the impulse response of the i'th room enters the earlier stage of direct sound and early reflections, and a section of the control curve m is recorded as m2, and the amplitude of this section of the i'th room's impulse response remains unchanged. That is, all the control values of the control points of the m2 section are set to 1, and at this time m2 is a straight line, as shown in Figure 3.

In an exemplary embodiment, after the main control point p, the i-th control curve is used to control early reverberation and late reverberation. Record the curve that controls the early reverberation in the i-th control curve as m3, and do not change the amplitude of m3, that is, set all the control values of the control points of the m3 segment to 1, and m3 is a straight line at this time, as shown in Figure 3 shown. In the i-th control curve, a section of the curve controlling the late reverberation is recorded as m4, and the control parameters can also be adjusted, for example, m4 is preset as a section of exponentially decaying curve, as shown in FIG. 3 . For the duration of m3, it can be set to the length of the early reverberation that needs to be preserved, and the duration of m4 can be set to the required reverberation time, such as T60. After m4, corresponding to the part where the amplitude of the impulse response of the i'th room gradually decays to 0, the corresponding control value parameter in the control curve can be directly set to 0, as shown in the dotted line on the right in Figure 3. According to the above-described embodiment, a curve as shown in FIG. 3 is generated.

In an exemplary embodiment, the parameters of the m1 segment can also be set to all 0s (or the duration of the m1 segment is set to 0), or m1 can be changed to a straight line that changes linearly. The parameters of the m4 segment can also be set to all 0 (or set the duration of the m4 segment to 0), or change m4 to a linear attenuation line. The m2 segment and the m3 segment can also be set as a straight line with a linear change, or as an exponential curve.

In an exemplary embodiment, in addition to the above examples, the shapes and lengths of m1, m2, m3, and m4 (that is, the control of the amplitude and duration of different stages of the i'th room impulse response) can be adjusted according to actual conditions , with no established rules. However, in order to ensure the effect of speech enhancement, the control value corresponding to the main control point p still needs to be kept as the maximum value in the i-th control curve, and the control values of other control points are not greater than the main control value. In addition, the segmentation of the i-th control curve is not limited to the four segments m1, m2, m3, and m4, and the number of segments can be increased or decreased arbitrarily, which is not limited here.

In an exemplary embodiment, after the i-th control curve is generated, the control value corresponding to the control point in the i-th control curve is multiplied by the value of the corresponding sampling point in the i-th room impulse response to obtain the controlled i-th control curve i'room impulse response.

In an exemplary embodiment, after the i'th room impulse response is obtained, some samples whose tail sample values have zero or very small amplitude can be deleted and truncated, and the processed samples of the i'th room impulse response The number decreases accordingly. According to the convolution theorem, such processing will not affect the result of convolution, and can save storage and computing resources.

Fig. 4 shows a schematic diagram of the impulse response of the i-th room, and Fig. 5 shows a schematic diagram of the impulse response of the i'th room. As shown in Figure 4, the impulse response of the i-th room has a long duration of reverberation, and there is a certain amount of noise at the head and tail. By using the method described in the above embodiment, the late reverberation in the impulse response of the i-th room can be removed, and the early reverberation can be retained. After processing the i-th room impulse response with the i-th control curve, the obtained i'th room impulse response is shown in Figure 5. Only the direct sound and early reverberation are reserved in the i'th room impulse response, and The head and tail noises in the i-th room impulse response are removed.

In an exemplary embodiment, the above i'th pure speech data is convolved with the i'th room impulse response to obtain the i'th target data. The i-th target data is the target label data during speech enhancement model training.

In an exemplary embodiment, the i-th pure speech data and the i-th room impulse response are convolved to obtain the i-th training data in the i-th group of training samples, which together with the i-th target data serve as the i-th training data of the neural network Group i of training samples are input into the neural network, as shown in Figure 2. Thus, the i-th group of training samples is used to continuously train the speech enhancement model until the output of the model can achieve excellent speech enhancement results.

In an exemplary embodiment, for the ith training data input to the neural network, in addition to the reverberant speech with room impulse response added, noise may also be added. For example, in the above-mentioned embodiment, the i-th training data in the i-th group of training samples also includes: convolving the i-th pure speech data with the i-th room impulse response, and adding them to the noise data to obtain the i-th training data ; Or, add the i-th pure speech data to the noise data, and convolve it with the i-th room impulse response to obtain the i-th training data. Whether to add noise depends on whether the model needs noise reduction capability. In addition, in order to make the training samples more abundant, the amplitude of the i-th pure speech data and noise data can also be randomly scaled.

In an exemplary embodiment, when the reverberation and noise are removed, the signal enhancement method used is not limited, for example, it can be an ideal binary mask (Ideal Binary Mask, IBM), an ideal ratio mask Ideal Ratio Mask (IRM), Ideal Amplitude Mask (IAM), Phase-Shifting Mask (PSM), Complex Ideal Ratio Mask (Complex Ideal Ratio Mask, CIRM), etc. way.

In an exemplary embodiment, the above-mentioned neural network can be a deep neural network (Deep Neural Networks, DNN), a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recurrent Neural Networks, RNN), a long short-term memory Neural networks (Long Short-Term Networks, LSTM), etc., are not limited here.

The following are device embodiments of the present disclosure, which can be used to implement the method embodiments of the present disclosure. For details not disclosed in the disclosed device embodiments, please refer to the disclosed method embodiments.

Wherein, FIG. 6 shows a structural diagram of an apparatus for training a speech enhancement model according to an exemplary embodiment of the present disclosure. Please refer to FIG. 6 , the speech enhancement model training device shown in this figure can be implemented as all or part of the terminal through software, hardware or a combination of the two, and can also be integrated on the server as an independent module.

In an exemplary embodiment, the training device 600 of the speech enhancement model includes: an acquisition module 601 and a training module 602, wherein:

An acquisition module 601, configured to: acquire N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;

The training module 602 is used for: training the speech enhancement model by N groups of training samples;

Among them, the acquiring module 601 is specifically used to: acquire the i-th room impulse response and the i-th pure voice data, process the i-th room impulse response and the i-th pure voice data, and obtain the i-th group of training samples in the i-th Training data; determine the i-th control curve according to the i-th room impulse response, and multiply the i-th control curve with the i-th room impulse response to obtain the i'th room impulse response; where i' is a positive value not greater than N Integer; the i-th pure speech data is convolved with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.

It should be noted that, in the speech enhancement model training method of the data synchronization device provided by the above-mentioned embodiments, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be assigned to different function modules according to needs Module completion means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the speech enhancement model training device and the speech enhancement model training method provided by the above embodiments belong to the same idea, so for the details not disclosed in the disclosed device embodiments, please refer to the speech enhancement model mentioned above in this disclosure. The embodiment of the training method will not be repeated here.

The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments.

An embodiment of the present disclosure also provides a readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method in any one of the foregoing embodiments are implemented. Among them, the readable storage medium may include but is not limited to any type of disk, including floppy disk, optical disk, DVD, CD-ROM, microdrive and magneto-optical disk, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory device, Magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of medium or device suitable for storing instructions and/or data.

An embodiment of the present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the steps of the method in any of the foregoing embodiments are implemented.

Fig. 7 schematically shows a structural diagram of an electronic device according to an exemplary embodiment of the present disclosure. Referring to FIG. 7 , an electronic device 700 includes: a processor 701 and a memory 702 .

In the embodiment of the present disclosure, the processor 701 is a control center of the computer system, and may be a processor of a physical machine or a processor of a virtual machine. The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 can adopt at least one hardware form in digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA) accomplish. The processor 701 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, also called a central processing unit (Central Processing Unit, CPU); the coprocessor is Low-power processor for processing data in standby state.

In the embodiment of the present disclosure, the above-mentioned processor 701 is specifically used to:

Obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, wherein N is a positive integer, and i is a positive integer not greater than N; training speech enhancement through the above-mentioned N groups of training samples model; wherein, obtaining the above i-th group of training samples includes: obtaining the i-th room impulse response and the i-th pure speech data, processing the above-mentioned i-th room impulse response and the above-mentioned i-th pure speech data, and obtaining the above-mentioned i-th The i-th training data in the group of training samples; the i-th control curve is determined according to the above-mentioned i-th room impulse response, and the above-mentioned i-th control curve is multiplied by the above-mentioned i-th room impulse response to obtain the i'th room impulse response, Wherein, i' is a positive integer not greater than N; the above i'th pure speech data is convolved with the above i'th room impulse response to obtain the i'th target data in the above i'th group of training samples.

Further, in an embodiment of the present disclosure, the i-th room impulse response includes a plurality of sampling points; the i-th control curve includes a plurality of control values, and the number of the control values is the same as the number of the i-th room impulse response The number of sampling points is the same; the impulse response of the i'th room above, when the value of the sampling points at the tail is zero or the absolute value is very small, you can choose to perform tail truncation processing.

Optionally, in the above i-th control curve, the determination of the i-th control curve according to the above-mentioned i-th room impulse response includes: determining the absolute value of each of the above-mentioned sampling points in the above-mentioned i-th room impulse response, wherein the above-mentioned absolute The value contains multiple maximum values with equal values; in the above-mentioned i-th room impulse response, the sampling point corresponding to the first maximum value in the above-mentioned absolute value is determined as the peak position point; the above-mentioned i-th control curve The control value corresponding to the peak position point of the impulse response of the i-th room is determined as the main control value of the i-th control curve.

Optionally, the above-mentioned determination of the i-th control curve according to the above-mentioned i-th room impulse response, the above-mentioned method further includes: adjusting the control value of the above-mentioned i-th control curve through parameters to determine the above-mentioned i-th control curve; wherein, in addition to the above-mentioned main control Except for the control point, the control values corresponding to the other above-mentioned control points are not greater than the above-mentioned main control value.

Optionally, the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th The room impulse response is convolved to obtain the i-th training data in the i-th group of training samples.

Optionally, the aforementioned i-th room impulse response and the aforementioned i-th pure speech data are processed to obtain the i-th training data in the aforementioned i-th group of training samples, including: combining the aforementioned i-th pure speech data with the aforementioned i-th The room impulse response is convolved and added to the noise data to obtain the i-th training data in the i-th group of training samples above.

Optionally, the aforementioned processing of the i-th room impulse response and the aforementioned i-th pure speech data to obtain the i-th training data in the aforementioned i-th group of training samples includes: combining the aforementioned i-th pure speech data with noise data and convolved with the impulse response of the above i-th room to obtain the i-th training data in the above-mentioned i-th group of training samples.

Memory 702 may include one or more readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments of the present disclosure, the non-transitory readable storage medium in the memory 702 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 701 to implement the methods in the embodiments of the present disclosure.

In some embodiments, the electronic device 700 further includes: a peripheral device interface 703 and at least one peripheral device. The processor 701, the memory 702, and the peripheral device interface 703 may be connected through buses or signal lines. Each peripheral device can be connected to the peripheral device interface 703 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a display screen 704 , a camera 707 and an audio circuit 706 .

The peripheral device interface 703 may be used to connect at least one peripheral device related to input/output (Input/Output, I/O) to the processor 701 and the memory 702 . In some embodiments of the present disclosure, the processor 701, the memory 702 and the peripheral device interface 703 are integrated on the same chip or circuit board; in some other embodiments of the present disclosure, the processor 701, the memory 702 and the peripheral device interface Either or both of 703 can be implemented on separate chips or circuit boards. Embodiments of the present disclosure do not specifically limit this.

The display screen 704 is used to display a user interface (User Interface, UI). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 704 is a touch display screen, the display screen 704 also has the ability to collect touch signals on or above the surface of the display screen 704 . The touch signal can be input to the processor 701 as a control signal for processing. At this time, the display screen 704 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments of the present disclosure, there may be one display screen 704, which is set on the front panel of the electronic device 700; in other embodiments of the present disclosure, there may be at least two display screens 704, which are respectively set on the Different surfaces may be in a folded design; in some other embodiments of the present disclosure, the display screen 704 may be a flexible display screen, which is arranged on a curved surface or a folded surface of the electronic device 700 . Even, the display screen 704 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 704 can be made of liquid crystal display (Liquid Crystal Display, LCD), organic light-emitting diode (Organic Light-Emitting Diode, OLED) and other materials.

The camera 707 is used to collect images or videos. Optionally, the camera 707 includes a front camera and a rear camera. Usually, the front camera is set on the front panel of the electronic device, and the rear camera is set on the back of the electronic device. In some embodiments, there are at least two rear cameras, which are any one of the main camera, depth-of-field camera, wide-angle camera, and telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function. Combined with the wide-angle camera to achieve panoramic shooting and virtual reality (Virtual Reality, VR) shooting functions or other fusion shooting functions. In some embodiments of the present disclosure, the camera 707 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.

Audio circuitry 706 may include a microphone and speakers. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 701 for processing. For the purpose of stereo sound collection or noise reduction, there may be multiple microphones, which are respectively arranged in different parts of the electronic device 700 . The microphone can also be an array microphone or an omnidirectional collection microphone.

The power supply 707 is used to supply power to various components in the electronic device 700 . Power source 707 may be AC, DC, disposable or rechargeable batteries. When the power source 707 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery charged through a wired line, and a wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charging technology.

The structural block diagram of the electronic device shown in the embodiment of the present disclosure does not constitute a limitation on the electronic device 700, and the electronic device 700 may include more or fewer components than shown in the figure, or combine some components, or adopt a different component arrangement .

In the present disclosure, the terms "first", "second", etc. are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or order; the term "plurality" refers to two or more than two, Unless expressly defined otherwise. The terms "installation", "connection", "connection", "fixed" and other terms should be interpreted in a broad sense, for example, "connection" can be fixed connection, detachable connection, or integral connection; "connection" can be directly or indirectly through an intermediary. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present disclosure according to specific situations.

In the description of the present disclosure, it should be understood that the orientation or positional relationship indicated by the terms "upper", "lower" and the like is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present disclosure and simplifying the description, and It is not to indicate or imply that a referenced device or unit must be in a particular orientation, be configured or operate in a particular orientation, and therefore, should not be construed as limiting the present disclosure.

The above is only a specific embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope of the present disclosure, and should within the protection scope of the present disclosure. Therefore, equivalent changes made according to the claims of the present disclosure still fall within the scope of the present disclosure.

Claims

A training method for a speech enhancement model, characterized in that it comprises:

Obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;

Training a speech enhancement model by the N groups of training samples;

Wherein, obtaining the i-th group of training samples includes:

Obtaining the i-th room impulse response and the i-th pure voice data, processing the i-th room impulse response and the i-th pure voice data, to obtain the i-th training data in the i-th group of training samples;

The i-th control curve is determined according to the i-th room impulse response, and the i-th room impulse response is obtained by multiplying the i-th control curve by the i-th room impulse response, wherein i' is not greater than N is a positive integer; the i-th pure speech data is convoluted with the i'th room impulse response to obtain the i-th target data in the i-th group of training samples.
The training method of speech enhancement model according to claim 1, is characterized in that,

The i-th room impulse response includes a plurality of sampling points;

The i-th control curve includes a plurality of control values, and the number of the control values is the same as the number of sampling points of the i-th room impulse response;

For the impulse response of the i'th room, when the sampling point value at the tail is zero or the absolute value is very small, tail truncation processing can be selected.
The training method of the speech enhancement model according to claim 2, wherein said determining the i-th control curve according to the i-th room impulse response comprises:

determining the absolute value of each sampling point in the i-th room impulse response, wherein the absolute value includes multiple maximum values with equal values;

In the i-th room impulse response, determining the sampling point corresponding to the first maximum value in the absolute value as the peak position point;

The control value corresponding to the peak position point of the i-th room impulse response in the i-th control curve is determined as the main control value of the i-th control curve.
The training method of the speech enhancement model according to claim 3, wherein the i-th control curve is determined according to the i-th room impulse response, and the method also includes:

Adjusting the control value of the i-th control curve by parameters to determine the i-th control curve; wherein, all other control values except the main control value are not greater than the main control value.
The training method of the speech enhancement model according to claim 1, wherein the i-th room impulse response and the i-th pure speech data are processed to obtain the i-th group of training samples The i-th training data, including:

Convolving the i-th pure speech data with the i-th room impulse response to obtain the i-th training data in the i-th group of training samples.
The training method of the speech enhancement model according to claim 5, wherein the i-th room impulse response and the i-th pure voice data are processed to obtain the i-th group of training samples The i-th training data, including:

Convolving the i-th pure speech data with the i-th room impulse response and adding them to the noise data to obtain the i-th training data in the i-th group of training samples.
The training method of the speech enhancement model according to claim 5, wherein the i-th room impulse response and the i-th pure voice data are processed to obtain the i-th group of training samples The i-th training data, including:

Adding the i-th pure speech data and noise data, and performing convolution with the i-th room impulse response, to obtain the i-th training data in the i-th group of training samples.
A training device for a speech enhancement model, characterized in that it comprises:

The obtaining module is used to: obtain N groups of training samples, wherein the i-th group of training samples includes: the i-th training data and the i-th target data, where N is a positive integer, and i is a positive integer not greater than N;

A training module, configured to: train a speech enhancement model through the N groups of training samples;

Wherein, the acquiring module is specifically configured to: acquire the i-th room impulse response and the i-th pure voice data, process the i-th room impulse response and the i-th pure voice data, and obtain the i-th room impulse response and the i-th pure voice data. The i-th training data in the group of training samples; the i-th control curve is determined according to the i-th room impulse response, and the i-th control curve is multiplied by the i-th room impulse response to obtain the i'th room impulse The impulse response, wherein, i' is a positive integer not greater than N; the ith pure speech data is convoluted with the i'th room impulse response to obtain the i'th target data in the i'th group of training samples .
A terminal, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that, when the processor executes the computer program, the computer program according to claims 1 to 7 is implemented. The training method of the speech enhancement model described in any one.
A readable storage medium on which a computer program is stored, wherein the computer program implements the method for training a speech enhancement model according to any one of claims 1 to 7 when the computer program is executed by a processor.