WO2022178970A1 - 语音降噪器训练方法、装置、计算机设备和存储介质 - Google Patents

语音降噪器训练方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022178970A1
WO2022178970A1 PCT/CN2021/090177 CN2021090177W WO2022178970A1 WO 2022178970 A1 WO2022178970 A1 WO 2022178970A1 CN 2021090177 W CN2021090177 W CN 2021090177W WO 2022178970 A1 WO2022178970 A1 WO 2022178970A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
morphological
data
speech
difference
Prior art date
Application number
PCT/CN2021/090177
Other languages
English (en)
French (fr)
Inventor
陈昊
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022178970A1 publication Critical patent/WO2022178970A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Definitions

  • the present application relates to the field of speech processing, and in particular, to a method, apparatus, computer equipment and storage medium for training a speech noise reducer.
  • ASR audio-to-text systems
  • the voice noise reduction method in the traditional technology is usually based on the estimation of the noise field, and then the corresponding noise reduction method is set.
  • this method of estimating the noise field can have a good effect in certain occasions, it is often used in other occasions. Underperforming. This is due to the limitation of the noise field estimation method.
  • more robust methods based on artificial intelligence have been developed.
  • GAN Generative Adversarial Network
  • AE Autoencoder
  • GAN Generative Adversarial Network
  • AE Autoencoder
  • GAN Generative Adversarial Network
  • AE Autoencoder
  • the first solution often has good convergence characteristics and can clearly converge in the direction of noise reduction.
  • the disadvantage is that it needs to have corresponding noise-free data as the target. It is very difficult to meet, so an approximation method is often required.
  • the approximate method leads to insufficient diversity of simulated noise conditions, and often the model does not have a high generalization ability.
  • the second is that there is no need to use noise-free data as the target, but this kind of network is often difficult to converge, and it is difficult to obtain better results, and because the discriminator settings in the generative adversarial network are too free, resulting in inconsistencies in some aspects. Too controllable, and there is a problem of poor noise reduction.
  • the present application provides a training method, device, computer equipment and storage medium for a speech noise reducer to solve the technical problems of slow network convergence and poor adaptability and robustness in the prior art .
  • a method for training a speech noise reducer comprising:
  • the morphological voice database includes a plurality of morphological voice data obtained by combining the noise voice and the ontology voice according to the noise generation algorithm;
  • Noise reduction training is performed on the noise reducer to be trained according to the voice difference, and when the voice difference meets a preset difference value, the noise reducer obtained by this round of noise reduction training is used as the trained noise reducer.
  • a speech noise reducer training device includes:
  • a building module for constructing a morphological speech database, wherein the morphological speech database includes a plurality of morphological speech data obtained by combining noise speech and ontology speech obtained by a noise generation algorithm;
  • a building module for constructing a morphological speech database, wherein the morphological speech database includes a plurality of morphological speech data obtained by combining noise speech and ontology speech obtained by a noise generation algorithm;
  • a calculation module configured to calculate the voice output of the morphological voice data according to the noise reducer to be trained, and calculate the voice difference between the voice output and the morphological voice data in the morphological voice database by using a preset voice loss function;
  • the training module is used to perform noise reduction training on the noise reducer to be trained according to the voice difference, and when the voice difference meets the preset difference value, the noise reducer obtained in this round of noise reduction training is used as the noise reducer after training. noiser.
  • a computer device comprising a memory and a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the above-mentioned speech noise reduction when executing the computer-readable instructions steps of the training method.
  • a computer-readable storage medium storing computer-readable instructions, when the computer-readable instructions are executed by a processor, implement the steps of the above-mentioned speech noise reducer training method.
  • the above-mentioned training method, device, computer equipment and storage medium of speech noise reducer are input into the noise reducer to be trained for multiple rounds of noise reduction training through speech data including noise, that is, the output of different speech data and different speech
  • the voice difference of the data is adjusted, and the network weight of the denoiser is adjusted according to the voice difference, until the end of one round of noise reduction training, and then the next round of noise reduction training is carried out on the basis of the current training.
  • Fig. 1 is the application environment schematic diagram of the speech noise reducer training method
  • FIG. 2 is a schematic flowchart of a training method for a speech noise reducer
  • Figure 3 is a schematic structural diagram of a noise reducer
  • Figure 4 is a schematic diagram of network convergence speed comparison
  • FIG. 5 is a schematic diagram of a speech noise reducer training device
  • Figure 6 is a schematic diagram of a computer device in one embodiment.
  • the speech noise reducer training method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 .
  • the application environment may include a terminal 102, a network, and a server 104.
  • the network is used to provide a communication link medium between the terminal 102 and the server 104.
  • the network may include various connection types, such as wired, wireless communication links or fiber optic cables, etc.
  • the user can use the terminal 102 to interact with the server 104 through the network to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal 102, such as web browser applications, shopping applications, search applications, instant communication tools, email clients, social platform software, and the like.
  • the terminal 102 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • Players Laptops and Desktops, etc.
  • the server 104 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal 102 .
  • the voice noise reducer training method provided by the embodiments of the present application is generally performed by the server/terminal, and accordingly, the voice noise reducer training apparatus is generally set in the server/terminal device.
  • the present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like.
  • the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • the present application can be applied in the field of smart cities, especially in the field of smart banks, so as to promote the construction of smart cities.
  • terminals, networks and servers in FIG. 1 are only illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 constructs a morphological voice database by combining the noise voice obtained from the terminal 102 and the ontology voice, and then performs multiple rounds of noise reduction training on the noise reducer to be trained according to the morphological voice data and the voice loss function, Obtain a trained noise reducer, wherein the noise reduction training is to calculate the voice output of the morphological voice data according to the noise reducer to be trained, and calculate the voice difference according to the voice output and the voice loss function , and perform noise reduction training on the noise reducer to be trained according to the voice difference. If the voice difference in this round of noise reduction training meets the preset difference value, the noise reducer obtained in this round of noise reduction training is used as The trained denoiser.
  • the terminal 102 can use the trained noise reducer to perform speech noise reduction processing.
  • the terminal 102 and the server 104 are connected through a network
  • the network may be a wired network or a wireless network
  • the terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices
  • the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for training a speech noise reducer is provided, which is described by taking the method applied to the server in FIG. 1 as an example, including the following steps:
  • Step 202 constructing a morphological speech database, wherein the morphological speech database includes a plurality of morphological speech data obtained by combining the noise speech and the ontology speech obtained according to the noise generation algorithm.
  • the solution of the present application is an intelligent speech noise reduction technology based only on noisy data, that is, within the framework of deep learning, noise reduction can be achieved by using only noisy data.
  • the conventional noise reduction method is usually based on estimating the noise field, and then setting the corresponding noise reduction method.
  • this method of estimating the noise field can have a good effect in some specific occasions, but it often does not perform well in other occasions. This is due to the limitation of the noise field estimation method.
  • GAN generative adversarial networks
  • AE autoencoders
  • the first solution often has better convergence characteristics and can clearly converge in the direction of noise reduction.
  • the disadvantage is that it requires corresponding noise-free data as the target. It is very difficult to meet in the network, so an approximation method is often required; the second scheme does not require noise-free data as a target, but this kind of network is often difficult to converge, and it is difficult to obtain better results, even if it is statistically significant.
  • Good results are also prone to an abnormal situation, that is, there may be poor noise reduction in some sections. This is mainly because the discriminator settings in the generative adversarial network are too free, resulting in not very good in some aspects. Controllable.
  • the present application attempts to build on this robust encoder-decoder approach, but improve upon it to obtain an encoder-decoder based approach that does not require the use of noise-free speech as a target.
  • the main idea of this application is based on the statistical law of large numbers. It is considered that a piece of speech contains noise. It is assumed that the noise has various forms, but its real signal is deterministic. Multiple pairs contain deterministic real signals, but noise. If voices with different shapes are detected, the expectation of the detected data should be close to the real signal, so that the noise reduction of the voice signal is actually achieved.
  • the server needs to build a morphological voice database to train the noise reducer as a training sample: obtain a noise database, which includes multiple noise voices in different environments; generate ontology voice, and combine ontology voice with noise.
  • the data is combined to obtain a plurality of morphological speech data including different noise speeches.
  • the parts selected here can be complete speech recordings or A segment from the complete recording. It can be obtained in the following way: the voices in the voice database can be derived from voice recording in a standard quiet place (such as a recording studio), and then the noise field is recorded in a noisy place (such as on the road), and then the two are recorded. Synthesis, you can get speech data containing noise. For example, ask a person to read a piece of text or dialogue in a relatively quiet place, such as a recording studio, and then combine it with the noise in various noise situations to obtain morphological speech data including different noises.
  • the voice data of people talking or reading aloud in different environments can also be obtained as morphological voice data; Conversation or text read aloud through voice data of different people in different environments.
  • morphological speech data can also be obtained by other such methods, as long as the information expressed by the real signal is consistent, but the noise conditions are different.
  • the data input into the denoiser to be trained is extracted from the data set of the morphological speech database constructed in the above steps.
  • the speech data can also be merged into the interval [-1,1]
  • a dictionary ⁇ T, c> is formed to obtain preprocessed morphological speech data.
  • T represents the content number
  • c represents the specific morphological voice data, such as ⁇ 1,3>, which represents the third morphological voice data of the content numbered 1.
  • Step 204 Calculate the voice output of the morphological voice data according to the noise reducer to be trained, and use a preset voice loss function to calculate the voice difference between the voice output and the morphological voice data in the morphological voice database.
  • the technical solution of the present application needs to perform multiple rounds of noise reduction training for the noise reducer to be trained according to the morphological speech data and the speech loss function to obtain the trained noise reducer, wherein the noise reduction training is based on the noise reduction training to be trained.
  • the noise reducer calculates the voice output of the morphological voice data, and calculates the voice difference according to the voice output and the voice loss function, and performs noise reduction training for the noise reducer to be trained according to the voice difference. If the voice difference in this round of noise reduction training is not If the preset difference value is satisfied, the denoiser obtained from this round of denoising training is used as the denoiser after training.
  • the noise reducer used in this application adopts a typical encoder-decoder structure: encoder and decoder structure, each encoder is composed of a convolution layer + batch normal layer + relu activation function
  • the substructure consists of the batch normal layer relu activation function; as for how many convolution kernels are used in the convolution layer, this depends on the situation; the input information passes through each encoder from top to bottom, and then from encoder 2,
  • the output data of the encoder 3 is copied once, and the copied data is combined with the decoder of the next stage to form the input of the decoder of the previous stage.
  • Train the denoiser Import the dictionary constructed in the above process into the denoiser to be trained, and train as follows. For example, there are 5 voices in the dictionary with the content number 1, and one voice is randomly selected from the dictionary. , for example, number 1 is used as the input, and another voice is extracted, such as number 3, as the training target.
  • the morphological voice data includes the first morphological voice data and the second morphological voice data.
  • the above-mentioned No. 1 is used as the first form of voice data
  • the above-mentioned No. 3 is used as the second form of voice data, and is calculated according to the noise reducer to be trained.
  • the speech output of the morphological speech data, and the speech difference is calculated based on the speech output and the speech loss function, including:
  • the network weight of the trained denoiser is to obtain the first state denoiser; the first form of speech data and the second form of speech data are updated; the updated first form of speech data is input into the denoiser to be trained to obtain The updated first voice output; compare the voice difference between the updated first voice output and the updated second form of voice data to obtain the updated first voice difference.
  • the noise reducer to be trained performs the following processing on the input data: as shown in Figure 3 above, the array representing the voice data (the first form of voice data) is input into the noise reducer to be trained, and encode1 converts the data into It is encoded as a relevant information tensor, and passed down in turn, and further encoded through the following encode respectively, and a new information tensor is obtained. It should be noted that there will be a short link from encode2 down to the new tensor at this time. Copy a copy and pass it to the corresponding decode without passing it down.
  • the new tensor is sent to the next layer of encode for encoding, which is carried out in sequence until encode4, and encode4 encodes a new information tensor and directly copies it with encode3
  • the tensors are merged, and then sent to decode4 for decoding to obtain information tensors; then proceed in sequence until the first voice output is obtained.
  • the first voice difference is calculated by subtracting the first voice output and the second form of voice data to obtain the voice difference between the output voice data and the voice data without noise reduction.
  • the reason for calculating the difference between the two voices is that the advantage of the neural network is that it is extremely good at capturing the common points between the input value and the target.
  • the training process of a neural network is analogous to this measurement. In this embodiment, it is necessary to make the neural network approach a state by means of repeated measurement. In this state, as long as it is a speech with the same meaning, no matter what the noise form is, a very close result can be obtained; then in this state In this context, what is this result? It is the common point, and these common points are the signals of the real meaning, so that a real voice can be denoised without affecting its real meaning. This is reflected in the model training, how to train it?
  • the first voice difference can also be obtained by squaring the obtained voice difference.
  • the network weight of the noise reducer to be trained is adjusted according to the first speech difference to obtain the first state noise reducer.
  • the network weight is adjusted through the reverse transfer mechanism of the neural network, and the Adam optimization method is specifically used in this application.
  • Step 206 Perform noise reduction training on the noise reducer to be trained according to the voice difference, and when the voice difference meets a preset difference value, use the noise reducer obtained from this round of noise reduction training as the trained noise reducer .
  • the voice difference also includes a second voice difference
  • the first state denoiser includes a first network weight
  • noise reduction training is performed on the denoiser to be trained according to the voice difference, including:
  • the first state denoiser may be updated according to the updated first voice difference to obtain the second state denoiser.
  • the network weight of the noise reducer to be trained is adjusted according to the first speech difference, and in the obtained first state noise reducer, in addition to the obtained first state noise reducer, the first network weight is also obtained, and the second network weight is Parameters of the first state denoiser.
  • the second network weight is the parameter of the noise reducer in the second state.
  • noise reduction training can be performed on the noise reducer to be trained through the speech loss function (1):
  • L is the total loss
  • Ln is the consistency loss function
  • Lw is the weight loss function
  • Lr is the reconstruction loss function
  • is an artificially set parameter.
  • the denoiser when training the denoiser, import the dictionary constructed in the above process into the denoiser for training in the following manner. For example, there are 5 voices in the dictionary whose content number is 1, and one voice is randomly selected from the dictionary. , for example, number 1 is used as the input, and another voice is extracted, such as number 3, as the training target.
  • here represents the loss function under a certain noise distribution
  • N is the data volume of the total speech data
  • f represents the processing of the speech data by the noise reducer (network) to be trained above.
  • x+n represents the speech data after the real signal (ontology speech) plus noise
  • the subscripts j and k of n are used to distinguish different noises.
  • this embodiment first designs the following consistency loss function (4):
  • the calculation here is divided into several times, that is, the first item is to use x+n1 when the network state (network weight, parameters of the denoiser) is ⁇ 1
  • the input is calculated, and the second item is that the network weight ⁇ 2 is decreased by x+n2.
  • the input is calculated, that is, when calculating the scheme, the input and the target are respectively exchanged positions.
  • the weight loss function is set here, where ⁇ indicates the network weight.
  • reconstruction loss function (6) there is the following called reconstruction loss function (6):
  • FIG. 4 shows the convergence speed of the trained model.
  • Figure 4 shows the calculation of the loss functions of the above three models within 100 training epochs. It can be seen that the convergence of wavenet and this application is faster, while GAN-based SEGAN has not converged yet.
  • the wavy line is the convergence speed of GAN-based SEGAN, between 20 and 40 (epoch), the curve corresponding to the upper curve segment is wavenet, and the curve corresponding to the lower curve segment is the application.
  • updating the first morphological voice data and the second morphological voice data includes: selecting two morphological voice data including different noise data from the morphological voice database, wherein the selected morphological voice data is the same as the first form before the update. There is at least one difference between the voice data and the second form of voice data; the two forms of voice data are respectively updated to the first form of voice data and the second form of voice data.
  • the voice data input to the noise reducer again does not refer to the same form of voice data as the form before the update, that is, if the first input voice data
  • the first form of voice data is No. 1 form of voice data
  • the second form of voice data is No. 3
  • the first form of voice data input for the second time can be one of No. 1, No. 2, No. 3, but when the first form of voice data is No. 1, No. 2, No. 3
  • the morphological voice data is No. 1, the second form of voice data cannot be No. 3, but can be other morphological voice data except No. 3.
  • the voice data in the morphological voice database can also be randomly combined in sequence to obtain a voice data group including two morphological voice data, and remove a group of voice data groups with the same order and number.
  • the obtained output is used as an update to the first voice output, and then compared with the new second form of voice data, and the comparison result is obtained as an update of the first voice difference.
  • the merged data to be denoised is sent to the pre-built denoiser to obtain an output result, and then according to the original merging method, based on this result The original signal format is restored, and the noise-reduced voice is obtained at this time.
  • the merging mentioned in this embodiment for example, the original input is the voice in mp3 format.
  • it needs to be converted into an array composed of numbers, and the data obtained after the model processing is completed is also the data.
  • the above-mentioned voice data information can also be stored in a node of a blockchain.
  • the voice data including noise is input into the noise reducer to be trained for multiple rounds of noise reduction training, that is, the output of different voice data and the voice difference of different voice data are calculated, and according to The voice difference adjusts the network weight of the denoiser until one round of denoising training is over, and then the next round of denoising training is performed on the basis of the current training.
  • the denoising neural network has fast convergence speed and strong adaptability and robustness.
  • steps in the flowchart of FIG. 2 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence also need not be sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a phase.
  • a voice noise reducer training apparatus is provided, and the voice noise reducer training apparatus corresponds one-to-one with the voice noise reducer training method in the above-mentioned embodiment.
  • the speech noise reducer training device includes:
  • the building module 502 is used to build a morphological speech database, wherein the morphological speech database includes a plurality of morphological speech data obtained by combining the noise speech and the ontology speech according to the noise generation algorithm;
  • the training module 504 is configured to perform multiple rounds of noise reduction training on the noise reducer to be trained according to the morphological speech data and the voice loss function, to obtain a trained noise reducer, wherein the noise reduction training is to calculate the shape according to the noise reducer to be trained The voice output of the voice data, and calculate the voice difference according to the voice output and the voice loss function, and perform noise reduction training for the noise reducer to be trained according to the voice difference. If the voice difference in this round of noise reduction training meets the preset difference value, The denoiser obtained from this round of denoising training is used as the denoiser after training.
  • the morphological voice data includes the first morphological voice data and the second morphological voice data
  • the voice output includes the first voice data
  • the training module 504 includes:
  • a first noise reduction sub-module for inputting the first form of speech data into the noise reducer to be trained to obtain the first speech data
  • the first comparison submodule is used to compare the voice difference between the first voice output and the second form of voice data to obtain the first voice difference
  • a first adjustment sub-module configured to adjust the network weight of the noise reducer to be trained according to the first speech difference to obtain the first state noise reducer
  • a data update submodule for updating the voice data of the first form and the voice data of the second form
  • a noise reduction update sub-module used for inputting the updated voice data of the first form into the noise reducer to be trained to obtain the updated first voice output
  • the comparison and update sub-module is used for comparing the voice difference between the updated first voice output and the updated second form of voice data to obtain the updated first voice difference.
  • the speech difference includes a second speech difference
  • the first state denoiser includes a first network weight
  • the training module 504 further includes:
  • an adjustment and update sub-module for adjusting the network weight of the noise reducer in the first state according to the updated first voice difference to obtain the noise reducer in the second state and the weight of the second network;
  • the difference update submodule is used to obtain the second voice difference output by the first state denoiser and the second state denoiser by calculating the voice loss function based on the first network weight and the second network weight;
  • the training result sub-module is configured to use the second state denoiser as a trained denoiser when the second speech difference is smaller than the preset difference threshold.
  • the data update submodule includes:
  • the first updating unit is used to update the two morphological voice data into the first morphological voice data and the second morphological voice data respectively.
  • data update submodule also includes:
  • the combining unit is used to randomly combine the morphological voice data in the morphological voice database in sequence, to obtain a voice data group including two morphological voice data;
  • the second update unit is used to remove a group of voice data groups with the same sequence and morphological voice data, and obtain at least one different voice data group in the multiple sequence and morphological voice data, wherein the voice data group includes the first voice data group.
  • a modality of voice data and a second modality of voice data are used to remove a group of voice data groups with the same sequence and morphological voice data, and obtain at least one different voice data group in the multiple sequence and morphological voice data, wherein the voice data group includes the first voice data group.
  • building module 502 includes:
  • an acquisition submodule for acquiring a noise database wherein the noise database includes multiple noise speeches in different environments
  • a sub-module is constructed for combining ontology speech and noise data to obtain a plurality of morphological speech data including different noise speech.
  • the above-mentioned voice data information can also be stored in a node of a blockchain.
  • the voice data including noise is input into the noise reducer to be trained for multiple rounds of noise reduction training, that is, the output of different voice data and the voice difference of different voice data are calculated, and according to The voice difference adjusts the network weight of the denoiser until one round of denoising training is over, and then the next round of denoising training is performed on the basis of the current training.
  • the denoising neural network has fast convergence speed and strong adaptability and robustness.
  • a computer device in one embodiment, the computer device may be a server, and its internal structure diagram may be as shown in FIG. 6 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the computer device's database is used to store speech data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer readable instructions when executed by a processor, implement a speech noise reducer training method.
  • the voice data including noise is input into the noise reducer to be trained to perform multiple rounds of noise reduction training, that is, the output of different voice data and the voice difference between different voice data are calculated, and the noise reducer is adjusted according to the voice difference. until the end of one round of noise reduction training, and then perform the next round of noise reduction training on the basis of the current training.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • a computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the steps of the speech noise reducer training method in the above-mentioned embodiment are implemented, for example, as shown in FIG. Steps 202 to 204 shown in 2, or, when the processor executes the computer-readable instructions, realizes the functions of each module/unit of the speech noise reducer training device in the above-mentioned embodiment, for example, the modules 502 to 504 shown in FIG. 5 . Function.
  • the voice data including noise is input into the noise reducer to be trained to perform multiple rounds of noise reduction training, that is, the output of different voice data and the voice difference between different voice data are calculated, and the noise reducer is adjusted according to the voice difference. until the end of one round of noise reduction training, and then perform the next round of noise reduction training on the basis of the current training.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种语音降噪器训练方法,属于人工智能中的语音处理领域,应用于智慧城市领域中,包括构建形态语音数据库;根据形态语音数据和语音损失函数对待训练的降噪器进行多轮降噪训练,得到训练后的降噪器,其中,降噪训练为根据待训练的降噪器计算形态语音数据的语音输出,并根据语音输出和语音损失函数计算语音差异,并根据语音差异对待训练的降噪器进行降噪训练,若本轮降噪训练中的语音差异满足预设差异值,将本轮降噪训练得到的降噪器作为训练后的降噪器。此外,还涉及区块链技术,语音数据存储于区块链中。解决了现有技术中网络收敛速度慢,适应性和鲁棒性不好的技术问题。

Description

语音降噪器训练方法、装置、计算机设备和存储介质
本申请要求于2021年02月26日提交中国专利局、申请号为202110218925.3、发明名称为“语音降噪器训练方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理领域,特别是涉及一种语音降噪器训练方法、装置、计算机设备和存储介质。
背景技术
随着手机等移动端在生活中的使用场景逐渐丰富,越来越多的后续应用,比如音频转文字系统(ASR)在各个业务场景的应用也更为频繁。与此同时,这也对ASR输入音频的质量要求越来越高。语音降噪(增强)是确保音频质量的关键。
现在传统技术中语音降噪的方式通常是基于对噪声场的估计,然后设置相应的降噪手段,然而这种对噪声场估计方式虽然能够在特定场合有很好的作用,但是往往在其他场合表现不佳。这是由于噪声场估计方式的局限所致。在深度学习技术兴起后,基于人工智能更具有鲁棒性的方式得以发展起来。在实际应用中,发明人意识到这些方式往往采用诸如生成对抗网络(GAN),自编码器(AE)等等方案来实现的,这其中可以分为两大类别,一种是基于编码器-解码器这样的一种生成式方式,来生成一段降噪后的语音,另一种是基于诸如生成对抗网络这种结构学习方式来获取结果。这两种方式各有利弊,第一种方案往往具有交好的收敛特性,能够明确地往降噪这个方向进行收敛,缺点是需要有对应的不含噪声的数据作为target,这在实际情况中非常难以满足,因此往往需要近似方法,但是近似方法由于现有技术的局限,导致模拟的噪声情况多样性不足,往往使得模型不具有较高的泛化能力。第二种是无需有不含噪声的数据作为target的,但是这种网络往往很难收敛,难以获得较好的结果,而且因为生成对抗网络中的判别器设置过于自由,导致在某些方面不太可控,出现降噪不佳的问题。
发明内容
基于此,针对上述技术问题,本申请提供一种语音降噪器训练方法、装置、计算机设备及存储介质,以解决现有技术中网络收敛速度慢,适应性和鲁棒性不好的技术问题。
一种语音降噪器训练方法,所述方法包括:
构建形态语音数据库,其中,所述形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异;
根据所述语音差异对所述待训练的降噪器进行降噪训练,当语音差异满足预设差异值时,将本轮降噪训练得到的降噪器作为训练后的降噪器。
一种语音降噪器训练装置,所述装置包括:
构建模块,用于构建形态语音数据库,其中,所述形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
构建模块,用于构建形态语音数据库,其中,所述形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
计算模块,用于根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异;
训练模块,用于根据所述语音差异对所述待训练的降噪器进行降噪训练,当语音差异满足预设差异值时,将本轮降噪训练得到的降噪器作为训练后的降噪器。
一种计算机设备,包括存储器和处理器,以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述语音降噪器训练方 法的步骤。
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述语音降噪器训练方法的步骤。
上述语音降噪器训练方法、装置、计算机设备和存储介质,通过包括噪声的语音数据,输入到待训练的降噪器中进行多轮降噪训练,即计算不同语音数据的输出与不同的语音数据的语音差异,并根据语音差异调整降噪器的网络权重,直到一轮降噪训练结束,再在当前训练的基础上进行下一轮的降噪训练。解决了现有技术中网络收敛速度慢,适应性和鲁棒性不好的技术问题。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为语音降噪器训练方法的应用环境示意图;
图2为语音降噪器训练方法的流程示意图;
图3为降噪器结构示意图;
图4为网络收敛速度对比示意图;
图5为语音降噪器训练装置的示意图;
图6为一个实施例中计算机设备的示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本申请的目的、技术方案及优点更加清楚明白,下面结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的语音降噪器训练方法,可以应用于如图1所示的应用环境中。其中,该应用环境可以包括终端102、网络以及服务端104,网络用于在终端102和服务端104之间提供通信链路介质,网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端102通过网络与服务端104交互,以接收或发送消息等。终端102上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端102可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV, 动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务端104可以是提供各种服务的服务器,例如对终端102上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的语音降噪器训练方法一般由服务端/终端执行,相应地,语音降噪器训练装置一般设置于服务端/终端设备中。
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本申请可应用于智慧城市领域中,特别是智慧银行领域中,从而推动智慧城市的建设。
应该理解,图1中的终端、网络和服务端的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
其中,终端102通过网络与服务端104进行通信。服务端104通过从终端102获取的噪声语音和本体语音组合一起,来构建形态语音数据库,然后根据所述形态语音数据和语音损失函数对所述待训练的降噪器进行多轮降噪训练,得到训练后的降噪器,其中,所述降噪训练为根据待训练的降噪器计算所述形态语音数据的语音输出,并根据所述语音输出和所述语音损失函数计算所述语音差异,并根据所述语音差异对所述待训练的降噪器进行降噪训练,若本轮降噪训练中的语音差异是否满足预设差异值,将本轮降噪训练得到的降噪器作为训练后的降噪器。之后,终端102就可以使用训练好的降噪器进行语音降噪处理。其中,终端102和服务端104之间通过网络进行连接,该网络可以是有线网络或者无线网络,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务端104可以用独立的服务器或者是多个组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种语音降噪器训练方法,以该方法应用于图1中的服务端为例进行说明,包括以下步骤:
步骤202,构建形态语音数据库,其中,形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据。
在一些实施例中,本申请的方案是一种仅仅基于含噪数据的智能语音降噪技术,即在深度学习框架内,仅仅使用含噪的数据即可实现降噪。常规的降噪方式通常是基于对噪声场的估计,然后设置相应的降噪手段,然而这种对噪声场的估计方法能够在某些特定场合有很好的作用,但是往往在其他场合表现不佳,这是由于噪声场估计方法的局限所致。在深度学习技术兴起后,基于人工智能的,更具有鲁棒性的方法得以发展起来;这些方法往往是采用诸如生成对抗网络(GAN),自编码器(AE)等等方案来实现的,这其中可以分为两大类别,一种是基于编码器-解码器这样的一种生成式方法,来生成出一段降噪后的语音;另一种是基于诸如生成对抗网络这种结构学习方法,来获取结果。这两种方法均有其利弊,第一种方案往往具有较好的收敛特性,能够明确的往降噪这个方向进行收敛,缺点是需要有对应的不含噪声的数据作为target,这在实际情况中非常难以满足,因此往往需要近似方法;第二种方案是无需有不含噪声的数据作为target的,但是这种网络往往很难收敛,难以获得较好的结果,即便是获得统计意义上的好结果,也很容易出现一种异常情况,即在某些区段有可能出现降噪不佳的情况,这主要是因为生成对抗网络中的判别器设置过于自由,导致在某些方面不太可控。
本申请试图基于编码器-解码器这种稳健的方式,但是加以改进,以获得一种基于编码器-解码器,但是无需使用不含噪声的语音作为target的方式。本申请的主体思路是基于统计 上的大数定律出发,认为一段含有噪声的语音,假定其噪声的形态多种多样,但是其真实信号是确定的,多次对包含有确定真实信号,但是噪声形态不一样的语音进行检测,则检测出的数据的期望就应该逼近这个真实的信号,这样一来就在事实上实现了对语音信号的降噪。
进一步地,服务端需要构建形态语音数据库以实现作为训练样本对降噪器进行训练:获取噪声数据库,其中,噪声数据库中包括多个不同环境下的噪声语音;生成本体语音,将本体语音与噪声数据组合得到多个包括不同噪声语音的形态语音数据。
基于此,需要构建形态语音数据库,首先要收集大量的语音数据,并采用人工监督的方式从中挑选出认为有较大噪声的部分,这里挑选出的部分,可以是完整的语音录音,也可以是完整录音中的某个片段。其得到方式可以是:在构成语音数据库中的语音可以来源于在标准安静场所(比如录音棚中)进行语音采录,随后在噪声场所(比如马路上)中采录下噪声场,随后将两者进行合成,可以得到含有噪声的语音数据。例如,请人在相对安静的场所,例如录音室,朗读一段文字或者对话,然后与各种噪声场合下的噪声进行合并,得到包括不同噪声的形态语音数据。
可选地,还可以获取人在不同环境下对话或者朗读的语音数据,作为形态语音数据;例如,人在市场的对话或者朗读的文字、同一段对话或者朗读的文字在马路环境下、同一段对话或者朗读的文字通过不同的人在不同环境下的语音数据。此外,还可以通过其他诸如此类的方式得到形态语音数据,只要保证真实信号所表达的信息一致,但是噪声情况不同即可。
进一步地,构建好形态语音数据库后,还需要对里面的语音数据进行预处理:
输入到待训练的降噪器中的数据是由上述步骤所构建的形态语音数据库的数据集中提取出,在将语音数据输入模型之前,还可以将语音数据归并到[-1,1]的区间内,对于同一个内容的多条语音数据,构成字典<T,c>,得到预处理后的形态语音数据。
其中,T表示的是内容编号,c表示的是具体的各个形态语音数据,比如<1,3>,表示的是编号为1内容的第3条形态语音数据。
步骤204,根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异。
在一些实施例中,本申请的技术方案需要根据形态语音数据和语音损失函数对待训练的降噪器进行多轮降噪训练,得到训练后的降噪器,其中,降噪训练为根据待训练的降噪器计算形态语音数据的语音输出,并根据语音输出和语音损失函数计算语音差异,并根据语音差异对待训练的降噪器进行降噪训练,若本轮降噪训练中的语音差异是否满足预设差异值,将本轮降噪训练得到的降噪器作为训练后的降噪器。
如图3所示,本申请使用的降噪器采用典型的编码器-解码器结构:编码器与解码器结构,每个编码器是一个卷积层+batch normal层+relu激活函数所构成的子结构组成的,batch normal层relu激活函数;至于卷积层到底是使用多少个卷积核,这个视情况而定;输入的信息由上到下依次经过各个编码器,然后从编码器2,编码器3的输出数据进行一次拷贝,并将该拷贝数据连同下一级的解码器组合在一起形成上一级解码器的输入。
进行降噪器的训练:将上述过程中构建的字典导入到待训练的降噪器中,按照如下方式进行训练,比如,从内容编号为1的字典中共有5条语音,从中随机抽取一条语音,比如是1号作为输入,另外再抽取另一条语音,比如编号为3,作为训练的target。
进一步地,形态语音数据包括第一形态语音数据和第二形态语音数据,比如这里将上述1号作为第一形态语音数据,上述3号作为第二形态语音数据,根据待训练的降噪器计算形态语音数据的语音输出,并根据语音输出和语音损失函数计算语音差异,包括:
将第一形态语音数据输入到待训练的降噪器中,得到第一语音数据;对比第一语音输出与第二形态语音数据的语音差异,得到第一语音差异;根据第一语音差异调整待训练的降噪器的网络权重,得到第一状态降噪器;更新第一形态语音数据和第二形态语音数据;将更新后的第一形态语音数据输入到待训练的降噪器中,得到更新后的第一语音输出;对比更新后的第一语音输出与更新后的第二形态语音数据的语音差异,得到更新后的第一语音差异。
将1号形态语音数据作为第一形态语音数据输入到待训练的降噪器中,得到该第一形态语音数据,即1号形态语音数据的输出,第一语音输出,该第一语音输出则为该待训练的降噪器对第一形态语音数据降噪后的输出结果。
具体地,待训练的降噪器对输入的数据进行以下处理:如上图3所示,将代表语音数据的数组(第一形态语音数据)输入到上述待训练的降噪器中,encode1将数据编码为相关信息张量,并依次往下传递,分别经过下面的encode进行进一步编码,并获得新的信息张量,需要注意的是,从encode2往下会有短链接将此时的新张量拷贝一份传递至对应的decode,而无需往下进行传递,同时该新张量送往下一层encode进行编码,依次进行,直至encode4,encode4编码得出新的信息张量后直接同encode3拷贝的张量进行合并,随后送入decode4进行解码,获得信息张量;随后往上依次进行直至输出得到第一语音输出。
然后我们需要对比第一语音输出与3号形态语音数据,即第二形态语音数据的第一语音差异。
其中,该第一语音差异的计算方式为,将第一语音输出与第二形态语音数据相减,得到输出后的语音数据与未经降噪的语音数据之间的语音差异。
计算二者语音差异的原因在于,神经网络的优点在于极其善于捕捉输入值与target的共同点,本申请的技术方案的设计的思想就在于通过对同一含义的语音数据进行反复的“测量”,神经网络的训练过程就类比于这个测量。在本实施例中需要通过反复测量的方式,使得神经网络逼近一种状态,这种状态下,只要是同一含义的语音,不管噪声形态是什么,都能得到十分接近的结果;那么在这种语境下,什么是这种结果?是共同点,这些共同点就是真实含义的信号,这样就做到对于一条真实语音进行了降噪,而不影响它真实的含义。这里反映到模型训练上,该怎么训练了?注意我刚才提到的,我们的希望是同一含义的语音输入,获得十分接近的输出,那这里就是希望,第一语音输入,得到一个估计,称之为第一语言估计,在本实施例中希望这个估计同第二语言很相似。
进一步地,还可以通过对得到的语音差异再进行平方后,得到第一语音差异。
然后,根据第一语音差异后对待训练的降噪器的网络权重进行调整,得到第一状态降噪器。
具体地,通过神经网络的反向传递机制进行网络权重的调整,本申请中具体使用的是Adam优化方法。
步骤206,根据所述语音差异对所述待训练的降噪器进行降噪训练,当语音差异满足预设差异值时,将本轮降噪训练得到的降噪器作为训练后的降噪器。
进一步地,语音差异还包括第二语音差异,第一状态降噪器包括第一网络权重,根据语音差异对待训练的降噪器进行降噪训练,包括:
根据更新后的第一语音差异调整第一状态降噪器的网络权重,得到第二状态降噪器和第二网络权重;基于第一网络权重和第二网络权重,通过语音损失函数计算得到第一状态降噪器和第二状态降噪器输出的第二语音差异;当第二语音差异小于预设差异阈值时,将第二状态降噪器作为训练好的降噪器。
具体地,可以根据更新后的第一语音差异对第一状态降噪器进行更新,得到第二状态降噪器。
进一步地,根据第一语音差异调整待训练的降噪器的网络权重,得到第一状态降噪器中,除了得到的第一状态降噪器,还得到第一网络权重,第二网络权重为第一状态降噪器的参数。
第二网络权重为第二状态降噪器的参数,在本实施例中,可以通过语音损失函数(1)对待训练的降噪器进行降噪训练:
L=L nwL wrL r           (1)
其中,L为总损失,Ln为一致性损失函数、Lw为权重损失函数、Lr为重构损失函数、β是人为设定的参数。
具体地,在进行降噪器的训练时,将上述过程中构建的字典导入到降噪器中按照如下方式进行训练,比如从内容编号为1的字典中共有5条语音,从中随机抽取一条语音,比如是1号作为输入,另外再抽取另一条语音,比如编号为3,作为训练的target。
如果此时直接进行训练,往往难以收敛,为了确保收敛,这里设计了新的损失函数,对此我们做如下的推导。
对于这个任务而言,可以将训练过程抽象为公式(2):
Figure PCTCN2021090177-appb-000001
其中,这里的θ代表的是在某个噪声分布下的损失函数,N是总的语音数据的数据量,f表示的是经过上述的待训练的降噪器(网络)对语音数据进行的处理,x+n表示的是真正信号(本体语音)加上噪声后的语音数据,n的下标j和k分别用来区别表示不同的噪声。针对公式(2),可以将其进行变形,即将上述公式(2)变为求取θ平方的最小值,因而可以将其推导为得到公式(3):
Figure PCTCN2021090177-appb-000002
这里的y代表的是公式(2)中f里面的计算,从公式(3)可以看出,公式(3)等号左边的第一项实际上是正常训练,存在不含噪声target的情况。这里可以看出,要使得我们的训练能够达到正常训练的效果,即需要使得统计意义上的后两项加起来期望等于0,这一观点即是我们损失函数设计的要点。
为此,本实施例首先设计了如下的一致性损失函数(4):
Figure PCTCN2021090177-appb-000003
这里需要说明的是,针对这个一致性损失函数,这里的计算是分为几次的,即第一项是在网络状态(网络权重,降噪器的参数)为θ 1下,使用x+n1输入进行计算,第二项是网络权重θ 2下降x+n2输入进行计算,即在进行计算该方案时是先后分别将输入与target互换位置的。
同时,制定第二个损失函数,权重损失函数(5):
Figure PCTCN2021090177-appb-000004
其中,同样的内容,不同噪声的语音输入应当使得网络参数非常相近,因而这里设定权重损失函数,这里的θ表明的是网络权重。另外还有如下的称之为重构损失函数(6):
Figure PCTCN2021090177-appb-000005
其中,
Figure PCTCN2021090177-appb-000006
指训练中使用到的语音数据的输出平均值。
将第一网络权重和第二网络权重输入到该语音损失函数(1)中,得到第一状态降噪器和第二状态降噪器的损失值,作为第一状态降噪器和第二状态降噪器输出的第二语音差异;然 后,判断该第二语音差异是否小于预设差异阈值,若小于,则将第二状态降噪器作为训练好的降噪器,若不小于,则并根据该第二语音差异对第二状态降噪器的网络权重进行调整,得到第三状态降噪器。然后通过以上更新的形态语音数据再进行一轮训练,直到得到的第二语音差异满足要求。
通过本申请的技术方案可以大幅度提高降噪器的降噪效果,更进一步地,为了体现本申请的效果,如表1所示:
Figure PCTCN2021090177-appb-000007
表1
我们使用合成的含噪声语音数据,采用通过本申请的技术方案训练得到的降噪器对语音进行降噪,并同常见的降噪方式(SEGAN、WaveNet)进行对比,降噪后的数据测试SNR,SIG,BAK(这几个指标是行业公认标准指标且常用,在此具体含义不做赘述)可见表1数值越大越好。
而训练的模型收敛速度,如图4的网络收敛速度对比示意图所示,该图4是100个训练epoch内,上述三种模型的损失函数计算情况,可见wavenet和本申请的收敛较快,而基于GAN的SEGAN还没有收敛。其中,图4中,波浪线为基于GAN的SEGAN的收敛速度,20到40(epoch)之间,在上的曲线段对应的曲线为wavenet,在下的曲线段对应的曲线为本申请的。
进一步地,更新第一形态语音数据和第二形态语音数据,包括:从形态语音数据库中任选两个包括不同噪声数据的形态语音数据,其中,选择的形态语音数据与更新前的第一形态语音数据和第二形态语音数据至少有一个不同;将两个形态语音数据分别更新为第一形态语音数据和第二形态语音数据。
其中,更新第一形态语音数据和第二形态语音数据是需要保证再次输入到降噪器中的语音数据与更新前的形态语音数据不是指同一形态语音数据,即,若第一次输入的第一形态语音数据为1号形态语音数据,第二形态语音数据为3号,则第二次输入的第一形态语音数据可以是1号,2号,3号中的一种,但当第一形态语音数据为1号时,第二形态语音数据不可以为3号,但可以为除3号之外的其他形态语音数据。
可选地,还可以事先将形态语音数据库中的语音数据进行分先后的随机组合,得到包括两个形态语音数据的语音数据组,并去除顺序、编号都相同的一组语音数据组,第一形态语音数据和第二形态语音数据,然后将每一次输入到降噪器的中的数据都设为不同的语音数据组,实现更新第一形态语音数据和第二形态语音数据的目的。
将新的第一形态语音数据输入到降噪器后,得到的输出与作为对第一语音输出的更新,然后与新的第二形态语音数据进行对比,得到对比结果作为第一语音差异的更新。
进一步地,得到训练好的降噪器后,将归并后的待降噪数据送入到已经预先构建好的降噪器中,获得一个输出结果,然后再按照原有的归并方式,基于此结果恢复成原有的信号格式,此时即获得降噪的语音。
具体地,本实施例所说的归并,举例说明,原有输入是mp3格式的语音,本实施例为了进行上述数值处理,需要将其转为数字构成的数组,模型处理完成后得到的也是数据构成的数组,然后我们按照先前的逆过程,将其转为mp3格式,则该段mp3格式语音就是降噪后语音。
需要强调的是,为进一步保证上述用户信息的私密和安全性,上述语音数据信息还可以 存储于一区块链的节点中。
上述语音降噪器训练方法中,通过包括噪声的语音数据,输入到待训练的降噪器中进行多轮降噪训练,即计算不同语音数据的输出与不同的语音数据的语音差异,并根据语音差异调整降噪器的网络权重,直到一轮降噪训练结束,再在当前训练的基础上进行下一轮的降噪训练。这种方式仅仅基于含噪数据即可训练获得一个良好的降噪神经网络,该降噪神经网络收敛速度快速,具有较强的适应性和鲁棒性。
应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图5所示,提供了一种语音降噪器训练装置,该语音降噪器训练装置与上述实施例中语音降噪器训练方法一一对应。该语音降噪器训练装置包括:
构建模块502,用于构建形态语音数据库,其中,形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
训练模块504,用于根据形态语音数据和语音损失函数对待训练的降噪器进行多轮降噪训练,得到训练后的降噪器,其中,降噪训练为根据待训练的降噪器计算形态语音数据的语音输出,并根据语音输出和语音损失函数计算语音差异,并根据语音差异对待训练的降噪器进行降噪训练,若本轮降噪训练中的语音差异是否满足预设差异值,将本轮降噪训练得到的降噪器作为训练后的降噪器。
进一步地,形态语音数据包括第一形态语音数据和第二形态语音数据,语音输出包括第一语音数据,训练模块504,包括:
第一降噪子模块,用于将第一形态语音数据输入到待训练的降噪器中,得到第一语音数据;
第一对比子模块,用于对比第一语音输出与第二形态语音数据的语音差异,得到第一语音差异;
第一调整子模块,用于根据第一语音差异调整待训练的降噪器的网络权重,得到第一状态降噪器;
数据更新子模块,用于更新第一形态语音数据和第二形态语音数据;
降噪更新子模块,用于将更新后的第一形态语音数据输入到待训练的降噪器中,得到更新后的第一语音输出;
对比更新子模块,用于对比更新后的第一语音输出与更新后的第二形态语音数据的语音差异,得到更新后的第一语音差异。
进一步地,语音差异包括第二语音差异,第一状态降噪器包括第一网络权重,训练模块504,还包括:
调整更新子模块,用于根据更新后的第一语音差异调整第一状态降噪器的网络权重,得到第二状态降噪器和第二网络权重;
差异更新子模块,用于基于第一网络权重和第二网络权重,通过语音损失函数计算得到第一状态降噪器和第二状态降噪器输出的第二语音差异;
训练结果子模块,用于当第二语音差异小于预设差异阈值时,将第二状态降噪器作为训练好的降噪器。
进一步地,数据更新子模块,包括:
选择单元,用于从形态语音数据库中任选两个包括不同噪声数据的形态语音数据,其中,选择的形态语音数据与更新前的第一形态语音数据和第二形态语音数据至少有一个不同;
第一更新单元,用于将两个形态语音数据分别更新为第一形态语音数据和第二形态语音 数据。
进一步地,数据更新子模块,还包括:
组合单元,用于对形态语音数据库中的形态语音数据进行分先后的随机组合,得到包括两个形态语音数据的语音数据组;
第二更新单元,用于去除先后顺序、形态语音数据都相同的一组语音数据组,得到多个先后顺序、形态语音数据中至少有一个不同的语音数据组,其中,语音数据组中包括第一形态语音数据和第二形态语音数据。
进一步地,构建模块502,包括:
获取子模块,用于获取噪声数据库,其中,噪声数据库中包括多个不同环境下的噪声语音;
生成子模块,用于生成本体语音;
构建子模块,用于将本体语音与噪声数据组合得到多个包括不同噪声语音的形态语音数据。
需要强调的是,为进一步保证上述用户信息的私密和安全性,上述语音数据信息还可以存储于一区块链的节点中。
上述语音降噪器训练装置中,通过包括噪声的语音数据,输入到待训练的降噪器中进行多轮降噪训练,即计算不同语音数据的输出与不同的语音数据的语音差异,并根据语音差异调整降噪器的网络权重,直到一轮降噪训练结束,再在当前训练的基础上进行下一轮的降噪训练。这种方式仅仅基于含噪数据即可训练获得一个良好的降噪神经网络,该降噪神经网络收敛速度快速,具有较强的适应性和鲁棒性。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储语音数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音降噪器训练方法。
本实施例通过包括噪声的语音数据,输入到待训练的降噪器中进行多轮降噪训练,即计算不同语音数据的输出与不同的语音数据的语音差异,并根据语音差异调整降噪器的网络权重,直到一轮降噪训练结束,再在当前训练的基础上进行下一轮的降噪训练。解决了现有技术中网络收敛速度慢,适应性和鲁棒性不好的技术问题。
其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现上述实施例中语音降噪器训练方法的步骤,例如图2所示的步骤202至步骤204,或者,处理器执行计算机可读指令时实现上述实施例中语音降噪器训练装置的各模块/单元的功能,例如图5所示模块502至模块504的功能。
本实施例通过包括噪声的语音数据,输入到待训练的降噪器中进行多轮降噪训练,即计算不同语音数据的输出与不同的语音数据的语音差异,并根据语音差异调整降噪器的网络权重,直到一轮降噪训练结束,再在当前训练的基础上进行下一轮的降噪训练。解决了现有技术中网络收敛速度慢,适应性和鲁棒性不好的技术问题。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机 可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形、改进或者对部分技术特征进行等同替换,而这些修改或者替换,并不使相同技术方案的本质脱离本申请个实施例技术方案地精神和范畴,都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音降噪器训练方法,其中,所述方法包括:
    构建形态语音数据库,其中,所述形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
    根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异;
    根据所述语音差异对所述待训练的降噪器进行降噪训练,当语音差异满足预设差异值时,将本轮降噪训练得到的降噪器作为训练后的降噪器。
  2. 根据权利要求1所述的方法,其中,所述形态语音数据包括第一形态语音数据和第二形态语音数据,所述根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异,包括:
    将第一形态语音数据输入到待训练的降噪器中,得到第一语音输出;
    对比所述第一语音输出与所述第二形态语音数据的语音差异,得到第一语音差异;
    根据所述第一语音差异调整待训练的降噪器的网络权重,得到第一状态降噪器;
    更新所述第一形态语音数据和所述第二形态语音数据;
    将更新后的第一形态语音数据输入到待训练的降噪器中,得到更新后的第一语音输出;
    对比更新后的第一语音输出与更新后的第二形态语音数据的语音差异,得到更新后的第一语音差异。
  3. 根据权利要求2所述的方法,其中,所述对比所述第一语音输出与所述第二形态语音数据的语音差异,得到第一语音差异,包括:
    将所述第一语音输出与所述第二形态语音数据相减,得到第一语音输出与所述第二形态语音数据之间的第一语音差异。
  4. 根据权利要求3所述的方法,其中,所述语音差异包括第二语音差异,所述第一状态降噪器包括第一网络权重,所述根据所述语音差异对所述待训练的降噪器进行降噪训练,包括:
    根据更新后的第一语音差异调整所述第一状态降噪器的网络权重,得到第二状态降噪器和第二网络权重;
    基于所述第一网络权重和所述第二网络权重,通过语音损失函数计算得到所述第一状态降噪器和所述第二状态降噪器输出的第二语音差异;
    当所述第二语音差异小于预设差异阈值时,将所述第二状态降噪器作为训练好的降噪器。
  5. 根据权利要求4所述的方法,其中,所述更新所述第一形态语音数据和所述第二形态语音数据,包括:
    从所述形态语音数据库中任选两个包括不同噪声数据的形态语音数据,其中,选择的形态语音数据与更新前的第一形态语音数据和第二形态语音数据至少有一个不同;
    将两个所述形态语音数据分别更新为第一形态语音数据和第二形态语音数据。
  6. 根据权利要求4所述的方法,其中,所述更新所述第一形态语音数据和所述第二形态语音数据,包括:
    对所述形态语音数据库中的形态语音数据进行分先后的随机组合,得到包括两个形态语音数据的语音数据组;
    去除先后顺序、形态语音数据都相同的一组语音数据组,得到多个先后顺序、形态语音数据中至少有一个不同的语音数据组,其中,所述语音数据组中包括第一形态语音数据和第二形态语音数据。
  7. 根据权利要求1所述的方法,其中,所述构建形态语音数据库,包括:
    获取噪声数据库,其中,所述噪声数据库中包括多个不同环境下的噪声语音;
    生成本体语音;
    将所述本体语音与所述噪声数据组合得到多个包括不同噪声语音的形态语音数据。
  8. 一种语音降噪器训练装置,其中,包括:
    构建模块,用于构建形态语音数据库,其中,所述形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
    计算模块,用于根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异;
    训练模块,用于根据所述语音差异对所述待训练的降噪器进行降噪训练,当语音差异满足预设差异值时,将本轮降噪训练得到的降噪器作为训练后的降噪器。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    构建形态语音数据库,其中,所述形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
    根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异;
    根据所述语音差异对所述待训练的降噪器进行降噪训练,当语音差异满足预设差异值时,将本轮降噪训练得到的降噪器作为训练后的降噪器。
  10. 根据权利要求9所述的计算机设备,其中,所述形态语音数据包括第一形态语音数据和第二形态语音数据,所述根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异,包括:
    将第一形态语音数据输入到待训练的降噪器中,得到第一语音输出;
    对比所述第一语音输出与所述第二形态语音数据的语音差异,得到第一语音差异;
    根据所述第一语音差异调整待训练的降噪器的网络权重,得到第一状态降噪器;
    更新所述第一形态语音数据和所述第二形态语音数据;
    将更新后的第一形态语音数据输入到待训练的降噪器中,得到更新后的第一语音输出;
    对比更新后的第一语音输出与更新后的第二形态语音数据的语音差异,得到更新后的第一语音差异。
  11. 根据权利要求10所述的计算机设备,其中,所述对比所述第一语音输出与所述第二形态语音数据的语音差异,得到第一语音差异,包括:
    将所述第一语音输出与所述第二形态语音数据相减,得到第一语音输出与所述第二形态语音数据之间的第一语音差异。
  12. 根据权利要求11所述的计算机设备,其中,所述语音差异包括第二语音差异,所述第一状态降噪器包括第一网络权重,所述根据所述语音差异对所述待训练的降噪器进行降噪训练,包括:
    根据更新后的第一语音差异调整所述第一状态降噪器的网络权重,得到第二状态降噪器和第二网络权重;
    基于所述第一网络权重和所述第二网络权重,通过语音损失函数计算得到所述第一状态降噪器和所述第二状态降噪器输出的第二语音差异;
    当所述第二语音差异小于预设差异阈值时,将所述第二状态降噪器作为训练好的降噪器。
  13. 根据权利要求12所述的计算机设备,其中,所述更新所述第一形态语音数据和所述第二形态语音数据,包括:
    从所述形态语音数据库中任选两个包括不同噪声数据的形态语音数据,其中,选择的形态语音数据与更新前的第一形态语音数据和第二形态语音数据至少有一个不同;
    将两个所述形态语音数据分别更新为第一形态语音数据和第二形态语音数据。
  14. 根据权利要求11所述的计算机设备,其中,所述更新所述第一形态语音数据和所述第二形态语音数据,包括:
    对所述形态语音数据库中的形态语音数据进行分先后的随机组合,得到包括两个形态语音数据的语音数据组;
    去除先后顺序、形态语音数据都相同的一组语音数据组,得到多个先后顺序、形态语音数据中至少有一个不同的语音数据组,其中,所述语音数据组中包括第一形态语音数据和第二形态语音数据。
  15. 根据权利要求9所述的计算机设备,其中,所述构建形态语音数据库,包括:
    获取噪声数据库,其中,所述噪声数据库中包括多个不同环境下的噪声语音;
    生成本体语音;
    将所述本体语音与所述噪声数据组合得到多个包括不同噪声语音的形态语音数据。
  16. 一种计算机可读存储介质,其上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现如下步骤:
    构建形态语音数据库,其中,所述形态语音数据库中包括多个根据噪声生成算法得到的噪声语音和本体语音组合的形态语音数据;
    根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异;
    根据所述语音差异对所述待训练的降噪器进行降噪训练,当语音差异满足预设差异值时,将本轮降噪训练得到的降噪器作为训练后的降噪器。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述形态语音数据包括第一形态语音数据和第二形态语音数据,所述根据待训练的降噪器计算所述形态语音数据的语音输出,并利用预设的语音损失函数计算所述语音输出和所述形态语音数据库中形态语音数据的语音差异,包括:
    将第一形态语音数据输入到待训练的降噪器中,得到第一语音输出;
    对比所述第一语音输出与所述第二形态语音数据的语音差异,得到第一语音差异;
    根据所述第一语音差异调整待训练的降噪器的网络权重,得到第一状态降噪器;
    更新所述第一形态语音数据和所述第二形态语音数据;
    将更新后的第一形态语音数据输入到待训练的降噪器中,得到更新后的第一语音输出;
    对比更新后的第一语音输出与更新后的第二形态语音数据的语音差异,得到更新后的第一语音差异。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述对比所述第一语音输出与所述第二形态语音数据的语音差异,得到第一语音差异,包括:
    将所述第一语音输出与所述第二形态语音数据相减,得到第一语音输出与所述第二形态语音数据之间的第一语音差异。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述语音差异包括第二语音差异,所述第一状态降噪器包括第一网络权重,所述根据所述语音差异对所述待训练的降噪器进行降噪训练,包括:
    根据更新后的第一语音差异调整所述第一状态降噪器的网络权重,得到第二状态降噪器和第二网络权重;
    基于所述第一网络权重和所述第二网络权重,通过语音损失函数计算得到所述第一状态降噪器和所述第二状态降噪器输出的第二语音差异;
    当所述第二语音差异小于预设差异阈值时,将所述第二状态降噪器作为训练好的降噪器。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述更新所述第一形态语音数据和所述第二形态语音数据,包括:
    从所述形态语音数据库中任选两个包括不同噪声数据的形态语音数据,其中,选择的形态语音数据与更新前的第一形态语音数据和第二形态语音数据至少有一个不同;
    将两个所述形态语音数据分别更新为第一形态语音数据和第二形态语音数据。
PCT/CN2021/090177 2021-02-26 2021-04-27 语音降噪器训练方法、装置、计算机设备和存储介质 WO2022178970A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110218925.3A CN112992168B (zh) 2021-02-26 2021-02-26 语音降噪器训练方法、装置、计算机设备和存储介质
CN202110218925.3 2021-02-26

Publications (1)

Publication Number Publication Date
WO2022178970A1 true WO2022178970A1 (zh) 2022-09-01

Family

ID=76351159

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090177 WO2022178970A1 (zh) 2021-02-26 2021-04-27 语音降噪器训练方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN112992168B (zh)
WO (1) WO2022178970A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922560A (zh) * 2018-05-02 2018-11-30 杭州电子科技大学 一种基于混合深度神经网络模型的城市噪声识别方法
CN109841226A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于卷积递归神经网络的单通道实时降噪方法
CN110503968A (zh) * 2018-05-18 2019-11-26 北京搜狗科技发展有限公司 一种音频处理方法、装置、设备及可读存储介质
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks
CN112309426A (zh) * 2020-11-24 2021-02-02 北京达佳互联信息技术有限公司 语音处理模型训练方法及装置和语音处理方法及装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147810B (zh) * 2018-09-30 2019-11-26 百度在线网络技术(北京)有限公司 建立语音增强网络的方法、装置、设备和计算机存储介质
CN109637525B (zh) * 2019-01-25 2020-06-09 百度在线网络技术(北京)有限公司 用于生成车载声学模型的方法和装置
CN110491404B (zh) * 2019-08-15 2020-12-22 广州华多网络科技有限公司 语音处理方法、装置、终端设备及存储介质
CN110600017B (zh) * 2019-09-12 2022-03-04 腾讯科技(深圳)有限公司 语音处理模型的训练方法、语音识别方法、系统及装置
CN110808058B (zh) * 2019-11-11 2022-06-21 广州国音智能科技有限公司 语音增强方法、装置、设备及可读存储介质
CN111583951A (zh) * 2020-04-29 2020-08-25 华中科技大学 一种基于深度特征损失的语音降噪方法及系统
CN111696532B (zh) * 2020-06-17 2023-08-18 北京达佳互联信息技术有限公司 语音识别方法、装置、电子设备以及存储介质
CN111863003B (zh) * 2020-07-24 2022-04-15 思必驰科技股份有限公司 语音数据增强方法和装置
CN112397057A (zh) * 2020-12-01 2021-02-23 平安科技(深圳)有限公司 基于生成对抗网络的语音处理方法、装置、设备及介质
CN112365885B (zh) * 2021-01-18 2021-05-07 深圳市友杰智新科技有限公司 唤醒模型的训练方法、装置和计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922560A (zh) * 2018-05-02 2018-11-30 杭州电子科技大学 一种基于混合深度神经网络模型的城市噪声识别方法
CN110503968A (zh) * 2018-05-18 2019-11-26 北京搜狗科技发展有限公司 一种音频处理方法、装置、设备及可读存储介质
CN109841226A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于卷积递归神经网络的单通道实时降噪方法
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks
CN112309426A (zh) * 2020-11-24 2021-02-02 北京达佳互联信息技术有限公司 语音处理模型训练方法及装置和语音处理方法及装置

Also Published As

Publication number Publication date
CN112992168B (zh) 2024-04-19
CN112992168A (zh) 2021-06-18

Similar Documents

Publication Publication Date Title
WO2021103698A1 (zh) 换脸方法、装置、电子设备及存储介质
US20200090682A1 (en) Voice activity detection method, method for establishing voice activity detection model, computer device, and storage medium
US20200106708A1 (en) Load Balancing Multimedia Conferencing System, Device, and Methods
US20210020160A1 (en) Sample-efficient adaptive text-to-speech
WO2019233364A1 (zh) 基于深度学习的音频音质增强
WO2023221674A1 (zh) 音频编解码方法及相关产品
WO2022105169A1 (zh) 一种欺诈行为识别方法、装置、计算机设备及存储介质
WO2022141868A1 (zh) 一种提取语音特征的方法、装置、终端及存储介质
WO2023226839A1 (zh) 音频增强方法、装置、电子设备及可读存储介质
CN112466314A (zh) 情感语音数据转换方法、装置、计算机设备及存储介质
CN114863229A (zh) 图像分类方法和图像分类模型的训练方法、装置
CN111696520A (zh) 智能配音方法、装置、介质及电子设备
CN110120228A (zh) 基于声谱图及深度残差网络的音频通用隐写分析方法及系统
WO2022141870A1 (zh) 基于人工智能的语音合成方法、装置、计算机设备和介质
CN112492312B (zh) 基于小波变换的图像压缩恢复方法、装置、设备和介质
WO2022178970A1 (zh) 语音降噪器训练方法、装置、计算机设备和存储介质
CN112669244A (zh) 人脸图像增强方法、装置、计算机设备以及可读存储介质
CN112885377A (zh) 语音质量评估方法、装置、计算机设备和存储介质
CN116959465A (zh) 语音转换模型训练方法、语音转换方法、装置及介质
CN113421554B (zh) 语音关键词检测模型处理方法、装置及计算机设备
WO2022126904A1 (zh) 语音转换方法、装置、计算机设备及存储介质
CN115294995A (zh) 语音转换方法、语音转换装置、电子设备、存储介质
JP7352243B2 (ja) コンピュータプログラム、サーバ装置、端末装置、学習済みモデル、プログラム生成方法、及び方法
CN115171666A (zh) 语音转换模型训练方法、语音转换方法、装置及介质
CN112950501A (zh) 基于噪声场的图像降噪方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927404

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927404

Country of ref document: EP

Kind code of ref document: A1