CN112992168B

CN112992168B - Speech noise reducer training method, device, computer equipment and storage medium

Info

Publication number: CN112992168B
Application number: CN202110218925.3A
Authority: CN
Inventors: 陈昊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-04-19
Anticipated expiration: 2041-02-26
Also published as: CN112992168A; WO2022178970A1

Abstract

The embodiment of the application belongs to the field of voice processing, is applied to the field of smart cities, and relates to a voice noise reducer training method, which comprises the steps of constructing a morphological voice database; and carrying out multi-round noise reduction training on the noise reducer to be trained according to the morphological voice data and the voice loss function to obtain the trained noise reducer, wherein the noise reduction training is to calculate the voice output of the morphological voice data according to the noise reducer to be trained, calculate the voice difference according to the voice output and the voice loss function, carry out noise reduction training on the noise reducer to be trained according to the voice difference, and take the noise reducer obtained by the noise reduction training of the round as the trained noise reducer if the voice difference in the noise reduction training of the round meets a preset difference value. Furthermore, the present application relates to blockchain technology, and the voice data is also stored in the blockchain. The method solves the technical problems of low network convergence speed and poor adaptability and robustness in the prior art.

Description

Speech noise reducer training method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech processing, and in particular, to a method and apparatus for training a speech noise reducer, a computer device, and a storage medium.

Background

As mobile terminals such as mobile phones are increasingly used in life, more and more subsequent applications, such as an audio-to-text system (ASR), are more frequently applied to various service scenes. At the same time, this also places increasing demands on the quality of the ASR input audio. Speech noise reduction (enhancement) is critical to ensure audio quality.

The voice noise reduction mode in the traditional technology is usually based on the estimation of a noise field and then a corresponding noise reduction means is arranged, however, the noise field estimation mode can have good effect in specific occasions, but often has poor performance in other occasions. This is due to limitations in the way the noise field is estimated. After the advent of deep learning techniques, more robust ways based on artificial intelligence have evolved. These approaches are often implemented using schemes such as generating a countermeasure network (GAN), self-encoder (AE), etc., which can be divided into two broad categories, one based on a generation approach such as encoder-decoder to generate a segment of denoised speech, and the other based on a structure learning approach such as generating a countermeasure network to obtain results. The two methods have advantages and disadvantages, the first scheme has a good convergence characteristic, can clearly converge towards the noise reduction direction, and has the defect that corresponding data without noise is required to be used as a target, which is very difficult to meet in practical situations, so that an approximation method is often required, but the approximation method has insufficient diversity of simulated noise situations due to the limitation of the prior art, so that the model does not have higher generalization capability. The second is that there is no need to have noise-free data as target, but such networks tend to be difficult to converge, difficult to achieve better results, and because the discriminators in the generation countermeasure network are too free to set up, they are in some way less controllable, presenting problems of poor noise reduction.

Disclosure of Invention

Based on the above, the application provides a training method, a training device, a training computer device and a training storage medium for a voice noise reducer, so as to solve the technical problems of low network convergence speed and poor adaptability and robustness in the prior art.

A method of speech noise reducer training, the method comprising:

Constructing a morphological voice database, wherein the morphological voice database comprises a plurality of morphological voice data of a combination of noise voice and body voice obtained according to a noise generation algorithm;

Calculating the voice output of the morphological voice data according to a noise reducer to be trained, and calculating the voice difference between the voice output and the morphological voice data in the morphological voice database by using a preset voice loss function;

And carrying out noise reduction training on the noise reducer to be trained according to the voice difference, and taking the noise reducer obtained by the noise reduction training of the round as the noise reducer after training when the voice difference meets a preset difference value.

A speech noise reducer training device, the device comprising:

The construction module is used for constructing a morphological voice database, wherein the morphological voice database comprises a plurality of morphological voice data of a combination of noise voice and body voice obtained according to a noise generation algorithm;

The calculation module is used for calculating the voice output of the morphological voice data according to the noise reducer to be trained and calculating the voice difference between the voice output and the morphological voice data in the morphological voice database by utilizing a preset voice loss function;

The training module is used for carrying out noise reduction training on the noise reducer to be trained according to the voice difference, and when the voice difference meets a preset difference value, the noise reducer obtained by the noise reduction training of the round is used as the noise reducer after training.

A computer device comprising a memory and a processor, and computer readable instructions stored in the memory and executable on the processor, which when executed by the processor implement the steps of the above-described speech noise reducer training method.

A computer readable storage medium storing computer readable instructions which when executed by a processor implement the steps of the above-described speech noise reducer training method.

According to the training method, the device, the computer equipment and the storage medium of the voice noise reducer, voice data comprising noise are input into the noise reducer to be trained for carrying out multiple rounds of noise reduction training, namely, the voice difference between the output of different voice data and the different voice data is calculated, the network weight of the noise reducer is adjusted according to the voice difference until one round of noise reduction training is finished, and then the next round of noise reduction training is carried out on the basis of the current training. The method solves the technical problems of low network convergence speed and poor adaptability and robustness in the prior art.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment for a speech noise reducer training method;

FIG. 2 is a flow chart of a method of training a speech noise reducer;

FIG. 3 is a schematic diagram of a noise reducer;

FIG. 4 is a diagram showing the comparison of network convergence rates;

FIG. 5 is a schematic diagram of a speech noise reducer training device;

FIG. 6 is a schematic diagram of a computer device in one embodiment.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The voice noise reducer training method provided by the embodiment of the invention can be applied to an application environment shown in fig. 1. The application environment may include, among other things, a terminal 102, a network for providing a communication link medium between the terminal 102 and the server 104, and a server 104, which may include various connection types, such as wired, wireless communication links, or fiber optic cables, etc.

A user may interact with the server 104 through a network using the terminal 102 to receive or send messages, etc. The terminal 102 may have installed thereon various communication client applications such as web browser applications, shopping class applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like.

The terminal 102 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 104 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal 102.

It should be noted that, the method for training the speech noise reducer provided by the embodiment of the application is generally executed by the server/terminal, and correspondingly, the device for training the speech noise reducer is generally arranged in the server/terminal equipment.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The application can be applied to the field of smart cities, in particular to the field of smart banks, thereby promoting the construction of smart cities.

It should be understood that the number of terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Wherein the terminal 102 communicates with the server 104 through a network. The server 104 constructs a morphological voice database by combining the noise voice and the body voice obtained from the terminal 102, and then performs multi-round noise reduction training on the noise reducer to be trained according to the morphological voice data and the voice loss function to obtain the trained noise reducer, wherein the noise reduction training is to calculate the voice output of the morphological voice data according to the noise reducer to be trained, calculate the voice difference according to the voice output and the voice loss function, and perform noise reduction training on the noise reducer to be trained according to the voice difference, and if the voice difference in the noise reduction training of the round meets a preset difference value, the noise reducer obtained by the noise reduction training of the round is used as the trained noise reducer. The terminal 102 may then perform speech noise reduction processing using the trained noise reducer. The terminal 102 and the server 104 are connected through a network, which may be a wired network or a wireless network, where the terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a method for training a speech noise reducer is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step 202, a morphological voice database is constructed, wherein the morphological voice database comprises a plurality of morphological voice data of a combination of noise voice and body voice obtained according to a noise generation algorithm.

In some embodiments, the scheme of the application is an intelligent voice noise reduction technology based on noise-containing data only, namely noise reduction can be realized by using the noise-containing data only in a deep learning framework. Conventional noise reduction is usually based on estimating the noise field and then setting corresponding noise reduction means, however, this method of estimating the noise field can have good effect in some specific occasions, but often does not perform well in other occasions due to limitations of the noise field estimation method. After the deep learning technology is raised, a method with robustness based on artificial intelligence is developed; these methods are often implemented using schemes such as generation of a countermeasure network (GAN), self-encoder (AE), etc., which can be divided into two broad categories, one based on a generation method such as encoder-decoder to generate a segment of noise-reduced speech; the other is to acquire the result based on a structure learning method such as generation of an countermeasure network. Both methods have advantages and disadvantages, the first scheme has better convergence characteristics, can clearly converge towards the noise reduction direction, and has the defect that corresponding data without noise is needed to be used as a target, which is very difficult to meet in actual situations, so that an approximation method is often needed; the second approach is to eliminate the need for noise-free data as target, but such networks tend to be difficult to converge and to obtain good results, and even statistically good results, an anomaly can easily occur, i.e., poor noise reduction in certain segments, primarily because the decision making in the countering network is too free to be controlled in some way.

The present application seeks to base the encoder-decoder in a robust manner, but is improved to achieve a manner that is based on the encoder-decoder, but does not require the use of noise-free speech as a target. The main idea of the application is to consider a section of voice containing noise based on a large number of laws in statistics, assume that the noise forms are various, but the real signal is definite, detect the voice containing definite real signals but the noise forms are different for many times, the expectations of the detected data should approximate to the real signal, thus realizing the noise reduction of the voice signal in fact.

Further, the server needs to construct a morphological voice database to train the noise reducer as a training sample: acquiring a noise database, wherein the noise database comprises a plurality of noise voices in different environments; generating an ontology voice, and combining the ontology voice with noise data to obtain a plurality of morphological voice data comprising different noise voices.

Based on the above, a morphological voice database needs to be constructed, a large amount of voice data is collected first, and a part considered to have larger noise is selected from the collected voice data by adopting a manual supervision mode, wherein the selected part can be a complete voice recording or a certain fragment in the complete voice recording. The obtaining method can be as follows: the voices in the constituent voice database may originate from voice recordings in a standard quiet location (e.g., in a studio), followed by recordings in a noisy location (e.g., ma Lushang), and then synthesizing the two, resulting in noisy voice data. For example, the applicant may read a text or dialogue in a relatively quiet place, such as a recording studio, and then combine the text or dialogue with noise from various noise situations to obtain morphological speech data including different noise.

Optionally, voice data of a person talking or reading in different environments can be obtained as morphological voice data; for example, the conversations or speakable words of a person in the market, the same segment of conversations or speakable words in the road environment, the same segment of conversations or speakable words pass through the voice data of different people in different environments. In addition, the morphological voice data can be obtained by other modes, such as other modes, so long as the consistency of information expressed by the real signals is ensured, but the noise conditions are different.

Further, after the morphological voice database is built, the voice data inside needs to be preprocessed:

The data input into the noise reducer to be trained is extracted from the data set of the morphological voice database constructed by the steps, before the voice data is input into the model, the voice data can be merged into the interval of [ -1,1], and a dictionary < T, c > is formed for a plurality of pieces of voice data with the same content, so that the preprocessed morphological voice data is obtained.

Where T represents a content number, c represents specific voice data of various forms, such as <1,3>, and 3 rd-state voice data of content number 1.

Step 204, calculating the voice output of the morphological voice data according to the noise reducer to be trained, and calculating the voice difference between the voice output and the morphological voice data in the morphological voice database by using a preset voice loss function.

In some embodiments, the technical solution of the present application needs to perform multiple rounds of noise reduction training on a noise reducer to be trained according to morphological speech data and a speech loss function, so as to obtain a trained noise reducer, where the noise reduction training is to calculate speech output of the morphological speech data according to the noise reducer to be trained, calculate speech difference according to the speech output and the speech loss function, and perform noise reduction training on the noise reducer to be trained according to the speech difference, and if the speech difference in the noise reduction training of the present round meets a preset difference value, use the noise reducer obtained in the noise reduction training of the present round as the trained noise reducer.

As shown in fig. 3, the noise reducer used in the present application employs a typical encoder-decoder structure: encoder and decoder structures, each encoder is composed of a substructure of a convolutional layer + a batch normal layer + relu activation function, the batch normal layer relu activation function; as to how many convolution kernels are used for the convolution layer to bottom, this is optional; the input information passes through the respective encoders sequentially from top to bottom, and then the output data of the encoder 3 is copied once from the encoder 2, and the copied data is combined together with the decoder of the next stage to form the input of the decoder of the previous stage.

Training of the noise reducer: the dictionary constructed in the above process is imported into the noise reducer to be trained, and training is performed in such a manner that, for example, 5 voices are shared in the dictionary with content number 1, one voice is randomly extracted from the dictionary, for example, number 1 is taken as input, and another voice, for example, number 3 is extracted as target of training.

Further, the morphological voice data includes a first morphological voice data and a second morphological voice data, for example, the number 1 is used as the first morphological voice data and the number 3 is used as the second morphological voice data, and the method includes the steps of:

Inputting the first form voice data into a noise reducer to be trained to obtain first voice data; comparing the voice difference between the first voice output and the second form voice data to obtain a first voice difference; according to the first voice difference, adjusting the network weight of the noise reducer to be trained to obtain a first state noise reducer; updating the first form voice data and the second form voice data; inputting the updated first form voice data into a noise reducer to be trained to obtain updated first voice output; and comparing the voice difference between the updated first voice output and the updated second-form voice data to obtain an updated first voice difference.

And (3) inputting the No. 1 form voice data serving as first form voice data into a noise reducer to be trained, and obtaining the first form voice data, namely outputting the No. 1 form voice data, and outputting first voice, wherein the first voice output is an output result of the noise reducer to be trained after the noise of the first form voice data is reduced.

Specifically, the noise reducer to be trained performs the following processing on the input data: as shown in fig. 3, an array (first form of speech data) representing speech data is input into the noise reducer to be trained, the data is encoded into relevant information tensors by the encode1 and sequentially transmitted downwards, the relevant information tensors are further encoded through the encode2 below, and new information tensors are obtained, and it is noted that a short link exists downwards from the encode2 to transmit the new tensor copy at the moment to the corresponding decode without transmitting downwards, and meanwhile, the new tensor is transmitted to the encode4 of the next layer, and the encoding is sequentially performed until the encode4 of the encode4 obtains the new information tensor and then the new information tensor is directly combined with the tensor of the encode3 copy, and then the new information tensor is transmitted to the decode4 for decoding, so that the information tensor is obtained; and then proceeding upwards in sequence until the output obtains a first voice output.

Then we need to compare the first speech output with the first speech difference of the morphology 3 speech data, i.e. the second morphology speech data.

The first voice difference is calculated by subtracting the first voice output from the second form voice data to obtain the voice difference between the output voice data and the voice data without noise reduction.

The reason for calculating the difference between the two voices is that the neural network has the advantage of being very good at capturing the common point of the input value and the target, and the idea of the design of the technical scheme of the application is that the training process of the neural network is analogous to the measurement by repeatedly measuring voice data with the same meaning. In this embodiment, the neural network is required to approach a state by repeated measurement, and in this state, a very close result can be obtained regardless of the noise form as long as the voice is of the same meaning; what is such a result in this context? Is common to all, and is a signal of true meaning, so that noise reduction is performed on a piece of true voice without affecting the true meaning. Here reflected on model training, what is trained? Note that i just mentioned that our desire is to have the same meaning of speech input and to obtain a very close output, i.e. here it is desired that the first speech input gets an estimate, called the first language estimate, which in this embodiment is desired to be very similar to the second language.

Further, the first voice difference may be obtained by squaring the obtained voice difference.

And then, according to the first voice difference, adjusting the network weight of the noise reducer to be trained to obtain the first state noise reducer.

Specifically, the application specifically uses an Adam optimization method by adjusting the network weight through a reverse transfer mechanism of a neural network.

And 206, performing noise reduction training on the noise reducer to be trained according to the voice difference, and taking the noise reducer obtained in the noise reduction training of the round as the trained noise reducer when the voice difference meets a preset difference value.

Further, the voice difference further includes a second voice difference, the first state noise reducer includes a first network weight, noise reduction training is performed on the noise reducer to be trained according to the voice difference, including:

According to the updated first voice difference, adjusting the network weight of the first state noise reducer to obtain a second state noise reducer and a second network weight; based on the first network weight and the second network weight, calculating to obtain a second voice difference output by the first state noise reducer and the second state noise reducer through a voice loss function; and when the second voice difference is smaller than a preset difference threshold value, using the second state noise reducer as a trained noise reducer.

Specifically, the first state noise reducer may be updated according to the updated first voice difference, to obtain the second state noise reducer.

Further, the network weight of the noise reducer to be trained is adjusted according to the first voice difference, the first network weight is obtained in addition to the obtained first state noise reducer, and the second network weight is a parameter of the first state noise reducer.

The second network weight is a parameter of the noise reducer in the second state, and in this embodiment, the noise reducer to be trained can be noise-reduced and trained through the voice loss function (1):

L＝L_n+_wL_w+_rL_r

(1)

where L is the total loss, ln is the consistency loss function, lw is the weight loss function, lr is the reconstruction loss function, β is an artificially set parameter.

Specifically, when training of the noise reducer is performed, the dictionary constructed in the above process is imported into the noise reducer to perform training in such a manner that, for example, 5 voices in total are taken from the dictionary with content number 1, one voice is randomly extracted from the dictionary, for example, number 1 is taken as an input, and another voice is extracted, for example, number 3 is taken as a target of training.

If training is performed directly at this time, it is often difficult to converge, and in order to ensure convergence, a new loss function is designed here, and we derive this as follows.

For this task, the training process can be abstracted to equation (2):

Here, θ represents a loss function under a certain noise distribution, N is a data amount of total voice data, f represents processing of voice data by the above-mentioned noise reducer (network) to be trained, x+n represents voice data obtained by adding noise to a real signal (bulk voice), and subscripts j and k of N are used to distinguish different noises. For equation (2), it can be deformed, that is, equation (2) becomes the minimum value of θ squared, and thus can be derived to obtain equation (3):

here, y represents the calculation of f in the formula (2), and as can be seen from the formula (3), the first term on the left of the equal sign of the formula (3) is actually normal training, and there is a case of no noise target. It can be seen here that to enable our training to achieve the effect of normal training, i.e. to have the statistically last two terms add up to be equal to 0, this is the point of our loss function design.

For this, the present embodiment first designs the following consistency loss function (4):

It should be noted that, for this consistency loss function, the calculation is divided into several times, that is, the first term is to use the x+n1 input to calculate when the network state (network weight, parameter of noise reducer) is θ ₁, and the second term is to use the x+n2 input when the network weight θ ₂ drops, that is, the inputs are interchanged with the target in sequence when the scheme is calculated.

Meanwhile, a second loss function is formulated, and the weight loss function (5):

In which, as well, the speech inputs of different noises should be such that the network parameters are very similar, thus here a weight loss function is set, where θ indicates the network weight. In addition, the following is also referred to as reconstruction loss function (6):

wherein, Refers to the output average of the speech data used in training.

Inputting the first network weight and the second network weight into the voice loss function (1) to obtain loss values of the first state noise reducer and the second state noise reducer, and taking the loss values as second voice differences output by the first state noise reducer and the second state noise reducer; and judging whether the second voice difference is smaller than a preset difference threshold value, if so, taking the second state noise reducer as a trained noise reducer, and if not, adjusting the network weight of the second state noise reducer according to the second voice difference to obtain a third state noise reducer. And then, training for a round through the updated morphological voice data until the obtained second voice difference meets the requirement.

According to the technical scheme, the noise reduction effect of the noise reducer can be greatly improved, and further, in order to embody the effect of the noise reducer, the noise reduction effect is shown in the following table 1:

TABLE 1

The synthesized noise-containing voice data is used, the noise is reduced by adopting the noise reducer obtained through the technical scheme of the application, the noise is compared with a common noise reduction mode (SEGAN, waveNet), and the noise-reduced data are used for testing SNR, SIG and BAK (the indexes are industry-accepted standard indexes and are commonly used, and specific meanings are not repeated here), so that the larger the numerical value of the table 1 is, the better the numerical value is.

The convergence rate of the trained models is shown in the comparison diagram of the network convergence rate of fig. 4, and fig. 4 shows that the above three models are calculated in 100 training epochs, so that the convergence of wavenet and the present application is fast, but the convergence of SEGAN based on GAN is not yet. In fig. 4, the wavy line is a converging speed of SEGAN based on GAN, and between 20 and 40 (epoch), the curve corresponding to the upper curve segment is wavenet, and the curve corresponding to the lower curve segment is the present application.

Further, updating the first modality voice data and the second modality voice data includes: optionally selecting two morphological speech data comprising different noise data from a morphological speech database, wherein the selected morphological speech data is different from at least one of the first morphological speech data and the second morphological speech data before updating; and respectively updating the two morphological voice data into the first morphological voice data and the second morphological voice data.

The updating of the first morphology speech data and the second morphology speech data is required to ensure that the speech data input again into the noise reducer and the morphology speech data before updating do not refer to the same morphology speech data, that is, if the first morphology speech data input for the first time is the morphology speech data No.1 and the second morphology speech data is the morphology speech data No.3, the first morphology speech data input for the second time may be one of the morphology speech data No.1, no.2 and No.3, but when the first morphology speech data is the morphology speech data No.1, the second morphology speech data may not be the morphology speech data No.3, but may be other morphology speech data except the morphology speech data No. 3.

Optionally, the voice data in the morphological voice database can be randomly combined in sequence in advance to obtain a voice data set comprising two morphological voice data, a group of voice data sets with the same sequence and number are removed, the first morphological voice data and the second morphological voice data are set as different voice data sets, and the purpose of updating the first morphological voice data and the second morphological voice data is achieved.

After the new first-form voice data is input into the noise reducer, the obtained output is compared with the new second-form voice data as the update of the first voice output, and the comparison result is obtained as the update of the first voice difference.

Further, after the trained noise reducer is obtained, the data to be noise reduced after merging is sent to the noise reducer which is already built in advance to obtain an output result, then the original signal format is restored based on the result according to the original merging mode, and noise-reduced voice is obtained at the moment.

Specifically, the merging of the present embodiment, for example, the original input is mp3 format voice, in order to perform the above numerical processing, the present embodiment needs to convert it into an array composed of digital data, the array composed of data is obtained after the model processing is completed, and then we convert it into mp3 format according to the previous inverse process, so that the section of mp3 format voice is the voice after noise reduction.

It should be emphasized that, to further ensure the privacy and security of the user information, the voice data information may also be stored in a blockchain node.

According to the voice noise reducer training method, voice data comprising noise is input into the noise reducer to be trained to perform multiple rounds of noise reduction training, namely, the voice difference between the output of different voice data and different voice data is calculated, the network weight of the noise reducer is adjusted according to the voice difference until one round of noise reduction training is finished, and then the next round of noise reduction training is performed on the basis of the current training. The method can train to obtain a good noise reduction neural network based on noise-containing data only, and the noise reduction neural network has high convergence rate and strong adaptability and robustness.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in FIG. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed need to be sequential, but may be performed in turn or alternately with at least some of the other steps or sub-steps of other steps.

In one embodiment, as shown in fig. 5, a speech noise reducer training device is provided, where the speech noise reducer training device corresponds to the speech noise reducer training method in the above embodiment one by one. The speech noise reducer training device comprises:

The construction module 502 is configured to construct a morphological voice database, where the morphological voice database includes a plurality of morphological voice data that are obtained by a noise generation algorithm and are combined by an ontology voice;

The training module 504 is configured to perform multiple rounds of noise reduction training on the noise reducer to be trained according to the morphological speech data and the speech loss function, so as to obtain a trained noise reducer, where the noise reduction training is to calculate speech output of the morphological speech data according to the noise reducer to be trained, calculate speech difference according to the speech output and the speech loss function, perform noise reduction training on the noise reducer to be trained according to the speech difference, and if the speech difference in the noise reduction training of the current round meets a preset difference value, use the noise reducer obtained in the noise reduction training of the current round as the trained noise reducer.

Further, the morphology speech data includes first morphology speech data and second morphology speech data, the speech output includes first speech data, and the training module 504 includes:

the first noise reduction sub-module is used for inputting the first form voice data into the noise reducer to be trained to obtain the first voice data;

the first comparison sub-module is used for comparing the voice difference between the first voice output and the second form voice data to obtain a first voice difference;

The first adjusting sub-module is used for adjusting the network weight of the noise reducer to be trained according to the first voice difference to obtain a first state noise reducer;

The data updating sub-module is used for updating the first form voice data and the second form voice data;

The noise reduction updating sub-module is used for inputting the updated first form voice data into a noise reducer to be trained to obtain updated first voice output;

And the comparison and update sub-module is used for comparing the voice difference between the updated first voice output and the updated second-form voice data to obtain the updated first voice difference.

Further, the speech differences include second speech differences, the first state noise reducer includes first network weights, and the training module 504 further includes:

an updating adjustment sub-module, configured to adjust a network weight of the first state noise reducer according to the updated first voice difference, so as to obtain a second state noise reducer and a second network weight;

the difference updating sub-module is used for calculating second voice differences output by the first state noise reducer and the second state noise reducer through a voice loss function based on the first network weight and the second network weight;

And the training result submodule is used for taking the second state noise reducer as a trained noise reducer when the second voice difference is smaller than a preset difference threshold value.

Further, the data updating sub-module includes:

A selection unit for selecting two morphological voice data including different noise data from the morphological voice database, wherein the selected morphological voice data is different from at least one of the first morphological voice data and the second morphological voice data before updating;

The first updating unit is used for updating the two morphological voice data into the first morphological voice data and the second morphological voice data respectively.

Further, the data updating sub-module further includes:

The combination unit is used for carrying out random combination on the morphological voice data in the morphological voice database in sequence to obtain a voice data group comprising two morphological voice data;

The second updating unit is used for removing a group of voice data groups with the same sequence and form voice data to obtain at least one voice data group with different sequence and form voice data, wherein the voice data group comprises first form voice data and second form voice data.

Further, the building module 502 includes:

The acquisition sub-module is used for acquiring a noise database, wherein the noise database comprises a plurality of noise voices in different environments;

The generating sub-module is used for generating the ontology voice;

and the construction submodule is used for combining the body voice and the noise data to obtain a plurality of morphological voice data comprising different noise voices.

According to the voice noise reducer training device, voice data comprising noise is input into the noise reducer to be trained to perform multi-round noise reduction training, namely, the voice difference between the output of different voice data and the different voice data is calculated, the network weight of the noise reducer is adjusted according to the voice difference until one round of noise reduction training is finished, and then the next round of noise reduction training is performed on the basis of the current training. The method can train to obtain a good noise reduction neural network based on noise-containing data only, and the noise reduction neural network has high convergence rate and strong adaptability and robustness.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the execution of an operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer device is for storing voice data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions, when executed by a processor, implement a speech noise reducer training method.

According to the embodiment, voice data including noise is input into the noise reducer to be trained to perform multiple rounds of noise reduction training, namely, the voice difference between the output of different voice data and the different voice data is calculated, the network weight of the noise reducer is adjusted according to the voice difference until one round of noise reduction training is finished, and then the next round of noise reduction training is performed on the basis of the current training. The method solves the technical problems of low network convergence speed and poor adaptability and robustness in the prior art.

It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

In one embodiment, a computer readable storage medium is provided, on which computer readable instructions are stored, which when executed by a processor, implement the steps of the method for training a speech noise reducer of the above embodiment, such as steps 202 through 204 shown in fig. 2, or the processor, when executing the computer readable instructions, implement the functions of the modules/units of the apparatus for training a speech noise reducer of the above embodiment, such as the functions of modules 502 through 504 shown in fig. 5.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by instructing the associated hardware by computer readable instructions stored on a non-transitory computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that, for those skilled in the art, it is possible to make several modifications, improvements or equivalent substitutions for some technical features without departing from the concept of the present application, and these modifications or substitutions do not make the essence of the same technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application, and all the modifications or substitutions fall within the protection scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of training a speech noise reducer, the method comprising:

The specific steps of calculating the voice output of the morphological voice data according to the noise reducer to be trained and calculating the voice difference between the voice output and the morphological voice data in the morphological voice database by utilizing a preset voice loss function include:

Inputting the voice data of the first form into a noise reducer to be trained to obtain first voice output;

comparing the voice difference between the first voice output and the second form voice data to obtain a first voice difference;

According to the first voice difference, adjusting the network weight of the noise reducer to be trained to obtain a first state noise reducer;

updating the first form voice data and the second form voice data;

inputting the updated first form voice data into a noise reducer to be trained to obtain updated first voice output;

Comparing the voice difference between the updated first voice output and the updated second-form voice data to obtain an updated first voice difference;

Performing noise reduction training on the noise reducer to be trained according to the voice difference, and taking the noise reducer obtained by the noise reduction training of the round as the trained noise reducer when the voice difference meets a preset difference value;

The voice difference comprises a second voice difference, the first state noise reducer comprises a first network weight, the noise reducer to be trained is noise-reduced according to the voice difference, and the noise reducer to be trained comprises:

According to the updated first voice difference, adjusting the network weight of the first state noise reducer to obtain a second state noise reducer and a second network weight;

based on the first network weight and the second network weight, calculating a second voice difference output by the first state noise reducer and the second state noise reducer through a voice loss function;

When the second voice difference is smaller than a preset difference threshold value, the second state noise reducer is used as a trained noise reducer;

the speech loss function is: l=l _n+β_wL_w+β_rL_r

2. The method of claim 1, wherein comparing the first speech output to the speech difference of the second form of speech data to obtain a first speech difference comprises:

and subtracting the first voice output from the second-form voice data to obtain a first voice difference between the first voice output and the second-form voice data.

3. The method of claim 1, wherein the updating the first morphology speech data and the second morphology speech data comprises:

optionally selecting two morphological speech data comprising different noise data from the morphological speech database, wherein the selected morphological speech data is different from at least one of the first morphological speech data and the second morphological speech data before updating;

And respectively updating the two morphological voice data into first morphological voice data and second morphological voice data.

4. The method of claim 1, wherein the updating the first morphology speech data and the second morphology speech data comprises:

carrying out random combination on the morphological voice data in the morphological voice database in sequence to obtain a voice data group comprising two morphological voice data;

And removing a group of voice data groups with the same sequence and form voice data to obtain at least one voice data group with different sequence and form voice data, wherein the voice data group comprises first form voice data and second form voice data.

5. The method of claim 1, wherein said constructing a morphological speech database comprises:

Acquiring a noise database, wherein the noise database comprises a plurality of noise voices in different environments;

generating body voice;

and combining the body voice and the noise data to obtain a plurality of morphological voice data comprising different noise voices.

6. A speech noise reducer training device, comprising:

the training module is used for carrying out noise reduction training on the noise reducer to be trained according to the voice difference, and when the voice difference meets a preset difference value, the noise reducer obtained by the noise reduction training of the round is used as the trained noise reducer;

the morphological voice data comprises first morphological voice data and second morphological voice data, and the voice difference comprises a second voice difference;

The training module comprises:

The first noise reduction sub-module is used for inputting the voice data of the first form into the noise reducer to be trained to obtain first voice output;

the first comparison sub-module is used for comparing the voice difference between the first voice output and the second-form voice data to obtain a first voice difference;

a data updating sub-module for updating the first form voice data and the second form voice data;

The comparison and update sub-module is used for comparing the voice difference between the updated first voice output and the updated second-form voice data to obtain an updated first voice difference;

the speech differences include second speech differences, the first state noise reducer includes first network weights, and the training module further includes:

The training result submodule is used for taking the second state noise reducer as a trained noise reducer when the second voice difference is smaller than a preset difference threshold value;

the speech loss function is: l=l _n+β_wL_w+β_rL_r

7. A computer device comprising a memory storing computer readable instructions and a processor, wherein the processor when executing the computer readable instructions performs the steps of the method of any one of claims 1 to 5.

8. A computer readable storage medium having stored thereon computer readable instructions, which when executed by a processor, implement the steps of the method of any of claims 1 to 5.