CN113611293A

CN113611293A - Mongolian data set expansion method

Info

Publication number: CN113611293A
Application number: CN202110955831.4A
Authority: CN
Inventors: 李晋益; 马志强; 张俊鹏
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-11-05
Anticipated expiration: 2041-08-19
Also published as: CN113611293B

Abstract

The invention discloses an expansion method for generating Mongolian audio, which is applied to the technical field of voice recognition, and comprises the steps of firstly obtaining Mongolian texts containing specified regional characteristics, specified regional characteristics and Mongolian audio with the specified regional characteristics of real audio; then constructing a confrontation generation network model in a designated area; and finally, carrying out countermeasure training on the countermeasure generating network model of the designated area, and inputting Mongolian audio with the characteristics of the real audio in the trained countermeasure generating network model for processing to generate a Mongolian expansion data set. The Mongolian data of the designated area are expanded, and the problems of high economic cost, large time consumption and uneven area of Mongolian corpus collection are solved.

Description

Mongolian data set expansion method

Technical Field

The invention relates to the technical field of voice recognition, in particular to a Mongolian data set expansion method.

Background

Data expansion refers to that the capacity of an original data set is enlarged through different methods to obtain a new data set more suitable for the current application environment. Training a speech recognition model requires a sufficient data set, and data augmentation is one of the possible ways to obtain a sufficient labeled Mongolian data set in a short time. In recent years, the open-source annotated Mongolian datasets are of very small magnitude and researchers often need to collect data with the support of colleges and universities and enterprises. However, acquiring a data set is an economic and time consuming task. In order to obtain a sufficient amount of data in a short time, a data expansion method is particularly important.

Currently, speech extension methods are classified into two categories according to different implementation techniques.

(1) The original audio or speech features are modified by an algorithm for expansion, such as speech rate perturbation, vocal cord length normalization, audio masking. This type of method can generate audio immediately, but usually needs to be adjusted manually to obtain excellent generated audio.

(2) And synthesizing the audio by a generating technology for expansion, such as noise audio generation and room simulation audio generation. The method generates new audio through a synthesis technology. Research has mainly focused on adding the environmental information needed for a specific task to existing audio, but synthesis techniques usually require more raw data.

In summary, the Mongolian audio marked in the existing Mongolian data set is deficient, and the regional distribution is unbalanced. The use of the current Mongolian data set by the speech recognition model may result in regions with large overfitting data, and sequences based on an attention mechanism may be overfitting to the sequence model.

Therefore, how to provide an expansion method for Mongolian data sets is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a method for expanding a Mongolian data set, which obtains a Mongolian expanded data set by using a confrontation model generated in a specified region, balances the regional distribution of the data set, and improves the recognition accuracy of the Mongolian speech recognition model.

In order to achieve the above purpose, the invention provides the following technical scheme:

a method of augmenting a Mongolian data set, comprising:

acquiring Mongolian texts containing specified regional characteristics, specified regional characteristics and Mongolian audios with the specified regional characteristics of real audios;

constructing a confrontation generation network model in a designated area;

and carrying out countermeasure training on the countermeasure generating network model of the specified region, inputting Mongolian audio with the characteristics of the real audio of the specified region into the trained countermeasure generating network model, and processing the Mongolian audio to generate a Mongolian expansion data set.

Preferably, the network model for generating the regional countermeasure comprises: the system comprises a conditional voice generator and a multi-item fusion discriminator, wherein the conditional voice generator is connected with the multi-item fusion discriminator and consists of a synthesizer and a vocoder;

wherein the content of the first and second substances,

the generator is used for: obtaining a constructed Mongolian Mel frequency spectrogram according to the Mongolian text and the characteristics of the designated region;

the vocoder: the Mongolian audio generator is connected with the generator and generates Mongolian audio of a designated area according to the Mongolian Mel frequency spectrogram;

the multi-item fusion discriminator: and judging whether Mongolian audio in the designated region is real data or not according to the Mongolian Mel frequency spectrogram and the characteristics of the designated region, and generating the Mongolian expansion data set.

Preferably, the synthesizer comprises a causal convolutional layer, an encoding layer, an attention layer, a decoding layer and an anti-convolutional layer which are connected in sequence;

wherein the content of the first and second substances,

the causal convolution layer is used for reducing the information quantity difference among the Mongolian text, the specified region feature and the Mongolian audio Mel frequency spectrogram;

the encoding layer, the attention layer and the decoding layer are used for mapping the relation between the input features and the output Mel frequency spectrogram in a time dimension;

the deconvolution layer is used for improving the definition of the Mongolian audio Mel frequency spectrogram.

Preferably, the generator obtains the distribution of the Mongolian Mel frequency spectrum diagram according to the specified region characteristics and the Mongolian text, and the formula is as follows:

wherein z is the characteristic of the designated region, t is Mongolian text, x is Mongolian Mel frequency spectrum diagram, and p (x | z · t) is the distribution of the Mongolian Mel frequency spectrum diagram;

modeling the distribution of the Mongolian Mel frequency spectrogram to obtain Mongolian Mel frequency spectrogram characteristics, wherein the calculation formula is as follows:

wherein, denotes a convolution operation, W_convRepresenting the convolution kernel parameter, W_encDenotes LSTM encoding parameters, c denotes attention context, W_attIndicating attention weights, g indicates LSTM decoding operations,

the parameters of the deconvolution are represented by,

representing the Mongolian Mel frequency spectrum graph characteristics obtained by model calculation.

Preferably, the multinomial fusion discriminator comprises a region classifier and a definition classifier, wherein the region classifier is used for discriminating Mongolian audio pronunciation regions, the definition classifier is used for discriminating Mongolian audio definition to obtain a discrimination result, and the discrimination result specifically comprises:

and respectively judging the Mongolian audio pronunciation region and the Mongolian audio definition of the Mongolian audio with the real audio and the designated region characteristic by using the region classifier and the definition classifier, adding a real Mongolian data set X if the Mongolian audio is judged to be true, and abandoning the Mongolian audio if the Mongolian audio is judged to be false to form a Mongolian expansion data set.

Preferably, the distinguishing of the audio pronunciation region of Mongolian by the region classifier includes:

performing two-dimensional convolution calculation on the Mongolian Mel frequency spectrogram to obtain convolution characteristics;

performing pooling processing on the convolution characteristics;

classifying according to the convolution characteristics;

calculating a probability value for each region in a classified manner, taking the region with the maximum probability as a Mongolian audio pronunciation region judgment result, and adopting the following calculation formula:

wherein x represents Mongolian Mel spectrogram, represents convolution operation, Wconv represents convolution kernel parameter, pool represents pooling operation, and W represents filtering operation_fcA full-link layer parameter is represented,

the area is identified by the area classification.

Preferably, the determining the audio definition of the Mongolian by the definition classifier includes:

performing pooling processing on the convolution characteristics;

classifying according to the convolution characteristics;

calculating the score of Mongolian audio definition, wherein the score range is [ -1,1], when the score is higher than a set score limit, the definition requirement is considered to be met, otherwise, the definition does not meet the requirement, and the calculation formula is as follows:

the definition of the definition classification judgment is indicated.

Preferably, the specific process of the multi-item fusion discriminator for the confrontation training includes:

the multinomial fusion discriminator uses a real Mongolian data set and a random parameter W of the multinomial fusion discriminator_DTraining;

the conditional speech generator uses the Mongolian data set and the random parameters W of the conditional speech generator_GTraining;

the random parameter W of the conditional speech generator is updated by back propagation according to the loss function of the conditional speech generator_G(ii) a According to the loss function of the multinomial fusion discriminator, the back propagation is carried out, and the random parameter W of the multinomial fusion discriminator is updated_DAnd circulating n times.

Compared with the prior art, the Mongolian data set expansion method has the advantages that the generator for generating the network model in the confrontation mode uses the conditional speech generator, and the discriminator consists of the multiphase fusion discriminator. The conditional speech generator generates Mongolian audio and Mel frequency spectrograms from Mongolian text and the specified regional features. And the multinomial fusion discriminator discriminates the regional characteristics and the definition according to the Mel frequency spectrum diagram and the designated regional characteristics. After the conditional speech generator and the multi-item fusion judger are mutually confronted and learned, the Mongolian audio frequency of the appointed area synthesized by the final conditional generator is judged as real data by the multi-item fusion judger. The augmented data set consists of all generated Mongolian audio that was judged to be true by the multi-item fusion discriminant. The Mongolian data of the designated area are expanded, and the problems of high economic cost, large time consumption and uneven area of Mongolian corpus collection are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a Mongolian data set expansion method provided by the present invention;

fig. 2 is a schematic structural diagram of a countermeasure generation network model provided in this embodiment;

fig. 3 is a schematic diagram of the generation countermeasure network for a specific area according to the present embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention discloses a method for expanding a Mongolian data set, including:

constructing a confrontation generation network model in a designated area;

and performing countermeasure training on the countermeasure generating network model of the designated area, inputting Mongolian audio with the characteristics of the real audio of the designated area into the trained countermeasure generating network model for processing, and generating a Mongolian expansion data set.

Specifically, the regional features in the Mongolian text are consistent with the regional features in the Mongolian audio, and both represent the target regional features to be expanded.

Referring to fig. 2, in one embodiment, the regional countermeasure generation network model mainly includes a conditional speech generator and a multi-item fusion discriminator, the conditional speech generator is connected to the multi-item fusion discriminator, wherein the conditional speech generator is composed of a synthesizer and a vocoder, and the multi-item fusion discriminator is composed of a region classifier and a sharpness classifier.

Referring to fig. 3, a schematic diagram of a generation countermeasure network for a specified region provided in this embodiment, in an embodiment, a conditional speech generator constructs a mapping between a Mongolian text and a Mongolian audio under the condition of the specified region, a synthesizer constructs a mapping between the Mongolian text and a region feature and a Mongolian Mel spectrogram in consideration of an information amount difference between Mongolian audio information and the Mongolian text, and a vocoder constructs a mapping between the Mongolian Mel spectrogram and the Mongolian audio. In the multinomial fusion discriminator, the multinomial fusion discriminator constructs mapping between a Mongolian Mel frequency spectrogram and Mongolian audio classification in a designated region, and in order to complete a task of discriminating Mongolian audio pronunciation region and a task of discriminating Mongolian audio definition, the multinomial fusion discriminator is divided into a region classifier and a definition classifier, and confrontation learning between a conditional speech generator and the multinomial fusion discriminator is realized. The conditional speech generator synthesizes Mongolian audio having regional characteristics, and the multinomial fusion discriminator removes Mongolian audio not having the characteristics of the designated region and sufficient definition, thereby generating a Mongolian augmented data set.

In one embodiment, the conditional speech generator is comprised of a synthesizer and a vocoder, the synthesizer comprising a causal convolutional layer, an LSTM encoding layer, an attention layer, an LSTM decoding layer, and an anti-convolutional layer. The vocoder converts the mel-frequency spectrogram into Mongolian audio by adopting a Griffin-Lim algorithm.

Specifically, in order to maximally restore audio, the mel-frequency spectrogram needs to be converted into a time-frequency spectrum.

In one embodiment, the conditional speech generator synthesizer models the distribution of Mongolian Mel frequency spectrogram under the condition of a specific region feature and Mongolian text. The formula is as follows:

wherein z is a specific region feature, t is a Mongolian text, x Mongolian Mel spectral diagram, and p (x | z · t) is a distribution of the Mongolian Mel spectral diagram x.

In a specific embodiment, the obtained Mongolian text containing the specified region characteristics and the specified region characteristics are spliced into codes which are used as the input of a synthesizer, and the synthesizer comprises a causal convolution layer, an LSTM coding layer, an attention layer, an LSTM decoding layer and an anti-convolution layer.

In particular, the causal convolutional layer may reduce the information content difference between Mongolian text and region features and Mongolian audio Mel-frequency spectrogram.

In particular, the LSTM encoding layer, the attention layer, and the LSTM decoding layer map the relationship between the input features and the output mel-frequency spectrogram in the time dimension.

Specifically, the deconvolution layer can improve the definition of the mel-frequency spectrum.

Specifically, the calculation formula is as follows:

the parameters of the deconvolution are represented by,

Specifically, the initial parameters of each layer are randomly generated, and a Gradient Descent Algorithm (Gradient Algorithm) is further required to correct the model parameters in order to obtain a better modeling effect. The formula of the loss function L required for the gradient descent algorithm is:

in one embodiment, the vocoder uses the Griffin-Lim algorithm to convert the Mel-spectral plot into Mongolian audio.

Specifically, in order to maximize the reproduction of audio, the mel-frequency Spectrogram needs to be converted into a time-frequency spectrum (Spectrogram). The time-frequency spectrum retains the frequency distribution of each frame, but lacks phase information, i.e., lacks information about the waveform variations of the signal. Let P be the phase spectrum, S be the time spectrum, X be the speech waveform information, f represent the Fourier transform, f^-1Is an inverse time fourier transform. The specific algorithm steps are as follows:

(1) randomly initializing a phase spectrum P;

(2) using the time spectrum S and the phase spectrum P to perform inverse Fourier transform f^-1Synthesizing new voice waveform information X;

(3) fourier transform f is carried out on the synthetic audio to obtain a new time-frequency spectrum S_newAnd a phase spectrum P_new；

(4) Discard the new time-frequency spectrum S_newUsing the original time spectrum S and the new phase spectrum P_newSynthesizing new voice waveform information X;

(5) repeating the steps (3) to (4) for a plurality of rounds, and outputting the audio waveform information X obtained in the last round;

in one embodiment, the multi-item fusion discriminator is divided into a region classifier and a definition classifier in order to accomplish the task of discriminating Mongolian audio pronunciation regions and the capability of discriminating Mongolian audio definition.

Specifically, the region classifier firstly performs two-dimensional convolution calculation on the Mel frequency spectrogram to obtain convolution characteristics, and after each convolution operation, the convolution characteristics apply modified linear unit (ReLU) conversion to introduce a nonlinear law into the model. The pooling operation then reduces the sampling rate of the convolved features, thereby reducing the number of bits of the feature map while still retaining critical feature information. The fully-connected layers are then classified according to the features proposed by the convolution. And finally, calculating a probability value for each region classification by using an activation function softmax, and taking the region with the maximum probability as a judgment result.

The specific calculation formula is as follows:

in the formula, x represents Mongolian Mel spectrogram, W represents convolution operation_convRepresenting the convolution kernel parameters, pool representing pooling operation, W_fcA full-link layer parameter is represented,

indicating the region determined by the region classifier.

In one embodiment, the intelligibility classifier and the region classifier are designed similarly, but the final activation function is transformed to sigmoid to calculate a score for the intelligibility of Mongolian audio, the score ranging from [ -1,1 ]. And when the score is higher than the set score limit, the definition requirement is considered to be met, otherwise, the definition requirement is not met.

The specific calculation formula is as follows:

the definition of the definition classification judgment is shown,

indicating the sharpness of the sharpness classifier decision.

Specifically, the multinomial fusion discriminator calculates the region classifier first, and the region classifier is calculated only if the region classification is correct, otherwise, the definition classifier is directly returned to fail. If the sharpness classifier result

If the set requirement y is higher than the set requirement y, returning to pass; otherwise still return toDo not pass. The penalty function for the polyphase fusion discriminator is then formulated as:

in one embodiment, the final goal of the countermeasure training for the specific regional countermeasure generation network model is:

wherein D is a multinomial fusion discriminator, G is a conditional speech generator, X is a real Mongolian audio, and X is_CIndicating region information as conditions in a conditional speech generator, Z representing speech characteristics of a specified region, W_DRepresenting a random initialization parameter, W, during training of a multi-item fusion discriminator_GRepresenting the random initialization parameters when training the conditional speech generator.

Specifically, the multinomial fusion discriminator D uses the real Mongolian data X set and the random parameter W_DAnd (5) training.

Conditional speech generator G uses a Mongolian data set X and a random parameter W_GTraining, obtaining Mongolian expansion data set Y through the specified regional characteristics z, and marking as false. Loss of speech generator by conditional_GPerforming backward propagation to update the parameter W_GWherein the loss formula is

A multinomial fusion discriminator D distinguishes the Mongolian expansion data set Y, and if the discrimination is true, the data set X is added; otherwise, the corresponding item is discarded. Loss according to multi-term fusion discriminator_DPerforming backward propagation to update multi-parameter W_DAnd circulating n times. Extending the data marked as false in Mongolian data X into Mongolian extended data set Y_Z。

In one embodiment, because the existing data set has less Mongolian audio, the regional characteristics of the Bingyuan are extracted from the Mongolian audio of the existing Bingyuan, the regional characteristics are reconstructed and combined with the Mongolian invention to obtain the text with the regional characteristics of the Bingyuan, and the text is sent to a synthesizer in a conditional speech generator to generate a Mongolian spectrogram with the regional characteristics of the Bingyuan, and then the Mongolian spectrogram is converted into speech by using a vocoder. A multinomial fusion discriminator in the generation countermeasure network utilizes the real Mongolian audio of the bauhinia to discriminate whether the generated Mongolian audio of the bauhinia is clear or not and whether the generated Mongolian audio has the regional characteristics of the bauhinia or not, and the conditional voice generator is continuously adjusted by calculating the countermeasure loss to generate the Mongolian audio with the regional characteristics of the bauhinia, so that the data set expansion is finally realized.

Compared with the prior art, the data expansion method provided by the invention can balance the regional distribution of the data set, thereby improving the recognition accuracy of the Mongolian speech recognition model. The problem of current Mongolian data set have the Mongolian audio that marks lack, and the area distributes disproportionately is solved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A Mongolian data set expansion method is characterized by comprising the following steps:

constructing a confrontation generation network model in a designated area;

2. The Mongolian data set augmentation method of claim 1, wherein the step of generating a network model for the specific regional countermeasures comprises: the system comprises a conditional voice generator and a multi-item fusion discriminator, wherein the conditional voice generator is connected with the multi-item fusion discriminator and consists of a synthesizer and a vocoder;

wherein the content of the first and second substances,

3. The Mongolian data set expansion method according to claim 2, wherein the synthesizer comprises a causal convolutional layer, an encoding layer, an attention layer, a decoding layer and an anti-convolutional layer which are connected in sequence;

wherein the content of the first and second substances,

4. The method of claim 2, wherein the generator obtains the distribution of Mongolian Mel frequency spectrogram from the specified region feature and Mongolian text, and the formula is as follows:

the parameters of the deconvolution are represented by,

5. The method for expanding Mongolian data set according to claim 2, wherein the plurality of fusion classifiers are composed of a region classifier and a intelligibility classifier, the region classifier is used for discriminating Mongolian audio pronunciation region, the intelligibility classifier is used for discriminating Mongolian audio intelligibility to obtain a discrimination result, and specifically comprises:

6. The method of claim 5, wherein the region classifier identifying Mongolian audio pronunciation regions comprises:

performing pooling processing on the convolution characteristics;

classifying according to the convolution characteristics;

the area is identified by the area classification.

7. The method of claim 5, wherein the intelligibility classifier for discriminating Mongolian audio intelligibility comprises:

performing pooling processing on the convolution characteristics;

classifying according to the convolution characteristics;

the definition of the definition classification judgment is indicated.

8. The Mongolian data set expansion method of claim 5, wherein the specific process of the countertraining by the multi-item fusion arbiter comprises: