CN113611293A - Mongolian data set expansion method - Google Patents

Mongolian data set expansion method Download PDF

Info

Publication number
CN113611293A
CN113611293A CN202110955831.4A CN202110955831A CN113611293A CN 113611293 A CN113611293 A CN 113611293A CN 202110955831 A CN202110955831 A CN 202110955831A CN 113611293 A CN113611293 A CN 113611293A
Authority
CN
China
Prior art keywords
mongolian
audio
region
data set
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110955831.4A
Other languages
Chinese (zh)
Other versions
CN113611293B (en
Inventor
李晋益
马志强
张俊鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN202110955831.4A priority Critical patent/CN113611293B/en
Publication of CN113611293A publication Critical patent/CN113611293A/en
Application granted granted Critical
Publication of CN113611293B publication Critical patent/CN113611293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses an expansion method for generating Mongolian audio, which is applied to the technical field of voice recognition, and comprises the steps of firstly obtaining Mongolian texts containing specified regional characteristics, specified regional characteristics and Mongolian audio with the specified regional characteristics of real audio; then constructing a confrontation generation network model in a designated area; and finally, carrying out countermeasure training on the countermeasure generating network model of the designated area, and inputting Mongolian audio with the characteristics of the real audio in the trained countermeasure generating network model for processing to generate a Mongolian expansion data set. The Mongolian data of the designated area are expanded, and the problems of high economic cost, large time consumption and uneven area of Mongolian corpus collection are solved.

Description

Mongolian data set expansion method
Technical Field
The invention relates to the technical field of voice recognition, in particular to a Mongolian data set expansion method.
Background
Data expansion refers to that the capacity of an original data set is enlarged through different methods to obtain a new data set more suitable for the current application environment. Training a speech recognition model requires a sufficient data set, and data augmentation is one of the possible ways to obtain a sufficient labeled Mongolian data set in a short time. In recent years, the open-source annotated Mongolian datasets are of very small magnitude and researchers often need to collect data with the support of colleges and universities and enterprises. However, acquiring a data set is an economic and time consuming task. In order to obtain a sufficient amount of data in a short time, a data expansion method is particularly important.
Currently, speech extension methods are classified into two categories according to different implementation techniques.
(1) The original audio or speech features are modified by an algorithm for expansion, such as speech rate perturbation, vocal cord length normalization, audio masking. This type of method can generate audio immediately, but usually needs to be adjusted manually to obtain excellent generated audio.
(2) And synthesizing the audio by a generating technology for expansion, such as noise audio generation and room simulation audio generation. The method generates new audio through a synthesis technology. Research has mainly focused on adding the environmental information needed for a specific task to existing audio, but synthesis techniques usually require more raw data.
In summary, the Mongolian audio marked in the existing Mongolian data set is deficient, and the regional distribution is unbalanced. The use of the current Mongolian data set by the speech recognition model may result in regions with large overfitting data, and sequences based on an attention mechanism may be overfitting to the sequence model.
Therefore, how to provide an expansion method for Mongolian data sets is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of this, the present invention provides a method for expanding a Mongolian data set, which obtains a Mongolian expanded data set by using a confrontation model generated in a specified region, balances the regional distribution of the data set, and improves the recognition accuracy of the Mongolian speech recognition model.
In order to achieve the above purpose, the invention provides the following technical scheme:
a method of augmenting a Mongolian data set, comprising:
acquiring Mongolian texts containing specified regional characteristics, specified regional characteristics and Mongolian audios with the specified regional characteristics of real audios;
constructing a confrontation generation network model in a designated area;
and carrying out countermeasure training on the countermeasure generating network model of the specified region, inputting Mongolian audio with the characteristics of the real audio of the specified region into the trained countermeasure generating network model, and processing the Mongolian audio to generate a Mongolian expansion data set.
Preferably, the network model for generating the regional countermeasure comprises: the system comprises a conditional voice generator and a multi-item fusion discriminator, wherein the conditional voice generator is connected with the multi-item fusion discriminator and consists of a synthesizer and a vocoder;
wherein the content of the first and second substances,
the generator is used for: obtaining a constructed Mongolian Mel frequency spectrogram according to the Mongolian text and the characteristics of the designated region;
the vocoder: the Mongolian audio generator is connected with the generator and generates Mongolian audio of a designated area according to the Mongolian Mel frequency spectrogram;
the multi-item fusion discriminator: and judging whether Mongolian audio in the designated region is real data or not according to the Mongolian Mel frequency spectrogram and the characteristics of the designated region, and generating the Mongolian expansion data set.
Preferably, the synthesizer comprises a causal convolutional layer, an encoding layer, an attention layer, a decoding layer and an anti-convolutional layer which are connected in sequence;
wherein the content of the first and second substances,
the causal convolution layer is used for reducing the information quantity difference among the Mongolian text, the specified region feature and the Mongolian audio Mel frequency spectrogram;
the encoding layer, the attention layer and the decoding layer are used for mapping the relation between the input features and the output Mel frequency spectrogram in a time dimension;
the deconvolution layer is used for improving the definition of the Mongolian audio Mel frequency spectrogram.
Preferably, the generator obtains the distribution of the Mongolian Mel frequency spectrum diagram according to the specified region characteristics and the Mongolian text, and the formula is as follows:
Figure BDA0003220218420000031
wherein z is the characteristic of the designated region, t is Mongolian text, x is Mongolian Mel frequency spectrum diagram, and p (x | z · t) is the distribution of the Mongolian Mel frequency spectrum diagram;
modeling the distribution of the Mongolian Mel frequency spectrogram to obtain Mongolian Mel frequency spectrogram characteristics, wherein the calculation formula is as follows:
Figure BDA0003220218420000032
wherein, denotes a convolution operation, WconvRepresenting the convolution kernel parameter, WencDenotes LSTM encoding parameters, c denotes attention context, WattIndicating attention weights, g indicates LSTM decoding operations,
Figure BDA0003220218420000033
the parameters of the deconvolution are represented by,
Figure BDA0003220218420000034
representing the Mongolian Mel frequency spectrum graph characteristics obtained by model calculation.
Preferably, the multinomial fusion discriminator comprises a region classifier and a definition classifier, wherein the region classifier is used for discriminating Mongolian audio pronunciation regions, the definition classifier is used for discriminating Mongolian audio definition to obtain a discrimination result, and the discrimination result specifically comprises:
and respectively judging the Mongolian audio pronunciation region and the Mongolian audio definition of the Mongolian audio with the real audio and the designated region characteristic by using the region classifier and the definition classifier, adding a real Mongolian data set X if the Mongolian audio is judged to be true, and abandoning the Mongolian audio if the Mongolian audio is judged to be false to form a Mongolian expansion data set.
Preferably, the distinguishing of the audio pronunciation region of Mongolian by the region classifier includes:
performing two-dimensional convolution calculation on the Mongolian Mel frequency spectrogram to obtain convolution characteristics;
performing pooling processing on the convolution characteristics;
classifying according to the convolution characteristics;
calculating a probability value for each region in a classified manner, taking the region with the maximum probability as a Mongolian audio pronunciation region judgment result, and adopting the following calculation formula:
Figure BDA0003220218420000041
wherein x represents Mongolian Mel spectrogram, represents convolution operation, Wconv represents convolution kernel parameter, pool represents pooling operation, and W represents filtering operationfcA full-link layer parameter is represented,
Figure BDA0003220218420000044
the area is identified by the area classification.
Preferably, the determining the audio definition of the Mongolian by the definition classifier includes:
performing two-dimensional convolution calculation on the Mongolian Mel frequency spectrogram to obtain convolution characteristics;
performing pooling processing on the convolution characteristics;
classifying according to the convolution characteristics;
calculating the score of Mongolian audio definition, wherein the score range is [ -1,1], when the score is higher than a set score limit, the definition requirement is considered to be met, otherwise, the definition does not meet the requirement, and the calculation formula is as follows:
Figure BDA0003220218420000042
wherein x represents Mongolian Mel spectrogram, represents convolution operation, Wconv represents convolution kernel parameter, pool represents pooling operation, and W represents filtering operationfcA full-link layer parameter is represented,
Figure BDA0003220218420000043
the definition of the definition classification judgment is indicated.
Preferably, the specific process of the multi-item fusion discriminator for the confrontation training includes:
the multinomial fusion discriminator uses a real Mongolian data set and a random parameter W of the multinomial fusion discriminatorDTraining;
the conditional speech generator uses the Mongolian data set and the random parameters W of the conditional speech generatorGTraining;
the random parameter W of the conditional speech generator is updated by back propagation according to the loss function of the conditional speech generatorG(ii) a According to the loss function of the multinomial fusion discriminator, the back propagation is carried out, and the random parameter W of the multinomial fusion discriminator is updatedDAnd circulating n times.
Compared with the prior art, the Mongolian data set expansion method has the advantages that the generator for generating the network model in the confrontation mode uses the conditional speech generator, and the discriminator consists of the multiphase fusion discriminator. The conditional speech generator generates Mongolian audio and Mel frequency spectrograms from Mongolian text and the specified regional features. And the multinomial fusion discriminator discriminates the regional characteristics and the definition according to the Mel frequency spectrum diagram and the designated regional characteristics. After the conditional speech generator and the multi-item fusion judger are mutually confronted and learned, the Mongolian audio frequency of the appointed area synthesized by the final conditional generator is judged as real data by the multi-item fusion judger. The augmented data set consists of all generated Mongolian audio that was judged to be true by the multi-item fusion discriminant. The Mongolian data of the designated area are expanded, and the problems of high economic cost, large time consumption and uneven area of Mongolian corpus collection are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a Mongolian data set expansion method provided by the present invention;
fig. 2 is a schematic structural diagram of a countermeasure generation network model provided in this embodiment;
fig. 3 is a schematic diagram of the generation countermeasure network for a specific area according to the present embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an embodiment of the present invention discloses a method for expanding a Mongolian data set, including:
acquiring Mongolian texts containing specified regional characteristics, specified regional characteristics and Mongolian audios with the specified regional characteristics of real audios;
constructing a confrontation generation network model in a designated area;
and performing countermeasure training on the countermeasure generating network model of the designated area, inputting Mongolian audio with the characteristics of the real audio of the designated area into the trained countermeasure generating network model for processing, and generating a Mongolian expansion data set.
Specifically, the regional features in the Mongolian text are consistent with the regional features in the Mongolian audio, and both represent the target regional features to be expanded.
Referring to fig. 2, in one embodiment, the regional countermeasure generation network model mainly includes a conditional speech generator and a multi-item fusion discriminator, the conditional speech generator is connected to the multi-item fusion discriminator, wherein the conditional speech generator is composed of a synthesizer and a vocoder, and the multi-item fusion discriminator is composed of a region classifier and a sharpness classifier.
Referring to fig. 3, a schematic diagram of a generation countermeasure network for a specified region provided in this embodiment, in an embodiment, a conditional speech generator constructs a mapping between a Mongolian text and a Mongolian audio under the condition of the specified region, a synthesizer constructs a mapping between the Mongolian text and a region feature and a Mongolian Mel spectrogram in consideration of an information amount difference between Mongolian audio information and the Mongolian text, and a vocoder constructs a mapping between the Mongolian Mel spectrogram and the Mongolian audio. In the multinomial fusion discriminator, the multinomial fusion discriminator constructs mapping between a Mongolian Mel frequency spectrogram and Mongolian audio classification in a designated region, and in order to complete a task of discriminating Mongolian audio pronunciation region and a task of discriminating Mongolian audio definition, the multinomial fusion discriminator is divided into a region classifier and a definition classifier, and confrontation learning between a conditional speech generator and the multinomial fusion discriminator is realized. The conditional speech generator synthesizes Mongolian audio having regional characteristics, and the multinomial fusion discriminator removes Mongolian audio not having the characteristics of the designated region and sufficient definition, thereby generating a Mongolian augmented data set.
In one embodiment, the conditional speech generator is comprised of a synthesizer and a vocoder, the synthesizer comprising a causal convolutional layer, an LSTM encoding layer, an attention layer, an LSTM decoding layer, and an anti-convolutional layer. The vocoder converts the mel-frequency spectrogram into Mongolian audio by adopting a Griffin-Lim algorithm.
Specifically, in order to maximally restore audio, the mel-frequency spectrogram needs to be converted into a time-frequency spectrum.
In one embodiment, the conditional speech generator synthesizer models the distribution of Mongolian Mel frequency spectrogram under the condition of a specific region feature and Mongolian text. The formula is as follows:
Figure BDA0003220218420000071
wherein z is a specific region feature, t is a Mongolian text, x Mongolian Mel spectral diagram, and p (x | z · t) is a distribution of the Mongolian Mel spectral diagram x.
In a specific embodiment, the obtained Mongolian text containing the specified region characteristics and the specified region characteristics are spliced into codes which are used as the input of a synthesizer, and the synthesizer comprises a causal convolution layer, an LSTM coding layer, an attention layer, an LSTM decoding layer and an anti-convolution layer.
In particular, the causal convolutional layer may reduce the information content difference between Mongolian text and region features and Mongolian audio Mel-frequency spectrogram.
In particular, the LSTM encoding layer, the attention layer, and the LSTM decoding layer map the relationship between the input features and the output mel-frequency spectrogram in the time dimension.
Specifically, the deconvolution layer can improve the definition of the mel-frequency spectrum.
Specifically, the calculation formula is as follows:
Figure BDA0003220218420000072
wherein, denotes a convolution operation, WconvRepresenting the convolution kernel parameter, WencDenotes LSTM encoding parameters, c denotes attention context, WattIndicating attention weights, g indicates LSTM decoding operations,
Figure BDA0003220218420000073
the parameters of the deconvolution are represented by,
Figure BDA0003220218420000074
representing the Mongolian Mel frequency spectrum graph characteristics obtained by model calculation.
Specifically, the initial parameters of each layer are randomly generated, and a Gradient Descent Algorithm (Gradient Algorithm) is further required to correct the model parameters in order to obtain a better modeling effect. The formula of the loss function L required for the gradient descent algorithm is:
Figure BDA0003220218420000081
in one embodiment, the vocoder uses the Griffin-Lim algorithm to convert the Mel-spectral plot into Mongolian audio.
Specifically, in order to maximize the reproduction of audio, the mel-frequency Spectrogram needs to be converted into a time-frequency spectrum (Spectrogram). The time-frequency spectrum retains the frequency distribution of each frame, but lacks phase information, i.e., lacks information about the waveform variations of the signal. Let P be the phase spectrum, S be the time spectrum, X be the speech waveform information, f represent the Fourier transform, f-1Is an inverse time fourier transform. The specific algorithm steps are as follows:
(1) randomly initializing a phase spectrum P;
(2) using the time spectrum S and the phase spectrum P to perform inverse Fourier transform f-1Synthesizing new voice waveform information X;
(3) fourier transform f is carried out on the synthetic audio to obtain a new time-frequency spectrum SnewAnd a phase spectrum Pnew
(4) Discard the new time-frequency spectrum SnewUsing the original time spectrum S and the new phase spectrum PnewSynthesizing new voice waveform information X;
(5) repeating the steps (3) to (4) for a plurality of rounds, and outputting the audio waveform information X obtained in the last round;
in one embodiment, the multi-item fusion discriminator is divided into a region classifier and a definition classifier in order to accomplish the task of discriminating Mongolian audio pronunciation regions and the capability of discriminating Mongolian audio definition.
Specifically, the region classifier firstly performs two-dimensional convolution calculation on the Mel frequency spectrogram to obtain convolution characteristics, and after each convolution operation, the convolution characteristics apply modified linear unit (ReLU) conversion to introduce a nonlinear law into the model. The pooling operation then reduces the sampling rate of the convolved features, thereby reducing the number of bits of the feature map while still retaining critical feature information. The fully-connected layers are then classified according to the features proposed by the convolution. And finally, calculating a probability value for each region classification by using an activation function softmax, and taking the region with the maximum probability as a judgment result.
The specific calculation formula is as follows:
Figure BDA0003220218420000091
in the formula, x represents Mongolian Mel spectrogram, W represents convolution operationconvRepresenting the convolution kernel parameters, pool representing pooling operation, WfcA full-link layer parameter is represented,
Figure BDA0003220218420000092
indicating the region determined by the region classifier.
In one embodiment, the intelligibility classifier and the region classifier are designed similarly, but the final activation function is transformed to sigmoid to calculate a score for the intelligibility of Mongolian audio, the score ranging from [ -1,1 ]. And when the score is higher than the set score limit, the definition requirement is considered to be met, otherwise, the definition requirement is not met.
The specific calculation formula is as follows:
Figure BDA0003220218420000093
wherein x represents Mongolian Mel spectrogram, represents convolution operation, Wconv represents convolution kernel parameter, pool represents pooling operation, and W represents filtering operationfcA full-link layer parameter is represented,
Figure BDA0003220218420000094
the definition of the definition classification judgment is shown,
Figure BDA0003220218420000095
indicating the sharpness of the sharpness classifier decision.
Specifically, the multinomial fusion discriminator calculates the region classifier first, and the region classifier is calculated only if the region classification is correct, otherwise, the definition classifier is directly returned to fail. If the sharpness classifier result
Figure BDA0003220218420000096
If the set requirement y is higher than the set requirement y, returning to pass; otherwise still return toDo not pass. The penalty function for the polyphase fusion discriminator is then formulated as:
Figure BDA0003220218420000097
in one embodiment, the final goal of the countermeasure training for the specific regional countermeasure generation network model is:
Figure BDA0003220218420000098
wherein D is a multinomial fusion discriminator, G is a conditional speech generator, X is a real Mongolian audio, and X isCIndicating region information as conditions in a conditional speech generator, Z representing speech characteristics of a specified region, WDRepresenting a random initialization parameter, W, during training of a multi-item fusion discriminatorGRepresenting the random initialization parameters when training the conditional speech generator.
Specifically, the multinomial fusion discriminator D uses the real Mongolian data X set and the random parameter WDAnd (5) training.
Conditional speech generator G uses a Mongolian data set X and a random parameter WGTraining, obtaining Mongolian expansion data set Y through the specified regional characteristics z, and marking as false. Loss of speech generator by conditionalGPerforming backward propagation to update the parameter WGWherein the loss formula is
Figure BDA0003220218420000101
A multinomial fusion discriminator D distinguishes the Mongolian expansion data set Y, and if the discrimination is true, the data set X is added; otherwise, the corresponding item is discarded. Loss according to multi-term fusion discriminatorDPerforming backward propagation to update multi-parameter WDAnd circulating n times. Extending the data marked as false in Mongolian data X into Mongolian extended data set YZ
In one embodiment, because the existing data set has less Mongolian audio, the regional characteristics of the Bingyuan are extracted from the Mongolian audio of the existing Bingyuan, the regional characteristics are reconstructed and combined with the Mongolian invention to obtain the text with the regional characteristics of the Bingyuan, and the text is sent to a synthesizer in a conditional speech generator to generate a Mongolian spectrogram with the regional characteristics of the Bingyuan, and then the Mongolian spectrogram is converted into speech by using a vocoder. A multinomial fusion discriminator in the generation countermeasure network utilizes the real Mongolian audio of the bauhinia to discriminate whether the generated Mongolian audio of the bauhinia is clear or not and whether the generated Mongolian audio has the regional characteristics of the bauhinia or not, and the conditional voice generator is continuously adjusted by calculating the countermeasure loss to generate the Mongolian audio with the regional characteristics of the bauhinia, so that the data set expansion is finally realized.
Compared with the prior art, the data expansion method provided by the invention can balance the regional distribution of the data set, thereby improving the recognition accuracy of the Mongolian speech recognition model. The problem of current Mongolian data set have the Mongolian audio that marks lack, and the area distributes disproportionately is solved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A Mongolian data set expansion method is characterized by comprising the following steps:
acquiring Mongolian texts containing specified regional characteristics, specified regional characteristics and Mongolian audios with the specified regional characteristics of real audios;
constructing a confrontation generation network model in a designated area;
and carrying out countermeasure training on the countermeasure generating network model of the specified region, inputting Mongolian audio with the characteristics of the real audio of the specified region into the trained countermeasure generating network model, and processing the Mongolian audio to generate a Mongolian expansion data set.
2. The Mongolian data set augmentation method of claim 1, wherein the step of generating a network model for the specific regional countermeasures comprises: the system comprises a conditional voice generator and a multi-item fusion discriminator, wherein the conditional voice generator is connected with the multi-item fusion discriminator and consists of a synthesizer and a vocoder;
wherein the content of the first and second substances,
the generator is used for: obtaining a constructed Mongolian Mel frequency spectrogram according to the Mongolian text and the characteristics of the designated region;
the vocoder: the Mongolian audio generator is connected with the generator and generates Mongolian audio of a designated area according to the Mongolian Mel frequency spectrogram;
the multi-item fusion discriminator: and judging whether Mongolian audio in the designated region is real data or not according to the Mongolian Mel frequency spectrogram and the characteristics of the designated region, and generating the Mongolian expansion data set.
3. The Mongolian data set expansion method according to claim 2, wherein the synthesizer comprises a causal convolutional layer, an encoding layer, an attention layer, a decoding layer and an anti-convolutional layer which are connected in sequence;
wherein the content of the first and second substances,
the causal convolution layer is used for reducing the information quantity difference among the Mongolian text, the specified region feature and the Mongolian audio Mel frequency spectrogram;
the encoding layer, the attention layer and the decoding layer are used for mapping the relation between the input features and the output Mel frequency spectrogram in a time dimension;
the deconvolution layer is used for improving the definition of the Mongolian audio Mel frequency spectrogram.
4. The method of claim 2, wherein the generator obtains the distribution of Mongolian Mel frequency spectrogram from the specified region feature and Mongolian text, and the formula is as follows:
Figure FDA0003220218410000021
wherein z is the characteristic of the designated region, t is Mongolian text, x is Mongolian Mel frequency spectrum diagram, and p (x | z · t) is the distribution of the Mongolian Mel frequency spectrum diagram;
modeling the distribution of the Mongolian Mel frequency spectrogram to obtain Mongolian Mel frequency spectrogram characteristics, wherein the calculation formula is as follows:
Figure FDA0003220218410000022
wherein, denotes a convolution operation, WconvRepresenting the convolution kernel parameter, WencDenotes LSTM encoding parameters, c denotes attention context, WattIndicating attention weights, g indicates LSTM decoding operations,
Figure FDA0003220218410000023
the parameters of the deconvolution are represented by,
Figure FDA0003220218410000024
representing the Mongolian Mel frequency spectrum graph characteristics obtained by model calculation.
5. The method for expanding Mongolian data set according to claim 2, wherein the plurality of fusion classifiers are composed of a region classifier and a intelligibility classifier, the region classifier is used for discriminating Mongolian audio pronunciation region, the intelligibility classifier is used for discriminating Mongolian audio intelligibility to obtain a discrimination result, and specifically comprises:
and respectively judging the Mongolian audio pronunciation region and the Mongolian audio definition of the Mongolian audio with the real audio and the designated region characteristic by using the region classifier and the definition classifier, adding a real Mongolian data set X if the Mongolian audio is judged to be true, and abandoning the Mongolian audio if the Mongolian audio is judged to be false to form a Mongolian expansion data set.
6. The method of claim 5, wherein the region classifier identifying Mongolian audio pronunciation regions comprises:
performing two-dimensional convolution calculation on the Mongolian Mel frequency spectrogram to obtain convolution characteristics;
performing pooling processing on the convolution characteristics;
classifying according to the convolution characteristics;
calculating a probability value for each region in a classified manner, taking the region with the maximum probability as a Mongolian audio pronunciation region judgment result, and adopting the following calculation formula:
Figure FDA0003220218410000031
wherein x represents Mongolian Mel spectrogram, represents convolution operation, Wconv represents convolution kernel parameter, pool represents pooling operation, and W represents filtering operationfcA full-link layer parameter is represented,
Figure FDA0003220218410000032
the area is identified by the area classification.
7. The method of claim 5, wherein the intelligibility classifier for discriminating Mongolian audio intelligibility comprises:
performing two-dimensional convolution calculation on the Mongolian Mel frequency spectrogram to obtain convolution characteristics;
performing pooling processing on the convolution characteristics;
classifying according to the convolution characteristics;
calculating the score of Mongolian audio definition, wherein the score range is [ -1,1], when the score is higher than a set score limit, the definition requirement is considered to be met, otherwise, the definition does not meet the requirement, and the calculation formula is as follows:
Figure FDA0003220218410000033
wherein x represents Mongolian Mel spectrogram, represents convolution operation, Wconv represents convolution kernel parameter, pool represents pooling operation, and W represents filtering operationfcA full-link layer parameter is represented,
Figure FDA0003220218410000034
the definition of the definition classification judgment is indicated.
8. The Mongolian data set expansion method of claim 5, wherein the specific process of the countertraining by the multi-item fusion arbiter comprises:
the multinomial fusion discriminator uses a real Mongolian data set and a random parameter W of the multinomial fusion discriminatorDTraining;
the conditional speech generator uses the Mongolian data set and the random parameters W of the conditional speech generatorGTraining;
the random parameter W of the conditional speech generator is updated by back propagation according to the loss function of the conditional speech generatorG(ii) a According to the loss function of the multinomial fusion discriminator, the back propagation is carried out, and the random parameter W of the multinomial fusion discriminator is updatedDAnd circulating n times.
CN202110955831.4A 2021-08-19 2021-08-19 Mongolian data set expansion method Active CN113611293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110955831.4A CN113611293B (en) 2021-08-19 2021-08-19 Mongolian data set expansion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110955831.4A CN113611293B (en) 2021-08-19 2021-08-19 Mongolian data set expansion method

Publications (2)

Publication Number Publication Date
CN113611293A true CN113611293A (en) 2021-11-05
CN113611293B CN113611293B (en) 2022-10-11

Family

ID=78341361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110955831.4A Active CN113611293B (en) 2021-08-19 2021-08-19 Mongolian data set expansion method

Country Status (1)

Country Link
CN (1) CN113611293B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171651A (en) * 2022-09-05 2022-10-11 中邮消费金融有限公司 Method and device for synthesizing infant voice, electronic equipment and storage medium
CN116564276A (en) * 2023-04-23 2023-08-08 内蒙古工业大学 Mongolian speech recognition method for generating countermeasure network based on double discriminators

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598221A (en) * 2019-08-29 2019-12-20 内蒙古工业大学 Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112133326A (en) * 2020-09-08 2020-12-25 东南大学 Gunshot data amplification and detection method based on antagonistic neural network
CN112652309A (en) * 2020-12-21 2021-04-13 科大讯飞股份有限公司 Dialect voice conversion method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598221A (en) * 2019-08-29 2019-12-20 内蒙古工业大学 Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112133326A (en) * 2020-09-08 2020-12-25 东南大学 Gunshot data amplification and detection method based on antagonistic neural network
CN112652309A (en) * 2020-12-21 2021-04-13 科大讯飞股份有限公司 Dialect voice conversion method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王海文,邱晓晖: "一种基于生成时对抗网络的图像数据扩充方法", 《计算机技术与发展》 *
郭家兴: "《硕士学位论文》", 15 February 2021, 哈尔滨工业大学 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171651A (en) * 2022-09-05 2022-10-11 中邮消费金融有限公司 Method and device for synthesizing infant voice, electronic equipment and storage medium
CN116564276A (en) * 2023-04-23 2023-08-08 内蒙古工业大学 Mongolian speech recognition method for generating countermeasure network based on double discriminators

Also Published As

Publication number Publication date
CN113611293B (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN104424943B (en) Speech processing system and method
US20180061439A1 (en) Automatic audio captioning
CN110853680B (en) double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy
EP0847041B1 (en) Method and apparatus for speech recognition performing noise adaptation
EP0755046B1 (en) Speech recogniser using a hierarchically structured dictionary
US5903863A (en) Method of partitioning a sequence of data frames
CN103578462A (en) Speech processing system
CN113611293B (en) Mongolian data set expansion method
CN110751044A (en) Urban noise identification method based on deep network migration characteristics and augmented self-coding
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
CN102810311B (en) Speaker estimation method and speaker estimation equipment
CN102663432A (en) Kernel fuzzy c-means speech emotion identification method combined with secondary identification of support vector machine
CN115662435B (en) Virtual teacher simulation voice generation method and terminal
WO1996008005A1 (en) System for recognizing spoken sounds from continuous speech and method of using same
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN116110405A (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
US6131089A (en) Pattern classifier with training system and methods of operation therefor
CN114863938A (en) Bird language identification method and system based on attention residual error and feature fusion
Yasmin et al. A rough set theory and deep learning-based predictive system for gender recognition using audio speech
CN111241820A (en) Bad phrase recognition method, device, electronic device, and storage medium
CN102063897B (en) Sound library compression for embedded type voice synthesis system and use method thereof
AU2362495A (en) Speech-recognition system utilizing neural networks and method of using same
CN108847251A (en) A kind of voice De-weight method, device, server and storage medium
CN111968669A (en) Multi-element mixed sound signal separation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant