CN113241062A - Method, device and equipment for enhancing voice training data set and storage medium - Google Patents

Method, device and equipment for enhancing voice training data set and storage medium Download PDF

Info

Publication number
CN113241062A
CN113241062A CN202110610940.2A CN202110610940A CN113241062A CN 113241062 A CN113241062 A CN 113241062A CN 202110610940 A CN202110610940 A CN 202110610940A CN 113241062 A CN113241062 A CN 113241062A
Authority
CN
China
Prior art keywords
training data
mel frequency
voice
data set
mel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110610940.2A
Other languages
Chinese (zh)
Other versions
CN113241062B (en
Inventor
唐彦玺
王健宗
瞿晓阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110610940.2A priority Critical patent/CN113241062B/en
Publication of CN113241062A publication Critical patent/CN113241062A/en
Application granted granted Critical
Publication of CN113241062B publication Critical patent/CN113241062B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a method, a device, equipment and a storage medium for enhancing a voice training data set, wherein the method comprises the following steps: the method comprises the steps of extracting a Mel frequency spectrogram corresponding to each voice training data and conducting pixel point rearrangement processing to obtain temporary Mel frequency spectrograms after pixel point rearrangement processing, setting an erasing area for each temporary Mel frequency spectrum, setting a shape parameter of an erasing area, changing a parameter or a random erasing coefficient to obtain a plurality of extended Mel frequency spectrograms, and converting each extended Mel frequency spectrogram into corresponding target voice training data, so that the voice training data are supplemented. The invention has the beneficial effects that: the problem of voice training data are less, lead to voice model to appear the overfitting easily in the training process is solved to increase voice model's robustness, avoided voice model to fall into the overfitting, improved voice model's range of application greatly.

Description

Method, device and equipment for enhancing voice training data set and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for enhancing a voice training data set.
Background
Speech recognition is a multi-disciplinary cross field, which is closely connected with many disciplines such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, etc., and aims to make a computer capable of 'dictating' continuous speech spoken by different people, namely a commonly called 'speech dictation machine', which is a technology for realizing 'sound' to 'text' conversion. Because the voice training data are less, the voice model is easy to have the over-fitting problem in the training process, and after the model is trained to have the over-fitting problem, the model can only obtain better effect on the training set and has poor performance on other data, and the generalization capability is lacked.
Disclosure of Invention
The invention mainly aims to provide a method, a device, equipment and a storage medium for enhancing a voice training data set, and aims to solve the problem that a voice model is easy to over-fit in the training process due to less voice training data.
The invention provides a method for enhancing a voice training data set, which comprises the following steps:
acquiring a voice training data set;
extracting each voice training data from the voice training data set, and converting each voice training data into a corresponding Mel frequency spectrogram;
performing pixel point rearrangement processing on each Mel frequency spectrum graph to obtain a temporary Mel frequency spectrum graph after the pixel point rearrangement processing;
setting an erasing area for each temporary Mel frequency spectrum according to the size of the image of the temporary Mel frequency spectrum;
introducing a random erasing coefficient, and setting shape parameters of an erasing area based on the area of the erasing area and the random erasing coefficient;
changing the position of the erasing area in the temporary Mel frequency spectrum or changing the random erasing coefficient to obtain a plurality of extended Mel frequency spectrums corresponding to the temporary Mel frequency spectrums;
converting each extended Mel frequency spectrogram into corresponding target voice training data;
and supplementing the target voice training data into the voice training data set to obtain an enhanced voice training data set.
Further, the step of performing pixel rearrangement processing on each mel frequency spectrum graph to obtain a temporary mel frequency spectrum graph after the pixel rearrangement processing includes:
dividing the Mel frequency spectrogram into a plurality of subset frequency spectrograms;
and randomly selecting a preset number of the subset frequency spectrum diagrams to carry out pixel point random arrangement to obtain a temporary Mel frequency spectrum diagram after pixel point rearrangement processing.
Further, the step of introducing a random erasure coefficient and setting a shape parameter of the erasure area based on the area of the erasure area and the random erasure coefficient includes:
random parameter r is arbitrarily selected from preset parameter rangee
According to the formula
Figure BDA0003095842700000021
Setting the width of the rectangular area, and according to the formula
Figure BDA0003095842700000022
Setting the height of the rectangular area, wherein SeIs the area of the erasing region, WeIs the width HeIs the height.
Further, after the step of introducing a random erasure coefficient and setting a shape parameter of an erasure area based on the area of the erasure area and the random erasure coefficient, the method further includes:
judging whether a central point of the erasing region exists in the temporary Mel frequency spectrogram based on the shape parameter, so that the erasing region is completely contained by the temporary Mel frequency spectrogram;
and if the central point does not exist, replacing the random erasing coefficient until the central point exists.
Further, before the step of supplementing the target speech training data into the speech training data set to obtain an enhanced speech training data set, the method further includes:
inputting each target voice training data into a preset vector machine to obtain a target vector X (X) corresponding to a fixed dimension1,x2,…,xi,…,xn);
According to the formula
Figure BDA0003095842700000031
Calculating difference values between each target vector and the voice vector corresponding to the original voice training data; wherein, Y is the multidimensional coordinate corresponding to the original phonetic training data, and Y is (Y)1,y2,…,yi,…,yn),xiRepresenting the value of the i-th dimension in the target vector, yiRepresenting the value of the ith dimension, s, in the corresponding speech vectoriThe coefficient is corresponding to the ith dimension data, and p is a set parameter value;
and deleting the target voice training data with the difference value smaller than the preset difference value.
Further, after the step of supplementing the target speech training data into the speech training data set to obtain an enhanced speech training data set, the method further includes:
converting sample voice data in the enhanced voice training data set into a sample Mel frequency spectrogram;
inputting the sample Mel frequency spectrogram and a preset interference frequency spectrogram into a generation network to obtain a middle Mel frequency spectrogram;
inputting the intermediate Mel frequency spectrogram into a discrimination network to obtain type probability and a prediction label corresponding to the intermediate Mel frequency spectrogram;
and performing alternate iterative training on the generation network and the discrimination network according to the type probability of the middle Mel frequency spectrogram and the prediction label, and taking the trained generation network as a voice model.
Further, the step of converting each of the speech training data into a corresponding mel frequency spectrum diagram includes:
performing Fourier transformation on each frame of voice in each voice training data to obtain a voice result corresponding to each frame of voice;
stacking each voice result along one dimension to obtain a corresponding spectrogram;
and inputting the spectrogram into a Mel filter bank to obtain the Mel spectrogram.
The invention provides a device for enhancing a voice training data set, which comprises:
the acquisition module is used for acquiring a voice training data set;
the first conversion module is used for extracting each voice training data from the voice training data set and converting each voice training data into a corresponding Mel frequency spectrogram;
the rearrangement module is used for carrying out pixel point rearrangement processing on each Mel frequency spectrum map to obtain a temporary Mel frequency spectrum map after the pixel point rearrangement processing;
the setting module is used for setting the area of an erasing area for each temporary Mel frequency spectrum according to the picture size of the temporary Mel frequency spectrum;
the introduction module is used for introducing a random erasing coefficient and setting the shape parameter of an erasing area based on the area of the erasing area and the random erasing coefficient;
a changing module, configured to change a position of the erasure area in the temporary mel spectrum or change the random erasure coefficient, so as to obtain a plurality of extended mel frequency spectrograms corresponding to the temporary mel frequency spectrograms;
the second conversion module is used for converting each extended Mel frequency spectrogram into corresponding target voice training data;
and the supplement module is used for supplementing the target voice training data into the voice training data set to obtain an enhanced voice training data set.
The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.
The invention has the beneficial effects that: the method comprises the steps of extracting a Mel frequency spectrogram corresponding to each voice training data and conducting pixel point rearrangement processing to obtain temporary Mel frequency spectrograms after the pixel point rearrangement processing, setting an erasing area for each temporary Mel frequency spectrum, setting a shape parameter of an erasing area, changing a parameter or a random erasing coefficient to obtain a plurality of extended Mel frequency spectrograms, and converting each extended Mel frequency spectrogram into corresponding target voice training data, so that the voice training data is supplemented, the problem that the voice training data is less, so that the voice model is easy to be over-fitted in the training process is solved, the robustness of the voice model is increased, the voice model is prevented from being over-fitted, and the application range of the voice model is greatly improved.
Drawings
FIG. 1 is a flow chart illustrating a method for enhancing a speech training data set according to an embodiment of the present invention;
FIG. 2 is a block diagram schematically illustrating an apparatus for enhancing a speech training data set according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that all directional indicators (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly, and the connection may be a direct connection or an indirect connection.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.
In addition, the descriptions related to "first", "second", etc. in the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a method for enhancing a speech training data set, including:
s1: acquiring a voice training data set;
s2: extracting each voice training data from the voice training data set, and converting each voice training data into a corresponding Mel frequency spectrogram;
s3: performing pixel point rearrangement processing on each Mel frequency spectrum graph to obtain a temporary Mel frequency spectrum graph after the pixel point rearrangement processing;
s4: setting an erasing area for each temporary Mel frequency spectrum according to the size of the image of the temporary Mel frequency spectrum;
s5: introducing a random erasing coefficient, and setting shape parameters of an erasing area based on the area of the erasing area and the random erasing coefficient;
s6: changing the position of the erasing area in the temporary Mel frequency spectrum or changing the random erasing coefficient to obtain a plurality of extended Mel frequency spectrums corresponding to the temporary Mel frequency spectrums;
s7: converting each extended Mel frequency spectrogram into corresponding target voice training data;
s8: and supplementing the target voice training data into the voice training data set to obtain an enhanced voice training data set.
As described in step S1, a voice training data set is obtained, wherein the obtaining may be performed by a microphone or from other data sources. It should be noted that, for a mobile terminal, generally, speech data of a user is trained, and therefore, an acquired speech training data set is small.
As described in step S2 above, each piece of speech training data is extracted from the set of speech training data, and each piece of speech training data is converted into a corresponding mel frequency spectrum. The extraction method is to perform short-time fourier transform (STFT) on each voice training data, that is, perform fourier transform on each frame in the voice training data, stack the results of each frame along another dimension to obtain a spectrogram, and input the spectrogram into a mel-scale filter banks (mel-scale filter banks) to obtain a mel-frequency spectrogram.
As described in step S3, the pixel rearrangement processing is performed on each mel frequency spectrum map to obtain a temporary mel frequency spectrum map after the pixel rearrangement processing. The rearrangement processing of the pixel points of the Mel frequency spectrum is a regularization method based on a kernel filter, namely, the regularization method is used for carrying out sharpening and fuzzy processing on an image, therefore, the Mel frequency spectrum image can be divided into a plurality of sub-level frequency spectrum images according to regions, and pixels in each sub-level frequency spectrum image can be rearranged.
As described in step S4, an erasure area is set for each of the temporary mel frequency spectrums according to the picture size of the temporary mel frequency spectrums. Namely, the set erasing area is related to the size of the picture of the temporary Mel-frequency spectrum, so that the situation that the training effect of the training data obtained after erasing is influenced by excessive data erasing is avoided. Wherein an area ratio, i.e. setting, can be set
Figure BDA0003095842700000071
Wherein α is the area ratio, SeFor erasing the area, S is the area of the corresponding temporary mel spectrum (i.e. the size of the picture of the temporary mel spectrum), thereby ensuring that the set erasing area is controllable, and the obtained new training data does not have too much data loss, thereby ensuring the training effect.
As described in step S5 above, a random erasure coefficient is introduced, and a shape parameter of an erasure area is set based on the area of the erasure area and the random erasure coefficient. The random erasure coefficient is a shape parameter for determining an erasure area, that is, parameters such as the area size and the shape of the erasure area are set. The shape of the rectangle, polygon, circle, etc. can be provided, and the detailed description of the width and height of the rectangle is not repeated here. For a polygon, parameters such as a corresponding side length, an angle between each side, and the like need to be set, and for a circle, a corresponding radius, and the like need to be set.
As described in step S6, the position of the erasure area in the temporary mel-frequency spectrum is changed or the random erasure coefficient is changed, so as to obtain a plurality of extended mel-frequency spectrums corresponding to the temporary mel-frequency spectrums. Namely, the parameters are changed continuously, and under the condition of meeting the requirements, a plurality of extended Mel frequency spectrograms can be obtained, namely, the data expansion is completed. Wherein, the condition of meeting the requirement means that the erasing area is completely contained in the temporary Mel frequency spectrogram. If the erased area is not completely contained in the temporary Mel-spectral diagram, it is determined that the obtained extended spectral diagram is not satisfactory, and it needs to be deleted.
As described in step S7, each extended mel frequency spectrum map is converted into corresponding target speech training data. The purpose of restoring each extended Mel frequency spectrogram is to expand the original voice training data set, and the restoring mode is obtained by performing inverse operation in a mode of correspondingly generating the Mel frequency spectrogram. In some embodiments, if the corresponding speech training model is trained through the mel-frequency spectrogram, it may be directly input to the speech training model without performing reduction calculation.
As described in step S8, the target speech training data is supplemented into the speech training data set to obtain an enhanced speech training data set. Therefore, the voice training data are supplemented, the problem that overfitting of the voice model is easy to occur in the training process due to the fact that the voice training data are few is solved, the robustness of the voice model is improved, the voice model is prevented from being overfitting, and the application range of the voice model is greatly enlarged.
In an embodiment, the step S3 of performing pixel rearrangement processing on each mel frequency spectrum map to obtain a temporary mel frequency spectrum map after the pixel rearrangement processing includes:
s301: dividing the Mel frequency spectrogram into a plurality of subset frequency spectrograms;
s302: and randomly selecting a preset number of the subset frequency spectrum diagrams to carry out pixel point random arrangement to obtain a temporary Mel frequency spectrum diagram after pixel point rearrangement processing.
As described in the foregoing steps S301 to S302, the pixel rearrangement processing on the mel-frequency spectrum diagram is implemented, that is, the mel-frequency spectrum diagram is divided into a plurality of subset frequency spectrum diagrams, and assuming that the mel-frequency spectrum diagram is a picture with a size of 4 × 4, the size of the subset frequency spectrum diagram can be set to 2x2, and the subset frequency spectrum diagram is divided into 4 subset frequency spectrum diagrams, and the probability of the pixel rearrangement processing on each subset frequency spectrum diagram is relatively independent. The pixel rearrangement processing can be carried out or not, and the pixel rearrangement processing mode is that the pixels in the subset frequency spectrogram are randomly arranged, so that sharpening and fuzzy processing can be effectively carried out, and subsequently generated target voice training data is better. The sizes of the set sub-set spectrograms may be the same or different.
In one embodiment, the step S5 of introducing a random erasure coefficient into the erase region, and setting the shape parameter of the erase region based on the area of the erase region and the random erasure coefficient includes:
s501: random parameter r is arbitrarily selected from preset parameter rangee
S502: according to the formula
Figure BDA0003095842700000091
Setting the width of the rectangular area, and according to the formula
Figure BDA0003095842700000092
Setting the height of the rectangular area, wherein SeIs the area of the erasing region, WeIs the width HeIs the height.
As described in the above steps S501 to S502, setting of the parameters of the erasure area is realized. I.e. selecting a random parameter reWherein the random parameter reCan not be selected at will, and the width W of the subsequent calculation needs to be ensuredeAnd height HeBoth smaller than the width and height of the temporal mel-frequency spectrogram. Due to SeIs determined so that the random parameter r can be selected in the corresponding rangee. The width and the height of the rectangle are set according to a formula, so that the parameters of the erasing area are set, and the set rectangular area can be ensured to be contained by the temporary Mel frequency spectrogram.
In one embodiment, after the step S5 of introducing a random erasure coefficient and setting a shape parameter of an erasure area based on the area of the erasure area and the random erasure coefficient, the method further includes:
s601: judging whether a central point of the erasing region exists in the temporary Mel frequency spectrogram based on the shape parameter, so that the erasing region is completely contained by the temporary Mel frequency spectrogram;
s602: and if the central point does not exist, replacing the random erasing coefficient until the central point exists.
Step S6 as described above01-S602, the random erasure coefficient is replaced by determining whether the center point of the erasure region exists in the temporary Mel frequency spectrogram based on the shape parameter, so that the erasure region is completely contained in the temporary Mel frequency spectrogram. For example, in the case of a rectangular erase region, assume that the center point P is (x)e,ye) Is any point in the temporary Mel frequency spectrum diagram, the condition x needs to be satisfiede+We≤W;ye+HeH, wherein W is the width of the temporary Mel spectrogram, and H is the height of the temporary Mel spectrogram. WeIs the width of the rectangular erase region, HeIs the height of the rectangular erase region. If so, a corresponding extended Mel frequency spectrum map may be obtained based on the random erasure coefficients.
In an embodiment, before the step S8 of supplementing the target speech training data into the speech training data set to obtain an enhanced speech training data set, the method further includes:
s701: inputting each target voice training data into a preset vector machine to obtain a target vector X (X) corresponding to a fixed dimension1,x2,…,xi,…,xn);
S702: according to the formula
Figure BDA0003095842700000101
Calculating difference values between each target vector and the voice vector corresponding to the original voice training data; wherein, Y is the multidimensional coordinate corresponding to the original phonetic training data, and Y is (Y)1,y2,…,yi,…,yn),xiRepresenting the value of the i-th dimension in the target vector, yiRepresenting the value of the ith dimension, s, in the corresponding speech vectoriThe coefficient is corresponding to the ith dimension data, and p is a set parameter value;
s703: and deleting the target voice training data with the difference value smaller than the preset difference value.
As described in step S501 above, the vector machine may be a Support Vector Machine (SVM), so as to obtain a corresponding target vector. The support vector machine is trained by a plurality of corresponding voice training data and corresponding expectation vectors in advance.
As stated in the above step S502, according to the formula
Figure BDA0003095842700000102
And calculating difference values between each target vector and the speech vector corresponding to the original speech training data. In the formula, calculation of each dimension is referred to, so that the obtained difference value is more accurate, and it should be noted that the speech vector is also obtained by the vector machine through calculation in advance.
As described in step S503 above: and deleting the target voice training data with the difference value smaller than the preset difference value. And calculating a difference value between the speech vector and the target vector, and if the difference value is too large, indicating that the obtained target speech training data has an important part deleted, so that the training error of a subsequent speech model is increased, and the training error needs to be deleted. In order to train the speech model better, the target speech training data which do not meet the requirements need to be deleted, and the training effect of the subsequent speech model is ensured.
In an embodiment, after the step S8 of supplementing the target speech training data into the speech training data set to obtain an enhanced speech training data set, the method further includes:
s901: converting sample voice data in the enhanced voice training data set into a sample Mel frequency spectrogram;
s902: inputting the sample Mel frequency spectrogram and a preset interference frequency spectrogram into a generation network to obtain a middle Mel frequency spectrogram;
s903: inputting the intermediate Mel frequency spectrogram into a discrimination network to obtain type probability and a prediction label corresponding to the intermediate Mel frequency spectrogram;
s904: and performing alternate iterative training on the generation network and the discrimination network according to the type probability of the middle Mel frequency spectrogram and the prediction label, and taking the trained generation network as a voice model.
As described in the above steps S901-S904, the sample audio includes unlabeled audio and labeled audio, where the labeled audio refers to audio with a certain label, for example, the labels corresponding to the audio are various, such as man, woman, girl, boy, and the like, and such audio with a certain label is referred to as labeled audio. The label-free audio is that the audio itself does not have a corresponding label, and the label is set to be unknown for the audio of which the audio itself does not have a corresponding label, that is, the label-free audio is the audio with the label being unknown, and indicates that the audio does not have a determined label. The sample audio may be obtained in a variety of ways, such as using a web crawler to obtain the sample audio from a network, and so on. For the obtained sample audio, the sample audio is converted into a sample mel frequency spectrum by using a mel filter, and each sample mel audio carries a corresponding label. And inputting the sample Mel frequency spectrum diagram and the interference frequency spectrum diagram into a generation network to obtain a middle Mel frequency spectrum diagram.
In a specific implementation, the structure of the generation network may include a pre-sampling layer, a down-sampling layer, a bottleneck layer, and an up-sampling layer. The pre-treatment layer consists of a convolution layer, a batch standardization layer and a nonlinear affine transformation layer; the down-sampling layer consists of a plurality of convolution layers and batch processing layers; the bottleneck layer consists of convolutions with residuals; the upsampling layer consists of an expanded convolution and a batch normalization layer.
And inputting the intermediate Mel frequency spectrogram into a discrimination network to obtain the type probability and the prediction label corresponding to the intermediate Mel frequency spectrogram. The type of the output mel frequency spectrum comprises a sample mel frequency spectrum and an interference mel frequency spectrum, and the type probability of the output mel frequency spectrum specifically refers to the probability that the output mel frequency spectrum is the sample mel frequency spectrum. The judgment network is used for judging the probability that the input output Mel frequency spectrum is the sample Mel frequency spectrum and determining the prediction label corresponding to the output Mel frequency spectrum.
In a specific implementation process, a main network of the discrimination network can be composed of a plurality of nonlinear affine transformations and convolution layers, the last layer is linear mapping of two classes and multiple classes, and output results of the discrimination network are respectively input probability of a sample Mel frequency spectrum and a prediction label of the intermediate Mel frequency spectrum.
And taking the intermediate Mel frequency spectrogram output by the generated network as the input of the discrimination network to obtain the probability that the intermediate Mel frequency spectrogram predicted by the discrimination network is the sample Mel frequency spectrum and the prediction label of the output Mel frequency spectrum.
And performing alternate iterative training on the generation network and the discrimination network according to the type probability of the middle Mel frequency spectrogram and the prediction label, and finishing model training by taking the trained generation network as a voice model.
And performing alternate iterative training on the generation network and the discrimination network according to the probability that the intermediate Mel frequency spectrogram predicted by the discrimination network is the sample Mel frequency spectrum and the prediction label of the output Mel frequency spectrum, and then, when the training of the generation network and the discrimination network is finished, not using the discrimination network, but using the trained generation network as a voice model to finish the training of the voice model.
In the process of alternately training the generation network and the discrimination network, the discrimination network is optimized firstly, and when the training starts, the discrimination network can easily distinguish a noise Mel spectrum and a sample Mel spectrum from the middle Mel spectrum, which shows that the generation network has great deviation for discriminating the interference spectrum and the sample Mel spectrum. And then optimizing the generation network to gradually reduce the loss function of the generation network, gradually improving the two-classification capability of the discrimination network in the process, and gradually improving the discrimination accuracy of the discrimination network on the output Mel frequency spectrum output by the generation network. The generation network generates an interference spectrogram close to real data as much as possible to deceive the discrimination network, and the discrimination network needs to distinguish the sample Mel spectrum from the interference spectrogram generated by the generation network as much as possible, so that the generation network and the discrimination network form a dynamic game process. And finally, until the judgment network cannot judge whether the output Mel frequency spectrum is the sample Mel frequency spectrum or the interference frequency spectrum diagram, the training of the generation network is completed, and the generation network after the training is used as a voice model.
In one embodiment, the step S3 of converting each of the speech training data into a corresponding mel frequency spectrum diagram includes:
s301: performing Fourier transformation on each frame of voice in each voice training data to obtain a voice result corresponding to each frame of voice;
s302: stacking each voice result along one dimension to obtain a corresponding spectrogram;
s303: and inputting the spectrogram into a Mel filter bank to obtain the Mel spectrogram.
As described in the above steps S301 to S303, the conversion of the mel-frequency spectrogram is realized, and since the sound signal is originally a one-dimensional time domain signal, it is difficult to visually see the frequency variation law. If it is changed to the frequency domain by fourier transform, although the frequency distribution of the signal can be seen, the time domain information is lost and the change of the frequency distribution with time cannot be seen. And a two-dimensional signal form similar to a graph is obtained by performing fourier transformation on each frame of speech, i.e. performing fourier transformation on the signal for a short time, and then stacking the results of each frame along another dimension. Since the signal is a sound signal, the two-dimensional signal obtained by expansion is a spectrogram, but the spectrogram is often a very large map, and in order to obtain sound features of a suitable size, the sound features need to be converted into a mel spectrum through a mel filter bank, where the mel filter bank is a filter bank composed of a plurality of triangular filters, which is not described herein for the prior art, and the conversion of the spectrogram can be realized.
Referring to fig. 2, the present invention further provides an apparatus for enhancing a speech training data set, including:
an obtaining module 10, configured to obtain a voice training data set;
a first conversion module 20, configured to extract each piece of speech training data from the set of speech training data, and convert each piece of speech training data into a corresponding mel frequency spectrogram;
a rearrangement module 30, configured to perform pixel rearrangement processing on each mel frequency spectrum map to obtain a temporary mel frequency spectrum map after the pixel rearrangement processing;
a setting module 40, configured to set an area of an erasure region for each of the temporary mel frequency spectrums according to the picture sizes of the temporary mel frequency spectrums;
the leading-in module 50 is used for leading in a random erasing coefficient and setting the shape parameter of an erasing area based on the area of the erasing area and the random erasing coefficient;
a changing module 60, configured to change a position of the erasure area in the temporary mel spectrum or change the random erasure coefficient, so as to obtain a plurality of extended mel frequency spectrograms corresponding to each of the temporary mel frequency spectrograms;
a second conversion module 70, configured to convert each extended mel frequency spectrum into corresponding target speech training data;
a supplement module 80, configured to supplement the target speech training data to the speech training data set, so as to obtain an enhanced speech training data set.
In one embodiment, the reordering module 30 includes:
a dividing sub-module for dividing the Mel frequency spectrogram into a plurality of subset frequency spectrograms;
and the random selection submodule is used for randomly selecting a preset number of the subset frequency spectrums to carry out random arrangement on the pixels so as to obtain a temporary Mel frequency spectrum after the pixels are rearranged.
In one embodiment, the lead-in module 50 includes:
a random parameter selection submodule for arbitrarily selecting a random parameter r from a preset parameter rangee
A width setting submodule for setting the width of the optical disk according to a formula
Figure BDA0003095842700000141
Setting the width of the rectangular area, and according to the formula
Figure BDA0003095842700000142
Setting the height of the rectangular area, wherein SeIs the area of the erasing region, WeIs the width HeIs that it isHeight.
In one embodiment, the apparatus for enhancing a speech training data set further comprises:
a center point determining module, configured to determine whether a center point of the erased area exists in the temporary mel frequency spectrum based on the shape parameter, so that the erased area is completely contained in the temporary mel frequency spectrum;
and the coefficient replacing module is used for replacing the random erasing coefficient if the central point does not exist until the central point exists.
In one embodiment, the apparatus for enhancing a speech training data set further comprises:
an input module, configured to input each of the target speech training data into a preset vector machine, so as to obtain a target vector X ═ X (X) corresponding to a fixed dimension1,x2,…,xi,…,xn);
A calculation module for calculating according to a formula
Figure BDA0003095842700000143
Calculating difference values between each target vector and the voice vector corresponding to the original voice training data; wherein, Y is the multidimensional coordinate corresponding to the original phonetic training data, and Y is (Y)1,y2,…,yi,…,yn),xiRepresenting the value of the i-th dimension in the target vector, yiRepresenting the value of the ith dimension, s, in the corresponding speech vectoriThe coefficient is corresponding to the ith dimension data, and p is a set parameter value;
and the deleting module is used for deleting the target voice training data with the difference value smaller than the preset difference value.
In one embodiment, the apparatus for enhancing a speech training data set further comprises:
a conversion module for converting the sample voice data in the enhanced voice training data set into a sample mel frequency spectrum diagram;
the production network input module is used for inputting the sample Mel frequency spectrogram and a preset interference frequency spectrogram into a generation network to obtain a middle Mel frequency spectrogram;
the judgment network input module is used for inputting the middle Mel frequency spectrogram into a judgment network to obtain type probability and a prediction label corresponding to the middle Mel frequency spectrogram;
and the iterative training module is used for performing alternate iterative training on the generation network and the discrimination network according to the type probability of the middle Mel frequency spectrogram and the prediction label, and taking the generated network after training as a voice model.
In one embodiment, the first conversion module 20 includes:
the change submodule is used for carrying out Fourier change on each frame of voice in each voice training data to obtain a voice result corresponding to each frame of voice;
the stacking submodule is used for stacking each voice result along one dimension to obtain a corresponding spectrogram;
and the spectrogram input sub-module is used for inputting the spectrogram into a Mel filter bank to obtain the Mel spectrogram.
The invention has the beneficial effects that: the method comprises the steps of extracting a Mel frequency spectrogram corresponding to each voice training data and conducting pixel point rearrangement processing to obtain temporary Mel frequency spectrograms after the pixel point rearrangement processing, setting an erasing area for each temporary Mel frequency spectrum, setting a shape parameter of an erasing area, changing a parameter or a random erasing coefficient to obtain a plurality of extended Mel frequency spectrograms, and converting each extended Mel frequency spectrogram into corresponding target voice training data, so that the voice training data is supplemented, the problem that the voice training data is less, so that the voice model is easy to be over-fitted in the training process is solved, the robustness of the voice model is increased, the voice model is prevented from being over-fitted, and the application range of the voice model is greatly improved.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing various voice training data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, may implement the method of enhancing a speech training data set as described in any of the above embodiments.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the method for enhancing a speech training data set according to any of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (10)

1. A method for enhancing a speech training data set, comprising:
acquiring a voice training data set;
extracting each voice training data from the voice training data set, and converting each voice training data into a corresponding Mel frequency spectrogram;
performing pixel point rearrangement processing on each Mel frequency spectrum graph to obtain a temporary Mel frequency spectrum graph after the pixel point rearrangement processing;
setting an erasing area for each temporary Mel frequency spectrum according to the size of the image of the temporary Mel frequency spectrum;
introducing a random erasing coefficient, and setting shape parameters of an erasing area based on the area of the erasing area and the random erasing coefficient;
changing the position of the erasing area in the temporary Mel frequency spectrum or changing the random erasing coefficient to obtain a plurality of extended Mel frequency spectrums corresponding to the temporary Mel frequency spectrums;
converting each extended Mel frequency spectrogram into corresponding target voice training data;
and supplementing the target voice training data into the voice training data set to obtain an enhanced voice training data set.
2. The method of enhancing a speech training data set according to claim 1, wherein said step of performing pixel rearrangement processing on each said mel frequency spectrum map to obtain a temporary mel frequency spectrum map after the pixel rearrangement processing comprises:
dividing the Mel frequency spectrogram into a plurality of subset frequency spectrograms;
and randomly selecting a preset number of the subset frequency spectrum diagrams to carry out pixel point random arrangement to obtain a temporary Mel frequency spectrum diagram after pixel point rearrangement processing.
3. The method of enhancing a speech training data set according to claim 1, wherein said erased area is a rectangular area, said step of introducing a random erasure coefficient and setting a shape parameter of the erased area based on said erased area and said random erasure coefficient comprises:
random parameter r is arbitrarily selected from preset parameter rangee
According to the formula
Figure FDA0003095842690000021
Setting the width of the rectangular area, and according to the formula
Figure FDA0003095842690000022
Setting the height of the rectangular area, wherein SeIs the area of the erasing region, WeIs the width HeIs the height.
4. The method of enhancing a speech training data set according to claim 3, wherein said step of introducing a random erasure coefficient and setting a shape parameter of an erasure region based on said erasure region area and said random erasure coefficient further comprises:
judging whether a central point of the erasing region exists in the temporary Mel frequency spectrogram based on the shape parameter, so that the erasing region is completely contained by the temporary Mel frequency spectrogram;
and if the central point does not exist, replacing the random erasing coefficient until the central point exists.
5. The method of enhancing a speech training data set according to claim 1, wherein the step of supplementing the target speech training data into the speech training data set to obtain the enhanced speech training data set further comprises, before the step of:
inputting each target voice training data into a preset vector machine to obtain a target vector X (X) corresponding to a fixed dimension1,x2,…,xi,…,xn);
According to the formula
Figure FDA0003095842690000023
Calculating difference values between each target vector and the voice vector corresponding to the original voice training data; wherein, Y is the multidimensional coordinate corresponding to the original phonetic training data, and Y is (Y)1,y2,…,yi,…,yn),xiRepresenting the value of the i-th dimension in the target vector, yiRepresenting the value of the ith dimension, s, in the corresponding speech vectoriThe coefficient is corresponding to the ith dimension data, and p is a set parameter value;
and deleting the target voice training data with the difference value smaller than the preset difference value.
6. The method of enhancing a speech training data set according to claim 1, wherein the step of supplementing the target speech training data into the speech training data set to obtain the enhanced speech training data set further comprises:
converting sample voice data in the enhanced voice training data set into a sample Mel frequency spectrogram;
inputting the sample Mel frequency spectrogram and a preset interference frequency spectrogram into a generation network to obtain a middle Mel frequency spectrogram;
inputting the intermediate Mel frequency spectrogram into a discrimination network to obtain type probability and a prediction label corresponding to the intermediate Mel frequency spectrogram;
and performing alternate iterative training on the generation network and the discrimination network according to the type probability of the middle Mel frequency spectrogram and the prediction label, and taking the trained generation network as a voice model.
7. The method of enhancing a set of speech training data according to claim 1, wherein said step of converting each of said speech training data into a corresponding mel spectrum comprises:
performing Fourier transformation on each frame of voice in each voice training data to obtain a voice result corresponding to each frame of voice;
stacking each voice result along one dimension to obtain a corresponding spectrogram;
and inputting the spectrogram into a Mel filter bank to obtain the Mel spectrogram.
8. An apparatus for enhancing a speech training data set, comprising:
the acquisition module is used for acquiring a voice training data set;
the first conversion module is used for extracting each voice training data from the voice training data set and converting each voice training data into a corresponding Mel frequency spectrogram;
the rearrangement module is used for carrying out pixel point rearrangement processing on each Mel frequency spectrum map to obtain a temporary Mel frequency spectrum map after the pixel point rearrangement processing;
the setting module is used for setting the area of an erasing area for each temporary Mel frequency spectrum according to the picture size of the temporary Mel frequency spectrum;
the introduction module is used for introducing a random erasing coefficient and setting the shape parameter of an erasing area based on the area of the erasing area and the random erasing coefficient;
a changing module, configured to change a position of the erasure area in the temporary mel spectrum or change the random erasure coefficient, so as to obtain a plurality of extended mel frequency spectrograms corresponding to the temporary mel frequency spectrograms;
the second conversion module is used for converting each extended Mel frequency spectrogram into corresponding target voice training data;
and the supplement module is used for supplementing the target voice training data into the voice training data set to obtain an enhanced voice training data set.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110610940.2A 2021-06-01 2021-06-01 Enhancement method, device, equipment and storage medium for voice training data set Active CN113241062B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110610940.2A CN113241062B (en) 2021-06-01 2021-06-01 Enhancement method, device, equipment and storage medium for voice training data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110610940.2A CN113241062B (en) 2021-06-01 2021-06-01 Enhancement method, device, equipment and storage medium for voice training data set

Publications (2)

Publication Number Publication Date
CN113241062A true CN113241062A (en) 2021-08-10
CN113241062B CN113241062B (en) 2023-12-26

Family

ID=77136176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110610940.2A Active CN113241062B (en) 2021-06-01 2021-06-01 Enhancement method, device, equipment and storage medium for voice training data set

Country Status (1)

Country Link
CN (1) CN113241062B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742775A (en) * 2021-09-08 2021-12-03 哈尔滨工业大学(深圳) Image data security detection method, system and storage medium
CN115294960A (en) * 2022-07-22 2022-11-04 网易有道信息技术(北京)有限公司 Vocoder training method, voice synthesis method and related products

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170040016A1 (en) * 2015-04-17 2017-02-09 International Business Machines Corporation Data augmentation method based on stochastic feature mapping for automatic speech recognition
CN111081259A (en) * 2019-12-18 2020-04-28 苏州思必驰信息科技有限公司 Speech recognition model training method and system based on speaker expansion
CN111161740A (en) * 2019-12-31 2020-05-15 中国建设银行股份有限公司 Intention recognition model training method, intention recognition method and related device
CN111370002A (en) * 2020-02-14 2020-07-03 平安科技(深圳)有限公司 Method and device for acquiring voice training sample, computer equipment and storage medium
US20200335086A1 (en) * 2019-04-19 2020-10-22 Behavioral Signal Technologies, Inc. Speech data augmentation
US20210035563A1 (en) * 2019-07-30 2021-02-04 Dolby Laboratories Licensing Corporation Per-epoch data augmentation for training acoustic models
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170040016A1 (en) * 2015-04-17 2017-02-09 International Business Machines Corporation Data augmentation method based on stochastic feature mapping for automatic speech recognition
US20200335086A1 (en) * 2019-04-19 2020-10-22 Behavioral Signal Technologies, Inc. Speech data augmentation
US20210035563A1 (en) * 2019-07-30 2021-02-04 Dolby Laboratories Licensing Corporation Per-epoch data augmentation for training acoustic models
CN111081259A (en) * 2019-12-18 2020-04-28 苏州思必驰信息科技有限公司 Speech recognition model training method and system based on speaker expansion
CN111161740A (en) * 2019-12-31 2020-05-15 中国建设银行股份有限公司 Intention recognition model training method, intention recognition method and related device
CN111370002A (en) * 2020-02-14 2020-07-03 平安科技(深圳)有限公司 Method and device for acquiring voice training sample, computer equipment and storage medium
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742775A (en) * 2021-09-08 2021-12-03 哈尔滨工业大学(深圳) Image data security detection method, system and storage medium
CN113742775B (en) * 2021-09-08 2023-07-28 哈尔滨工业大学(深圳) Image data security detection method, system and storage medium
CN115294960A (en) * 2022-07-22 2022-11-04 网易有道信息技术(北京)有限公司 Vocoder training method, voice synthesis method and related products

Also Published As

Publication number Publication date
CN113241062B (en) 2023-12-26

Similar Documents

Publication Publication Date Title
CN111316352B (en) Speech synthesis method, device, computer equipment and storage medium
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
Lan et al. An extreme learning machine approach for speaker recognition
CN113241062B (en) Enhancement method, device, equipment and storage medium for voice training data set
Dua et al. LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems
CN110853656B (en) Audio tampering identification method based on improved neural network
CN108922544A (en) General vector training method, voice clustering method, device, equipment and medium
JP7140317B2 (en) Method for learning data embedding network that generates marked data by synthesizing original data and mark data, method for testing, and learning device using the same
Ogunfunmi et al. A primer on deep learning architectures and applications in speech processing
Fong Using hierarchical time series clustering algorithm and wavelet classifier for biometric voice classification
CN115083435A (en) Audio data processing method and device, computer equipment and storage medium
CN117672176A (en) Rereading controllable voice synthesis method and device based on voice self-supervision learning characterization
CN113869212B (en) Multi-mode living body detection method, device, computer equipment and storage medium
CN113611315A (en) Voiceprint recognition method and device based on lightweight convolutional neural network
CN115171666A (en) Speech conversion model training method, speech conversion method, apparatus and medium
CN113554047A (en) Training method of image processing model, image processing method and corresponding device
Bhattacharyya et al. Normalizing flows with multi-scale autoregressive priors
CN111933154B (en) Method, equipment and computer readable storage medium for recognizing fake voice
Shah et al. Speech recognition using spectrogram-based visual features
Singh et al. Short duration voice data speaker recognition system using novel fuzzy vector quantization algorithm
CN116778946A (en) Separation method of vocal accompaniment, network training method, device and storage medium
CN112906637B (en) Fingerprint image identification method and device based on deep learning and electronic equipment
CN112992177B (en) Training method, device, equipment and storage medium of voice style migration model
CN112308150B (en) Target detection model training method and device, computer equipment and storage medium
CN114822497A (en) Method, apparatus, device and medium for training speech synthesis model and speech synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant