CN114512111A - Model training method and device, terminal equipment and computer readable storage medium - Google Patents

Model training method and device, terminal equipment and computer readable storage medium Download PDF

Info

Publication number
CN114512111A
CN114512111A CN202111682661.3A CN202111682661A CN114512111A CN 114512111 A CN114512111 A CN 114512111A CN 202111682661 A CN202111682661 A CN 202111682661A CN 114512111 A CN114512111 A CN 114512111A
Authority
CN
China
Prior art keywords
training
model
audio
preset
dynamic range
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111682661.3A
Other languages
Chinese (zh)
Inventor
丁万
黄东延
赵之源
杨志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ubtech Technology Co ltd
Original Assignee
Shenzhen Ubtech Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ubtech Technology Co ltd filed Critical Shenzhen Ubtech Technology Co ltd
Priority to CN202111682661.3A priority Critical patent/CN114512111A/en
Publication of CN114512111A publication Critical patent/CN114512111A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application is applicable to the technical field of data processing, and provides a model training method, a device, a terminal device and a computer readable storage medium, comprising the following steps: training a preset model according to a first sample set which accords with a first audio dynamic range to obtain the preset model after the first training; if the first model precision of the preset model on the verification set after the first training is smaller than the target precision, expanding the first audio dynamic range to obtain a second audio dynamic range; and continuing training the preset model after the first training according to a second sample set which accords with the second audio dynamic range until a target model meeting the target precision is obtained. By the method, the convergence effect of the deep learning model can be effectively ensured, and the training difficulty of the deep learning model is reduced.

Description

Model training method and device, terminal equipment and computer readable storage medium
Technical Field
The present application belongs to the technical field of data processing, and in particular, to a model training method, an apparatus, a terminal device, and a computer-readable storage medium.
Background
With the development of deep learning, the application field of the method is wider and wider. For example, in the field of audio processing, speech synthesis may be performed using a deep learning model, that is, acoustic features are input into the deep learning model, and corresponding synthesized speech is output. Before application, the deep learning model needs to be trained.
In the existing model training method, audio samples are generally selected randomly, and then the deep learning model is trained by using the audio samples. For some high-quality (high-sampling-rate) audio samples, the dynamic details are very rich, and the influence on the model parameters of the deep learning model is sensitive, so that the deep learning model is difficult to converge, and the training difficulty of the deep learning model is increased.
Disclosure of Invention
The embodiment of the application provides a model training method, a model training device, terminal equipment and a computer readable storage medium, which can effectively ensure the convergence effect of a deep learning model and reduce the training difficulty of the deep learning model.
In a first aspect, an embodiment of the present application provides a model method, including:
training a preset model according to a first sample set which accords with a first audio dynamic range to obtain the preset model after the first training;
if the first model precision of the preset model on the verification set after the first training is smaller than the target precision, expanding the first audio dynamic range to obtain a second audio dynamic range;
and continuing training the preset model after the first training according to a second sample set which accords with the second audio dynamic range until a target model meeting the target precision is obtained.
In the embodiment of the application, the preset model is trained according to the sample set with the smaller audio dynamic range, and then the preset model is trained continuously according to the sample set with the larger audio dynamic range. The audio samples with higher sampling rate correspond to a larger audio dynamic range, which contains richer dynamic details, while the audio samples with smaller audio dynamic range are equivalent to blurring the dynamic details. By the method, the preset model is equivalently made to learn simpler features first and then learn more complex features, so that the problem that the model is difficult to converge due to random selection of training samples is solved, and the training difficulty of the deep learning model is effectively reduced.
In a possible implementation manner of the first aspect, the training a preset model according to a first sample set conforming to a first audio dynamic range to obtain the preset model after the first training includes:
calculating a third audio dynamic range of the audio samples in the preset audio library;
and adding the audio samples meeting a preset condition in the preset audio library into the first sample set, wherein the preset condition is that the third audio dynamic range is smaller than the first audio dynamic range.
And training the preset model according to the first sample set to obtain the preset model after the first training.
In a possible implementation manner of the first aspect, the adding, to the first sample set, audio samples in the preset audio library that meet a preset condition includes:
if the audio samples meeting the preset conditions do not exist in the preset audio library, compressing the audio samples in the preset audio library into the audio samples meeting the preset conditions;
adding audio samples satisfying the preset condition to the first set of samples.
In a possible implementation manner of the first aspect, the adding, to the first sample set, audio samples in the preset audio library that meet a preset condition includes:
acquiring audio samples meeting the preset conditions from the preset audio library to obtain a first set;
calculating a training value for each of the audio samples in the first set;
and adding the audio samples with the training values larger than a preset value in the first set into the first sample set.
In a possible implementation manner of the first aspect, the calculating a training value of each audio sample in the first set includes:
obtaining the second model precision of the current preset model on the verification set;
and calculating the training value of each audio sample in the first set according to the second model precision and the third audio dynamic range of each audio sample in the first set.
In a possible implementation manner of the first aspect, the training the preset model according to the first sample set to obtain the preset model after the first training includes:
extracting the acoustic features of the audio samples in the first sample set and the actual values of the audio sampling points corresponding to the acoustic features;
inputting the acoustic features into the preset model, and outputting a predicted value of an audio sampling point;
calculating a first loss value of the preset model according to the actual value and the predicted value of the audio sampling point corresponding to the acoustic feature;
if the first loss value is larger than a first threshold value, updating the model parameters of the preset model according to the first loss value;
and if the first loss value is smaller than a first threshold value, determining the current preset model as the preset model after the first training.
In a possible implementation manner of the first aspect, after training a preset model according to a first sample set conforming to a first audio dynamic range to obtain the preset model after the first training, the method further includes:
and if the first model precision of the preset model after the first training on the verification set is equal to or greater than the target precision, determining the preset model after the first training as the target model.
In a second aspect, an embodiment of the present application provides a model apparatus, including:
the first training unit is used for training a preset model according to a first sample set which accords with a first audio dynamic range to obtain the preset model after the first training;
the range expansion unit is used for expanding the first audio dynamic range to obtain a second audio dynamic range if the first model precision of the preset model on the verification set after the first training is smaller than the target precision;
and the second training unit is used for continuously training the preset model after the first training according to a second sample set which accords with the second audio dynamic range until a target model meeting the target precision is obtained.
In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the model method according to any one of the above first aspects.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, and an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the model method according to any one of the above first aspects.
In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the model method of any one of the above first aspects.
It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a training process provided by an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.
The input to the vocoder is acoustic features and the output is the corresponding synthesized speech. The relationship between acoustic features and speech is a complex non-linear mapping, and therefore, a neural network vocoder becomes one of the mainstream vocoder technologies due to its strong non-linear function fitting capability. The Neural Network vocoder can obtain more natural synthesized voice, and common Neural Network models include a Recurrent Neural Network (RNN), a Gated Recurrent Unit (GRU) Network, a Long short-term memory (LSTM) Network, and the like.
In the existing neural network vocoder training method, audio samples are collected in advance, data pairs of acoustic features and audio are generated, and then a network model is trained on the basis of loss functions such as cross entropy and the like. The training method is to randomly select audio samples from a training data set (a preset voice library) and adjust the network weight according to the loss. Since the dynamic details of high-quality (e.g. sampling rate greater than 16k) audio are very rich, and commonly used models such as RNN are mainly based on a recursive structure, a small weight adjustment may cause a large change in the model as a whole. Therefore, in practical situations, the neural network vocoder is difficult to converge, thereby increasing the difficulty of model training.
In order to solve the above problem, an embodiment of the present application provides a model training method. Referring to fig. 1, which is a schematic flow chart of a modeling method provided in an embodiment of the present application, by way of example and not limitation, the method may include the following steps:
s101, training a preset model according to a first sample set which accords with a first audio dynamic range, and obtaining the preset model after the first training.
Dynamic range is the ratio of the maximum and minimum values of a variable signal (e.g., sound or light, etc.). It can be expressed in base 10 logarithm (decibels) or base 2 logarithm. The audio dynamic range in the embodiment of the present application refers to the ratio of the maximum value and the minimum value of the audio. The first audio dynamic range may be an empirically preset audio dynamic range.
In one embodiment, S101 may include:
s1011, calculating a third audio dynamic range of the audio samples in the preset audio library.
The audio sample in the embodiment of the application may be an audio clip, and an audio sampling point sequence obtained by sampling the audio clip. In particular, it can be based on the formula
Figure BDA0003444305880000061
A third audio dynamic range of audio samples is calculated. Where z is the sequence of audio sample points to which the audio samples correspond.
And S1012, adding the audio samples meeting preset conditions in the preset audio library into the first sample set.
Wherein the preset condition is that the third audio dynamic range is smaller than the first audio dynamic range. Specifically, comparing a third audio dynamic range of the audio samples in the preset audio library with the first audio dynamic range; and if the third audio dynamic range is smaller than the first audio dynamic range, adding the audio samples corresponding to the third audio dynamic range into the first sample set.
As can be seen from the above, in practical applications, there may be a case where there are no audio samples satisfying the preset condition in the preset audio library. To solve this problem, optionally, S1012 further includes:
if the audio samples meeting the preset conditions do not exist in the preset audio library, compressing the audio samples in the preset audio library into the audio samples meeting the preset conditions; adding audio samples satisfying the preset condition to the first set of samples.
The audio samples are compressed, for example, by reducing 16-bit audio data to 8-bit, more broadly speaking by discretizing, i.e. quantizing, the amplitude of the sampled signal. Commonly used quantization methods are uniform quantization and non-uniform quantization. Here, uniform quantization refers to performing quantization with the same quantization level for both small signals and large signals within the coding range, so that the "signal to quantization noise ratio" of the small signals is small, and the "signal to quantization noise ratio" of the large signals is large, which is disadvantageous for the small signals. In order to increase the signal-to-noise ratio of small signals, the quantization step can be subdivided, and the signal-to-noise ratio of large signals is also increased, but as a result, the digital code rate is also increased, and transmission by a channel with wider frequency band is required. The problem of uniform quantization can be solved by non-uniform quantization, and the basic idea is to compress a large signal and amplify a small signal greatly. The signal-to-noise ratio of the small signal is greatly improved because the amplitude of the small signal is amplified more (i.e., more quantization levels are provided for the small signal and less quantization levels are provided for the large signal).
In the embodiment of the present application, the purpose of compressing the audio sample is to reduce the amplitude of a large signal and amplify the amplitude of a small signal, so as to reduce the data accuracy of the audio sample and blur the dynamic detail features of the audio sample, and therefore, optionally, a non-uniform quantization method is adopted. The existing non-uniform quantization methods are A-law compression, MU-law compression and the like.
Preferably, a MU-law compression method is used, which can improve the signal-to-noise ratio without adding more data. The principle of the method is as follows:
Figure BDA0003444305880000081
where t _ in is the quantizer input and t _ out is the quantizer output. The larger the constant μ, the higher the companding benefit of small signals.
In the embodiment of the application, if the audio samples meeting the preset conditions exist in the preset audio library, a plurality of audio samples meeting the preset conditions can be randomly selected and added into the first sample set.
In order to ensure the training effect of the model, optionally, the selection may be performed according to the training value of the audio sample, and specifically includes:
acquiring audio samples meeting the preset conditions from the preset audio library to obtain a first set; calculating a training value for each of the audio samples in the first set; and adding the audio samples with the training values larger than a preset value in the first set into the first sample set.
Further, the calculation method of the training value comprises the following steps:
obtaining the second model precision of the current preset model on the verification set; and calculating the training value of each audio sample in the first set according to the second model precision and the third audio dynamic range of each audio sample in the first set.
In particular, it can be represented by the formula
Figure BDA0003444305880000082
The training value of the audio sample is calculated. Where x is the audio sample, plearn xIn order to select the probability (i.e. the training value) of model training for the sample x, Z is a preset normalization coefficient, r (x) is the audio dynamic range of the sample x, and Loss (Val | M) is the performance evaluation (such as Loss value) of the current preset model M on the verification set Val.
Figure BDA0003444305880000083
Is one [ - ε, ε [ ]]The random distribution of the intervals, ε, is a value close to 0 taken empirically. Introduction of
Figure BDA0003444305880000084
The purpose of (1) is to act as a compensation mechanism for possible limitations of the probabilistic computational model. For sample x, if plearn xIf the value is larger than the preset value, the sample x is added into the first sample set.
By the method, the audio samples are selected according to the training value, so that the problem of difficulty in training caused by random selection can be avoided, the model training difficulty is further reduced, and the model training effect is enhanced.
S1013, training the preset model according to the first sample set to obtain the preset model after the first training.
Optionally, the training process in S1013 includes:
extracting the acoustic features of the audio samples in the first sample set and the actual values of the audio sampling points corresponding to the acoustic features; inputting the acoustic features into the preset model, and outputting a predicted value of an audio sampling point; calculating a first loss value of the preset model according to the actual value and the predicted value of the audio sampling point corresponding to the acoustic feature; if the first loss value is larger than a first threshold value, updating the model parameters of the preset model according to the first loss value; and if the first loss value is smaller than a first threshold value, determining the current preset model as the preset model after the first training.
Alternatively, the acoustic feature may be a mel-frequency cepstrum. The preset model can adopt an RNN model, a GRU model, an LSTM model and the like. The loss function used to calculate the loss value may be a mean square error function or a cross entropy function, etc. The method for updating the model parameters using the loss values may employ a steepest descent method, a gradient method, or the like. The specific model training process is not specifically limited herein.
The first threshold in this step is a threshold of training accuracy that can be achieved by training the preset model based on the first sample set. As described above, the sample set corresponding to the larger audio dynamic range has more quantization levels of large signals and more detail features of audio samples, and accordingly, the model can learn some more complex feature relationships; the sample set corresponding to the smaller audio dynamic range has fewer quantization levels of large signals and fewer detail features of audio samples, and accordingly, the model can only learn some simpler feature relationships. Therefore, the preset model is trained according to the sample sets corresponding to different audio dynamic ranges, and the obtained training precision is different. The larger the audio dynamic range is, the higher the training precision of the preset model after the corresponding sample set is trained is.
S102, if the first model precision of the preset model after the first training on the verification set is smaller than the target precision, the first audio dynamic range is expanded, and a second audio dynamic range is obtained.
And if the first model precision of the preset model after the first training on the verification set is equal to or greater than the target precision, determining the preset model after the first training as the target model.
In the embodiment of the application, a training mode of sample set training and verification set verification is adopted.
The increment of each expansion of the audio dynamic range can be preset according to the requirement. The larger the increment is, the larger the learning span of the model on the characteristics is, and the training times of the model are relatively less; the smaller the increment, the smaller the learning span of the model to the features, the relatively more times the model is trained, and the easier the model is to converge.
Illustratively, the first audio dynamic range is 28The second audio dynamic range is 210
S103, continuing training the preset model after the first training according to a second sample set which accords with the second audio dynamic range until a target model meeting the target precision is obtained.
The specific step of training the preset model according to the second sample set is the same as that in S101, and reference may be specifically made to the description in S101, which is not described herein again.
And if the model precision of the second trained preset model obtained after the first training of the preset model is still smaller than the target precision according to the second sample set, repeating S102-S103 until the model precision of the trained preset model on the verification set meets the target precision.
In the embodiment of the application, the preset model is trained according to the sample set with the smaller audio dynamic range, and then the preset model is trained continuously according to the sample set with the larger audio dynamic range. The audio samples with higher sampling rate correspond to a larger audio dynamic range, which contains richer dynamic details, while the audio samples with smaller audio dynamic range are equivalent to blurring the dynamic details. By the method, the preset model is equivalently made to learn simpler features first and then learn more complex features, so that the problem that the model is difficult to converge due to random selection of training samples is solved, and the training difficulty of the deep learning model is effectively reduced.
In addition, in the embodiment of the application, the difficulty of model learning features is gradually increased, and when the model precision reaches the target precision, the training is stopped. By the method, when the target precision is low, the learning of the model to some detail characteristics can be reduced, so that the training difficulty of the model is reduced, and the training efficiency of the model is improved.
For example, refer to fig. 2, which is a schematic diagram of a training process provided in an embodiment of the present application. As shown in fig. 2, the training process includes the following steps:
1. the result of the model (i.e., the pre-set model) on the verification set is obtained.
2. The threshold value of the dynamic range (i.e. the audio dynamic range) is adjusted according to the result in step 1.
3. And selecting audio (namely audio samples) meeting the threshold value of the dynamic range from a voice data set (namely a preset voice library) according to the threshold value of the dynamic range.
These audio are used to train the models, as shown in steps 4-8.
4. Acoustic features of the audio are extracted.
5. The acoustic features are resampled to align with the audio sampling points.
6. And inputting the acoustic characteristics into the model to obtain a predicted value of the audio sampling point.
7. And calculating the loss value of the model according to the actual value and the predicted value of the audio sampling point.
8. The gradient is calculated from the loss values, and the model parameters (hidden node weights) of the model are adjusted according to the gradient.
If the current model meets the stopping condition, outputting the current model; and if the stopping condition is not met, returning to the step 1 to continue the execution until the stopping condition is met.
The stopping condition in the embodiment of the present application may be that the number of iterations reaches a preset number, or that a loss value of the model on the verification set is smaller than a second threshold (i.e., the model precision reaches the target precision).
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 3 is a block diagram of a model device provided in the embodiment of the present application, which corresponds to the model method described in the above embodiment, and only shows a part related to the embodiment of the present application for convenience of description.
Referring to fig. 3, the apparatus includes:
the first training unit 31 is configured to train a preset model according to a first sample set that conforms to the first audio dynamic range, and obtain the preset model after the first training.
The range expanding unit 32 is configured to expand the first audio dynamic range to obtain a second audio dynamic range if the first model precision of the preset model after the first training on the verification set is smaller than the target precision.
And the second training unit 33 is configured to continue training the preset model after the first training according to the second sample set that conforms to the second audio dynamic range until a target model meeting the target accuracy is obtained.
Optionally, the first training unit 31 is further configured to:
calculating a third audio dynamic range of the audio samples in the preset audio library;
and adding the audio samples meeting a preset condition in the preset audio library into the first sample set, wherein the preset condition is that the third audio dynamic range is smaller than the first audio dynamic range.
And training the preset model according to the first sample set to obtain the preset model after the first training.
Optionally, the first training unit 31 is further configured to:
if the audio samples meeting the preset conditions do not exist in the preset audio library, compressing the audio samples in the preset audio library into the audio samples meeting the preset conditions; adding audio samples satisfying the preset condition to the first set of samples.
Optionally, the first training unit 31 is further configured to:
acquiring audio samples meeting the preset conditions from the preset audio library to obtain a first set;
calculating a training value for each of the audio samples in the first set;
and adding the audio samples with the training values larger than a preset value in the first set into the first sample set.
Optionally, the first training unit 31 is further configured to:
obtaining the second model precision of the current preset model on the verification set;
and calculating the training value of each audio sample in the first set according to the second model precision and the third audio dynamic range of each audio sample in the first set.
Optionally, the first training unit 31 is further configured to:
extracting the acoustic features of the audio samples in the first sample set and the actual values of the audio sampling points corresponding to the acoustic features;
inputting the acoustic features into the preset model, and outputting a predicted value of an audio sampling point;
calculating a first loss value of the preset model according to the actual value and the predicted value of the audio sampling point corresponding to the acoustic feature;
if the first loss value is larger than a first threshold value, updating the model parameters of the preset model according to the first loss value;
and if the first loss value is smaller than a first threshold value, determining the current preset model as the preset model after the first training.
Optionally, the apparatus 3 further comprises:
and the model determining unit 34 is configured to determine the preset model after the first training as the target model if the first model precision of the preset model after the first training on the verification set is equal to or greater than the target precision.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
The model device shown in fig. 3 may be a software unit, a hardware unit, or a combination of software and hardware unit built in the existing terminal device, may be integrated into the terminal device as an independent pendant, or may exist as an independent terminal device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 4, the terminal device 4 of this embodiment includes: at least one processor 40 (only one shown in fig. 4), a memory 41, and a computer program 42 stored in the memory 41 and executable on the at least one processor 40, the processor 40 implementing the steps in any of the various model method embodiments described above when executing the computer program 42.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 4 is merely an example of the terminal device 4, and does not constitute a limitation of the terminal device 4, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.
The Processor 40 may be a Central Processing Unit (CPU), and the Processor 40 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 41 may in some embodiments be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. In other embodiments, the memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the terminal device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the terminal device 4. The memory 41 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 41 may also be used to temporarily store data that has been output or is to be output.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to an apparatus/terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method of model training, comprising:
training a preset model according to a first sample set which accords with a first audio dynamic range to obtain the preset model after the first training;
if the first model precision of the preset model on the verification set after the first training is smaller than the target precision, expanding the first audio dynamic range to obtain a second audio dynamic range;
and continuing training the preset model after the first training according to a second sample set which accords with the second audio dynamic range until a target model meeting the target precision is obtained.
2. The model training method of claim 1, wherein the training a preset model according to a first set of samples conforming to a first audio dynamic range to obtain the preset model after a first training comprises:
calculating a third audio dynamic range of the audio samples in the preset audio library;
and adding the audio samples meeting a preset condition in the preset audio library into the first sample set, wherein the preset condition is that the third audio dynamic range is smaller than the first audio dynamic range.
And training the preset model according to the first sample set to obtain the preset model after the first training.
3. The model training method of claim 2, wherein the adding of the audio samples in the preset audio library that satisfy a preset condition to the first set of samples comprises:
if the audio samples meeting the preset conditions do not exist in the preset audio library, compressing the audio samples in the preset audio library into the audio samples meeting the preset conditions;
adding audio samples satisfying the preset condition to the first set of samples.
4. The model training method of claim 2, wherein the adding of the audio samples in the preset audio library that satisfy a preset condition to the first set of samples comprises:
acquiring audio samples meeting the preset conditions from the preset audio library to obtain a first set;
calculating a training value for each of the audio samples in the first set;
and adding the audio samples with the training values larger than a preset value in the first set into the first sample set.
5. The model training method of claim 4, wherein said calculating a training value for each of the audio samples in the first set comprises:
obtaining the second model precision of the current preset model on the verification set;
and calculating the training value of each audio sample in the first set according to the second model precision and the third audio dynamic range of each audio sample in the first set.
6. The model training method of claim 2, wherein the training the preset model according to the first sample set to obtain the preset model after the first training comprises:
extracting the acoustic features of the audio samples in the first sample set and the actual values of the audio sampling points corresponding to the acoustic features;
inputting the acoustic features into the preset model, and outputting a predicted value of an audio sampling point;
calculating a first loss value of the preset model according to the actual value and the predicted value of the audio sampling point corresponding to the acoustic feature;
if the first loss value is larger than a first threshold value, updating the model parameters of the preset model according to the first loss value;
and if the first loss value is smaller than a first threshold value, determining the current preset model as the preset model after the first training.
7. The model training method of claim 1, wherein after training a preset model according to a first set of samples conforming to a first audio dynamic range, obtaining the preset model after a first training, the method further comprises:
and if the first model precision of the preset model after the first training on the verification set is equal to or greater than the target precision, determining the preset model after the first training as the target model.
8. A model training apparatus, comprising:
the first training unit is used for training a preset model according to a first sample set which accords with a first audio dynamic range to obtain the preset model after the first training;
the range expansion unit is used for expanding the first audio dynamic range to obtain a second audio dynamic range if the first model precision of the preset model on the verification set after the first training is smaller than the target precision;
and the second training unit is used for continuously training the preset model after the first training according to a second sample set which accords with the second audio dynamic range until a target model meeting the target precision is obtained.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202111682661.3A 2021-12-29 2021-12-29 Model training method and device, terminal equipment and computer readable storage medium Pending CN114512111A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111682661.3A CN114512111A (en) 2021-12-29 2021-12-29 Model training method and device, terminal equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111682661.3A CN114512111A (en) 2021-12-29 2021-12-29 Model training method and device, terminal equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114512111A true CN114512111A (en) 2022-05-17

Family

ID=81548640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111682661.3A Pending CN114512111A (en) 2021-12-29 2021-12-29 Model training method and device, terminal equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114512111A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072096A (en) * 2022-08-10 2023-05-05 荣耀终端有限公司 Model training method, acoustic model, voice synthesis system and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072096A (en) * 2022-08-10 2023-05-05 荣耀终端有限公司 Model training method, acoustic model, voice synthesis system and electronic equipment
CN116072096B (en) * 2022-08-10 2023-10-20 荣耀终端有限公司 Model training method, acoustic model, voice synthesis system and electronic equipment

Similar Documents

Publication Publication Date Title
CN112786008B (en) Speech synthesis method and device, readable medium and electronic equipment
CN104347067A (en) Audio signal classification method and device
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN109256138A (en) Auth method, terminal device and computer readable storage medium
WO2023134549A1 (en) Encoder generation method, fingerprint extraction method, medium, and electronic device
CN110738980A (en) Singing voice synthesis model training method and system and singing voice synthesis method
CN113223536A (en) Voiceprint recognition method and device and terminal equipment
US7272557B2 (en) Method and apparatus for quantizing model parameters
WO2022213825A1 (en) Neural network-based end-to-end speech enhancement method and apparatus
CN114512111A (en) Model training method and device, terminal equipment and computer readable storage medium
CN108847251B (en) Voice duplicate removal method, device, server and storage medium
CN112837670B (en) Speech synthesis method and device and electronic equipment
CN116913258B (en) Speech signal recognition method, device, electronic equipment and computer readable medium
CN113421584A (en) Audio noise reduction method and device, computer equipment and storage medium
CN111653261A (en) Speech synthesis method, speech synthesis device, readable storage medium and electronic equipment
US10540990B2 (en) Processing of speech signals
CN113658581B (en) Acoustic model training method, acoustic model processing method, acoustic model training device, acoustic model processing equipment and storage medium
US20140140519A1 (en) Sound processing device, sound processing method, and program
CN111583945B (en) Method, apparatus, electronic device, and computer-readable medium for processing audio
CN116982111A (en) Audio characteristic compensation method, audio identification method and related products
CN117649846B (en) Speech recognition model generation method, speech recognition method, device and medium
WO2024008215A2 (en) Speech emotion recognition method and apparatus
Xu et al. An improved singer's formant extraction method based on LPC algorithm
JP7376896B2 (en) Learning device, learning method, learning program, generation device, generation method, and generation program
JP2005345599A (en) Speaker-recognizing device, program, and speaker-recognizing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination