WO2018161763A1 - 语音数据集训练方法、计算机设备和计算机可读存储介质 - Google Patents

语音数据集训练方法、计算机设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2018161763A1
WO2018161763A1 PCT/CN2018/075595 CN2018075595W WO2018161763A1 WO 2018161763 A1 WO2018161763 A1 WO 2018161763A1 CN 2018075595 W CN2018075595 W CN 2018075595W WO 2018161763 A1 WO2018161763 A1 WO 2018161763A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
model
test set
training
error rate
Prior art date
Application number
PCT/CN2018/075595
Other languages
English (en)
French (fr)
Inventor
孙涛
康跃腾
张晓明
张力
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP18764634.4A priority Critical patent/EP3594940B1/en
Publication of WO2018161763A1 publication Critical patent/WO2018161763A1/zh
Priority to US16/436,479 priority patent/US11069342B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/148Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Definitions

  • the present application relates to the field of data processing, and in particular, to a voice data set training method, a computer device, and a computer readable storage medium.
  • HMM Hidden Markov Model
  • GMM Gaussian Mixture Model
  • HMM+DNN Deep Neuron Network
  • HMM+GMM and HMM+DNN need to train all data sets. As the data set increases, the total training time will increase, resulting in a long training time.
  • a speech data set training method In accordance with various embodiments of the present application, a speech data set training method, computer apparatus, and computer readable storage medium are provided.
  • a voice data set training method includes:
  • the second voice model training is performed on the second voice data set by using the first voice model parameter obtained by the training.
  • a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform the step of reading from the first Selecting, by the voice data set, a first test set generated by the data, and a first voice model parameter obtained by training the first voice data set; acquiring a second voice data set, and randomly selecting data from the second voice data set Generating a second test set; detecting that the second test set and the first test set satisfy a similar condition, performing a second voice model on the second voice data set by using the first voice model parameter obtained by the training training.
  • a non-transitory computer readable storage medium storing computer readable instructions, the computer readable instructions being executed by one or more processors, causing the one or more processors to perform the steps of: reading Taking a first test set generated by selecting data from the first voice data set, and first voice model parameters obtained by training the first voice data set; acquiring a second voice data set from the second voice data Concentrating randomly selecting data to generate a second test set; and detecting that the second test set and the first test set satisfy a similar condition, performing, by using the first voice model parameter obtained by the training, the second voice data set Second speech model training.
  • FIG. 1 is a schematic diagram showing the internal structure of a computer device in an embodiment
  • FIG. 2 is a flow chart of a method for training a voice data set in an embodiment
  • FIG. 3 is a flow chart of a voice data set training method in another embodiment
  • FIG. 4 is a flow chart of a voice data set training method in another embodiment
  • FIG. 5 is a schematic structural diagram of an HMM+GMM model in an embodiment
  • FIG. 6 is a schematic structural diagram of an HMM+DNN model in one embodiment
  • FIG. 7 is a structural block diagram of a voice data set training apparatus in an embodiment
  • FIG. 8 is a structural block diagram of a voice data set training apparatus in another embodiment
  • FIG. 9 is a structural block diagram of a speech data set training apparatus in another embodiment.
  • FIG. 1 is a schematic diagram showing the internal structure of a computer device in an embodiment.
  • the computer device includes a processor, a memory, and a network interface connected by a system bus.
  • the memory includes a nonvolatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and a voice data set training device.
  • the database stores an algorithm model of HMM+GMM and HMM+DNN, and the voice data set training device is used to implement A voice data set training method suitable for computer equipment.
  • the processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device.
  • the internal memory of the computer device provides an environment for operation of a voice data set training device in a non-volatile storage medium, and the internal memory can store computer readable instructions that are executed by the processor
  • the processor can be caused to perform a speech data set training method.
  • the network interface of the computer device is configured to communicate with an external device via a network connection, such as receiving a voice recognition request sent by the device and returning a voice recognition result to the device.
  • the computer device can be implemented by a stand-alone computer device or a cluster of computer devices consisting of a plurality of computer devices. It will be understood by those skilled in the art that the structure shown in FIG.
  • FIG. 1 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a voice data set training method includes:
  • Step 202 Read a first test set generated by selecting data from the first voice data set, and first voice model parameters obtained by training the first voice data set.
  • the first voice data set refers to the voice data set used for the first training.
  • the computer device can generate data from the first set of voice data to generate a first test set.
  • the first test set is a data set for verifying the performance of the first speech model obtained by training the first speech data set.
  • the first speech model can be a hidden Markov model and a mixed Gaussian model.
  • the computer device can read the first test set generated by the data selected from the first voice data set.
  • the computer device can read the first test set by using a CPU (Central Processing Unit) in the computer device.
  • the hard disk extracts the required data, and then the CPU processes the data and sends it to the memory.
  • CPU Central Processing Unit
  • the hidden Markov model and the mixed Gaussian model (ie, HMM+GMM) parameters refer to the start and end time of each HMM state.
  • Each voice frame corresponds to an HMM state.
  • HMM Hidden Markov Model
  • HMM is a statistical model used to describe a Markov process with implicit unknown parameters.
  • the state in the HMM is the basic component of the HMM; the transition probability of the HMM indicates the probability of transition between the states of the HMM; and each state has a probability distribution on the possible output symbols, ie the output probability of the HMM.
  • the Markov process is a stochastic process without memory characteristics. The stochastic process, given a current state and all past states, has a conditional probability distribution of its future state that depends only on the current state.
  • GMM Global System for Mobile Communications
  • the computer device may generate a training set and a first test set according to the first voice data set selection data, and the computer device may also train the training set of the first voice data set to obtain a hidden Markov model and a mixed Gaussian model, thereby obtaining a hidden Markov model and mixed Gaussian model parameters.
  • Step 204 Acquire a second voice data set, and randomly select data from the second voice data set to generate a second test set.
  • the second voice data set refers to a voice data set for retraining.
  • the computer device can generate a second test set by randomly selecting data from the second voice data set.
  • the second test set is for representing the second voice data set.
  • the ratio of the amount of data in the second test set to the amount of data in the second voice data set is the same as the ratio of the amount of data in the first test set to the amount of data in the first voice data set.
  • Step 206 After detecting that the second test set and the first test set satisfy the similar condition, perform the second voice model training on the second voice data set by using the first voice model parameter obtained by the training.
  • the second speech model may be a hidden Markov model and a deep neural network model.
  • DNN deep neuron networks
  • a neural network is a combination of many single neurons, and the output of one neuron can be the input of another neuron.
  • a neuron is a basic arithmetic unit of a neural network. It converts multiple input values into one output through an activation function, and multiple input values are in one-to-one correspondence with multiple weights.
  • the similar condition means that the similarity exceeds the similarity threshold, or the difference between the word recognition error rates is less than or equal to the fault tolerance threshold. If the similarity exceeds the similarity threshold, or the difference between the word recognition error rate is less than or equal to the fault tolerance threshold, it indicates that the second test set and the first test set have high similarity, and the hidden Markov model suitable for training with the first speech data set is suitable. And the Gaussian model parameters are used to train the second speech data set with Hidden Markov Model and Deep Neural Network Model.
  • the computer device detects that the second test set generated by selecting data from the second voice data set and the first test set generated by selecting data from the first voice data set satisfy similar conditions, and the computer device adopts the first voice.
  • the first speech model parameter obtained by the data set training performs the second speech model training on the second speech data set, which saves the first speech model training on the second speech data set, saves the total training time and improves the training efficiency.
  • generating a second test set by randomly selecting data from the second voice data set includes: obtaining a ratio of the number of data in the first test set to the number of data in the first voice data set, The second voice data set randomly selects data occupying the ratio to generate the second test set.
  • the number of data in the first test set TEST1 is recorded as number (TEST1), and the number of data in the first voice data set is recorded as number (data set 1).
  • the voice data set training method further includes:
  • Step 302 Select a data generation training set and a first test set from the first voice data set.
  • the training set is the data set used to estimate the model.
  • Step 304 Perform a first speech model training on the training set to obtain a preset number of first speech models.
  • the preset number may be configured as needed, for example, 5, 10, and the like.
  • Step 306 The first test set is tested by using the preset number of first voice models to obtain a first voice model whose word recognition error rate is within a preset range.
  • the computer device tests the first test set by using each of the first voice models in the preset number of first voice models, and the computer device can obtain the word recognition error rate of each first voice model, according to each The word recognition error rate of a speech model is filtered to obtain a first speech model in which the word recognition error rate is within a preset range.
  • the preset range can be set as needed.
  • Step 308 The parameter of the first speech model whose word recognition error rate is within a preset range is used as the first speech model parameter.
  • the parameter of the first speech model whose word recognition error rate is within the preset range is the start and end time of each HMM state obtained by the first speech model whose word recognition error rate is within a preset range.
  • the training set is generated by selecting data from the first voice data set by the computer device, and the computer device trains the training set to obtain a plurality of first voice models.
  • the computer device obtains the word recognition error rate within a preset range.
  • the computer device may use, as the first speech model parameter, a parameter of a first speech model having a minimum word recognition error rate of a word recognition error rate within a preset range, and subsequently as a shared first speech model parameter more precise.
  • the computer device may use a parameter of the first speech model whose word recognition error rate is within a preset range as the first speech model parameter.
  • the voice data set training method further includes: using the parameter of the first voice model in which the word recognition error rate is the smallest word recognition error rate within the preset range, performing the first voice data set on the first voice data set. Second speech model training.
  • the voice data set training method further includes: performing second voice model training on the first voice data set by using a parameter of the first voice model whose word recognition error rate is within a preset range.
  • performing the first speech model training on the training set to obtain a preset number of first speech models includes: randomly selecting a first preset ratio of data or a first fixed quantity from the training set each time The data is subjected to the first speech model training, and the preset number of times is repeated to obtain a preset number of first speech models.
  • the first preset ratio may be configured as needed, and the first preset ratio is too high to be time consuming, and too low to represent the entire training set.
  • the first fixed number can be configured as needed.
  • the preset number of times refers to the number of times the first preset model data or the first fixed amount of data is randomly selected from the training set for the first voice model training.
  • the first test set is tested by using the preset number of first voice models to obtain a first voice model with a word recognition error rate within a preset range, including: adopting a preset The first voice model of the quantity is respectively tested on the first test set to obtain a word recognition error rate of each first voice model; and the word recognition error rate is selected according to the word recognition error rate of each first voice model in a preset range.
  • the first speech model within.
  • the Word Error Rate indicates the ratio between the number of words identifying the error at the time of testing and the total number of words in the test set.
  • the computer device separately tests the first test set by using a preset number of first voice models to obtain a word recognition error rate of each first voice model for testing the first test set, and the computer device can identify the word recognition error rate. Compared with the preset range, the computer device can obtain the first speech model in which the word recognition error rate is within a preset range.
  • the detecting that the second test set and the first test set satisfy a similar condition comprising: using the word recognition error rate corresponding to a minimum word recognition error rate within a preset range
  • the first speech model tests the second test set to obtain a word recognition error rate corresponding to the second test set; and detects a word recognition error rate corresponding to the second test set and the word recognition error
  • the difference between the smallest word recognition error rate in the preset range is less than or equal to the fault tolerance threshold, indicating that the second test set satisfies similar conditions with the first test set.
  • the fault tolerance threshold can be obtained according to actual multiple training.
  • the voice data set training method further includes: separately selecting a data generation training set and a first test set from the first voice data set; and performing a first voice model training on the training set to obtain a preset quantity.
  • the first voice model is tested by using the preset number of first voice models to obtain a first voice model of a minimum word recognition error rate of the preset number;
  • the parameter of the first speech model of the smallest word recognition error rate in the preset number is used as the first speech model parameter.
  • the computer device respectively tests the first test set by using a preset number of first voice models to obtain a word recognition error rate of each first voice model for testing the first test set, and the computer device may Sorting the word recognition error rate yields the smallest word recognition error rate in the preset number.
  • the computer device detects that the second test set satisfies the similar condition with the first test set, and includes: adopting a first voice model corresponding to a minimum word recognition error rate of the preset number to the second test set Performing a test to obtain a word recognition error rate corresponding to the second test set; detecting that a difference between a word recognition error rate corresponding to the second test set and a minimum word recognition error rate in the preset number is less than Or equal to the fault tolerance threshold, indicating that the second test set satisfies similar conditions with the first test set.
  • the step of determining the start and end time of each HMM state by using the HMM+GMM model includes: acquiring voice data, segmenting the voice data, extracting features of each voice, and listing each voice. Corresponding text; converting the text into phonemes according to the pronunciation dictionary; converting the phonemes into HMM states according to the HMM model; obtaining the probability corresponding to each character according to the parameters of the HMM+GMM model; There is a possible sequence of HMM states; the start and end time of each HMM state can be obtained according to the HMM state sequence.
  • Feature extraction of speech may include sound intensity and intensity level, loudness, pitch, pitch period, pitch frequency, signal to noise ratio, harmonic to noise ratio, and the like.
  • Sound intensity refers to the average sound energy per unit area passing through the direction perpendicular to the sound wave propagation direction per unit time.
  • the sound intensity is expressed by I and the unit is watt/square meter.
  • the sound intensity is expressed by the sound intensity level.
  • the common unit of sound intensity level is decibel (dB).
  • Loudness is the degree of sound intensity. Loudness is expressed in loudness level.
  • Pitch is the perception of the human auditory system about the frequency of sound.
  • the unit of pitch is Meier.
  • the pitch period reflects the time interval or frequency of opening and closing between adjacent opening and closing of the glottis.
  • the signal-to-noise ratio is calculated as the ratio between the power of the signal and the noise.
  • the harmonic ratio is the ratio of harmonic components to noise components in speech.
  • a phoneme is the smallest unit of speech that is divided according to the natural attributes of speech. Label the voice data to get the phoneme. Labeling refers to the processing of unprocessed data, and the annotation of the voice is to show the real content represented by the voice.
  • the HMM state sequence obtained by the computer device is similar to 112233345. It is assumed that starting from time t, the start and end time of state 1 is t to t+2, and the start and end time of state 2 is t+3 to t+4.
  • a voice data set training method includes:
  • Step 402 Acquire a voice data set, determine whether the current training is the first training, if yes, execute step 404, and if no, perform step 410.
  • Step 404 Select a data generation training set and a first test set from the voice data set.
  • the voice data set may be referred to as a first voice data set.
  • Step 406 randomly select the first preset ratio data from the training set to perform hidden Markov model and mixed Gaussian model training, repeat the preset number of times, and obtain a preset number of hidden Markov models and a mixed Gaussian model. .
  • Step 408 Test a preset number of hidden Markov models and a mixed Gaussian model respectively to obtain a minimum word recognition error rate, record the first word recognition error rate, and select a minimum word recognition error rate.
  • the corresponding hidden Markov model and the mixed Gaussian model are used as the optimal hidden Markov model and the mixed Gaussian model, and then step 416 is performed.
  • Step 410 Randomly select data from the voice data set to generate a second test set.
  • the voice data set may be referred to as a second voice data set.
  • Step 412 Test the second test set by using the optimal hidden Markov model and the mixed Gaussian model obtained by the first training, and obtain the word recognition error rate corresponding to the second test set, and record the second word recognition error. rate.
  • Step 414 determining that the difference between the second word recognition error rate and the first word recognition error rate is less than or equal to the fault tolerance threshold, and if yes, executing step 416, and if not, ending.
  • the hidden Markov model and the deep neural network model are trained by using the parameters of the optimal hidden Markov model and the mixed Gaussian model.
  • the above voice data set training method detects that the training is not the first training, and the first word recognition error rate obtained by testing the first test set according to the optimal HMM+GMM model and the test result obtained by testing the second test set If the two-word recognition error rate, the second word recognition error rate and the first word recognition error rate are less than or equal to the fault tolerance threshold, the hidden Markov model and the mixed Gaussian model parameters obtained by the first speech data set are used for the second speech data.
  • the set of hidden Markov model and deep neural network model training saves the hidden Markov model and the mixed Gaussian model training for the second speech data set, which saves the total training time and improves the training efficiency; if this training For the first training, the optimal HMM+GMM model is selected, and the parameters of the optimal HMM+GMM model are used for HMM+DNN training.
  • FIG. 5 is a schematic structural diagram of an HMM+GMM model in one embodiment.
  • the first layer 52 is a voice frame data
  • the second layer 54 is a GMM model
  • the third layer 56 is an HMM model.
  • the HMM model corresponds to multiple GMM models of output probabilities.
  • S represents the HMM state in the HMM model
  • a represents the transition probability in the HMM model, Indicates the transition probability from the s k-1 state to the s k-2 state.
  • Each GMM corresponds to the output probability of an HMM model state.
  • the computer device divides the voice data into one voice frame data, and one voice frame data corresponds to one HMM state.
  • the speech frame is the observation in the HMM.
  • FIG. 6 is a schematic structural diagram of an HMM+DNN model in one embodiment.
  • the first layer 62 is a voice frame data
  • the second layer 64 is a DNN model
  • the third layer 66 is an HMM model.
  • S represents the HMM state in the HMM model
  • a represents the transition probability in the HMM model, Indicates the transition probability from the s k-1 state to the s k-2 state
  • h represents the neurons in the DNN model
  • W represents the weight in the DNN model
  • M represents the number of layers in the DNN model.
  • h represents a function.
  • the input of h is the weight of one frame of data or several frames of data; if it is the second layer to the last layer, the input of h is the upper layer.
  • the output of each DNN corresponds to the output probability of an HMM model state.
  • the output of each DNN corresponds to a speech frame.
  • a DNN model can be employed to implement the input of a speech frame in the time domain to output a probability corresponding to an HMM state.
  • FIG. 7 is a structural block diagram of a voice data set training apparatus in an embodiment.
  • a voice data set training apparatus 700 includes a reading module 702, an obtaining module 704, and a training module 706. among them:
  • the reading module 702 is configured to read a first test set generated by selecting data from the first voice data set, and a first voice model parameter obtained by training the first voice data set.
  • the first voice data set refers to the voice data set used for the first training.
  • the computer device can generate data from the first set of voice data to generate a first test set.
  • the first test set is a data set for verifying the performance of the first speech model obtained by training the first speech data set.
  • the first speech model parameter refers to the start and end time of each speech model state.
  • the first speech model parameter can be the start and end time of each HMM state.
  • Each voice frame corresponds to an HMM state.
  • the obtaining module 704 is configured to obtain a second voice data set, and randomly select data from the second voice data set to generate a second test set.
  • the training module 706 is configured to detect that the second test set and the first test set satisfy a similar condition, and perform second voice model training on the second voice data set by using the first voice model parameter obtained by the training. .
  • the first speech model can be a hidden Markov model and a mixed Gaussian model.
  • the second speech model may be a hidden Markov model and a deep neural network model.
  • the voice data set training device detects that the second test set generated by selecting data from the second voice data set and the first test set generated by selecting data from the first voice data set satisfy similar conditions, and the computer device can use the first voice data.
  • the first speech model parameter obtained by the training performs the second speech model training on the second speech data set, which saves the first speech model training on the second speech data set, saves the total training time and improves the training efficiency.
  • FIG. 8 is a structural block diagram of a voice data set training apparatus in another embodiment.
  • a voice data set training apparatus 700 includes a generation module 708, a model construction module 710, a screening module 712, and a parameter acquisition module 714 in addition to the reading module 702, the acquisition module 704, and the training module 706.
  • the voice data set training device 700 forms at least a portion of the computer device, and the modules 702-714 can perform corresponding operations by the computer device.
  • the generating module 708 is configured to separately select a data generation training set and a first test set from the first voice data set.
  • the generating module 708 is further configured to obtain a ratio of the number of data in the first test set to the number of data in the first voice data set, and randomly select the ratio from the second voice data set. Data, generating the second test set.
  • the model building module 710 is configured to perform a first voice model training on the training set to obtain a preset number of first voice models.
  • the screening module 712 is configured to test the first test set by using the preset number of first voice models to obtain a first voice model whose word recognition error rate is within a preset range.
  • the parameter obtaining module 714 is configured to use, as the first voice model parameter, a parameter of the first voice model whose word recognition error rate is within a preset range.
  • the training module 706 is further configured to perform second speech model training on the first speech data set by using parameters of the first speech model whose word recognition error rate is within a preset range.
  • the computer device can select a data to generate a training set for the first voice data set, and the computer device can train the training set to obtain a plurality of first voice models. After the first test set test, the computer device can obtain an optimal first voice model.
  • the computer device may use the parameter of the first speech model whose word recognition error rate is within the preset range as the first speech model parameter, or the word recognition error rate that the computer device can minimize the word recognition error rate in the preset range.
  • the parameters of the first speech model are used as the first speech model parameters, and are subsequently more accurate as the shared first speech model parameters.
  • the model building module 710 is further configured to randomly select the first preset ratio of data or the first fixed amount of data from the training set to perform the first voice model training, repeating the preset number of times, A preset number of first speech models are obtained.
  • the screening module 712 is further configured to separately test the first test set by using a preset number of first voice models to obtain a word recognition error rate of each first voice model; and according to each first voice.
  • the word recognition error rate of the model is filtered to obtain a first speech model in which the word recognition error rate is within a preset range.
  • FIG. 9 is a structural block diagram of a speech data set training apparatus in another embodiment.
  • a voice data set training apparatus 700 includes, in addition to the reading module 702, the obtaining module 704, the training module 706, the generating module 708, the model building module 710, the screening module 712, and the parameter obtaining module 714, Detection module 716.
  • the detecting module 716 is configured to test the second test set by using a first voice model corresponding to a word recognition error rate with a minimum word recognition error rate within a preset range, to obtain a corresponding test set of the second test set. a word recognition error rate; and detecting that a difference between a word recognition error rate corresponding to the second test set and a minimum word recognition error rate of the word recognition error rate within a preset range is less than or equal to a fault tolerance threshold The second test set satisfies similar conditions with the first test set.
  • the generating module 708 is further configured to separately select a data generation training set and a first test set from the first voice data set.
  • the model building module 710 is configured to perform a first voice model training on the training set to obtain a preset number of first voice models.
  • the screening module 712 is configured to test the first test set by using the preset number of first voice models to obtain a first voice model of a minimum word recognition error rate of the preset number;
  • the parameter obtaining module 714 is configured to use the parameter of the first speech model of the minimum word recognition error rate as the first speech model parameter.
  • the detecting module 716 is further configured to test the second test set by using a first voice model corresponding to a minimum word recognition error rate of the preset number, to obtain a word recognition error rate corresponding to the second test set. And detecting that the difference between the word recognition error rate corresponding to the second test set and the smallest word recognition error rate of the preset number is less than or equal to a fault tolerance threshold, indicating the second test set and the The first test set satisfies similar conditions.
  • each module in the above-mentioned voice data set training device is for illustrative purposes only. In other embodiments, the voice data set training device may be divided into different modules as needed to complete all or part of the voice data set training device.
  • Embodiments of the present invention also provide a computer device and a computer readable storage medium.
  • a computer device comprising a memory, a processor, and a computer program (instruction) stored on the memory and operable on the processor, the processor executing the program, the step of: reading from the first voice data set Selecting a first test set generated by the data, and first voice model parameters obtained by training the first voice data set; acquiring a second voice data set, and randomly selecting data from the second voice data set to generate a second And detecting that the second test set and the first test set satisfy a similar condition, and performing the second voice model training on the second voice data set by using the first voice model parameter obtained by the training.
  • the first speech model can be a hidden Markov model and a mixed Gaussian model.
  • the second speech model may be a hidden Markov model and a deep neural network model.
  • the processor is further configured to: when the program is executed, the following steps are performed: separately selecting a data generation training set and a first test set from the first voice data set; and performing the first on the training set
  • the voice model is trained to obtain a preset number of first voice models; and the first test set is tested by using the preset number of first voice models to obtain a first voice model with a word recognition error rate within a preset range. And using the parameter of the first speech model in which the word recognition error rate is within a preset range as the first speech model parameter.
  • the processor is further configured to perform the first voice model training on the training set to obtain a preset number of first voice models, including: randomly selecting a first preset ratio from the training set each time The data or the first fixed amount of data is subjected to the first speech model training, and the preset number of times is repeated to obtain a preset number of first speech models.
  • the processor is further configured to test the first test set by using the preset number of first voice models to obtain a first voice model whose word recognition error rate is within a preset range
  • the method includes: testing, by using a preset number of first voice models, the first test set to obtain a word recognition error rate of each first voice model; and selecting a word recognition error according to a word recognition error rate of each first voice model.
  • the first speech model is within a preset range.
  • the processor is further configured to detect that the second test set satisfies a similar condition with the first test set, including: using the word to identify a word with a minimum error rate within a preset range. Identifying, by the first voice model corresponding to the error rate, testing the second test set, obtaining a word recognition error rate corresponding to the second test set; detecting a word recognition error rate corresponding to the second test set The difference between the word recognition error rate and the smallest word recognition error rate in the preset range is less than or equal to the fault tolerance threshold, indicating that the second test set and the first test set satisfy similar conditions.
  • the processor is further configured to separately select a data generation training set and a first test set from the first voice data set; and perform a first voice model training on the training set to obtain a preset number of a voice model; testing the first test set by using the preset number of first voice models to obtain a first voice model of a minimum word recognition error rate in the preset number; The word identifies the parameter of the first speech model of the error rate as the first speech model parameter.
  • the processor is further configured to test the second test set by using a first voice model corresponding to a minimum word recognition error rate of the preset number, to obtain the second test set. Corresponding word recognition error rate; detecting that a difference between a word recognition error rate corresponding to the second test set and a minimum word recognition error rate of the preset number is less than or equal to a fault tolerance threshold The second test set satisfies similar conditions with the first test set.
  • the processor is further configured to generate a second test set by randomly selecting data from the second voice data set, including: acquiring data in the first test set and data in the first voice data set And comparing the quantity, randomly selecting data occupying the ratio from the second voice data set to generate the second test set.
  • a computer readable storage medium having stored thereon a computer program, the program being executed by the processor to: read a first test set generated by selecting data from the first voice data set, and to the first a first voice model parameter obtained by training the voice data set; acquiring a second voice data set, randomly selecting data from the second voice data set to generate a second test set; and detecting the second test set and the first When a test set satisfies the similar condition, the second speech model is trained on the second speech data set by using the first speech model parameter obtained by the training.
  • the first speech model can be a hidden Markov model and a mixed Gaussian model.
  • the second speech model may be a hidden Markov model and a deep neural network model.
  • the processor is further configured to: when the program is executed, the following steps are performed: separately selecting a data generation training set and a first test set from the first voice data set; and performing the first on the training set
  • the voice model is trained to obtain a preset number of first voice models; the first test set is separately tested by using the preset number of first voice models to obtain an optimal first voice model;
  • the parameters of the first speech model are used as the first speech model parameters.
  • the processor is further configured to perform the first voice model training on the training set to obtain a preset number of first voice models, including: randomly selecting a first preset ratio from the training set each time The data or the first fixed amount of data is subjected to the first speech model training, and the preset number of times is repeated to obtain a preset number of first speech models.
  • the processor is further configured to test the first test set by using the preset number of first voice models to obtain an optimal first voice model, including: adopting a preset number of The first speech model respectively tests the first test set to obtain a word recognition error rate of each first speech model; and the word recognition error rate is selected according to the word recognition error rate of each first speech model to be within a preset range.
  • the first speech model is further configured to test the first test set by using the preset number of first voice models to obtain an optimal first voice model, including: adopting a preset number of The first speech model respectively tests the first test set to obtain a word recognition error rate of each first speech model; and the word recognition error rate is selected according to the word recognition error rate of each first speech model to be within a preset range.
  • the first speech model is further configured to test the first test set by using the preset number of first voice models to obtain an optimal first voice model, including: adopting a preset number of The first speech model respectively tests the first test set to obtain a word recognition error rate of each first speech model
  • the processor is further configured to detect that the second test set satisfies a similar condition with the first test set, including: using the word to identify a word with a minimum error rate within a preset range. Identifying, by the first voice model corresponding to the error rate, testing the second test set, obtaining a word recognition error rate corresponding to the second test set; detecting a word recognition error rate corresponding to the second test set The difference between the word recognition error rate and the smallest word recognition error rate in the preset range is less than or equal to the fault tolerance threshold, indicating that the second test set and the first test set satisfy similar conditions.
  • the processor is further configured to generate a second test set by randomly selecting data from the second voice data set, including: acquiring data in the first test set and data in the first voice data set And comparing the quantity, randomly selecting data occupying the ratio from the second voice data set to generate the second test set.
  • the processor is further configured to separately select a data generation training set and a first test set from the first voice data set; and perform a first voice model training on the training set to obtain a preset number of a voice model; testing the first test set by using the preset number of first voice models to obtain a first voice model of a minimum word recognition error rate in the preset number; The word identifies the parameter of the first speech model of the error rate as the first speech model parameter.
  • the processor is further configured to test the second test set by using a first voice model corresponding to a minimum word recognition error rate of the preset number, to obtain the second test set. Corresponding word recognition error rate; detecting that a difference between a word recognition error rate corresponding to the second test set and a minimum word recognition error rate of the preset number is less than or equal to a fault tolerance threshold The second test set satisfies similar conditions with the first test set.
  • a computer readable medium refers to a non-volatile storage medium that excludes media such as energy, electromagnetic waves, and the like.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种语音数据集训练方法,包括:读取从第一语音数据集中选取数据所生成的第一测试集,以及获取对第一语音数据集进行训练得到的第一语音模型参数(202);获取第二语音数据集,从第二语音数据集中随机选取数据生成第二测试集(204);当检测到第二测试集与第一测试集满足相似条件时,则采用训练得到的第一语音模型参数对第二语音数据集进行第二语音模型训练(206)。

Description

语音数据集训练方法、计算机设备和计算机可读存储介质
本申请要求于2017年03月10日提交中国专利局、申请号为201710143053.2、申请名称为“语音数据集训练方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,特别是涉及一种语音数据集训练方法、计算机设备和计算机可读存储介质。
背景技术
传统的语音数据集的训练一般包括两部分,一部分是HMM(Hidden Markov Model,隐马尔科夫模型)+GMM(Gaussian Mixture Model,混合高斯模型)的训练,另一部分是HMM+DNN(Deep Neuron Network,深度神经网络)的训练。HMM+GMM和HMM+DNN需要对全部的数据集进行训练,随着数据集的不断增大,总的训练时间会增大,导致训练时间很长。
发明内容
根据本申请的各种实施例,提供一种语音数据集训练方法、计算机设备和计算机可读存储介质。
一种语音数据集训练方法,包括:
读取从第一语音数据集中选取数据所生成的第一测试集,以及对所述第一语音数据集进行训练得到的第一语音模型参数;
获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;
检测到所述第二测试集与所述第一测试集满足相似条件,则采用所述训 练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
一种计算机设备,包括存储器和处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:读取从第一语音数据集中选取数据所生成的第一测试集,以及对所述第一语音数据集进行训练得到的第一语音模型参数;获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;检测到所述第二测试集与所述第一测试集满足相似条件,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:读取从第一语音数据集中选取数据所生成的第一测试集,以及对所述第一语音数据集进行训练得到的第一语音模型参数;获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;检测到所述第二测试集与所述第一测试集满足相似条件,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中计算机设备的内部结构示意图;
图2为一个实施例中语音数据集训练方法的流程图;
图3为另一个实施例中语音数据集训练方法的流程图;
图4为另一个实施例中语音数据集训练方法的流程图;
图5为一个实施例中HMM+GMM模型的结构示意图;
图6为一个实施例中HMM+DNN模型的结构示意图;
图7为一个实施例中语音数据集训练装置的结构框图;
图8为另一个实施例中语音数据集训练装置的结构框图;
图9为另一个实施例中语音数据集训练装置的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中计算机设备的内部结构示意图。如图1所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。存储器包括非易失性存储介质和内存储器。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和语音数据集训练装置,数据库中存储有HMM+GMM和HMM+DNN的算法模型等,该语音数据集训练装置用于实现适用于计算机设备的一种语音数据集训练方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的内存储器为非易失性存储介质中的语音数据集训练装置的运行提供环境,该内存储器中可储存有计算机可读指令,该计算机可读指令被所述处理器执行时,可使得所述处理器执行一种语音数据集训练方法。该计算机设备的网络接口用于据以与外部的设备通过网络连接通信,比如接收设备发送的语音识别请求以及向设备返回语音识别结果等。计算机设备可以用独立的计算机设备或者是多个计算机设备组成的计算机设备集群来实现。本领域技术人员可以理解,图1中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
图2为一个实施例中语音数据集训练方法的流程图。如图2所示,一种 语音数据集训练方法,包括:
步骤202,读取从第一语音数据集中选取数据所生成的第一测试集,以及对所述第一语音数据集进行训练得到的第一语音模型参数。
本实施例中,第一语音数据集是指用于第一次训练的语音数据集。计算机设备可从第一语音数据集中选取数据生成第一测试集。第一测试集是用于检验通过第一语音数据集进行训练得到的第一语音模型的性能的数据集。第一语音模型可为隐马尔科夫模型和混合高斯模型。
计算机设备可以读取从第一语音数据集中选取数据所生成的第一测试集,例如,计算机设备读取第一测试集的方式可以是通过计算机设备中的CPU(Central Processing Unit中央处理器)在硬盘中提取所需要的数据,然后CPU再把数据综合处理后送给内存。
隐马尔科夫模型和混合高斯模型(即HMM+GMM)参数是指每个HMM状态的起止时间。每一语音帧对应一个HMM状态。
HMM(Hidden Markov Model,隐马尔科夫模型)是一种统计模型,它用来描述一个含有隐含未知参数的马尔可夫过程。在隐马尔可夫模型中,状态并不是直接可见的,但受状态影响的某些变量则是可见的。HMM中的状态是HMM的基本组成部分;HMM的转移概率表示HMM的状态之间发生转换的概率;而每一个状态在可能输出的符号上都有一概率分布,即HMM的输出概率。其中,马尔可夫过程是一个不具备记忆特质的随机过程。该随机过程在给定现在状态及所有过去状态情况下,其未来状态的条件概率分布仅依赖于当前状态。
GMM(Gaussian Mixture Model,混合高斯模型)是用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。
计算机设备可以预先根据第一语音数据集选取数据生成训练集和第一测试集,计算机设备还可以对第一语音数据集的训练集进行训练得到隐马尔科夫模型和混合高斯模型,从而得到隐马尔科夫模型和混合高斯模型参数。
步骤204,获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集。
本实施例中,第二语音数据集是指用于再次训练的语音数据集。计算机设备可以从第二语音数据集中随机选取数据生成第二测试集。第二测试集是用于代表第二语音数据集的。第二测试集中数据量占第二语音数据集中数据量的比例与第一测试集中数据量占第一语音数据集中数据量的比例相同。
步骤206,检测到所述第二测试集与所述第一测试集满足相似条件,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
本实施例中,第二语音模型可为隐马尔科夫模型和深度神经网络模型。DNN(deep neuron networks,深度神经网络)是一种具备至少一个隐层的神经网络。与浅层神经网络类似,深度神经网络也能够为复杂非线性系统提供建模,但多出的层次为模型提供了更高的抽象层次,因而提高了模型的能力。神经网络就是将许多个单一神经元联结在一起,一个神经元的输出就可以是另一个神经元的输入。神经元是神经网络的基本运算单元,它通过激活函数将多个输入值转化为一个输出,多个输入值与多个权值一一对应。
本实施例中,相似条件是指相似度超过相似度阈值,或者字识别错误率之差小于或等于容错阈值。相似度超过相似度阈值,或字识别错误率之差小于或等于容错阈值,则表示第二测试集和第一测试集相似度高,适合采用第一语音数据集训练得到的隐马尔科夫模型和混合高斯模型参数对第二语音数据集进行隐马尔科夫模型和深度神经网络模型训练。
上述语音数据集训练方法,计算机设备检测到从第二语音数据集中选取数据生成的第二测试集与从第一语音数据集中选取数据生成的第一测试集满足相似条件,计算机设备采用第一语音数据集训练得到的第一语音模型参数对第二语音数据集进行第二语音模型训练,节省了对第二语音数据集进行第一语音模型训练,节省了总的训练时长,提高了训练效率。
在一个实施例中,从所述第二语音数据集中随机选取数据生成第二测试 集,包括:获取所述第一测试集中数据数量与所述第一语音数据集中数据数量的比值,从所述第二语音数据集中随机选取占所述比值的数据,生成所述第二测试集。
本实施例中,第一测试集TEST1中数据数量记为number(TEST1),第一语音数据集中数据数量记为number(数据集1)。第二测试集TEST2中数据数量记为number(TEST2),第二语音数据集中数据数量记为number(数据集2)。则满足number(TEST1)/number(数据集1)=number(TEST2)/number(数据集2)。
通过使得第二测试集中数据量与第二语音数据集中数据量的比例与第一测试集中数据量与第一语音数据集中数据量的比例相同,可确保进行相似度计算时,计算结果更加准确。
图3为另一个实施例中语音数据集训练方法的流程图。如图3所示,在一个实施例中,上述语音数据集训练方法还包括:
步骤302,从所述第一语音数据集中分别选取数据生成训练集和第一测试集。
训练集是用来估计模型的数据集。
步骤304,对所述训练集进行第一语音模型训练得到预设数量的第一语音模型。
本实施例中,预设数量可根据需要配置,例如5个、10个等。
步骤306,采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到字识别错误率在预设范围内的第一语音模型。
本实施例中,计算机设备采用预设数量的第一语音模型中每一个第一语音模型对第一测试集进行测试,计算机设备可以得到每个第一语音模型的字识别错误率,根据各个第一语音模型的字识别错误率筛选得到字识别错误率在预设范围内的第一语音模型。预设范围可根据需要设定。
步骤308,将所述字识别错误率在预设范围内的第一语音模型的参数作为所述第一语音模型参数。
本实施例中,字识别错误率在预设范围内的第一语音模型的参数是指字识别错误率在预设范围内的第一语音模型得到的每个HMM状态的起止时间。
通过计算机设备对第一语音数据集中选取数据生成训练集,计算机设备对训练集进行训练得到多个第一语音模型,通过第一测试集测试,计算机设备得到字识别错误率在预设范围内的第一语音模型,计算机设备可将字识别错误率在预设范围内中最小的字识别错误率的第一语音模型的参数作为所述第一语音模型参数,后续作为共用的第一语音模型参数更加准确。或者,计算机设备可将字识别错误率在预设范围内中任意的第一语音模型的参数作为所述第一语音模型参数。
在一个实施例中,上述语音数据集训练方法还包括:采用所述字识别错误率在预设范围内中最小的字识别错误率的第一语音模型的参数对所述第一语音数据集进行第二语音模型训练。
在一个实施例中,上述语音数据集训练方法还包括:采用字识别错误率在预设范围内中任意的第一语音模型的参数对第一语音数据集进行第二语音模型训练。
在一个实施例中,对所述训练集进行第一语音模型训练得到预设数量的第一语音模型,包括:每次从所述训练集中随机选取第一预设比例的数据或第一固定数量的数据进行第一语音模型训练,重复预设数量次数,得到预设数量的第一语音模型。
本实施例中,第一预设比例可根据需要配置,第一预设比例太高会耗时,太低则不能代表整个训练集。第一固定数量可根据需要配置。预设数量次数是指从训练集中随机选取第一预设比例的数据或第一固定数量的数据进行第一语音模型训练的次数。
在一个实施例中,所述采用所述预设数量的第一语音模型对所述第一测试集进行测试,得到字识别错误率在预设范围内的第一语音模型,包括:采用预设数量的第一语音模型分别对所述第一测试集进行测试,得到各个第一 语音模型的字识别错误率;根据各个第一语音模型的字识别错误率筛选得到字识别错误率在预设范围内的第一语音模型。
本实施例中,字识别错误率(Word Error Rate,简称WER)表示测试时识别错误的字的数量和测试集中字的总数量之间的比值。计算机设备采用预设数量的第一语音模型分别对所述第一测试集进行测试可得到每个第一语音模型对第一测试集进行测试的字识别错误率,计算机设备可以将字识别错误率与预设范围比较,计算机设备可以得到字识别错误率在预设范围内的第一语音模型。
在一个实施例中,所述检测到所述第二测试集与所述第一测试集满足相似条件,包括:采用所述字识别错误率在预设范围内中最小的字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;检测到所述第二测试集所对应的字识别错误率与所述字识别错误率在预设范围内中最小的字识别错误率之差小于或等于容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
本实施例中,容错阈值可根据实际多次训练得到。
在一个实施例中,上述语音数据集训练方法还包括:从所述第一语音数据集中分别选取数据生成训练集和第一测试集;对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到所述预设数量中的最小的字识别错误率的第一语音模型;将所述预设数量中的最小的字识别错误率的第一语音模型的参数作为所述第一语音模型参数。
本实施例中,计算机设备采用预设数量的第一语音模型分别对所述第一测试集进行测试可得到每个第一语音模型对第一测试集进行测试的字识别错误率,计算机设备可以对字识别错误率进行排序得到预设数量中的最小的字识别错误率。
计算机设备检测到所述第二测试集与所述第一测试集满足相似条件,包括:采用所述预设数量中的最小的字识别错误率对应的第一语音模型对所述 第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;检测到所述第二测试集所对应的字识别错误率与所述预设数量中的最小的字识别错误率之差小于或等于容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
在一个实施例中,采用HMM+GMM模型求取每个HMM状态的起止时间的步骤包括:获取语音数据,对所述语音数据进行分段,提取每段语音的特征;列出每段语音所有可能对应的文字;将所述文字根据发音词典转换为音素;根据HMM模型将所述音素转换为HMM状态;根据HMM+GMM模型的参数得到每条文字对应的概率;通过概率的比较得出最有可能的HMM状态序列;根据HMM状态序列可得到每个HMM状态的起止时间。
语音的特征提取可包括声强和声强级、响度、音高、基音周期、基音频率、信噪比、谐噪比等等。声强是指单位时间内通过垂直于声波传播方向的单位面积的平均声能。声强用I表示,单位为瓦/平米。声强采用声强级来表示。声强级的常用单位为分贝(dB)。响度是表示声音强弱程度。响度采用响度级表示。音高是人类听觉系统对于声音频率高低的感觉。音高的单位是美尔。基音周期反映了声门相邻两次开闭之间的时间间隔或开闭的频率。信噪比是信号和噪声的功率之间比值计算得到的。谐躁比是语音中谐波成分和噪声成分的比率。
音素是根据语音的自然属性划分出来的最小语音单位。对语音数据进行标注得到音素。标注是指对未处理的数据进行加工处理,语音的标注是展示语音所代表的真实内容。
计算机设备得到的HMM状态序列类似于112233345,假设从时刻t开始,则状态1的起止时间为t至t+2,状态2的起止时间为t+3至t+4。
图4为另一个实施例中语音数据集训练方法的流程图。如图4所示,一种语音数据集训练方法,包括:
步骤402,获取语音数据集,判断本次训练是不是第一次训练,若是,则执行步骤404,若否,执行步骤410。
步骤404,从语音数据集中分别选取数据生成训练集和第一测试集。
若本次训练为第一次训练,则语音数据集可称为第一语音数据集。
步骤406,从所述训练集中随机选取第一预设比例的数据进行隐马尔科夫模型和混合高斯模型训练,重复进行预设数量次,得到预设数量个隐马尔科夫模型和混合高斯模型。
步骤408,将预设数量个隐马尔科夫模型和混合高斯模型分别对第一测试集进行测试,得到最小的字识别错误率,记为第一字识别错误率,选取最小的字识别错误率对应的隐马尔科夫模型和混合高斯模型作为最优的隐马尔科夫模型和混合高斯模型,再执行步骤416。
步骤410,从语音数据集中随机选取数据生成第二测试集。
若本次训练不为第一次训练,则该语音数据集可称为第二语音数据集。
步骤412,用第一次训练得到的最优的隐马尔科夫模型和混合高斯模型对第二测试集进行测试,得到第二测试集所对应的字识别错误率,记为第二字识别错误率。
步骤414,判断第二字识别错误率与第一字识别错误率之差小于或等于容错阈值,若是,则执行步骤416,若否,则结束。
步骤416,用最优的隐马尔科夫模型和混合高斯模型的参数进行隐马尔科夫模型和深度神经网络模型训练。上述语音数据集训练方法,检测本次训练不是第一次训练,且根据最优的HMM+GMM模型对第一测试集测试得到的第一字识别错误率和对第二测试集测试得到的第二字识别错误率,第二字识别错误率与第一字识别错误率小于或等于容错阈值,则采用第一语音数据集训练得到的隐马尔科夫模型和混合高斯模型参数对第二语音数据集进行隐马尔科夫模型和深度神经网络模型训练,节省了对第二语音数据集进行隐马尔科夫模型和混合高斯模型训练,节省了总的训练时长,提高了训练效率;若本次训练为第一次训练,则选取最优的HMM+GMM模型,采用最优的HMM+GMM模型的参数进行HMM+DNN训练。
图5为一个实施例中HMM+GMM模型的结构示意图。如图5所示,第 一层52为一个一个语音帧数据,第二层54为GMM模型,第三层56为HMM模型。HMM模型对应输出概率的多个GMM模型。其中,S表示HMM模型中的HMM状态;a表示HMM模型中的转移概率,
Figure PCTCN2018075595-appb-000001
表示从s k-1状态变为s k-2状态的转移概率。每一个GMM对应的是一个HMM模型状态的输出概率。计算机设备将语音数据切分为一个一个语音帧数据,一个语音帧数据对应一个HMM状态。语音帧即为HMM中的观测值。
图6为一个实施例中HMM+DNN模型的结构示意图。如图6所示,第一层62为一个一个语音帧数据,第二层64为DNN模型,第三层66为HMM模型。其中,S表示HMM模型中的HMM状态;a表示HMM模型中的转移概率,
Figure PCTCN2018075595-appb-000002
表示从s k-1状态变为s k-2状态的转移概率;h表示DNN模型中的神经元;W表示DNN模型中的权值,M表示DNN模型的层数。h代表的是一个函数,如果是第一层,则h的输入是一帧数据或几帧数据对应的各自权值;如果是第二层至最后一层,则h的输入是上一层的输出和每一个输出所对应的权值。每一个DNN的输出对应的是一个HMM模型状态的输出概率。每一个DNN的输出对应的是一个语音帧。
在一个实施例中,可采用一个DNN模型在时域上实现输入一个语音帧输出一个HMM状态对应的概率。
图7为一个实施例中语音数据集训练装置的结构框图。如图7所示,一种语音数据集训练装置700,包括读取模块702、获取模块704和训练模块706。其中:
读取模块702用于读取从第一语音数据集中选取数据所生成的第一测试集,以及对所述第一语音数据集进行训练得到的第一语音模型参数。
本实施例中,第一语音数据集是指用于第一次训练的语音数据集。计算机设备可从第一语音数据集中选取数据生成第一测试集。第一测试集是用于检验通过第一语音数据集进行训练得到的第一语音模型的性能的数据集。
第一语音模型参数是指每个语音模型状态的起止时间。例如,第一语音 模型参数可为每个HMM状态的起止时间。每一语音帧对应一个HMM状态。
获取模块704用于获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集。
训练模块706用于检测到所述第二测试集与所述第一测试集满足相似条件,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
第一语音模型可为隐马尔科夫模型和混合高斯模型。第二语音模型可为隐马尔科夫模型和深度神经网络模型。
上述语音数据集训练装置,检测到从第二语音数据集中选取数据生成的第二测试集与从第一语音数据集中选取数据生成的第一测试集满足相似条件,计算机设备可以采用第一语音数据集训练得到的第一语音模型参数对第二语音数据集进行第二语音模型训练,节省了对第二语音数据集进行第一语音模型训练,节省了总的训练时长,提高了训练效率。
图8为另一个实施例中语音数据集训练装置的结构框图。如图8所示,一种语音数据集训练装置700,除了包括读取模块702、获取模块704和训练模块706,还包括生成模块708、模型构建模块710、筛选模块712和参数获取模块714。本实施例中,所述语音数据集训练装置700形成所述计算机设备的至少其中一部分,且所述模块702-714可通过所述计算机设备执行相应的操作。
生成模块708用于从所述第一语音数据集中分别选取数据生成训练集和第一测试集。
在一个实施例中,所述生成模块708还用于获取所述第一测试集中数据数量与所述第一语音数据集中数据数量的比值,从所述第二语音数据集中随机选取占所述比值的数据,生成所述第二测试集。
模型构建模块710用于对所述训练集进行第一语音模型训练得到预设数量的第一语音模型。
筛选模块712用于采用所述预设数量的第一语音模型对所述第一测试集 进行测试,得到字识别错误率在预设范围内的第一语音模型。
参数获取模块714用于将所述字识别错误率在预设范围内的第一语音模型的参数作为所述第一语音模型参数。
训练模块706还用于采用字识别错误率在预设范围内的第一语音模型的参数对第一语音数据集进行第二语音模型训练。
通过计算机设备对第一语音数据集中选取数据生成训练集,计算机设备可以对训练集进行训练得到多个第一语音模型,通过第一测试集测试,计算机设备可以得到最优的第一语音模型,计算机设备可以将字识别错误率在预设范围内任意的第一语音模型的参数作为所述第一语音模型参数,或者计算机设备可以将字识别错误率在预设范围中最小的字识别错误率的第一语音模型的参数作为所述第一语音模型参数,后续作为共用的第一语音模型参数更加准确。
在一个实施例中,模型构建模块710还用于每次从所述训练集中随机选取第一预设比例的数据或第一固定数量的数据进行第一语音模型训练,重复预设数量的次数,得到预设数量的第一语音模型。
在一个实施例中,筛选模块712还用于采用预设数量的第一语音模型分别对所述第一测试集进行测试,得到各个第一语音模型的字识别错误率;以及根据各个第一语音模型的字识别错误率筛选得到字识别错误率在预设范围内的第一语音模型。
图9为另一个实施例中语音数据集训练装置的结构框图。如图9所示,一种语音数据集训练装置700,除了包括读取模块702、获取模块704、训练模块706、生成模块708、模型构建模块710、筛选模块712和参数获取模块714,还包括检测模块716。
检测模块716用于采用所述字识别错误率在预设范围内中最小的字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;以及检测到所述第二测试集所对应的字识别错误率与所述字识别错误率在预设范围内中最小的字识别错误率之差小于或等于 容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
在一个实施例中,生成模块708还用于从所述第一语音数据集中分别选取数据生成训练集和第一测试集。
模型构建模块710用于对所述训练集进行第一语音模型训练得到预设数量的第一语音模型。
筛选模块712用于采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到所述预设数量中的最小的字识别错误率的第一语音模型;
参数获取模块714用于将所述最小的字识别错误率的第一语音模型的参数作为所述第一语音模型参数。
检测模块716还用于采用所述预设数量中的最小的字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;以及检测到所述第二测试集所对应的字识别错误率与所述预设数量中的最小的字识别错误率之差小于或等于容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
上述语音数据集训练装置中各个模块的划分仅用于举例说明,在其他实施例中,可将语音数据集训练装置按照需要划分为不同的模块,以完成上述语音数据集训练装置的全部或部分功能。
本发明的实施例还提供了一种计算机设备和计算机可读存储介质。
一种计算机设备,包括存储器,处理器及存储在存储器上并可在处理器上运行的计算机程序(指令),所述处理器执行所述程序时实现以下步骤:读取从第一语音数据集中选取数据所生成的第一测试集,以及对所述第一语音数据集进行训练得到的第一语音模型参数;获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;以及检测到所述第二测试集与所述第一测试集满足相似条件,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。第一语音模型可为隐马尔科夫模型和混合高斯模型。第二语音模型可为隐马尔科夫模型和深度神经网络模型。
在一个实施例中,所述处理器还用于执行所述程序时实现以下步骤:从所述第一语音数据集中分别选取数据生成训练集和第一测试集;对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到字识别错误率在预设范围内的第一语音模型;将所述字识别错误率在预设范围内的第一语音模型的参数作为所述第一语音模型参数。
在一个实施例中,所述处理器还用于对所述训练集进行第一语音模型训练得到预设数量的第一语音模型,包括:每次从所述训练集中随机选取第一预设比例的数据或第一固定数量的数据进行第一语音模型训练,重复预设数量的次数,得到预设数量的第一语音模型。
在一个实施例中,所述处理器还用于采用所述预设数量的第一语音模型对所述第一测试集进行测试,得到字识别错误率在预设范围内的第一语音模型,包括:采用预设数量的第一语音模型分别对所述第一测试集进行测试,得到各个第一语音模型的字识别错误率;根据各个第一语音模型的字识别错误率筛选得到字识别错误率在预设范围内的第一语音模型。
在一个实施例中,所述处理器还用于检测到所述第二测试集与所述第一测试集满足相似条件,包括:采用所述字识别错误率在预设范围内中最小的字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;检测到所述第二测试集所对应的字识别错误率与所述字识别错误率在预设范围内中最小的字识别错误率之差小于或等于容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
在一个实施例中,所述处理器还用于从所述第一语音数据集中分别选取数据生成训练集和第一测试集;对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到所述预设数量中的最小的字识别错误率的第一语音模型;将所述最小的字识别错误率的第一语音模型的参数作为所述第一语音模型参数。
在一个实施例中,所述处理器还用于采用所述预设数量中的最小的字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;检测到所述第二测试集所对应的字识别错误率与所述预设数量中的最小的字识别错误率之差小于或等于容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
在一个实施例中,所述处理器还用于从所述第二语音数据集中随机选取数据生成第二测试集,包括:获取所述第一测试集中数据数量与所述第一语音数据集中数据数量的比值,从所述第二语音数据集中随机选取占所述比值的数据,生成所述第二测试集。
一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:读取从第一语音数据集中选取数据所生成的第一测试集,以及对所述第一语音数据集进行训练得到的第一语音模型参数;获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;以及检测到所述第二测试集与所述第一测试集满足相似条件,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。第一语音模型可为隐马尔科夫模型和混合高斯模型。第二语音模型可为隐马尔科夫模型和深度神经网络模型。
在一个实施例中,所述处理器还用于执行所述程序时实现以下步骤:从所述第一语音数据集中分别选取数据生成训练集和第一测试集;对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到最优的第一语音模型;将所述最优的第一语音模型的参数作为所述第一语音模型参数。
在一个实施例中,所述处理器还用于对所述训练集进行第一语音模型训练得到预设数量的第一语音模型,包括:每次从所述训练集中随机选取第一预设比例的数据或第一固定数量的数据进行第一语音模型训练,重复预设数量的次数,得到预设数量的第一语音模型。
在一个实施例中,所述处理器还用于采用所述预设数量的第一语音模型 对所述第一测试集进行测试,得到最优的第一语音模型,包括:采用预设数量的第一语音模型分别对所述第一测试集进行测试,得到各个第一语音模型的字识别错误率;根据各个第一语音模型的字识别错误率筛选得到字识别错误率在预设范围内的第一语音模型。
在一个实施例中,所述处理器还用于检测到所述第二测试集与所述第一测试集满足相似条件,包括:采用所述字识别错误率在预设范围内中最小的字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;检测到所述第二测试集所对应的字识别错误率与所述字识别错误率在预设范围内中最小的字识别错误率之差小于或等于容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
在一个实施例中,所述处理器还用于从所述第二语音数据集中随机选取数据生成第二测试集,包括:获取所述第一测试集中数据数量与所述第一语音数据集中数据数量的比值,从所述第二语音数据集中随机选取占所述比值的数据,生成所述第二测试集。
在一个实施例中,所述处理器还用于从所述第一语音数据集中分别选取数据生成训练集和第一测试集;对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到所述预设数量中的最小的字识别错误率的第一语音模型;将所述最小的字识别错误率的第一语音模型的参数作为所述第一语音模型参数。
在一个实施例中,所述处理器还用于采用所述预设数量中的最小的字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;检测到所述第二测试集所对应的字识别错误率与所述预设数量中的最小的字识别错误率之差小于或等于容错阈值,则表示所述第二测试集与所述第一测试集满足相似条件。
在一个实施例中,计算机可读介质是指非易失性存储介质,可以排除能量、电磁波等介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (21)

  1. 一种语音数据集训练方法,执行于计算机设备,所述方法包括:
    读取从第一语音数据集中选取数据所生成的第一测试集,以及获取对所述第一语音数据集进行训练得到的第一语音模型参数;
    获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;
    当检测到所述第二测试集与所述第一测试集满足相似条件时,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
  2. 根据权利要求1所述的方法,其特征在于,所述获取对所述第一语音数据集进行训练得到的第一语音模型参数的步骤包括:
    从所述第一语音数据集中分别选取数据生成训练集和第一测试集;
    对所述训练集进行第一语音模型训练,得到预设数量的第一语音模型;
    采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到字识别错误率在预设范围内的至少一个第一语音模型;
    将所述字识别错误率在预设范围内的至少一个第一语音模型的参数作为所述第一语音模型参数。
  3. 根据权利要求2所述的方法,其特征在于,对所述训练集进行第一语音模型训练得到预设数量的第一语音模型,包括:
    每次从所述训练集中随机选取第一预设比例的数据或第一固定数量的数据进行第一语音模型训练,并重复训练,得到预设数量的第一语音模型。
  4. 根据权利要求2或3所述的方法,其特征在于,所述采用所述预设数量的第一语音模型对所述第一测试集进行测试,得到字识别错误率在预设范围内的至少一个第一语音模型,包括:
    采用预设数量的第一语音模型分别对所述第一测试集进行测试,得到各个第一语音模型的字识别错误率;
    根据各个第一语音模型的字识别错误率筛选得到字识别错误率在预设范 围内的至少一个第一语音模型。
  5. 根据权利要求2所述的方法,其特征在于,所述采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练,包括:
    采用所述字识别错误率在预设范围内中最小字识别错误率的第一语音模型的参数,对所述第二语音数据集进行第二语音模型训练。
  6. 根据权利要求5所述的方法,其特征在于,所述检测到所述第二测试集与所述第一测试集满足相似条件,包括:
    采用所述字识别错误率在预设范围内中最小字识别错误率对应的第一语音模型,对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;
    当检测到所述第二测试集所对应的字识别错误率与预设范围内中最小字识别错误率之差小于或等于容错阈值时,则表示所述第二测试集与所述第一测试集满足相似条件。
  7. 根据权利要求1至3中任一项所述的方法,其特征在于,从所述第二语音数据集中随机选取数据生成第二测试集,包括:
    获取所述第一测试集中数据数量与所述第一语音数据集中数据数量的比值,从所述第二语音数据集中选取占所述比值的随机数据,生成所述第二测试集。
  8. 根据权利要求1所述的方法,其特征在于,还包括:
    从所述第一语音数据集中分别选取数据生成训练集和第一测试集;
    对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;
    采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到所述预设数量中的最小字识别错误率的第一语音模型;
    将所述最小的字识别错误率的第一语音模型的参数作为更新后的所述第一语音模型参数。
  9. 根据权利要求8所述的方法,其特征在于,所述检测到所述第二测试集与所述第一测试集满足相似条件,包括:
    采用所述预设数量中的最小字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;
    当检测到所述第二测试集所对应的字识别错误率与所述预设数量中的最小字识别错误率之差小于或等于容错阈值时,则表示所述第二测试集与所述第一测试集满足相似条件。
  10. 根据权利要求1所述的方法,其特征在于,还包括:
    获取语音数据,对所述语音数据进行分段,提取每段语音的特征;
    列出每段语音对应的文字;
    将所述文字根据发音词典转换为音素;
    根据隐马尔科夫模型将所述音素转换为隐马尔科夫模型状态;
    根据隐马尔科夫模型和混合高斯模型的参数得到每条文字对应的概率;
    通过概率的比较得出隐马尔科夫模型状态序列;
    根据所述隐马尔科夫模型状态序列得到每个隐马尔科夫模型状态的起止时间。
  11. 一种计算机设备,包括存储器和处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
    读取从第一语音数据集中选取数据所生成的第一测试集,以及获取对所述第一语音数据集进行训练得到的第一语音模型参数;
    获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;
    当检测到所述第二测试集与所述第一测试集满足相似条件时,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行以下步骤:
    从所述第一语音数据集中分别选取数据生成训练集和第一测试集;
    对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;
    采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到字识别错误率在预设范围内的至少一个第一语音模型;
    将所述字识别错误率在预设范围内的至少一个第一语音模型的参数作为所述第一语音模型参数。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行对所述训练集进行第一语音模型训练得到预设数量的第一语音模型的步骤时,还执行以下步骤:
    每次从所述训练集中随机选取第一预设比例的数据或第一固定数量的数据进行第一语音模型训练,并重复训练,得到预设数量的第一语音模型。
  14. 根据权利要求12所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述采用所述预设数量的第一语音模型对所述第一测试集进行测试,得到字识别错误率在预设范围内的至少一个第一语音模型的步骤时,还执行以下步骤:
    采用预设数量的第一语音模型分别对所述第一测试集进行测试,得到各个第一语音模型的字识别错误率;根据各个第一语音模型的字识别错误率筛选得到字识别错误率在预设范围内的至少一个第一语音模型。
  15. 根据权利要求14所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练,还执行以下步骤:
    采用所述字识别错误率在预设范围内中最小的字识别错误率的第一语音模型的参数对所述第二语音数据集进行第二语音模型训练。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述检测到所述第二测试集与所述第一测试集满足相似条件的步骤时,还执行以下步骤:
    采用所述字识别错误率在预设范围内中最小字识别错误率对应的第一语 音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;当检测到所述第二测试集所对应的字识别错误率与预设范围内中最小的字识别错误率之差小于或等于容错阈值时,则表示所述第二测试集与所述第一测试集满足相似条件。
  17. 根据权利要求12所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行从所述第二语音数据集中随机选取数据生成第二测试集的步骤时,还执行以下步骤:
    获取所述第一测试集中数据数量与所述第一语音数据集中数据数量的比值,从所述第二语音数据集中选取占所述比值的随机数据,生成所述第二测试集。
  18. 根据权利要求12所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行以下步骤:
    从所述第一语音数据集中分别选取数据生成训练集和第一测试集;对所述训练集进行第一语音模型训练得到预设数量的第一语音模型;采用所述预设数量的第一语音模型分别对所述第一测试集进行测试,得到所述预设数量中的最小字识别错误率的第一语音模型;将所述最小字识别错误率的第一语音模型的参数作为更新后的所述第一语音模型参数。
  19. 根据权利要求18所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述检测到所述第二测试集与所述第一测试集满足相似条件的步骤时,还执行以下步骤:
    采用所述预设数量中的最小字识别错误率对应的第一语音模型对所述第二测试集进行测试,得到所述第二测试集所对应的字识别错误率;当检测到所述第二测试集所对应的字识别错误率与所述预设数量中的最小字识别错误率之差小于或等于容错阈值时,则表示所述第二测试集与所述第一测试集满足相似条件。
  20. 根据权利要求11所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行以下步骤:
    获取语音数据,对所述语音数据进行分段,提取每段语音的特征;列出每段语音对应的文字;将所述文字根据发音词典转换为音素;根据隐马尔科夫模型将所述音素转换为隐马尔科夫模型状态;根据隐马尔科夫模型和混合高斯模型的参数得到每条文字对应的概率;通过概率的比较得出隐马尔科夫模型状态序列;根据隐马尔科夫模型状态序列得到每个隐马尔科夫模型状态的起止时间。
  21. 一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    读取从第一语音数据集中选取数据所生成的第一测试集,以及获取对所述第一语音数据集进行训练得到的第一语音模型参数;获取第二语音数据集,从所述第二语音数据集中随机选取数据生成第二测试集;当检测到所述第二测试集与所述第一测试集满足相似条件时,则采用所述训练得到的第一语音模型参数对所述第二语音数据集进行第二语音模型训练。
PCT/CN2018/075595 2017-03-10 2018-02-07 语音数据集训练方法、计算机设备和计算机可读存储介质 WO2018161763A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18764634.4A EP3594940B1 (en) 2017-03-10 2018-02-07 Training method for voice data set, computer device and computer readable storage medium
US16/436,479 US11069342B2 (en) 2017-03-10 2019-06-10 Method for training voice data set, computer device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710143053.2A CN108305619B (zh) 2017-03-10 2017-03-10 语音数据集训练方法和装置
CN201710143053.2 2017-03-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/436,479 Continuation US11069342B2 (en) 2017-03-10 2019-06-10 Method for training voice data set, computer device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2018161763A1 true WO2018161763A1 (zh) 2018-09-13

Family

ID=62872036

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/075595 WO2018161763A1 (zh) 2017-03-10 2018-02-07 语音数据集训练方法、计算机设备和计算机可读存储介质

Country Status (4)

Country Link
US (1) US11069342B2 (zh)
EP (1) EP3594940B1 (zh)
CN (1) CN108305619B (zh)
WO (1) WO2018161763A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305619B (zh) * 2017-03-10 2020-08-04 腾讯科技(深圳)有限公司 语音数据集训练方法和装置
CN109408660B (zh) * 2018-08-31 2021-08-10 安徽四创电子股份有限公司 一种基于音频特征的音乐自动分类的方法
CN109378014A (zh) * 2018-10-22 2019-02-22 华中师范大学 一种基于卷积神经网络的移动设备源识别方法及系统
EP3963474A4 (en) 2019-05-01 2022-12-14 Microsoft Technology Licensing, LLC METHOD AND SYSTEM FOR USING UNSUPERVISED LEARNING TO IMPROVE TEXT ON SUGGESTED CONTENT
CN110265001B (zh) * 2019-05-06 2023-06-23 平安科技(深圳)有限公司 用于语音识别训练的语料筛选方法、装置及计算机设备
CN110379416B (zh) * 2019-08-15 2021-10-22 腾讯科技(深圳)有限公司 一种神经网络语言模型训练方法、装置、设备及存储介质
KR20210095431A (ko) * 2020-01-23 2021-08-02 삼성전자주식회사 전자 장치 및 그 제어 방법
US11727270B2 (en) * 2020-02-24 2023-08-15 Microsoft Technology Licensing, Llc Cross data set knowledge distillation for training machine learning models
CN112435230B (zh) * 2020-11-20 2021-07-16 哈尔滨市科佳通用机电股份有限公司 一种基于深度学习的数据集生成方法及系统
CN112786051B (zh) * 2020-12-28 2023-08-01 问问智能信息科技有限公司 一种语音数据的识别方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2048656A1 (en) * 2007-10-10 2009-04-15 Harman/Becker Automotive Systems GmbH Speaker recognition
CN103903613A (zh) * 2014-03-10 2014-07-02 联想(北京)有限公司 一种信息处理方法及电子设备
CN104240699A (zh) * 2014-09-12 2014-12-24 浙江大学 一种简单有效的短语语音识别方法
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及系统
CN106228980A (zh) * 2016-07-21 2016-12-14 百度在线网络技术(北京)有限公司 数据处理方法和装置

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5390278A (en) * 1991-10-08 1995-02-14 Bell Canada Phoneme based speech recognition
US6697769B1 (en) * 2000-01-21 2004-02-24 Microsoft Corporation Method and apparatus for fast machine training
JP4615385B2 (ja) 2005-07-12 2011-01-19 株式会社沖データ 画像読取装置
JP2009086581A (ja) * 2007-10-03 2009-04-23 Toshiba Corp 音声認識の話者モデルを作成する装置およびプログラム
CN101866418B (zh) 2009-04-17 2013-02-27 株式会社理光 确定文档阅读顺序的方法和设备
US8532994B2 (en) * 2010-08-27 2013-09-10 Cisco Technology, Inc. Speech recognition using a personal vocabulary and language model
JP6234060B2 (ja) * 2013-05-09 2017-11-22 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation ターゲットドメインの学習用音声データの生成方法、生成装置、および生成プログラム
CN104167206B (zh) * 2013-05-17 2017-05-31 佳能株式会社 声学模型合并方法和设备以及语音识别方法和系统
US9508347B2 (en) * 2013-07-10 2016-11-29 Tencent Technology (Shenzhen) Company Limited Method and device for parallel processing in model training
US10438581B2 (en) * 2013-07-31 2019-10-08 Google Llc Speech recognition using neural networks
US9620145B2 (en) * 2013-11-01 2017-04-11 Google Inc. Context-dependent state tying using a neural network
US10019985B2 (en) * 2013-11-04 2018-07-10 Google Llc Asynchronous optimization for sequence training of neural networks
US9390712B2 (en) * 2014-03-24 2016-07-12 Microsoft Technology Licensing, Llc. Mixed speech recognition
CN104268127B (zh) 2014-09-22 2018-02-09 同方知网(北京)技术有限公司 一种电子档版式文件阅读顺序分析的方法
CN104808799A (zh) 2015-05-20 2015-07-29 成都通甲优博科技有限责任公司 一种能够识别手势的无人机及其识别方法
CN104941203A (zh) 2015-06-03 2015-09-30 赵旭 一种基于手势轨迹识别的玩具及其识别、控制方法
CN105045819B (zh) * 2015-06-26 2018-04-20 深圳市腾讯计算机系统有限公司 一种训练数据的模型训练方法及装置
US10529318B2 (en) * 2015-07-31 2020-01-07 International Business Machines Corporation Implementing a classification model for recognition processing
CN105185372B (zh) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置
CN105955308B (zh) 2016-05-20 2018-06-29 腾讯科技(深圳)有限公司 一种飞行器的控制方法和装置
CN106020227B (zh) 2016-08-12 2019-02-26 北京奇虎科技有限公司 无人机的控制方法、装置
CN106339006B (zh) 2016-09-09 2018-10-23 腾讯科技(深圳)有限公司 一种飞行器的目标跟踪方法和装置
CN106843489B (zh) 2017-01-24 2019-02-19 腾讯科技(深圳)有限公司 一种飞行器的飞行路线控制方法及飞行器
CN108305619B (zh) * 2017-03-10 2020-08-04 腾讯科技(深圳)有限公司 语音数据集训练方法和装置
CN106774945A (zh) 2017-01-24 2017-05-31 腾讯科技(深圳)有限公司 一种飞行器飞行控制方法、装置、飞行器及系统
KR102399535B1 (ko) * 2017-03-23 2022-05-19 삼성전자주식회사 음성 인식을 위한 학습 방법 및 장치

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2048656A1 (en) * 2007-10-10 2009-04-15 Harman/Becker Automotive Systems GmbH Speaker recognition
CN103903613A (zh) * 2014-03-10 2014-07-02 联想(北京)有限公司 一种信息处理方法及电子设备
CN104240699A (zh) * 2014-09-12 2014-12-24 浙江大学 一种简单有效的短语语音识别方法
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及系统
CN106228980A (zh) * 2016-07-21 2016-12-14 百度在线网络技术(北京)有限公司 数据处理方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3594940A4 *

Also Published As

Publication number Publication date
CN108305619B (zh) 2020-08-04
US20190318723A1 (en) 2019-10-17
CN108305619A (zh) 2018-07-20
EP3594940A1 (en) 2020-01-15
US11069342B2 (en) 2021-07-20
EP3594940B1 (en) 2023-07-26
EP3594940A4 (en) 2020-03-11

Similar Documents

Publication Publication Date Title
WO2018161763A1 (zh) 语音数据集训练方法、计算机设备和计算机可读存储介质
US10460721B2 (en) Dialogue act estimation method, dialogue act estimation apparatus, and storage medium
CN107680582B (zh) 声学模型训练方法、语音识别方法、装置、设备及介质
US10332510B2 (en) Method and apparatus for training language model and recognizing speech
CN103400577B (zh) 多语种语音识别的声学模型建立方法和装置
WO2021128741A1 (zh) 语音情绪波动分析方法、装置、计算机设备及存储介质
KR102399535B1 (ko) 음성 인식을 위한 학습 방법 및 장치
CN110349597B (zh) 一种语音检测方法及装置
US11282501B2 (en) Speech recognition method and apparatus
CN106098059A (zh) 可定制语音唤醒方法及系统
CN104903954A (zh) 使用基于人工神经网络的亚语音单位区分的说话人验证及识别
US11205419B2 (en) Low energy deep-learning networks for generating auditory features for audio processing pipelines
US8005674B2 (en) Data modeling of class independent recognition models
CN106340297A (zh) 一种基于云计算与置信度计算的语音识别方法与系统
US9519049B1 (en) Processing unknown radar emitters
CN113314100B (zh) 口语测试的评估、结果显示方法、装置、设备及存储介质
JPWO2017146073A1 (ja) 声質変換装置、声質変換方法およびプログラム
Pace et al. Hidden Markov Modeling for humpback whale (Megaptera Novaeanglie) call classification
CN116114015A (zh) 用于语音使能设备的混沌测试
WO2020216286A1 (zh) 教师风格预测模型的训练方法及计算机存储介质
CN113555005B (zh) 模型训练、置信度确定方法及装置、电子设备、存储介质
KR101862352B1 (ko) 음성 인식을 위한 전처리 장치, 및 이를 이용한 음성 인식 장치 및 방법
CN114639390A (zh) 一种语音噪声分析方法及系统
KR20110071742A (ko) 단어별 신뢰도 문턱값에 기반한 발화 검증 장치 및 그 방법
KR20160109942A (ko) 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18764634

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2018764634

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2018764634

Country of ref document: EP

Effective date: 20191010