CN108172218A - A kind of pronunciation modeling method and device - Google Patents
A kind of pronunciation modeling method and device Download PDFInfo
- Publication number
- CN108172218A CN108172218A CN201611103738.6A CN201611103738A CN108172218A CN 108172218 A CN108172218 A CN 108172218A CN 201611103738 A CN201611103738 A CN 201611103738A CN 108172218 A CN108172218 A CN 108172218A
- Authority
- CN
- China
- Prior art keywords
- layer
- input
- data
- probability
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 241001672694 Citrus reticulata Species 0.000 claims abstract description 97
- 238000013528 artificial neural network Methods 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims description 35
- 238000000605 extraction Methods 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 210000005036 nerve Anatomy 0.000 claims 1
- 230000003044 adaptive effect Effects 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 230000006978 adaptation Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000004611 spectroscopical analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of pronunciation modeling method and device, is related to technical field of voice recognition, to reduce the complexity of speech model modeling.The pronunciation modeling method of the present invention, including:Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and extract the speech feature vector of the input data;Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the output layer of the acoustic model exports the first probability;Obtain the mandarin data with target dialectal accent;Learn the output layer, and utilize the first probability described in the second probability updating of output layer output using the mandarin data with target dialectal accent.The present invention can reduce the complexity of speech model modeling.
Description
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of pronunciation modeling method and devices.
Background technology
Speech recognition allows machine to understand people's word, and voice signal is converted into the identifiable input of computer.At present
Speech recognition technology be mainly statistical-simulation spectrometry technology and artificial neural network technology.
Hidden Markov Model (Hidden Markov Model, HMM) is the voice technologies such as current speech recognition field
More mature model is more improved, is built the voice of timing by hidden Markov models using the concept of statistics
Mould yields good result.
It is studied in recent years based on deep neural network (Deep Neural Networks, DNN) speech recognition system
Personnel more and more pay close attention to.The concept of deep learning is derived from the research of artificial neural network, by Hinton et al. in 2006
It proposes.The essence of deep learning is the training data by machine learning model of the structure with many hidden layers and magnanimity, to learn
More useful feature is practised, so as to finally promote the accuracy of classification or prediction.Mainly there is following viewpoint:(1) more hidden layers is artificial
Neural network has excellent feature learning ability, and the feature learnt has data more essential description, so as to be conducive to
Classification;(2) difficulty of the deep neural network in training can effectively be overcome by " successively initializing ", and successively initial
Change is realized by unsupervised learning.
In order to improve the recognition accuracy to the mandarin with dialectal accent background, the prior art provides a variety of sides
Method.Some of methods are improved for the training method during Acoustic Modeling, and certain methods are to the language in identifying system
Speech model is improved.But existing in the mandarin recognition methods with dialectal accent background, the complexity of training pattern
Degree is high.
Invention content
In view of this, the present invention provides a kind of pronunciation modeling method and device, to reduce the complexity of speech model modeling
Degree.
In order to solve the above technical problems, the present invention provides a kind of pronunciation modeling method, including:
Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and extract
The speech feature vector of the input data;
Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the acoustic model is defeated
Go out layer and export the first probability;
Obtain the mandarin data with target dialectal accent;
Learn the output layer, and defeated using the output layer using the mandarin data with target dialectal accent
First probability described in the second probability updating gone out.
Wherein, the step of speech feature vector of the extraction input data, including:
Adding window framing operation is carried out to the input data, obtains speech frame;
The mute frame in the speech frame is removed, obtains the speech feature vector.
Wherein, it is described to train deep neural network DNN acoustic models using the speech feature vector, wherein the acoustics
The output layer of model exports the step of the first probability, including:
The input layer of the DNN acoustic models is input to using the speech feature vector as input signal;
In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described every
The input signal of a hidden layer is handled, and obtains the output signal of each hidden layer;
In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, it is general to obtain first
Rate.
Wherein, it is described to learn the output layer using the mandarin data with target dialectal accent, and utilize institute
The step of stating the first probability described in the second probability updating of output layer output, including:
The mandarin data with target dialectal accent are input to the DNN acoustic models as input signal
Input layer;
In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to described every
The input signal of a hidden layer is handled, and obtains the output signal of each hidden layer;
In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, it is general to obtain second
Rate;
Utilize the first probability described in second probability updating.
Wherein, learn the output layer, and utilize using the mandarin data with target dialectal accent described
Described in second probability updating of the output layer output the step of the first probability before, the method further includes:
Mute frame in the removal mandarin data with target dialectal accent;
It is described that the mandarin data with target dialectal accent are input to the DNN acoustic modes as input signal
The step of input layer of type, including:
The mandarin data with target dialectal accent after the mute frame will be eliminated, inputted as input signal
To the input layer of the DNN acoustic models.
Wherein, the method further includes:
Obtain the mandarin data that band to be identified has an accent;
It is identified according to the mandarin data that second probability identification has an accent to the band to be identified.
Second aspect, the present invention provide a kind of pronunciation modeling device, including:
Extraction module, for using standard mandarin data and at least one mandarin data with dialectal accent as defeated
Enter data, and extract the speech feature vector of the input data;
Training module, for training deep neural network DNN acoustic models using the speech feature vector, wherein described
The output layer of acoustic model exports the first probability;
Acquisition module, for obtaining the mandarin data with target dialectal accent;
Modeling module, for learning the output layer, and profit using the mandarin data with target dialectal accent
The first probability described in the second probability updating of output layer output.
Wherein, the extraction module includes:
First acquisition submodule for carrying out adding window framing operation to the input data, obtains speech frame;
Second acquisition submodule for removing the mute frame in the speech frame, obtains the speech feature vector.
Wherein, the training module includes:
First input layer submodule, for being input to the DNN acoustics using the speech feature vector as input signal
The input layer of model;
First hidden layer submodule, in multiple hidden layers of the DNN acoustic models, utilizing each hidden layer pair
The first weights answered handle the input signal of each hidden layer, obtain the output signal of each hidden layer;
First output layer submodule, in the output layer of the DNN acoustic models, believing the output of a most upper hidden layer
It number is handled, obtains the first probability.
Wherein, the modeling module includes:
Second input layer submodule, for the mandarin data with target dialectal accent are defeated as input signal
Enter the input layer to the DNN acoustic models;
Second hidden layer submodule, in multiple hidden layers of the DNN acoustic models, utilizing each hidden layer pair
The second weights answered handle the input signal of each hidden layer, obtain the output signal of each hidden layer;
Second output layer submodule, in the output layer of the DNN acoustic models, believing the output of a most upper hidden layer
It number is handled, obtains the second probability;
Submodule is updated, for utilizing the first probability described in second probability updating.
Wherein, described device further includes:
Processing module, for removing the mute frame in the mandarin data with target dialectal accent;
The second input layer submodule is specifically used for, by eliminate after the mute frame with target dialectal accent
Mandarin data, the input layers of the DNN acoustic models is input to as input signal.
Wherein, described device further includes:
Receiving module, for obtaining the mandarin data that band to be identified has an accent;
Identification module, for according to second probability identification to the mandarin data that the band to be identified has an accent into
Row identification.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
In embodiments of the present invention, using standard mandarin data and at least one mandarin data with dialectal accent as
Basis with deep neural network technique drill acoustic model, obtains the first probability.To carrying the mandarin of target dialectal accent
Data, using learning the acoustic model output layer, and utilize the first probability described in the second probability updating of output layer output.
Therefore, it is multiple when adaptively being adjusted using the mandarin data of target dialectal accent using the scheme of the embodiment of the present invention
With the parameter for the hidden layer for training acoustic model, without individually establishing model for data of each localism area with dialectal accent,
The complexity of model training is simplified, so as to reduce the complexity of speech model modeling.
Description of the drawings
Fig. 1 is the flow chart of the pronunciation modeling method of the embodiment of the present invention one;
Fig. 2 is the structure chart of the pronunciation modeling device of the embodiment of the present invention two;
Fig. 3 is the schematic diagram of the pronunciation modeling device of the embodiment of the present invention two;
Fig. 4 is the schematic diagram of the automatic speech recognition system of the embodiment of the present invention three.
Specific embodiment
Below in conjunction with drawings and examples, the specific embodiment of the present invention is described in further detail.Following reality
Example is applied for illustrating the present invention, but be not limited to the scope of the present invention.
Embodiment one
As shown in Figure 1, the pronunciation modeling method of the embodiment of the present invention one, including:
Step 101, using standard mandarin data and at least one mandarin data with dialectal accent as input number
According to, and extract the speech feature vector of the input data.
Chinese mainly includes:The standard mandarin that official announces and the mandarin with each department dialectal accent.Chinese side
Speech can be greatly classified into eight big localism areas by region.Standard Chinese is a kind of single language.But the pronunciation of mandarin can be by
To the influence of dialectal accent described in each area, in comparison with standard mandarin in pronunciation there are the phenomenon that the change of tune on words.Cause
And this leads to the acoustic model only trained with the data of single standard mandarin, effectively can not correctly describe to carry the change of tune
Acoustic feature;In addition it also is difficult to be collected into the mandarin data for having the dialectal accent in specific dialect lower band in engineering, and
Establish the database of working majority evidence.
Therefore, in embodiments of the present invention, input data selection standard mandarin data and at least one localism area
Mandarin data with dialectal accent, it is common to extract acoustic feature vector, the DNN models of the more hidden layers of training.Preferably, herein
Input data selection standard mandarin data and eight big dialect zone have the mandarin data of dialectal accent.
For input data, in order to enable the acoustic model subsequently established is more accurate, here, to the input data into
The framing of row adding window operates, and obtains speech frame.The short-time energy value of each speech frame is calculated later, is gone according to the short-time energy value
Except mute frame.Specifically, the short-time energy value of each speech frame is compared respectively with predetermined threshold value.If some speech frame
Short-time energy value is less than the threshold value, then can be as mute frame.The mute frame in the speech frame is removed, obtains the voice
Feature vector.Wherein, the threshold value can arbitrarily be set.
Wherein, which can also be context-sensitive, be configured to receive the feature vector of multiple frames.It should
Speech feature vector for example can be Mel frequency cepstral coefficients (Mel-scale Frequency Cepstral
Coefficients, MFCC), consciousness linear prediction (Perceptual Linear Predictive, PLP) feature etc..
Step 102 trains DNN acoustic models using the speech feature vector, wherein the output layer of the acoustic model
Export the first probability.
Wherein, in practical applications, which includes:
Input layer, for receiving speech feature vector.
Multiple hidden layers (at least three).Wherein, each hidden layer includes corresponding multiple nodes (neuron), each to hide
Each node in layer is configured to, and the output of at least one node of the adjacent lower in the DNN is performed linear
Or nonlinear transformation.Wherein, the input of the node of upper strata hidden layer can be based on a node in adjacent lower or several sections
The output of point.Each hidden layer has corresponding weights, and the wherein weights are what the acoustic signal based on training data obtained.
When being trained to model, it can be obtained by using being subjected to supervision or unsupervised learning process carries out the pre-training of model
The initial weight of each hidden layer.It, can be by using back-propagation (Back to fine-tuning for the weights of each hidden layer
Propagation, BP) algorithm carries out.
Output layer, for receiving the output signal from most last layer hidden layer.The node of output layer is utilized according to common
The modeling unit of words pronunciation phonemes composition handles the signal received, and output is the probability point in the modeling unit
Cloth is referred to as probability herein.
Output unit in output layer is the modeling unit for the phonetic element for representing to use in standard Chinese.Modeling unit
Morpheme (binding triphones state) can be used, and modeling unit can be Hidden Markov Model (HMM) or other are suitable
Modeling unit.
Specifically, in this step, the DNN acoustic models are input to using the speech feature vector as input signal
Input layer;In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described
The input signal of each hidden layer is handled, and obtains the output signal of each hidden layer;In the DNN acoustic models
Output layer handles the output signal of a most upper hidden layer, obtains the first probability.
Step 103 obtains the mandarin data with target dialectal accent.
Wherein, the mandarin data with target dialectal accent can be any mandarin with dialectal accent
Data.
Step 104 learns the output layer, and described in utilization using the mandarin data with target dialectal accent
First probability described in second probability updating of output layer output.
In embodiments of the present invention, step 103 and 104 process are properly termed as utilizing common with target dialectal accent
The process that words data adaptively adjust the DNN acoustic models of step 102.In the stage of model adaptation, using with
The mandarin data of target dialectal accent learn output layer, and utilize the probability value for newly learning the output layer, directly replacement
The probability that output layer exports in the acoustic model trained in step 102 with standard mandarin data and a variety of dialectal accent data
Value.
Specifically, the mandarin data with target dialectal accent are input to the DNN sound as input signal
Learn the input layer of model;In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer,
The input signal of each hidden layer is handled, obtains the output signal of each hidden layer;In the DNN acoustics
The output layer of model handles the output signal of a most upper hidden layer, obtains the second probability;Using second probability more
New first probability.
It should be noted that if the training data in step 104 is relatively fewer, then, when carrying out model adaptation,
The identical weights with corresponding hidden layer each in step 102 can be used.In this way, a large amount of number will not needed to using the program
According to acquired acoustic model can adaptively carry the mandarin data of target dialectal accent, and mesh is carried to this so as to be promoted
Mark the recognition correct rate of the mandarin data of dialectal accent.If the training data in step 104 is relatively more, then, herein
It, can also be for the weights that hidden layer is readjusted with the mandarin data of target dialectal accent, and in top layer in step
Hidden layer updates the output probability of output layer, the discrimination of similary lift scheme identification.
Training DNN acoustic models and adaptive adjustment DNN acoustic models by above-mentioned steps 101-104, complete DNN
The foundation of acoustic model.
As seen from the above, using the scheme of the embodiment of the present invention, the mandarin data using target dialectal accent into
During the adaptive adjustment of row, multiplexing trains the parameter of the hidden layer of acoustic model, without being each localism area with dialectal accent
Data individually establish model, simplify the complexity of model training, so as to reduce the complexity of speech model modeling.
On the basis of embodiment one, after step 103 obtains the mandarin data with target dialectal accent, in order to
Recognition accuracy is improved, in this mute frame in also can remove the mandarin data with target dialectal accent.Specifically,
Adding window framing operation is carried out to the mandarin data with target dialectal accent, obtains speech frame.Each language is calculated later
The short-time energy value of sound frame removes mute frame according to the short-time energy value.Specifically, the short-time energy value by each speech frame
It is compared respectively with predetermined threshold value.If the short-time energy value of some speech frame be less than the threshold value, can as mute frame,
Remove the mute frame in the speech frame.Wherein, which can arbitrarily set.
It, can also be according to the model after the adaptive adjustment to language after above-mentioned training pattern and adaptive adjustment model
Sound is identified.At this point, obtaining the mandarin data that band to be identified has an accent, treated according to second probability identification to described
The mandarin data that the band of identification has an accent are identified.
It above-mentioned is obtained after step 101-104 specifically, the mandarin data that band to be identified has an accent are input to
Acoustic model obtains the third probability of output.The third probability and the second probability are matched, and according to the big of matching degree
Word in the mandarin data that small identification band to be identified has an accent etc..
Through the above scheme, with the correlation modeling technology of deep neural network, make acquired acoustic model hidden in multilayer
The classification capacity for hiding layer has very big promotion, so as to improve the accuracy of identification.In the stage of model adaptation, multiplexing has obtained
Acoustic model hidden layer parameter, without individually establishing model for data of each localism area with dialectal accent, simplify
The complexity of model training.In addition, using the scheme of the embodiment of the present invention, the number of the dialectal accent without establishing indivedual localism areas
According to library, learn to update the probability value of output layer with a small amount of data, acoustic model can adaptive different target dialect
The data with dialectal accent in area.
Embodiment two
As shown in Fig. 2, the pronunciation modeling device of the embodiment of the present invention two, including:
Extraction module 201, for standard mandarin data and at least one mandarin data with dialectal accent to be made
For input data, and extract the speech feature vector of the input data;Training module 202, for utilizing the phonetic feature
Vector training deep neural network DNN acoustic models, wherein the output layer of the acoustic model exports the first probability;Acquisition module
203, for obtaining the mandarin data with target dialectal accent;Modeling module 204, for carrying target dialect using described
The mandarin data of accent learn the output layer, and general using described in the second probability updating of output layer output first
Rate.
Wherein, the extraction module 201 includes:First acquisition submodule, for carrying out adding window point to the input data
Frame operates, and obtains speech frame;For removing the mute frame in the speech frame, it is special to obtain the voice for second acquisition submodule
Sign vector.
Wherein, the training module 202 includes:First input layer submodule, for using the speech feature vector as
Input signal is input to the input layer of the DNN acoustic models;First hidden layer submodule, in the DNN acoustic models
Multiple hidden layers in, using corresponding first weights of each hidden layer, at the input signal of each hidden layer
Reason obtains the output signal of each hidden layer;First output layer submodule, in the output of the DNN acoustic models
Layer, handles the output signal of a most upper hidden layer, obtains the first probability.
Wherein, the modeling module 204 includes:Second input layer submodule, for described target dialectal accent will to be carried
Mandarin data the input layers of the DNN acoustic models is input to as input signal;Second hidden layer submodule, for
In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to each hidden layer
Input signal is handled, and obtains the output signal of each hidden layer;Second output layer submodule, in the DNN
The output layer of acoustic model handles the output signal of a most upper hidden layer, obtains the second probability;Submodule is updated, is used
In the first probability described in utilization second probability updating.
As shown in figure 3, described device further includes:Processing module 205, it is described general with target dialectal accent for removing
Mute frame in communicating data.At this point, the second input layer submodule is specifically used for, after eliminating the mute frame
Mandarin data with target dialectal accent are input to the input layer of the DNN acoustic models as input signal.
Again as shown in figure 3, described device further includes:Receiving module 206, for obtain band to be identified have an accent it is common
Talk about data;Identification module 207, for the mandarin data being had an accent according to second probability identification to the band to be identified
It is identified.
The operation principle of device of the present invention can refer to the description of preceding method embodiment.
As seen from the above, using the scheme of the embodiment of the present invention, the mandarin data using target dialectal accent into
During the adaptive adjustment of row, multiplexing trains the parameter of the hidden layer of acoustic model, without being each localism area with dialectal accent
Data individually establish model, simplify the complexity of model training, so as to reduce the complexity of speech model modeling.
Embodiment three
As shown in figure 4, the automatic speech recognition system for the embodiment of the present invention three.The system includes:Extract device assembly
401st, training device assembly 402, decoder component 403 etc..
Wherein, device assembly is extracted, for extracting the speech feature vector of input signal.The process of training DNN acoustic models
In, selection criteria mandarin data and the mandarin data merging data of major localism area band area's dialectal accent are believed as input
Number;And acoustic model it is adaptive when, mandarin data of the selection target localism area with dialectal accent are as input signal.
Training device assembly (DNN), for training DNN acoustic models and adaptively being adjusted to acquired acoustic model
It is whole.Including:
Input layer, for receiving the speech feature vector of extraction device assembly.
Multiple hidden layers (at least three).Wherein, each hidden layer includes corresponding multiple nodes (neuron), each to hide
Each node in layer is configured to, and the output of at least one node of the adjacent lower in the DNN is performed linear
Or nonlinear transformation.Wherein, the input of the node of upper strata hidden layer can be based on a node in adjacent lower or several sections
The output of point.Each hidden layer has weights corresponding thereto, wherein the weights are the acoustic signals based on training data
It obtains.It, can be by using being subjected to supervision or unsupervised learning process carries out the pre- of model when being trained to model
Training, obtains the initial weight of each hidden layer.It, can be by using back-propagation to fine-tuning for the weights of each hidden layer
Algorithm carries out.
Output layer, for receiving the output of the most upper hidden layer in the DNN.The node of output layer is utilized by general
The modeling unit of call pronunciation phonemes composition handles the signal received, and output is the probability in the modeling unit
Distribution, is referred to as the first probability herein.
Output unit in output layer is the modeling unit for the phonetic element for representing to use in standard Chinese.Modeling unit
Morpheme (binding triphones state) can be used, and modeling unit can be Hidden Markov Model (HMM) or other are suitable
Modeling unit.
Decoder component, for utilizing the common of the probability identification target dialect zone dialectal accent of training device assembly output
Talk about the word of data.
In embodiments of the present invention, trained data selection standard mandarin data and eight big dialect zone dialect mouths of addition
The data of sound, it is common to extract acoustic feature vector, the DNN models of the more hidden layers of training.In addition, to promote DNN models to major
The adaptive ability of mandarin data of the localism area with dialectal accent carries the mandarin of dialectal accent under to target localism area
It in the identifying system of data, to acquired DNN models, is multiplexed its and hides layer parameter, and using being carried under the target localism area
The mandarin data of dialectal accent relearn and output probability value output layer.Finally, the acoustics obtained by such mode
Model, compared to single localism area with dialectal accent mandarin data or standard mandarin data train model,
Discrimination in identifying system can be promoted.
Through the above scheme, with the correlation modeling technology of deep neural network, make acquired acoustic model hidden in multilayer
The classification capacity for hiding layer has very big promotion, so as to improve the accuracy of identification.In the stage of model adaptation, multiplexing has obtained
Acoustic model hidden layer parameter, without individually establishing model for data of each localism area with dialectal accent, simplify
The complexity of model training.In addition, using the scheme of the embodiment of the present invention, the number of the dialectal accent without establishing indivedual localism areas
According to library, learn to update the probability value of output layer with a small amount of data, acoustic model can adaptive different target dialect
The data with dialectal accent in area.
In several embodiments provided herein, it should be understood that disclosed method and apparatus, it can be by other
Mode realize.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only
For a kind of division of logic function, there can be other dividing mode in actual implementation, such as multiple units or component can combine
Or it is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed phase
Coupling, direct-coupling or communication connection between mutually can be by some interfaces, the INDIRECT COUPLING or communication of device or unit
Connection can be electrical, machinery or other forms.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That the independent physics of each unit includes, can also two or more units integrate in a unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, is used including some instructions so that a computer
Equipment (can be personal computer, server or the network equipment etc.) performs receiving/transmission method described in each embodiment of the present invention
Part steps.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, abbreviation
ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic disc or CD etc. are various to store
The medium of program code.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (12)
- A kind of 1. pronunciation modeling method, which is characterized in that including:Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and described in extraction The speech feature vector of input data;Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the output layer of the acoustic model Export the first probability;Obtain the mandarin data with target dialectal accent;Learn the output layer, and utilize output layer output using the mandarin data with target dialectal accent First probability described in second probability updating.
- 2. the according to the method described in claim 1, it is characterized in that, speech feature vector of the extraction input data Step, including:Adding window framing operation is carried out to the input data, obtains speech frame;The mute frame in the speech frame is removed, obtains the speech feature vector.
- 3. according to the method described in claim 1, it is characterized in that, described utilize speech feature vector training depth nerve Network DNN acoustic models, wherein the step of output layer of the acoustic model exports the first probability, including:The input layer of the DNN acoustic models is input to using the speech feature vector as input signal;In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described each hidden The input signal for hiding layer is handled, and obtains the output signal of each hidden layer;In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, obtains the first probability.
- 4. according to the method described in claim 1, it is characterized in that, described utilize the mandarin for carrying target dialectal accent Data learn the output layer, and described in the second probability updating of the utilization output layer output the step of the first probability, including:The input of the DNN acoustic models is input to using the mandarin data with target dialectal accent as input signal Layer;In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to described each hidden The input signal for hiding layer is handled, and obtains the output signal of each hidden layer;In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, obtains the second probability;Utilize the first probability described in second probability updating.
- 5. according to the method described in claim 4, it is characterized in that, described using described common with target dialectal accent Before the step of talking about data and learn the output layer, and utilizing the first probability described in the second probability updating of output layer output, The method further includes:Mute frame in the removal mandarin data with target dialectal accent;It is described that the mandarin data with target dialectal accent are input to the DNN acoustic models as input signal The step of input layer, including:The mandarin data with target dialectal accent after the mute frame will be eliminated, institute is input to as input signal State the input layer of DNN acoustic models.
- 6. according to claim 1-5 any one of them methods, which is characterized in that the method further includes:Obtain the mandarin data that band to be identified has an accent;It is identified according to the mandarin data that second probability identification has an accent to the band to be identified.
- 7. a kind of pronunciation modeling device, which is characterized in that including:Extraction module, for using standard mandarin data and at least one mandarin data with dialectal accent as input number According to, and extract the speech feature vector of the input data;Training module, for training deep neural network DNN acoustic models using the speech feature vector, wherein the acoustics The output layer of model exports the first probability;Acquisition module, for obtaining the mandarin data with target dialectal accent;Modeling module for learning the output layer using the mandarin data with target dialectal accent, and utilizes institute State the first probability described in the second probability updating of output layer output.
- 8. device according to claim 7, which is characterized in that the extraction module includes:First acquisition submodule for carrying out adding window framing operation to the input data, obtains speech frame;Second acquisition submodule for removing the mute frame in the speech frame, obtains the speech feature vector.
- 9. device according to claim 7, which is characterized in that the training module includes:First input layer submodule, for being input to the DNN acoustic models using the speech feature vector as input signal Input layer;First hidden layer submodule, it is corresponding using each hidden layer in multiple hidden layers of the DNN acoustic models First weights handle the input signal of each hidden layer, obtain the output signal of each hidden layer;First output layer submodule, in the output layer of the DNN acoustic models, to the output signal of a most upper hidden layer into Row processing, obtains the first probability.
- 10. device according to claim 7, which is characterized in that the modeling module includes:Second input layer submodule, for the mandarin data with target dialectal accent to be input to as input signal The input layer of the DNN acoustic models;Second hidden layer submodule, it is corresponding using each hidden layer in multiple hidden layers of the DNN acoustic models Second weights handle the input signal of each hidden layer, obtain the output signal of each hidden layer;Second output layer submodule, in the output layer of the DNN acoustic models, to the output signal of a most upper hidden layer into Row processing, obtains the second probability;Submodule is updated, for utilizing the first probability described in second probability updating.
- 11. device according to claim 10, which is characterized in that described device further includes:Processing module, for removing the mute frame in the mandarin data with target dialectal accent;The second input layer submodule is specifically used for, general with target dialectal accent after the mute frame by eliminating Communicating data is input to the input layer of the DNN acoustic models as input signal.
- 12. according to claim 7-11 any one of them devices, which is characterized in that described device further includes:Receiving module, for obtaining the mandarin data that band to be identified has an accent;Identification module, the mandarin data for being had an accent according to second probability identification to the band to be identified are known Not.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611103738.6A CN108172218B (en) | 2016-12-05 | 2016-12-05 | Voice modeling method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611103738.6A CN108172218B (en) | 2016-12-05 | 2016-12-05 | Voice modeling method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108172218A true CN108172218A (en) | 2018-06-15 |
CN108172218B CN108172218B (en) | 2021-01-12 |
Family
ID=62525918
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611103738.6A Active CN108172218B (en) | 2016-12-05 | 2016-12-05 | Voice modeling method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108172218B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109887497A (en) * | 2019-04-12 | 2019-06-14 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110211565A (en) * | 2019-05-06 | 2019-09-06 | 平安科技(深圳)有限公司 | Accent recognition method, apparatus and computer readable storage medium |
CN110738991A (en) * | 2019-10-11 | 2020-01-31 | 东南大学 | Speech recognition equipment based on flexible wearable sensor |
CN110930995A (en) * | 2019-11-26 | 2020-03-27 | 中国南方电网有限责任公司 | Voice recognition model applied to power industry |
CN111179938A (en) * | 2019-12-26 | 2020-05-19 | 安徽仁昊智能科技有限公司 | Speech recognition garbage classification system based on artificial intelligence |
CN111243574A (en) * | 2020-01-13 | 2020-06-05 | 苏州奇梦者网络科技有限公司 | Voice model adaptive training method, system, device and storage medium |
CN112528679A (en) * | 2020-12-17 | 2021-03-19 | 科大讯飞股份有限公司 | Intention understanding model training method and device and intention understanding method and device |
CN112770154A (en) * | 2021-01-19 | 2021-05-07 | 深圳西米通信有限公司 | Intelligent set top box with voice interaction function and interaction method thereof |
CN112967720A (en) * | 2021-01-29 | 2021-06-15 | 南京迪港科技有限责任公司 | End-to-end voice-to-text model optimization method under small amount of accent data |
WO2021135438A1 (en) * | 2020-07-31 | 2021-07-08 | 平安科技(深圳)有限公司 | Multilingual speech recognition model training method, apparatus, device, and storage medium |
CN113192492A (en) * | 2021-04-28 | 2021-07-30 | 平安科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
CN113223542A (en) * | 2021-04-26 | 2021-08-06 | 北京搜狗科技发展有限公司 | Audio conversion method and device, storage medium and electronic equipment |
CN113345451A (en) * | 2021-04-26 | 2021-09-03 | 北京搜狗科技发展有限公司 | Sound changing method and device and electronic equipment |
WO2021213161A1 (en) * | 2020-11-25 | 2021-10-28 | 平安科技(深圳)有限公司 | Dialect speech recognition method, apparatus, medium, and electronic device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103310788A (en) * | 2013-05-23 | 2013-09-18 | 北京云知声信息技术有限公司 | Voice information identification method and system |
CN104282300A (en) * | 2013-07-05 | 2015-01-14 | 中国移动通信集团公司 | Non-periodic component syllable model building and speech synthesizing method and device |
CN104575490A (en) * | 2014-12-30 | 2015-04-29 | 苏州驰声信息科技有限公司 | Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm |
EP2889804A1 (en) * | 2013-12-30 | 2015-07-01 | Alcatel Lucent | Systems and methods for contactless speech recognition |
CN105118498A (en) * | 2015-09-06 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | Training method and apparatus of speech synthesis model |
CN105206258A (en) * | 2015-10-19 | 2015-12-30 | 百度在线网络技术(北京)有限公司 | Generation method and device of acoustic model as well as voice synthetic method and device |
CN105391873A (en) * | 2015-11-25 | 2016-03-09 | 上海新储集成电路有限公司 | Method for realizing local voice recognition in mobile device |
CN105578115A (en) * | 2015-12-22 | 2016-05-11 | 深圳市鹰硕音频科技有限公司 | Network teaching method and system with voice assessment function |
CN105632501A (en) * | 2015-12-30 | 2016-06-01 | 中国科学院自动化研究所 | Deep-learning-technology-based automatic accent classification method and apparatus |
US20160239476A1 (en) * | 2015-02-13 | 2016-08-18 | Facebook, Inc. | Machine learning dialect identification |
-
2016
- 2016-12-05 CN CN201611103738.6A patent/CN108172218B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103310788A (en) * | 2013-05-23 | 2013-09-18 | 北京云知声信息技术有限公司 | Voice information identification method and system |
CN104282300A (en) * | 2013-07-05 | 2015-01-14 | 中国移动通信集团公司 | Non-periodic component syllable model building and speech synthesizing method and device |
EP2889804A1 (en) * | 2013-12-30 | 2015-07-01 | Alcatel Lucent | Systems and methods for contactless speech recognition |
CN104575490A (en) * | 2014-12-30 | 2015-04-29 | 苏州驰声信息科技有限公司 | Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm |
US20160239476A1 (en) * | 2015-02-13 | 2016-08-18 | Facebook, Inc. | Machine learning dialect identification |
CN105118498A (en) * | 2015-09-06 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | Training method and apparatus of speech synthesis model |
CN105206258A (en) * | 2015-10-19 | 2015-12-30 | 百度在线网络技术(北京)有限公司 | Generation method and device of acoustic model as well as voice synthetic method and device |
CN105391873A (en) * | 2015-11-25 | 2016-03-09 | 上海新储集成电路有限公司 | Method for realizing local voice recognition in mobile device |
CN105578115A (en) * | 2015-12-22 | 2016-05-11 | 深圳市鹰硕音频科技有限公司 | Network teaching method and system with voice assessment function |
CN105632501A (en) * | 2015-12-30 | 2016-06-01 | 中国科学院自动化研究所 | Deep-learning-technology-based automatic accent classification method and apparatus |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109887497B (en) * | 2019-04-12 | 2021-01-29 | 北京百度网讯科技有限公司 | Modeling method, device and equipment for speech recognition |
CN109887497A (en) * | 2019-04-12 | 2019-06-14 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
US11688391B2 (en) | 2019-04-15 | 2023-06-27 | Beijing Baidu Netcom Science And Technology Co. | Mandarin and dialect mixed modeling and speech recognition |
CN110033760B (en) * | 2019-04-15 | 2021-01-29 | 北京百度网讯科技有限公司 | Modeling method, device and equipment for speech recognition |
CN110211565A (en) * | 2019-05-06 | 2019-09-06 | 平安科技(深圳)有限公司 | Accent recognition method, apparatus and computer readable storage medium |
CN110738991A (en) * | 2019-10-11 | 2020-01-31 | 东南大学 | Speech recognition equipment based on flexible wearable sensor |
CN110930995B (en) * | 2019-11-26 | 2022-02-11 | 中国南方电网有限责任公司 | Voice recognition model applied to power industry |
CN110930995A (en) * | 2019-11-26 | 2020-03-27 | 中国南方电网有限责任公司 | Voice recognition model applied to power industry |
CN111179938A (en) * | 2019-12-26 | 2020-05-19 | 安徽仁昊智能科技有限公司 | Speech recognition garbage classification system based on artificial intelligence |
CN111243574B (en) * | 2020-01-13 | 2023-01-03 | 苏州奇梦者网络科技有限公司 | Voice model adaptive training method, system, device and storage medium |
CN111243574A (en) * | 2020-01-13 | 2020-06-05 | 苏州奇梦者网络科技有限公司 | Voice model adaptive training method, system, device and storage medium |
WO2021135438A1 (en) * | 2020-07-31 | 2021-07-08 | 平安科技(深圳)有限公司 | Multilingual speech recognition model training method, apparatus, device, and storage medium |
WO2021213161A1 (en) * | 2020-11-25 | 2021-10-28 | 平安科技(深圳)有限公司 | Dialect speech recognition method, apparatus, medium, and electronic device |
CN112528679A (en) * | 2020-12-17 | 2021-03-19 | 科大讯飞股份有限公司 | Intention understanding model training method and device and intention understanding method and device |
CN112528679B (en) * | 2020-12-17 | 2024-02-13 | 科大讯飞股份有限公司 | Method and device for training intention understanding model, and method and device for intention understanding |
CN112770154A (en) * | 2021-01-19 | 2021-05-07 | 深圳西米通信有限公司 | Intelligent set top box with voice interaction function and interaction method thereof |
CN112967720A (en) * | 2021-01-29 | 2021-06-15 | 南京迪港科技有限责任公司 | End-to-end voice-to-text model optimization method under small amount of accent data |
CN113223542A (en) * | 2021-04-26 | 2021-08-06 | 北京搜狗科技发展有限公司 | Audio conversion method and device, storage medium and electronic equipment |
CN113345451A (en) * | 2021-04-26 | 2021-09-03 | 北京搜狗科技发展有限公司 | Sound changing method and device and electronic equipment |
CN113345451B (en) * | 2021-04-26 | 2023-08-22 | 北京搜狗科技发展有限公司 | Sound changing method and device and electronic equipment |
CN113223542B (en) * | 2021-04-26 | 2024-04-12 | 北京搜狗科技发展有限公司 | Audio conversion method and device, storage medium and electronic equipment |
CN113192492A (en) * | 2021-04-28 | 2021-07-30 | 平安科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
CN113192492B (en) * | 2021-04-28 | 2024-05-28 | 平安科技(深圳)有限公司 | Speech recognition method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108172218B (en) | 2021-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108172218A (en) | A kind of pronunciation modeling method and device | |
CN109326302B (en) | Voice enhancement method based on voiceprint comparison and generation of confrontation network | |
CN107545903B (en) | Voice conversion method based on deep learning | |
Abdel-Hamid et al. | Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code | |
CN103928023B (en) | A kind of speech assessment method and system | |
CN105632501A (en) | Deep-learning-technology-based automatic accent classification method and apparatus | |
CN107039036B (en) | High-quality speaker recognition method based on automatic coding depth confidence network | |
CN111210807B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
US20080208577A1 (en) | Multi-stage speech recognition apparatus and method | |
Xie et al. | Sequence error (SE) minimization training of neural network for voice conversion. | |
WO2007114605A1 (en) | Acoustic model adaptation methods based on pronunciation variability analysis for enhancing the recognition of voice of non-native speaker and apparatuses thereof | |
CN107731233A (en) | A kind of method for recognizing sound-groove based on RNN | |
CN109147774B (en) | Improved time-delay neural network acoustic model | |
CN102938252B (en) | System and method for recognizing Chinese tone based on rhythm and phonetics features | |
CN106898355B (en) | Speaker identification method based on secondary modeling | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
CN109637526A (en) | The adaptive approach of DNN acoustic model based on personal identification feature | |
CN105575383A (en) | Apparatus and method for controlling target information voice output through using voice characteristics of user | |
CN110931045A (en) | Audio feature generation method based on convolutional neural network | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
Salam et al. | Malay isolated speech recognition using neural network: a work in finding number of hidden nodes and learning parameters. | |
CN109377986A (en) | A kind of non-parallel corpus voice personalization conversion method | |
CN108831486B (en) | Speaker recognition method based on DNN and GMM models | |
Yang et al. | Essence knowledge distillation for speech recognition | |
CN108182938A (en) | A kind of training method of the Mongol acoustic model based on DNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |