CN108172218A

CN108172218A - A kind of pronunciation modeling method and device

Info

Publication number: CN108172218A
Application number: CN201611103738.6A
Authority: CN
Inventors: 徐衍瀚
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Co Ltd
Priority date: 2016-12-05
Filing date: 2016-12-05
Publication date: 2018-06-15
Anticipated expiration: 2036-12-05
Also published as: CN108172218B

Abstract

The present invention provides a kind of pronunciation modeling method and device, is related to technical field of voice recognition, to reduce the complexity of speech model modeling.The pronunciation modeling method of the present invention, including：Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and extract the speech feature vector of the input data；Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the output layer of the acoustic model exports the first probability；Obtain the mandarin data with target dialectal accent；Learn the output layer, and utilize the first probability described in the second probability updating of output layer output using the mandarin data with target dialectal accent.The present invention can reduce the complexity of speech model modeling.

Description

A kind of pronunciation modeling method and device

Technical field

The present invention relates to technical field of voice recognition more particularly to a kind of pronunciation modeling method and devices.

Background technology

Speech recognition allows machine to understand people's word, and voice signal is converted into the identifiable input of computer.At present Speech recognition technology be mainly statistical-simulation spectrometry technology and artificial neural network technology.

Hidden Markov Model (Hidden Markov Model, HMM) is the voice technologies such as current speech recognition field More mature model is more improved, is built the voice of timing by hidden Markov models using the concept of statistics Mould yields good result.

It is studied in recent years based on deep neural network (Deep Neural Networks, DNN) speech recognition system Personnel more and more pay close attention to.The concept of deep learning is derived from the research of artificial neural network, by Hinton et al. in 2006 It proposes.The essence of deep learning is the training data by machine learning model of the structure with many hidden layers and magnanimity, to learn More useful feature is practised, so as to finally promote the accuracy of classification or prediction.Mainly there is following viewpoint：(1) more hidden layers is artificial Neural network has excellent feature learning ability, and the feature learnt has data more essential description, so as to be conducive to Classification；(2) difficulty of the deep neural network in training can effectively be overcome by " successively initializing ", and successively initial Change is realized by unsupervised learning.

In order to improve the recognition accuracy to the mandarin with dialectal accent background, the prior art provides a variety of sides Method.Some of methods are improved for the training method during Acoustic Modeling, and certain methods are to the language in identifying system Speech model is improved.But existing in the mandarin recognition methods with dialectal accent background, the complexity of training pattern Degree is high.

Invention content

In view of this, the present invention provides a kind of pronunciation modeling method and device, to reduce the complexity of speech model modeling Degree.

In order to solve the above technical problems, the present invention provides a kind of pronunciation modeling method, including：

Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and extract The speech feature vector of the input data；

Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the acoustic model is defeated Go out layer and export the first probability；

Obtain the mandarin data with target dialectal accent；

Learn the output layer, and defeated using the output layer using the mandarin data with target dialectal accent First probability described in the second probability updating gone out.

Wherein, the step of speech feature vector of the extraction input data, including：

Adding window framing operation is carried out to the input data, obtains speech frame；

The mute frame in the speech frame is removed, obtains the speech feature vector.

Wherein, it is described to train deep neural network DNN acoustic models using the speech feature vector, wherein the acoustics The output layer of model exports the step of the first probability, including：

The input layer of the DNN acoustic models is input to using the speech feature vector as input signal；

In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described every The input signal of a hidden layer is handled, and obtains the output signal of each hidden layer；

In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, it is general to obtain first Rate.

Wherein, it is described to learn the output layer using the mandarin data with target dialectal accent, and utilize institute The step of stating the first probability described in the second probability updating of output layer output, including：

The mandarin data with target dialectal accent are input to the DNN acoustic models as input signal Input layer；

In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to described every The input signal of a hidden layer is handled, and obtains the output signal of each hidden layer；

In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, it is general to obtain second Rate；

Utilize the first probability described in second probability updating.

Wherein, learn the output layer, and utilize using the mandarin data with target dialectal accent described Described in second probability updating of the output layer output the step of the first probability before, the method further includes：

Mute frame in the removal mandarin data with target dialectal accent；

It is described that the mandarin data with target dialectal accent are input to the DNN acoustic modes as input signal The step of input layer of type, including：

The mandarin data with target dialectal accent after the mute frame will be eliminated, inputted as input signal To the input layer of the DNN acoustic models.

Wherein, the method further includes：

Obtain the mandarin data that band to be identified has an accent；

It is identified according to the mandarin data that second probability identification has an accent to the band to be identified.

Second aspect, the present invention provide a kind of pronunciation modeling device, including：

Extraction module, for using standard mandarin data and at least one mandarin data with dialectal accent as defeated Enter data, and extract the speech feature vector of the input data；

Training module, for training deep neural network DNN acoustic models using the speech feature vector, wherein described The output layer of acoustic model exports the first probability；

Acquisition module, for obtaining the mandarin data with target dialectal accent；

Modeling module, for learning the output layer, and profit using the mandarin data with target dialectal accent The first probability described in the second probability updating of output layer output.

Wherein, the extraction module includes：

First acquisition submodule for carrying out adding window framing operation to the input data, obtains speech frame；

Second acquisition submodule for removing the mute frame in the speech frame, obtains the speech feature vector.

Wherein, the training module includes：

First input layer submodule, for being input to the DNN acoustics using the speech feature vector as input signal The input layer of model；

First hidden layer submodule, in multiple hidden layers of the DNN acoustic models, utilizing each hidden layer pair The first weights answered handle the input signal of each hidden layer, obtain the output signal of each hidden layer；

First output layer submodule, in the output layer of the DNN acoustic models, believing the output of a most upper hidden layer It number is handled, obtains the first probability.

Wherein, the modeling module includes：

Second input layer submodule, for the mandarin data with target dialectal accent are defeated as input signal Enter the input layer to the DNN acoustic models；

Second hidden layer submodule, in multiple hidden layers of the DNN acoustic models, utilizing each hidden layer pair The second weights answered handle the input signal of each hidden layer, obtain the output signal of each hidden layer；

Second output layer submodule, in the output layer of the DNN acoustic models, believing the output of a most upper hidden layer It number is handled, obtains the second probability；

Submodule is updated, for utilizing the first probability described in second probability updating.

Wherein, described device further includes：

Processing module, for removing the mute frame in the mandarin data with target dialectal accent；

The second input layer submodule is specifically used for, by eliminate after the mute frame with target dialectal accent Mandarin data, the input layers of the DNN acoustic models is input to as input signal.

Wherein, described device further includes：

Receiving module, for obtaining the mandarin data that band to be identified has an accent；

Identification module, for according to second probability identification to the mandarin data that the band to be identified has an accent into Row identification.

The above-mentioned technical proposal of the present invention has the beneficial effect that：

In embodiments of the present invention, using standard mandarin data and at least one mandarin data with dialectal accent as Basis with deep neural network technique drill acoustic model, obtains the first probability.To carrying the mandarin of target dialectal accent Data, using learning the acoustic model output layer, and utilize the first probability described in the second probability updating of output layer output. Therefore, it is multiple when adaptively being adjusted using the mandarin data of target dialectal accent using the scheme of the embodiment of the present invention With the parameter for the hidden layer for training acoustic model, without individually establishing model for data of each localism area with dialectal accent, The complexity of model training is simplified, so as to reduce the complexity of speech model modeling.

Description of the drawings

Fig. 1 is the flow chart of the pronunciation modeling method of the embodiment of the present invention one；

Fig. 2 is the structure chart of the pronunciation modeling device of the embodiment of the present invention two；

Fig. 3 is the schematic diagram of the pronunciation modeling device of the embodiment of the present invention two；

Fig. 4 is the schematic diagram of the automatic speech recognition system of the embodiment of the present invention three.

Specific embodiment

Below in conjunction with drawings and examples, the specific embodiment of the present invention is described in further detail.Following reality Example is applied for illustrating the present invention, but be not limited to the scope of the present invention.

Embodiment one

As shown in Figure 1, the pronunciation modeling method of the embodiment of the present invention one, including：

Step 101, using standard mandarin data and at least one mandarin data with dialectal accent as input number According to, and extract the speech feature vector of the input data.

Chinese mainly includes：The standard mandarin that official announces and the mandarin with each department dialectal accent.Chinese side Speech can be greatly classified into eight big localism areas by region.Standard Chinese is a kind of single language.But the pronunciation of mandarin can be by To the influence of dialectal accent described in each area, in comparison with standard mandarin in pronunciation there are the phenomenon that the change of tune on words.Cause And this leads to the acoustic model only trained with the data of single standard mandarin, effectively can not correctly describe to carry the change of tune Acoustic feature；In addition it also is difficult to be collected into the mandarin data for having the dialectal accent in specific dialect lower band in engineering, and Establish the database of working majority evidence.

Therefore, in embodiments of the present invention, input data selection standard mandarin data and at least one localism area Mandarin data with dialectal accent, it is common to extract acoustic feature vector, the DNN models of the more hidden layers of training.Preferably, herein Input data selection standard mandarin data and eight big dialect zone have the mandarin data of dialectal accent.

For input data, in order to enable the acoustic model subsequently established is more accurate, here, to the input data into The framing of row adding window operates, and obtains speech frame.The short-time energy value of each speech frame is calculated later, is gone according to the short-time energy value Except mute frame.Specifically, the short-time energy value of each speech frame is compared respectively with predetermined threshold value.If some speech frame Short-time energy value is less than the threshold value, then can be as mute frame.The mute frame in the speech frame is removed, obtains the voice Feature vector.Wherein, the threshold value can arbitrarily be set.

Wherein, which can also be context-sensitive, be configured to receive the feature vector of multiple frames.It should Speech feature vector for example can be Mel frequency cepstral coefficients (Mel-scale Frequency Cepstral Coefficients, MFCC), consciousness linear prediction (Perceptual Linear Predictive, PLP) feature etc..

Step 102 trains DNN acoustic models using the speech feature vector, wherein the output layer of the acoustic model Export the first probability.

Wherein, in practical applications, which includes：

Input layer, for receiving speech feature vector.

Multiple hidden layers (at least three).Wherein, each hidden layer includes corresponding multiple nodes (neuron), each to hide Each node in layer is configured to, and the output of at least one node of the adjacent lower in the DNN is performed linear Or nonlinear transformation.Wherein, the input of the node of upper strata hidden layer can be based on a node in adjacent lower or several sections The output of point.Each hidden layer has corresponding weights, and the wherein weights are what the acoustic signal based on training data obtained. When being trained to model, it can be obtained by using being subjected to supervision or unsupervised learning process carries out the pre-training of model The initial weight of each hidden layer.It, can be by using back-propagation (Back to fine-tuning for the weights of each hidden layer Propagation, BP) algorithm carries out.

Output layer, for receiving the output signal from most last layer hidden layer.The node of output layer is utilized according to common The modeling unit of words pronunciation phonemes composition handles the signal received, and output is the probability point in the modeling unit Cloth is referred to as probability herein.

Output unit in output layer is the modeling unit for the phonetic element for representing to use in standard Chinese.Modeling unit Morpheme (binding triphones state) can be used, and modeling unit can be Hidden Markov Model (HMM) or other are suitable Modeling unit.

Specifically, in this step, the DNN acoustic models are input to using the speech feature vector as input signal Input layer；In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described The input signal of each hidden layer is handled, and obtains the output signal of each hidden layer；In the DNN acoustic models Output layer handles the output signal of a most upper hidden layer, obtains the first probability.

Step 103 obtains the mandarin data with target dialectal accent.

Wherein, the mandarin data with target dialectal accent can be any mandarin with dialectal accent Data.

Step 104 learns the output layer, and described in utilization using the mandarin data with target dialectal accent First probability described in second probability updating of output layer output.

In embodiments of the present invention, step 103 and 104 process are properly termed as utilizing common with target dialectal accent The process that words data adaptively adjust the DNN acoustic models of step 102.In the stage of model adaptation, using with The mandarin data of target dialectal accent learn output layer, and utilize the probability value for newly learning the output layer, directly replacement The probability that output layer exports in the acoustic model trained in step 102 with standard mandarin data and a variety of dialectal accent data Value.

Specifically, the mandarin data with target dialectal accent are input to the DNN sound as input signal Learn the input layer of model；In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, The input signal of each hidden layer is handled, obtains the output signal of each hidden layer；In the DNN acoustics The output layer of model handles the output signal of a most upper hidden layer, obtains the second probability；Using second probability more New first probability.

It should be noted that if the training data in step 104 is relatively fewer, then, when carrying out model adaptation, The identical weights with corresponding hidden layer each in step 102 can be used.In this way, a large amount of number will not needed to using the program According to acquired acoustic model can adaptively carry the mandarin data of target dialectal accent, and mesh is carried to this so as to be promoted Mark the recognition correct rate of the mandarin data of dialectal accent.If the training data in step 104 is relatively more, then, herein It, can also be for the weights that hidden layer is readjusted with the mandarin data of target dialectal accent, and in top layer in step Hidden layer updates the output probability of output layer, the discrimination of similary lift scheme identification.

Training DNN acoustic models and adaptive adjustment DNN acoustic models by above-mentioned steps 101-104, complete DNN The foundation of acoustic model.

As seen from the above, using the scheme of the embodiment of the present invention, the mandarin data using target dialectal accent into During the adaptive adjustment of row, multiplexing trains the parameter of the hidden layer of acoustic model, without being each localism area with dialectal accent Data individually establish model, simplify the complexity of model training, so as to reduce the complexity of speech model modeling.

On the basis of embodiment one, after step 103 obtains the mandarin data with target dialectal accent, in order to Recognition accuracy is improved, in this mute frame in also can remove the mandarin data with target dialectal accent.Specifically, Adding window framing operation is carried out to the mandarin data with target dialectal accent, obtains speech frame.Each language is calculated later The short-time energy value of sound frame removes mute frame according to the short-time energy value.Specifically, the short-time energy value by each speech frame It is compared respectively with predetermined threshold value.If the short-time energy value of some speech frame be less than the threshold value, can as mute frame, Remove the mute frame in the speech frame.Wherein, which can arbitrarily set.

It, can also be according to the model after the adaptive adjustment to language after above-mentioned training pattern and adaptive adjustment model Sound is identified.At this point, obtaining the mandarin data that band to be identified has an accent, treated according to second probability identification to described The mandarin data that the band of identification has an accent are identified.

It above-mentioned is obtained after step 101-104 specifically, the mandarin data that band to be identified has an accent are input to Acoustic model obtains the third probability of output.The third probability and the second probability are matched, and according to the big of matching degree Word in the mandarin data that small identification band to be identified has an accent etc..

Through the above scheme, with the correlation modeling technology of deep neural network, make acquired acoustic model hidden in multilayer The classification capacity for hiding layer has very big promotion, so as to improve the accuracy of identification.In the stage of model adaptation, multiplexing has obtained Acoustic model hidden layer parameter, without individually establishing model for data of each localism area with dialectal accent, simplify The complexity of model training.In addition, using the scheme of the embodiment of the present invention, the number of the dialectal accent without establishing indivedual localism areas According to library, learn to update the probability value of output layer with a small amount of data, acoustic model can adaptive different target dialect The data with dialectal accent in area.

Embodiment two

As shown in Fig. 2, the pronunciation modeling device of the embodiment of the present invention two, including：

Extraction module 201, for standard mandarin data and at least one mandarin data with dialectal accent to be made For input data, and extract the speech feature vector of the input data；Training module 202, for utilizing the phonetic feature Vector training deep neural network DNN acoustic models, wherein the output layer of the acoustic model exports the first probability；Acquisition module 203, for obtaining the mandarin data with target dialectal accent；Modeling module 204, for carrying target dialect using described The mandarin data of accent learn the output layer, and general using described in the second probability updating of output layer output first Rate.

Wherein, the extraction module 201 includes：First acquisition submodule, for carrying out adding window point to the input data Frame operates, and obtains speech frame；For removing the mute frame in the speech frame, it is special to obtain the voice for second acquisition submodule Sign vector.

Wherein, the training module 202 includes：First input layer submodule, for using the speech feature vector as Input signal is input to the input layer of the DNN acoustic models；First hidden layer submodule, in the DNN acoustic models Multiple hidden layers in, using corresponding first weights of each hidden layer, at the input signal of each hidden layer Reason obtains the output signal of each hidden layer；First output layer submodule, in the output of the DNN acoustic models Layer, handles the output signal of a most upper hidden layer, obtains the first probability.

Wherein, the modeling module 204 includes：Second input layer submodule, for described target dialectal accent will to be carried Mandarin data the input layers of the DNN acoustic models is input to as input signal；Second hidden layer submodule, for In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to each hidden layer Input signal is handled, and obtains the output signal of each hidden layer；Second output layer submodule, in the DNN The output layer of acoustic model handles the output signal of a most upper hidden layer, obtains the second probability；Submodule is updated, is used In the first probability described in utilization second probability updating.

As shown in figure 3, described device further includes：Processing module 205, it is described general with target dialectal accent for removing Mute frame in communicating data.At this point, the second input layer submodule is specifically used for, after eliminating the mute frame Mandarin data with target dialectal accent are input to the input layer of the DNN acoustic models as input signal.

Again as shown in figure 3, described device further includes：Receiving module 206, for obtain band to be identified have an accent it is common Talk about data；Identification module 207, for the mandarin data being had an accent according to second probability identification to the band to be identified It is identified.

The operation principle of device of the present invention can refer to the description of preceding method embodiment.

Embodiment three

As shown in figure 4, the automatic speech recognition system for the embodiment of the present invention three.The system includes：Extract device assembly 401st, training device assembly 402, decoder component 403 etc..

Wherein, device assembly is extracted, for extracting the speech feature vector of input signal.The process of training DNN acoustic models In, selection criteria mandarin data and the mandarin data merging data of major localism area band area's dialectal accent are believed as input Number；And acoustic model it is adaptive when, mandarin data of the selection target localism area with dialectal accent are as input signal.

Training device assembly (DNN), for training DNN acoustic models and adaptively being adjusted to acquired acoustic model It is whole.Including：

Input layer, for receiving the speech feature vector of extraction device assembly.

Multiple hidden layers (at least three).Wherein, each hidden layer includes corresponding multiple nodes (neuron), each to hide Each node in layer is configured to, and the output of at least one node of the adjacent lower in the DNN is performed linear Or nonlinear transformation.Wherein, the input of the node of upper strata hidden layer can be based on a node in adjacent lower or several sections The output of point.Each hidden layer has weights corresponding thereto, wherein the weights are the acoustic signals based on training data It obtains.It, can be by using being subjected to supervision or unsupervised learning process carries out the pre- of model when being trained to model Training, obtains the initial weight of each hidden layer.It, can be by using back-propagation to fine-tuning for the weights of each hidden layer Algorithm carries out.

Output layer, for receiving the output of the most upper hidden layer in the DNN.The node of output layer is utilized by general The modeling unit of call pronunciation phonemes composition handles the signal received, and output is the probability in the modeling unit Distribution, is referred to as the first probability herein.

Decoder component, for utilizing the common of the probability identification target dialect zone dialectal accent of training device assembly output Talk about the word of data.

In embodiments of the present invention, trained data selection standard mandarin data and eight big dialect zone dialect mouths of addition The data of sound, it is common to extract acoustic feature vector, the DNN models of the more hidden layers of training.In addition, to promote DNN models to major The adaptive ability of mandarin data of the localism area with dialectal accent carries the mandarin of dialectal accent under to target localism area It in the identifying system of data, to acquired DNN models, is multiplexed its and hides layer parameter, and using being carried under the target localism area The mandarin data of dialectal accent relearn and output probability value output layer.Finally, the acoustics obtained by such mode Model, compared to single localism area with dialectal accent mandarin data or standard mandarin data train model, Discrimination in identifying system can be promoted.

In several embodiments provided herein, it should be understood that disclosed method and apparatus, it can be by other Mode realize.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only For a kind of division of logic function, there can be other dividing mode in actual implementation, such as multiple units or component can combine Or it is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed phase Coupling, direct-coupling or communication connection between mutually can be by some interfaces, the INDIRECT COUPLING or communication of device or unit Connection can be electrical, machinery or other forms.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That the independent physics of each unit includes, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or the network equipment etc.) performs receiving/transmission method described in each embodiment of the present invention Part steps.And aforementioned storage medium includes：USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic disc or CD etc. are various to store The medium of program code.

The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

A kind of 1. pronunciation modeling method, which is characterized in that including：

Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and described in extraction The speech feature vector of input data；

Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the output layer of the acoustic model Export the first probability；

Obtain the mandarin data with target dialectal accent；

Learn the output layer, and utilize output layer output using the mandarin data with target dialectal accent First probability described in second probability updating.
2. the according to the method described in claim 1, it is characterized in that, speech feature vector of the extraction input data Step, including：

Adding window framing operation is carried out to the input data, obtains speech frame；

The mute frame in the speech frame is removed, obtains the speech feature vector.
3. according to the method described in claim 1, it is characterized in that, described utilize speech feature vector training depth nerve Network DNN acoustic models, wherein the step of output layer of the acoustic model exports the first probability, including：

The input layer of the DNN acoustic models is input to using the speech feature vector as input signal；

In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described each hidden The input signal for hiding layer is handled, and obtains the output signal of each hidden layer；

In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, obtains the first probability.
4. according to the method described in claim 1, it is characterized in that, described utilize the mandarin for carrying target dialectal accent Data learn the output layer, and described in the second probability updating of the utilization output layer output the step of the first probability, including：

The input of the DNN acoustic models is input to using the mandarin data with target dialectal accent as input signal Layer；

In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to described each hidden The input signal for hiding layer is handled, and obtains the output signal of each hidden layer；

In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, obtains the second probability；

Utilize the first probability described in second probability updating.
5. according to the method described in claim 4, it is characterized in that, described using described common with target dialectal accent Before the step of talking about data and learn the output layer, and utilizing the first probability described in the second probability updating of output layer output, The method further includes：

Mute frame in the removal mandarin data with target dialectal accent；

It is described that the mandarin data with target dialectal accent are input to the DNN acoustic models as input signal The step of input layer, including：

The mandarin data with target dialectal accent after the mute frame will be eliminated, institute is input to as input signal State the input layer of DNN acoustic models.
6. according to claim 1-5 any one of them methods, which is characterized in that the method further includes：

Obtain the mandarin data that band to be identified has an accent；

It is identified according to the mandarin data that second probability identification has an accent to the band to be identified.
7. a kind of pronunciation modeling device, which is characterized in that including：

Extraction module, for using standard mandarin data and at least one mandarin data with dialectal accent as input number According to, and extract the speech feature vector of the input data；

Training module, for training deep neural network DNN acoustic models using the speech feature vector, wherein the acoustics The output layer of model exports the first probability；

Acquisition module, for obtaining the mandarin data with target dialectal accent；

Modeling module for learning the output layer using the mandarin data with target dialectal accent, and utilizes institute State the first probability described in the second probability updating of output layer output.
8. device according to claim 7, which is characterized in that the extraction module includes：

First acquisition submodule for carrying out adding window framing operation to the input data, obtains speech frame；

Second acquisition submodule for removing the mute frame in the speech frame, obtains the speech feature vector.
9. device according to claim 7, which is characterized in that the training module includes：

First input layer submodule, for being input to the DNN acoustic models using the speech feature vector as input signal Input layer；

First hidden layer submodule, it is corresponding using each hidden layer in multiple hidden layers of the DNN acoustic models First weights handle the input signal of each hidden layer, obtain the output signal of each hidden layer；

First output layer submodule, in the output layer of the DNN acoustic models, to the output signal of a most upper hidden layer into Row processing, obtains the first probability.
10. device according to claim 7, which is characterized in that the modeling module includes：

Second input layer submodule, for the mandarin data with target dialectal accent to be input to as input signal The input layer of the DNN acoustic models；

Second hidden layer submodule, it is corresponding using each hidden layer in multiple hidden layers of the DNN acoustic models Second weights handle the input signal of each hidden layer, obtain the output signal of each hidden layer；

Second output layer submodule, in the output layer of the DNN acoustic models, to the output signal of a most upper hidden layer into Row processing, obtains the second probability；

Submodule is updated, for utilizing the first probability described in second probability updating.
11. device according to claim 10, which is characterized in that described device further includes：

Processing module, for removing the mute frame in the mandarin data with target dialectal accent；

The second input layer submodule is specifically used for, general with target dialectal accent after the mute frame by eliminating Communicating data is input to the input layer of the DNN acoustic models as input signal.
12. according to claim 7-11 any one of them devices, which is characterized in that described device further includes：

Receiving module, for obtaining the mandarin data that band to be identified has an accent；

Identification module, the mandarin data for being had an accent according to second probability identification to the band to be identified are known Not.