CN106875942A

CN106875942A - Acoustic model adaptive approach based on accent bottleneck characteristic

Info

Publication number: CN106875942A
Application number: CN201611232996.4A
Authority: CN
Inventors: 陶建华; 易江燕; 温正棋; 倪浩
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2016-12-28
Filing date: 2016-12-28
Publication date: 2017-06-20
Anticipated expiration: 2036-12-28
Also published as: CN106875942B

Abstract

The invention belongs to technical field of voice recognition, and in particular to a kind of acoustic model adaptive approach based on accent bottleneck characteristic.In order to realize the user for different accents, personalized customization acoustic model is carried out, the method that the present invention is provided comprises the following steps：S1 is based on the first deep neural network, and feature is spliced as training sample using the vocal print of multiple accent voice datas, obtains depth accent bottleneck network model；S2, based on the depth accent bottleneck network, obtain the accent splicing feature of the accent voice data；S3, based on depth nervus opticus network, feature is spliced as training sample using the accent of multiple accent voice datas, obtain the baseline acoustic model of accent independence；S4, using specific accent voice data the accent splicing feature parameter of the baseline acoustic model of the accent independence is adjusted, generation accent rely on acoustic model.By the method for the present invention, the accuracy rate with accent speech recognition is improve.

Description

Acoustic model adaptive approach based on accent bottleneck characteristic

Technical field

The invention belongs to technical field of voice recognition, and in particular to a kind of acoustic model based on accent bottleneck characteristic is adaptive Induction method.

Background technology

So far, speech recognition technology has become the important entrance of man-machine interaction, uses the user number of the technology It is growing.Because these users come from all corners of the country, accent varies, thus general voice recognition acoustic model is difficult Suitable for all users.Accordingly, it would be desirable to the user of different accents is directed to, the corresponding acoustic model of personalized customization.At present, extract The technology of vocal print feature is widely used in speaker field, and the mouth of the vocal print feature of speaker and speaker Sound has the contact of countless ties.Although had many scholars before this extracting accent spy by extracting the technology of vocal print feature Levy, but this technology can not high-levelly characterize accent feature, and accent feature is characterized how high-levelly to personalization Customization acoustic model is most important.

Therefore, this area needs a kind of new method to solve the above problems.

The content of the invention

In order to solve above mentioned problem of the prior art, i.e., in order to realize the user for different accents, carry out individual Propertyization customizes acoustic model, the invention provides a kind of acoustic model adaptive approach based on accent bottleneck characteristic.The method Comprise the following steps：

S1, based on the first deep neural network, feature is spliced as training sample using the vocal print of multiple accent voice datas, Obtain depth accent bottleneck network model；

S2, based on the depth accent bottleneck network, obtain the accent splicing feature of the accent voice data；

S3, based on the second deep neural network, using the accent of multiple accent voice datas splice feature as Training sample, obtains the baseline acoustic model of accent independence；

S4, using the accent splicing feature of specific accent voice data to the baseline acoustic mould of the accent independence The parameter of type is adjusted, the acoustic model that generation accent is relied on.

Preferably, in step sl, the step of obtaining the vocal print splicing feature includes：

S11, from accent voice data extract acoustic feature；

S12, the vocal print feature vector that speaker is extracted using the acoustic feature；

S13, to merge the vocal print feature vectorial with the acoustic feature, generation vocal print splicing feature.

Preferably, in step sl, the first nerves network is depth BP network model, with the multiple institute The vocal print splicing feature for stating accent voice data is trained to the depth BP network model, obtains depth accent bottle Neck network.

Preferably, step S2 is further included：

S21, the accent bottleneck characteristic that the accent voice data is extracted using the depth accent bottleneck network model；

S22, the fusion accent bottleneck characteristic and the acoustic feature, obtain the accent splicing of the accent voice data Feature.

Preferably, step S21 is further included：The vocal print of the accent voice data is spliced into feature as the depth The input of accent bottleneck network model, the accent bottleneck characteristic of the accent voice data is obtained using propagated forward algorithm.

Preferably, in step s3, the nervus opticus network is the two-way short-term memory Recognition with Recurrent Neural Network long of depth, with Multiple accent splicing feature short-term memory Recognition with Recurrent Neural Network long two-way to the depth are trained, and obtain accent independence The two-way short-term memory Recognition with Recurrent Neural Network long of depth acoustic model；

Using the acoustic model of the two-way short-term memory Recognition with Recurrent Neural Network long of the depth of the accent independence as accent independence Baseline acoustic model.

Preferably, in step s 4, feature is spliced to the baseline acoustic model of the accent independence using the accent The parameter of output layer is adjusted, the acoustic model that production accent is relied on.

Preferably, in step s 4, to the parameter of last output layer of the baseline acoustic model of the accent independence It is adjusted.

Preferably, the parameter of the output layer of the baseline acoustic model of the accent independence is carried out using Back Propagation Algorithm Adjustment.

By using the acoustic model adaptive approach based on accent bottleneck characteristic of the invention, with following beneficial effect Really：

(1) there is more abstract, more general expression using the accent splicing feature of depth accent bottleneck network extraction, can be accurate Really obtain the high-level sign of accent.

(2) go to carry out self adaptation to the output layer of the baseline acoustic model of accent independence using accent splicing feature, it is each Planting accent has corresponding output layer, shares hidden layer parameter, can reduce the memory space of model.

(3) by the acoustic model adaptive approach based on accent bottleneck characteristic of the invention, improve band accent voice The accuracy rate of identification.

Brief description of the drawings

Fig. 1 is the flow chart of the acoustic model adaptive approach based on accent bottleneck characteristic of the invention；

Fig. 2 is the overall flow figure of the embodiment of the present invention；

Fig. 3 is the flow chart of the generation vocal print splicing feature of the embodiment of the present invention；

Fig. 4 is the flow chart of the generation accent splicing feature of the embodiment of the present invention.

Specific embodiment

The preferred embodiment of the present invention described with reference to the accompanying drawings.It will be apparent to a skilled person that this A little implementation methods are used only for explaining know-why of the invention, it is not intended that limit the scope of the invention.

Reference picture 1, Fig. 1 shows the flow of the acoustic model adaptive approach based on accent bottleneck characteristic of the invention Figure.The method of the present invention is comprised the following steps：

S1, based on first nerves network model, feature is spliced as training sample using the vocal print of multiple accent voice datas, Obtain depth accent bottleneck network；

S3, based on nervus opticus network model, using the accent of multiple accent voice datas splice feature as Training sample, obtains the baseline acoustic model of accent independence；

Fig. 2 shows that 2 is the overall flow figure of the embodiment of the present invention.The method of the present invention is carried out in detail referring to Fig. 2 Describe in detail bright.

In step sl, the step of obtaining the vocal print splicing feature includes：

S11, from accent voice data extract acoustic feature.Specifically, it is main in the step to use Mel spectrum signature Or mel cepstrum feature.By taking mel cepstrum feature as an example, the static parameter of mel cepstrum feature can be 13 dimensions, and one is done to it Order difference and second differnce, the dimension of final argument is 39 dimensions, then does subsequent treatment using the feature of this 39 dimension.

S12, the vocal print feature vector that speaker is extracted using the acoustic feature.Specifically, instructed using the acoustic feature Practice gauss hybrid models-universal background model, and then using the gauss hybrid models-universal background model from the acoustic feature In come extract everyone vocal print feature vector, and the vocal print feature vector dimension for 80 dimension.

S13, to merge the vocal print feature vectorial with the acoustic feature, generation vocal print splicing feature.As shown in figure 3, During production vocal print splicing feature, the vocal print feature Vector Fusion extracted in the acoustic feature and S12 that will be extracted in S11. Specifically, everyone vocal print feature vector is spliced on the acoustic feature of every frame, so as to generate vocal print splicing feature.

In step sl, first nerves network can be depth BP network model, splice special with the vocal print for generating Levy and the depth BP network model is trained, obtain depth accent bottleneck network.In the present embodiment, depth mouthful Last hidden node of sound bottleneck network is 60, fewer than other the number of hidden nodes, other hidden nodes can for 1024 or 2048.In the present embodiment, the training criterion of the depth BP network model is cross entropy, and training method is back-propagating Algorithm.The activation primitive of depth BP network model can be tangent bend activation primitive or hyperbolic tangent activation letter Number, the loss function of the network is cross entropy, and it belongs to techniques known in the art, is not described in detail herein.

In step s 2, the step of obtaining accent splicing feature includes：

S21, the accent bottleneck characteristic using accent voice data described in the depth accent bottleneck network extraction；

Specifically, the depth accent bottleneck network that will be obtained in step S1 is considered as a feature extractor, with step S13 The vocal print of middle generation splices feature as the input of the depth accent bottleneck network, and the accent is obtained using propagated forward algorithm The accent bottleneck characteristic of voice data.In the present embodiment, the accent bottleneck characteristic is 60 dimensions.As shown in figure 4, in production accent During splicing feature, the acoustic feature that the accent bottleneck characteristic that S21 is extracted is extracted with S11 is not merged in frame level, So as to generate accent splicing feature.

In step s3, nervus opticus network can be the two-way short-term memory Recognition with Recurrent Neural Network long of depth, with step S2 In the accent splicing feature short-term memory Recognition with Recurrent Neural Network long two-way to the depth that obtains be trained, will obtain in S2 Accent splicing feature is input into the two-way short-term memory Recognition with Recurrent Neural Network long of the depth, and the label of its output layer is sound mother.Obtain The acoustic model of the two-way short-term memory Recognition with Recurrent Neural Network long of depth of accent independence, and by the two-way length of the depth of the accent independence The acoustic model of short-term memory Recognition with Recurrent Neural Network as accent independence baseline acoustic model.In the present embodiment, depth is double To the training criterion of short-term memory Recognition with Recurrent Neural Network long to be coupled sequential classification function, training method is Back Propagation Algorithm. The two-way short-term memory Recognition with Recurrent Neural Network long of depth can remember the historical information of input feature vector, can predict input feature vector not again Carry out knowledge, its function of using three control doors to realize remembering and predicting, these three control doors are respectively input gate, forget door And out gate.The two-way short-term memory Recognition with Recurrent Neural Network long of depth belongs to techniques known in the art, is no longer retouched in detail herein State.

In step s 4, the base of the accent independence using the accent splicing feature obtained in step S2 to being obtained in step S3 The parameter of the output layer (generally last output layer) of line acoustic model is finely adjusted, the acoustic mode that production accent is relied on Type.Specifically, feature as the input of the baseline acoustic model of the accent independence is spliced into the corresponding accent of every kind of accent, it is every kind of The output layer that accent one accent of correspondence is relied on, hidden layer is so accent is shared.Further, using Back Propagation Algorithm to accent Independent baseline acoustic model carries out small parameter perturbations.Because the baseline acoustic model of accent independence is remembered in short-term based on two-way length Recall neural network model, the acoustic model that the accent that hidden layer is ultimately produced is relied on is also based on the two-way short-term memory circulation long of depth Neural network model, the label of its output layer is sound mother, and its combining with pronunciation dictionary and language model are that may recognize that audio number According to corresponding text.

So far, combined preferred embodiment shown in the drawings describes technical scheme, but, this area Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this On the premise of the principle of invention, those skilled in the art can make equivalent change or replacement to correlation technique feature, these Technical scheme after changing or replacing it is fallen within protection scope of the present invention.

Claims

1. a kind of acoustic model adaptive approach based on accent bottleneck characteristic, it is characterised in that methods described includes following step Suddenly：

S3, based on the second deep neural network, feature as training is spliced using the accent of multiple accent voice datas Sample, obtains the baseline acoustic model of accent independence；

S4, using the accent splicing feature of specific accent voice data to the baseline acoustic model of the accent independence Parameter is adjusted, the acoustic model that generation accent is relied on.

2. method according to claim 1, it is characterised in that in step sl, obtains the step that the vocal print splices feature Suddenly include：

S11, from accent voice data extract acoustic feature；

3. method according to claim 2, it is characterised in that in step sl, before the first nerves network is depth Feedback neutral net, is carried out with the vocal print splicing feature of the multiple accent voice data to the depth feedforward neural network Training, obtains depth accent bottleneck network.

4. method according to claim 3, it is characterised in that step S2 is further included：

S22, the fusion accent bottleneck characteristic and the acoustic feature, the accent splicing for obtaining the accent voice data are special Levy.

5. method according to claim 4, it is characterised in that step S21 is further included：

The vocal print of the accent voice data is spliced into feature as the input of the depth accent bottleneck network model, using preceding The accent bottleneck characteristic of the accent voice data is obtained to propagation algorithm.

6. method according to claim 5, it is characterised in that in step s3, the nervus opticus network is double depth To short-term memory Recognition with Recurrent Neural Network long,

It is trained with multiple accent splicing feature short-term memory Recognition with Recurrent Neural Network long two-way to the depth, obtains mouth The acoustic model of the two-way short-term memory Recognition with Recurrent Neural Network long of depth of sound independence；

Using the acoustic model of the two-way short-term memory Recognition with Recurrent Neural Network long of the depth of the accent independence as accent independence base Line acoustic model.

7. method according to claim 6, it is characterised in that in step s 4, feature is spliced to institute using the accent The parameter for stating the output layer of the baseline acoustic model of accent independence is adjusted, the acoustic model that production accent is relied on.

8. method according to claim 7, it is characterised in that in step s 4, to the baseline acoustic of the accent independence The parameter of last output layer of model is adjusted.

9. the method according to claim 7 or 8, it is characterised in that using Back Propagation Algorithm to the accent independence The parameter of the output layer of baseline acoustic model is adjusted.