CN108172218A - A kind of pronunciation modeling method and device - Google Patents

A kind of pronunciation modeling method and device Download PDF

Info

Publication number
CN108172218A
CN108172218A CN201611103738.6A CN201611103738A CN108172218A CN 108172218 A CN108172218 A CN 108172218A CN 201611103738 A CN201611103738 A CN 201611103738A CN 108172218 A CN108172218 A CN 108172218A
Authority
CN
China
Prior art keywords
layer
input
data
probability
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611103738.6A
Other languages
Chinese (zh)
Other versions
CN108172218B (en
Inventor
徐衍瀚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201611103738.6A priority Critical patent/CN108172218B/en
Publication of CN108172218A publication Critical patent/CN108172218A/en
Application granted granted Critical
Publication of CN108172218B publication Critical patent/CN108172218B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of pronunciation modeling method and device, is related to technical field of voice recognition, to reduce the complexity of speech model modeling.The pronunciation modeling method of the present invention, including:Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and extract the speech feature vector of the input data;Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the output layer of the acoustic model exports the first probability;Obtain the mandarin data with target dialectal accent;Learn the output layer, and utilize the first probability described in the second probability updating of output layer output using the mandarin data with target dialectal accent.The present invention can reduce the complexity of speech model modeling.

Description

A kind of pronunciation modeling method and device
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of pronunciation modeling method and devices.
Background technology
Speech recognition allows machine to understand people's word, and voice signal is converted into the identifiable input of computer.At present Speech recognition technology be mainly statistical-simulation spectrometry technology and artificial neural network technology.
Hidden Markov Model (Hidden Markov Model, HMM) is the voice technologies such as current speech recognition field More mature model is more improved, is built the voice of timing by hidden Markov models using the concept of statistics Mould yields good result.
It is studied in recent years based on deep neural network (Deep Neural Networks, DNN) speech recognition system Personnel more and more pay close attention to.The concept of deep learning is derived from the research of artificial neural network, by Hinton et al. in 2006 It proposes.The essence of deep learning is the training data by machine learning model of the structure with many hidden layers and magnanimity, to learn More useful feature is practised, so as to finally promote the accuracy of classification or prediction.Mainly there is following viewpoint:(1) more hidden layers is artificial Neural network has excellent feature learning ability, and the feature learnt has data more essential description, so as to be conducive to Classification;(2) difficulty of the deep neural network in training can effectively be overcome by " successively initializing ", and successively initial Change is realized by unsupervised learning.
In order to improve the recognition accuracy to the mandarin with dialectal accent background, the prior art provides a variety of sides Method.Some of methods are improved for the training method during Acoustic Modeling, and certain methods are to the language in identifying system Speech model is improved.But existing in the mandarin recognition methods with dialectal accent background, the complexity of training pattern Degree is high.
Invention content
In view of this, the present invention provides a kind of pronunciation modeling method and device, to reduce the complexity of speech model modeling Degree.
In order to solve the above technical problems, the present invention provides a kind of pronunciation modeling method, including:
Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and extract The speech feature vector of the input data;
Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the acoustic model is defeated Go out layer and export the first probability;
Obtain the mandarin data with target dialectal accent;
Learn the output layer, and defeated using the output layer using the mandarin data with target dialectal accent First probability described in the second probability updating gone out.
Wherein, the step of speech feature vector of the extraction input data, including:
Adding window framing operation is carried out to the input data, obtains speech frame;
The mute frame in the speech frame is removed, obtains the speech feature vector.
Wherein, it is described to train deep neural network DNN acoustic models using the speech feature vector, wherein the acoustics The output layer of model exports the step of the first probability, including:
The input layer of the DNN acoustic models is input to using the speech feature vector as input signal;
In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described every The input signal of a hidden layer is handled, and obtains the output signal of each hidden layer;
In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, it is general to obtain first Rate.
Wherein, it is described to learn the output layer using the mandarin data with target dialectal accent, and utilize institute The step of stating the first probability described in the second probability updating of output layer output, including:
The mandarin data with target dialectal accent are input to the DNN acoustic models as input signal Input layer;
In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to described every The input signal of a hidden layer is handled, and obtains the output signal of each hidden layer;
In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, it is general to obtain second Rate;
Utilize the first probability described in second probability updating.
Wherein, learn the output layer, and utilize using the mandarin data with target dialectal accent described Described in second probability updating of the output layer output the step of the first probability before, the method further includes:
Mute frame in the removal mandarin data with target dialectal accent;
It is described that the mandarin data with target dialectal accent are input to the DNN acoustic modes as input signal The step of input layer of type, including:
The mandarin data with target dialectal accent after the mute frame will be eliminated, inputted as input signal To the input layer of the DNN acoustic models.
Wherein, the method further includes:
Obtain the mandarin data that band to be identified has an accent;
It is identified according to the mandarin data that second probability identification has an accent to the band to be identified.
Second aspect, the present invention provide a kind of pronunciation modeling device, including:
Extraction module, for using standard mandarin data and at least one mandarin data with dialectal accent as defeated Enter data, and extract the speech feature vector of the input data;
Training module, for training deep neural network DNN acoustic models using the speech feature vector, wherein described The output layer of acoustic model exports the first probability;
Acquisition module, for obtaining the mandarin data with target dialectal accent;
Modeling module, for learning the output layer, and profit using the mandarin data with target dialectal accent The first probability described in the second probability updating of output layer output.
Wherein, the extraction module includes:
First acquisition submodule for carrying out adding window framing operation to the input data, obtains speech frame;
Second acquisition submodule for removing the mute frame in the speech frame, obtains the speech feature vector.
Wherein, the training module includes:
First input layer submodule, for being input to the DNN acoustics using the speech feature vector as input signal The input layer of model;
First hidden layer submodule, in multiple hidden layers of the DNN acoustic models, utilizing each hidden layer pair The first weights answered handle the input signal of each hidden layer, obtain the output signal of each hidden layer;
First output layer submodule, in the output layer of the DNN acoustic models, believing the output of a most upper hidden layer It number is handled, obtains the first probability.
Wherein, the modeling module includes:
Second input layer submodule, for the mandarin data with target dialectal accent are defeated as input signal Enter the input layer to the DNN acoustic models;
Second hidden layer submodule, in multiple hidden layers of the DNN acoustic models, utilizing each hidden layer pair The second weights answered handle the input signal of each hidden layer, obtain the output signal of each hidden layer;
Second output layer submodule, in the output layer of the DNN acoustic models, believing the output of a most upper hidden layer It number is handled, obtains the second probability;
Submodule is updated, for utilizing the first probability described in second probability updating.
Wherein, described device further includes:
Processing module, for removing the mute frame in the mandarin data with target dialectal accent;
The second input layer submodule is specifically used for, by eliminate after the mute frame with target dialectal accent Mandarin data, the input layers of the DNN acoustic models is input to as input signal.
Wherein, described device further includes:
Receiving module, for obtaining the mandarin data that band to be identified has an accent;
Identification module, for according to second probability identification to the mandarin data that the band to be identified has an accent into Row identification.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
In embodiments of the present invention, using standard mandarin data and at least one mandarin data with dialectal accent as Basis with deep neural network technique drill acoustic model, obtains the first probability.To carrying the mandarin of target dialectal accent Data, using learning the acoustic model output layer, and utilize the first probability described in the second probability updating of output layer output. Therefore, it is multiple when adaptively being adjusted using the mandarin data of target dialectal accent using the scheme of the embodiment of the present invention With the parameter for the hidden layer for training acoustic model, without individually establishing model for data of each localism area with dialectal accent, The complexity of model training is simplified, so as to reduce the complexity of speech model modeling.
Description of the drawings
Fig. 1 is the flow chart of the pronunciation modeling method of the embodiment of the present invention one;
Fig. 2 is the structure chart of the pronunciation modeling device of the embodiment of the present invention two;
Fig. 3 is the schematic diagram of the pronunciation modeling device of the embodiment of the present invention two;
Fig. 4 is the schematic diagram of the automatic speech recognition system of the embodiment of the present invention three.
Specific embodiment
Below in conjunction with drawings and examples, the specific embodiment of the present invention is described in further detail.Following reality Example is applied for illustrating the present invention, but be not limited to the scope of the present invention.
Embodiment one
As shown in Figure 1, the pronunciation modeling method of the embodiment of the present invention one, including:
Step 101, using standard mandarin data and at least one mandarin data with dialectal accent as input number According to, and extract the speech feature vector of the input data.
Chinese mainly includes:The standard mandarin that official announces and the mandarin with each department dialectal accent.Chinese side Speech can be greatly classified into eight big localism areas by region.Standard Chinese is a kind of single language.But the pronunciation of mandarin can be by To the influence of dialectal accent described in each area, in comparison with standard mandarin in pronunciation there are the phenomenon that the change of tune on words.Cause And this leads to the acoustic model only trained with the data of single standard mandarin, effectively can not correctly describe to carry the change of tune Acoustic feature;In addition it also is difficult to be collected into the mandarin data for having the dialectal accent in specific dialect lower band in engineering, and Establish the database of working majority evidence.
Therefore, in embodiments of the present invention, input data selection standard mandarin data and at least one localism area Mandarin data with dialectal accent, it is common to extract acoustic feature vector, the DNN models of the more hidden layers of training.Preferably, herein Input data selection standard mandarin data and eight big dialect zone have the mandarin data of dialectal accent.
For input data, in order to enable the acoustic model subsequently established is more accurate, here, to the input data into The framing of row adding window operates, and obtains speech frame.The short-time energy value of each speech frame is calculated later, is gone according to the short-time energy value Except mute frame.Specifically, the short-time energy value of each speech frame is compared respectively with predetermined threshold value.If some speech frame Short-time energy value is less than the threshold value, then can be as mute frame.The mute frame in the speech frame is removed, obtains the voice Feature vector.Wherein, the threshold value can arbitrarily be set.
Wherein, which can also be context-sensitive, be configured to receive the feature vector of multiple frames.It should Speech feature vector for example can be Mel frequency cepstral coefficients (Mel-scale Frequency Cepstral Coefficients, MFCC), consciousness linear prediction (Perceptual Linear Predictive, PLP) feature etc..
Step 102 trains DNN acoustic models using the speech feature vector, wherein the output layer of the acoustic model Export the first probability.
Wherein, in practical applications, which includes:
Input layer, for receiving speech feature vector.
Multiple hidden layers (at least three).Wherein, each hidden layer includes corresponding multiple nodes (neuron), each to hide Each node in layer is configured to, and the output of at least one node of the adjacent lower in the DNN is performed linear Or nonlinear transformation.Wherein, the input of the node of upper strata hidden layer can be based on a node in adjacent lower or several sections The output of point.Each hidden layer has corresponding weights, and the wherein weights are what the acoustic signal based on training data obtained. When being trained to model, it can be obtained by using being subjected to supervision or unsupervised learning process carries out the pre-training of model The initial weight of each hidden layer.It, can be by using back-propagation (Back to fine-tuning for the weights of each hidden layer Propagation, BP) algorithm carries out.
Output layer, for receiving the output signal from most last layer hidden layer.The node of output layer is utilized according to common The modeling unit of words pronunciation phonemes composition handles the signal received, and output is the probability point in the modeling unit Cloth is referred to as probability herein.
Output unit in output layer is the modeling unit for the phonetic element for representing to use in standard Chinese.Modeling unit Morpheme (binding triphones state) can be used, and modeling unit can be Hidden Markov Model (HMM) or other are suitable Modeling unit.
Specifically, in this step, the DNN acoustic models are input to using the speech feature vector as input signal Input layer;In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described The input signal of each hidden layer is handled, and obtains the output signal of each hidden layer;In the DNN acoustic models Output layer handles the output signal of a most upper hidden layer, obtains the first probability.
Step 103 obtains the mandarin data with target dialectal accent.
Wherein, the mandarin data with target dialectal accent can be any mandarin with dialectal accent Data.
Step 104 learns the output layer, and described in utilization using the mandarin data with target dialectal accent First probability described in second probability updating of output layer output.
In embodiments of the present invention, step 103 and 104 process are properly termed as utilizing common with target dialectal accent The process that words data adaptively adjust the DNN acoustic models of step 102.In the stage of model adaptation, using with The mandarin data of target dialectal accent learn output layer, and utilize the probability value for newly learning the output layer, directly replacement The probability that output layer exports in the acoustic model trained in step 102 with standard mandarin data and a variety of dialectal accent data Value.
Specifically, the mandarin data with target dialectal accent are input to the DNN sound as input signal Learn the input layer of model;In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, The input signal of each hidden layer is handled, obtains the output signal of each hidden layer;In the DNN acoustics The output layer of model handles the output signal of a most upper hidden layer, obtains the second probability;Using second probability more New first probability.
It should be noted that if the training data in step 104 is relatively fewer, then, when carrying out model adaptation, The identical weights with corresponding hidden layer each in step 102 can be used.In this way, a large amount of number will not needed to using the program According to acquired acoustic model can adaptively carry the mandarin data of target dialectal accent, and mesh is carried to this so as to be promoted Mark the recognition correct rate of the mandarin data of dialectal accent.If the training data in step 104 is relatively more, then, herein It, can also be for the weights that hidden layer is readjusted with the mandarin data of target dialectal accent, and in top layer in step Hidden layer updates the output probability of output layer, the discrimination of similary lift scheme identification.
Training DNN acoustic models and adaptive adjustment DNN acoustic models by above-mentioned steps 101-104, complete DNN The foundation of acoustic model.
As seen from the above, using the scheme of the embodiment of the present invention, the mandarin data using target dialectal accent into During the adaptive adjustment of row, multiplexing trains the parameter of the hidden layer of acoustic model, without being each localism area with dialectal accent Data individually establish model, simplify the complexity of model training, so as to reduce the complexity of speech model modeling.
On the basis of embodiment one, after step 103 obtains the mandarin data with target dialectal accent, in order to Recognition accuracy is improved, in this mute frame in also can remove the mandarin data with target dialectal accent.Specifically, Adding window framing operation is carried out to the mandarin data with target dialectal accent, obtains speech frame.Each language is calculated later The short-time energy value of sound frame removes mute frame according to the short-time energy value.Specifically, the short-time energy value by each speech frame It is compared respectively with predetermined threshold value.If the short-time energy value of some speech frame be less than the threshold value, can as mute frame, Remove the mute frame in the speech frame.Wherein, which can arbitrarily set.
It, can also be according to the model after the adaptive adjustment to language after above-mentioned training pattern and adaptive adjustment model Sound is identified.At this point, obtaining the mandarin data that band to be identified has an accent, treated according to second probability identification to described The mandarin data that the band of identification has an accent are identified.
It above-mentioned is obtained after step 101-104 specifically, the mandarin data that band to be identified has an accent are input to Acoustic model obtains the third probability of output.The third probability and the second probability are matched, and according to the big of matching degree Word in the mandarin data that small identification band to be identified has an accent etc..
Through the above scheme, with the correlation modeling technology of deep neural network, make acquired acoustic model hidden in multilayer The classification capacity for hiding layer has very big promotion, so as to improve the accuracy of identification.In the stage of model adaptation, multiplexing has obtained Acoustic model hidden layer parameter, without individually establishing model for data of each localism area with dialectal accent, simplify The complexity of model training.In addition, using the scheme of the embodiment of the present invention, the number of the dialectal accent without establishing indivedual localism areas According to library, learn to update the probability value of output layer with a small amount of data, acoustic model can adaptive different target dialect The data with dialectal accent in area.
Embodiment two
As shown in Fig. 2, the pronunciation modeling device of the embodiment of the present invention two, including:
Extraction module 201, for standard mandarin data and at least one mandarin data with dialectal accent to be made For input data, and extract the speech feature vector of the input data;Training module 202, for utilizing the phonetic feature Vector training deep neural network DNN acoustic models, wherein the output layer of the acoustic model exports the first probability;Acquisition module 203, for obtaining the mandarin data with target dialectal accent;Modeling module 204, for carrying target dialect using described The mandarin data of accent learn the output layer, and general using described in the second probability updating of output layer output first Rate.
Wherein, the extraction module 201 includes:First acquisition submodule, for carrying out adding window point to the input data Frame operates, and obtains speech frame;For removing the mute frame in the speech frame, it is special to obtain the voice for second acquisition submodule Sign vector.
Wherein, the training module 202 includes:First input layer submodule, for using the speech feature vector as Input signal is input to the input layer of the DNN acoustic models;First hidden layer submodule, in the DNN acoustic models Multiple hidden layers in, using corresponding first weights of each hidden layer, at the input signal of each hidden layer Reason obtains the output signal of each hidden layer;First output layer submodule, in the output of the DNN acoustic models Layer, handles the output signal of a most upper hidden layer, obtains the first probability.
Wherein, the modeling module 204 includes:Second input layer submodule, for described target dialectal accent will to be carried Mandarin data the input layers of the DNN acoustic models is input to as input signal;Second hidden layer submodule, for In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to each hidden layer Input signal is handled, and obtains the output signal of each hidden layer;Second output layer submodule, in the DNN The output layer of acoustic model handles the output signal of a most upper hidden layer, obtains the second probability;Submodule is updated, is used In the first probability described in utilization second probability updating.
As shown in figure 3, described device further includes:Processing module 205, it is described general with target dialectal accent for removing Mute frame in communicating data.At this point, the second input layer submodule is specifically used for, after eliminating the mute frame Mandarin data with target dialectal accent are input to the input layer of the DNN acoustic models as input signal.
Again as shown in figure 3, described device further includes:Receiving module 206, for obtain band to be identified have an accent it is common Talk about data;Identification module 207, for the mandarin data being had an accent according to second probability identification to the band to be identified It is identified.
The operation principle of device of the present invention can refer to the description of preceding method embodiment.
As seen from the above, using the scheme of the embodiment of the present invention, the mandarin data using target dialectal accent into During the adaptive adjustment of row, multiplexing trains the parameter of the hidden layer of acoustic model, without being each localism area with dialectal accent Data individually establish model, simplify the complexity of model training, so as to reduce the complexity of speech model modeling.
Embodiment three
As shown in figure 4, the automatic speech recognition system for the embodiment of the present invention three.The system includes:Extract device assembly 401st, training device assembly 402, decoder component 403 etc..
Wherein, device assembly is extracted, for extracting the speech feature vector of input signal.The process of training DNN acoustic models In, selection criteria mandarin data and the mandarin data merging data of major localism area band area's dialectal accent are believed as input Number;And acoustic model it is adaptive when, mandarin data of the selection target localism area with dialectal accent are as input signal.
Training device assembly (DNN), for training DNN acoustic models and adaptively being adjusted to acquired acoustic model It is whole.Including:
Input layer, for receiving the speech feature vector of extraction device assembly.
Multiple hidden layers (at least three).Wherein, each hidden layer includes corresponding multiple nodes (neuron), each to hide Each node in layer is configured to, and the output of at least one node of the adjacent lower in the DNN is performed linear Or nonlinear transformation.Wherein, the input of the node of upper strata hidden layer can be based on a node in adjacent lower or several sections The output of point.Each hidden layer has weights corresponding thereto, wherein the weights are the acoustic signals based on training data It obtains.It, can be by using being subjected to supervision or unsupervised learning process carries out the pre- of model when being trained to model Training, obtains the initial weight of each hidden layer.It, can be by using back-propagation to fine-tuning for the weights of each hidden layer Algorithm carries out.
Output layer, for receiving the output of the most upper hidden layer in the DNN.The node of output layer is utilized by general The modeling unit of call pronunciation phonemes composition handles the signal received, and output is the probability in the modeling unit Distribution, is referred to as the first probability herein.
Output unit in output layer is the modeling unit for the phonetic element for representing to use in standard Chinese.Modeling unit Morpheme (binding triphones state) can be used, and modeling unit can be Hidden Markov Model (HMM) or other are suitable Modeling unit.
Decoder component, for utilizing the common of the probability identification target dialect zone dialectal accent of training device assembly output Talk about the word of data.
In embodiments of the present invention, trained data selection standard mandarin data and eight big dialect zone dialect mouths of addition The data of sound, it is common to extract acoustic feature vector, the DNN models of the more hidden layers of training.In addition, to promote DNN models to major The adaptive ability of mandarin data of the localism area with dialectal accent carries the mandarin of dialectal accent under to target localism area It in the identifying system of data, to acquired DNN models, is multiplexed its and hides layer parameter, and using being carried under the target localism area The mandarin data of dialectal accent relearn and output probability value output layer.Finally, the acoustics obtained by such mode Model, compared to single localism area with dialectal accent mandarin data or standard mandarin data train model, Discrimination in identifying system can be promoted.
Through the above scheme, with the correlation modeling technology of deep neural network, make acquired acoustic model hidden in multilayer The classification capacity for hiding layer has very big promotion, so as to improve the accuracy of identification.In the stage of model adaptation, multiplexing has obtained Acoustic model hidden layer parameter, without individually establishing model for data of each localism area with dialectal accent, simplify The complexity of model training.In addition, using the scheme of the embodiment of the present invention, the number of the dialectal accent without establishing indivedual localism areas According to library, learn to update the probability value of output layer with a small amount of data, acoustic model can adaptive different target dialect The data with dialectal accent in area.
In several embodiments provided herein, it should be understood that disclosed method and apparatus, it can be by other Mode realize.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only For a kind of division of logic function, there can be other dividing mode in actual implementation, such as multiple units or component can combine Or it is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed phase Coupling, direct-coupling or communication connection between mutually can be by some interfaces, the INDIRECT COUPLING or communication of device or unit Connection can be electrical, machinery or other forms.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That the independent physics of each unit includes, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or the network equipment etc.) performs receiving/transmission method described in each embodiment of the present invention Part steps.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic disc or CD etc. are various to store The medium of program code.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (12)

  1. A kind of 1. pronunciation modeling method, which is characterized in that including:
    Using standard mandarin data and at least one mandarin data with dialectal accent as input data, and described in extraction The speech feature vector of input data;
    Deep neural network DNN acoustic models are trained using the speech feature vector, wherein the output layer of the acoustic model Export the first probability;
    Obtain the mandarin data with target dialectal accent;
    Learn the output layer, and utilize output layer output using the mandarin data with target dialectal accent First probability described in second probability updating.
  2. 2. the according to the method described in claim 1, it is characterized in that, speech feature vector of the extraction input data Step, including:
    Adding window framing operation is carried out to the input data, obtains speech frame;
    The mute frame in the speech frame is removed, obtains the speech feature vector.
  3. 3. according to the method described in claim 1, it is characterized in that, described utilize speech feature vector training depth nerve Network DNN acoustic models, wherein the step of output layer of the acoustic model exports the first probability, including:
    The input layer of the DNN acoustic models is input to using the speech feature vector as input signal;
    In multiple hidden layers of the DNN acoustic models, using corresponding first weights of each hidden layer, to described each hidden The input signal for hiding layer is handled, and obtains the output signal of each hidden layer;
    In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, obtains the first probability.
  4. 4. according to the method described in claim 1, it is characterized in that, described utilize the mandarin for carrying target dialectal accent Data learn the output layer, and described in the second probability updating of the utilization output layer output the step of the first probability, including:
    The input of the DNN acoustic models is input to using the mandarin data with target dialectal accent as input signal Layer;
    In multiple hidden layers of the DNN acoustic models, using corresponding second weights of each hidden layer, to described each hidden The input signal for hiding layer is handled, and obtains the output signal of each hidden layer;
    In the output layer of the DNN acoustic models, the output signal of a most upper hidden layer is handled, obtains the second probability;
    Utilize the first probability described in second probability updating.
  5. 5. according to the method described in claim 4, it is characterized in that, described using described common with target dialectal accent Before the step of talking about data and learn the output layer, and utilizing the first probability described in the second probability updating of output layer output, The method further includes:
    Mute frame in the removal mandarin data with target dialectal accent;
    It is described that the mandarin data with target dialectal accent are input to the DNN acoustic models as input signal The step of input layer, including:
    The mandarin data with target dialectal accent after the mute frame will be eliminated, institute is input to as input signal State the input layer of DNN acoustic models.
  6. 6. according to claim 1-5 any one of them methods, which is characterized in that the method further includes:
    Obtain the mandarin data that band to be identified has an accent;
    It is identified according to the mandarin data that second probability identification has an accent to the band to be identified.
  7. 7. a kind of pronunciation modeling device, which is characterized in that including:
    Extraction module, for using standard mandarin data and at least one mandarin data with dialectal accent as input number According to, and extract the speech feature vector of the input data;
    Training module, for training deep neural network DNN acoustic models using the speech feature vector, wherein the acoustics The output layer of model exports the first probability;
    Acquisition module, for obtaining the mandarin data with target dialectal accent;
    Modeling module for learning the output layer using the mandarin data with target dialectal accent, and utilizes institute State the first probability described in the second probability updating of output layer output.
  8. 8. device according to claim 7, which is characterized in that the extraction module includes:
    First acquisition submodule for carrying out adding window framing operation to the input data, obtains speech frame;
    Second acquisition submodule for removing the mute frame in the speech frame, obtains the speech feature vector.
  9. 9. device according to claim 7, which is characterized in that the training module includes:
    First input layer submodule, for being input to the DNN acoustic models using the speech feature vector as input signal Input layer;
    First hidden layer submodule, it is corresponding using each hidden layer in multiple hidden layers of the DNN acoustic models First weights handle the input signal of each hidden layer, obtain the output signal of each hidden layer;
    First output layer submodule, in the output layer of the DNN acoustic models, to the output signal of a most upper hidden layer into Row processing, obtains the first probability.
  10. 10. device according to claim 7, which is characterized in that the modeling module includes:
    Second input layer submodule, for the mandarin data with target dialectal accent to be input to as input signal The input layer of the DNN acoustic models;
    Second hidden layer submodule, it is corresponding using each hidden layer in multiple hidden layers of the DNN acoustic models Second weights handle the input signal of each hidden layer, obtain the output signal of each hidden layer;
    Second output layer submodule, in the output layer of the DNN acoustic models, to the output signal of a most upper hidden layer into Row processing, obtains the second probability;
    Submodule is updated, for utilizing the first probability described in second probability updating.
  11. 11. device according to claim 10, which is characterized in that described device further includes:
    Processing module, for removing the mute frame in the mandarin data with target dialectal accent;
    The second input layer submodule is specifically used for, general with target dialectal accent after the mute frame by eliminating Communicating data is input to the input layer of the DNN acoustic models as input signal.
  12. 12. according to claim 7-11 any one of them devices, which is characterized in that described device further includes:
    Receiving module, for obtaining the mandarin data that band to be identified has an accent;
    Identification module, the mandarin data for being had an accent according to second probability identification to the band to be identified are known Not.
CN201611103738.6A 2016-12-05 2016-12-05 Voice modeling method and device Active CN108172218B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611103738.6A CN108172218B (en) 2016-12-05 2016-12-05 Voice modeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611103738.6A CN108172218B (en) 2016-12-05 2016-12-05 Voice modeling method and device

Publications (2)

Publication Number Publication Date
CN108172218A true CN108172218A (en) 2018-06-15
CN108172218B CN108172218B (en) 2021-01-12

Family

ID=62525918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611103738.6A Active CN108172218B (en) 2016-12-05 2016-12-05 Voice modeling method and device

Country Status (1)

Country Link
CN (1) CN108172218B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887497A (en) * 2019-04-12 2019-06-14 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN110033760A (en) * 2019-04-15 2019-07-19 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN110211565A (en) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 Accent recognition method, apparatus and computer readable storage medium
CN110738991A (en) * 2019-10-11 2020-01-31 东南大学 Speech recognition equipment based on flexible wearable sensor
CN110930995A (en) * 2019-11-26 2020-03-27 中国南方电网有限责任公司 Voice recognition model applied to power industry
CN111179938A (en) * 2019-12-26 2020-05-19 安徽仁昊智能科技有限公司 Speech recognition garbage classification system based on artificial intelligence
CN111243574A (en) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN112528679A (en) * 2020-12-17 2021-03-19 科大讯飞股份有限公司 Intention understanding model training method and device and intention understanding method and device
CN112770154A (en) * 2021-01-19 2021-05-07 深圳西米通信有限公司 Intelligent set top box with voice interaction function and interaction method thereof
CN112967720A (en) * 2021-01-29 2021-06-15 南京迪港科技有限责任公司 End-to-end voice-to-text model optimization method under small amount of accent data
WO2021135438A1 (en) * 2020-07-31 2021-07-08 平安科技(深圳)有限公司 Multilingual speech recognition model training method, apparatus, device, and storage medium
CN113192492A (en) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN113223542A (en) * 2021-04-26 2021-08-06 北京搜狗科技发展有限公司 Audio conversion method and device, storage medium and electronic equipment
CN113345451A (en) * 2021-04-26 2021-09-03 北京搜狗科技发展有限公司 Sound changing method and device and electronic equipment
WO2021213161A1 (en) * 2020-11-25 2021-10-28 平安科技(深圳)有限公司 Dialect speech recognition method, apparatus, medium, and electronic device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310788A (en) * 2013-05-23 2013-09-18 北京云知声信息技术有限公司 Voice information identification method and system
CN104282300A (en) * 2013-07-05 2015-01-14 中国移动通信集团公司 Non-periodic component syllable model building and speech synthesizing method and device
CN104575490A (en) * 2014-12-30 2015-04-29 苏州驰声信息科技有限公司 Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm
EP2889804A1 (en) * 2013-12-30 2015-07-01 Alcatel Lucent Systems and methods for contactless speech recognition
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105391873A (en) * 2015-11-25 2016-03-09 上海新储集成电路有限公司 Method for realizing local voice recognition in mobile device
CN105578115A (en) * 2015-12-22 2016-05-11 深圳市鹰硕音频科技有限公司 Network teaching method and system with voice assessment function
CN105632501A (en) * 2015-12-30 2016-06-01 中国科学院自动化研究所 Deep-learning-technology-based automatic accent classification method and apparatus
US20160239476A1 (en) * 2015-02-13 2016-08-18 Facebook, Inc. Machine learning dialect identification

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310788A (en) * 2013-05-23 2013-09-18 北京云知声信息技术有限公司 Voice information identification method and system
CN104282300A (en) * 2013-07-05 2015-01-14 中国移动通信集团公司 Non-periodic component syllable model building and speech synthesizing method and device
EP2889804A1 (en) * 2013-12-30 2015-07-01 Alcatel Lucent Systems and methods for contactless speech recognition
CN104575490A (en) * 2014-12-30 2015-04-29 苏州驰声信息科技有限公司 Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm
US20160239476A1 (en) * 2015-02-13 2016-08-18 Facebook, Inc. Machine learning dialect identification
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105391873A (en) * 2015-11-25 2016-03-09 上海新储集成电路有限公司 Method for realizing local voice recognition in mobile device
CN105578115A (en) * 2015-12-22 2016-05-11 深圳市鹰硕音频科技有限公司 Network teaching method and system with voice assessment function
CN105632501A (en) * 2015-12-30 2016-06-01 中国科学院自动化研究所 Deep-learning-technology-based automatic accent classification method and apparatus

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887497B (en) * 2019-04-12 2021-01-29 北京百度网讯科技有限公司 Modeling method, device and equipment for speech recognition
CN109887497A (en) * 2019-04-12 2019-06-14 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN110033760A (en) * 2019-04-15 2019-07-19 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
US11688391B2 (en) 2019-04-15 2023-06-27 Beijing Baidu Netcom Science And Technology Co. Mandarin and dialect mixed modeling and speech recognition
CN110033760B (en) * 2019-04-15 2021-01-29 北京百度网讯科技有限公司 Modeling method, device and equipment for speech recognition
CN110211565A (en) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 Accent recognition method, apparatus and computer readable storage medium
CN110738991A (en) * 2019-10-11 2020-01-31 东南大学 Speech recognition equipment based on flexible wearable sensor
CN110930995B (en) * 2019-11-26 2022-02-11 中国南方电网有限责任公司 Voice recognition model applied to power industry
CN110930995A (en) * 2019-11-26 2020-03-27 中国南方电网有限责任公司 Voice recognition model applied to power industry
CN111179938A (en) * 2019-12-26 2020-05-19 安徽仁昊智能科技有限公司 Speech recognition garbage classification system based on artificial intelligence
CN111243574B (en) * 2020-01-13 2023-01-03 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN111243574A (en) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
WO2021135438A1 (en) * 2020-07-31 2021-07-08 平安科技(深圳)有限公司 Multilingual speech recognition model training method, apparatus, device, and storage medium
WO2021213161A1 (en) * 2020-11-25 2021-10-28 平安科技(深圳)有限公司 Dialect speech recognition method, apparatus, medium, and electronic device
CN112528679A (en) * 2020-12-17 2021-03-19 科大讯飞股份有限公司 Intention understanding model training method and device and intention understanding method and device
CN112528679B (en) * 2020-12-17 2024-02-13 科大讯飞股份有限公司 Method and device for training intention understanding model, and method and device for intention understanding
CN112770154A (en) * 2021-01-19 2021-05-07 深圳西米通信有限公司 Intelligent set top box with voice interaction function and interaction method thereof
CN112967720A (en) * 2021-01-29 2021-06-15 南京迪港科技有限责任公司 End-to-end voice-to-text model optimization method under small amount of accent data
CN113223542A (en) * 2021-04-26 2021-08-06 北京搜狗科技发展有限公司 Audio conversion method and device, storage medium and electronic equipment
CN113345451A (en) * 2021-04-26 2021-09-03 北京搜狗科技发展有限公司 Sound changing method and device and electronic equipment
CN113345451B (en) * 2021-04-26 2023-08-22 北京搜狗科技发展有限公司 Sound changing method and device and electronic equipment
CN113223542B (en) * 2021-04-26 2024-04-12 北京搜狗科技发展有限公司 Audio conversion method and device, storage medium and electronic equipment
CN113192492A (en) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN113192492B (en) * 2021-04-28 2024-05-28 平安科技(深圳)有限公司 Speech recognition method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN108172218B (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN108172218A (en) A kind of pronunciation modeling method and device
CN109326302B (en) Voice enhancement method based on voiceprint comparison and generation of confrontation network
CN107545903B (en) Voice conversion method based on deep learning
Abdel-Hamid et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code
CN103928023B (en) A kind of speech assessment method and system
CN105632501A (en) Deep-learning-technology-based automatic accent classification method and apparatus
CN107039036B (en) High-quality speaker recognition method based on automatic coding depth confidence network
CN111210807B (en) Speech recognition model training method, system, mobile terminal and storage medium
US20080208577A1 (en) Multi-stage speech recognition apparatus and method
Xie et al. Sequence error (SE) minimization training of neural network for voice conversion.
WO2007114605A1 (en) Acoustic model adaptation methods based on pronunciation variability analysis for enhancing the recognition of voice of non-native speaker and apparatuses thereof
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN109147774B (en) Improved time-delay neural network acoustic model
CN102938252B (en) System and method for recognizing Chinese tone based on rhythm and phonetics features
CN106898355B (en) Speaker identification method based on secondary modeling
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN109637526A (en) The adaptive approach of DNN acoustic model based on personal identification feature
CN105575383A (en) Apparatus and method for controlling target information voice output through using voice characteristics of user
CN110931045A (en) Audio feature generation method based on convolutional neural network
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
Salam et al. Malay isolated speech recognition using neural network: a work in finding number of hidden nodes and learning parameters.
CN109377986A (en) A kind of non-parallel corpus voice personalization conversion method
CN108831486B (en) Speaker recognition method based on DNN and GMM models
Yang et al. Essence knowledge distillation for speech recognition
CN108182938A (en) A kind of training method of the Mongol acoustic model based on DNN

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant