CN108172218B - Voice modeling method and device - Google Patents

Voice modeling method and device Download PDF

Info

Publication number
CN108172218B
CN108172218B CN201611103738.6A CN201611103738A CN108172218B CN 108172218 B CN108172218 B CN 108172218B CN 201611103738 A CN201611103738 A CN 201611103738A CN 108172218 B CN108172218 B CN 108172218B
Authority
CN
China
Prior art keywords
layer
data
probability
acoustic model
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611103738.6A
Other languages
Chinese (zh)
Other versions
CN108172218A (en
Inventor
徐衍瀚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN201611103738.6A priority Critical patent/CN108172218B/en
Publication of CN108172218A publication Critical patent/CN108172218A/en
Application granted granted Critical
Publication of CN108172218B publication Critical patent/CN108172218B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The invention provides a voice modeling method and a voice modeling device, relates to the technical field of voice recognition, and aims to reduce the complexity of voice model modeling. The voice modeling method comprises the following steps: taking standard Mandarin data and at least one Mandarin data with dialect accent as input data, and extracting voice feature vector of the input data; training a Deep Neural Network (DNN) acoustic model by utilizing the voice feature vectors, wherein an output layer of the acoustic model outputs a first probability; obtaining mandarin data with a target dialect accent; and learning the output layer by using the Mandarin data with the accent of the target dialect, and updating the first probability by using the second probability output by the output layer. The invention can reduce the complexity of voice model modeling.

Description

Voice modeling method and device
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice modeling method and a voice modeling device.
Background
Speech recognition is the process of converting speech signals into computer recognizable inputs by a machine to understand what a human speaks. The current speech recognition technology mainly comprises a statistical pattern recognition technology and an artificial neural network technology.
Hidden Markov Models (HMM) are relatively perfect and mature models in the technical field of speech such as speech recognition at present, and a statistical concept is used for modeling time-sequence speech through a Hidden Markov process, so that a good result is obtained.
Deep Neural Network (DNN) based speech recognition systems have received increasing attention from researchers in recent years. The concept of deep learning stems from the study of artificial neural networks, proposed by Hinton et al in 2006. The essence of deep learning is to learn more useful features by constructing a machine learning model with many hidden layers and massive training data, thereby finally improving the accuracy of classification or prediction. The following points are mainly found: (1) the multi-hidden-layer artificial neural network has excellent feature learning capability, and the learned features describe data more essentially, so that classification is facilitated; (2) the difficulty of the deep neural network in training can be effectively overcome by 'layer-by-layer initialization', and the layer-by-layer initialization is realized by unsupervised learning.
In order to improve the recognition accuracy of mandarin with dialect accent background, the prior art provides various methods. Some of these methods improve on training methods in the acoustic modeling process, and some improve on language models in the recognition system. However, the conventional mandarin recognition method with dialect accent background has high complexity of training a model.
Disclosure of Invention
In view of this, the present invention provides a speech modeling method and apparatus for reducing complexity of speech model modeling.
In order to solve the above technical problem, the present invention provides a speech modeling method, including:
taking standard Mandarin data and at least one Mandarin data with dialect accent as input data, and extracting voice feature vector of the input data;
training a Deep Neural Network (DNN) acoustic model by utilizing the voice feature vectors, wherein an output layer of the acoustic model outputs a first probability;
obtaining mandarin data with a target dialect accent;
and learning the output layer by using the Mandarin data with the accent of the target dialect, and updating the first probability by using the second probability output by the output layer.
Wherein the step of extracting the speech feature vector of the input data comprises:
performing windowing and framing operation on the input data to obtain a voice frame;
and removing the mute frame in the voice frame to obtain the voice feature vector.
Wherein the step of training a deep neural network DNN acoustic model by using the speech feature vectors, wherein an output layer of the acoustic model outputs a first probability comprises:
inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model;
processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a first weight corresponding to each hidden layer to obtain an output signal of each hidden layer;
and processing the output signal of the uppermost hidden layer at the output layer of the DNN acoustic model to obtain a first probability.
Wherein the step of learning the output layer using the mandarin data with the target dialect accent and updating the first probability using the second probability output by the output layer comprises:
inputting the Mandarin data with a target dialect accent as an input signal to an input layer of the DNN acoustic model;
processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a second weight corresponding to each hidden layer to obtain an output signal of each hidden layer;
processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability;
updating the first probability with the second probability.
Wherein, before the step of learning the output layer using the Mandarin data with the target dialect accent and updating the first probability using the second probability output by the output layer, the method further comprises:
removing silence frames in the mandarin data with the target dialect accent;
the step of inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model comprises:
the mandarin data with the target dialect accent after the mute frame is removed is input to an input layer of the DNN acoustic model as an input signal.
Wherein the method further comprises:
acquiring mandarin data with accents to be identified;
and identifying the Mandarin data with the accent to be identified according to the second probability identification.
In a second aspect, the present invention provides a speech modeling apparatus, comprising:
the extraction module is used for taking standard Mandarin data and at least one Mandarin data with dialect accent as input data and extracting the voice feature vector of the input data;
a training module for training a Deep Neural Network (DNN) acoustic model by using the voice feature vector, wherein an output layer of the acoustic model outputs a first probability;
the obtaining module is used for obtaining mandarin data with the dialect accent of the target dialect;
and the modeling module is used for learning the output layer by utilizing the Mandarin data with the target dialect accent and updating the first probability by utilizing the second probability output by the output layer.
Wherein the extraction module comprises:
the first obtaining submodule is used for carrying out windowing and framing operation on the input data to obtain a voice frame;
and the second obtaining submodule is used for removing the mute frame in the voice frame to obtain the voice characteristic vector.
Wherein the training module comprises:
a first input layer sub-module for inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model;
a first hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a first weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer;
and the first output layer submodule is used for processing the output signal of the uppermost hidden layer in the output layer of the DNN acoustic model to obtain a first probability.
Wherein the modeling module comprises:
a second input layer submodule for inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model;
a second hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a second weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer;
the second output layer submodule is used for processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability;
an update submodule for updating the first probability with the second probability.
Wherein the apparatus further comprises:
a processing module for removing a mute frame from the mandarin data with the target dialect accent;
the second input layer sub-module is specifically configured to input mandarin data with a target dialect accent after the silence frame is removed as an input signal to an input layer of the DNN acoustic model.
Wherein the apparatus further comprises:
the receiving module is used for acquiring mandarin data with accents to be identified;
and the recognition module is used for recognizing the mandarin data with the accent to be recognized according to the second probability recognition.
The technical scheme of the invention has the following beneficial effects:
in an embodiment of the invention, an acoustic model is trained using a deep neural network technique based on standard Mandarin data and at least one Mandarin data with dialect accents to obtain a first probability. Outputting a layer with learned acoustic models for Mandarin data with target dialect accents, and updating the first probability with a second probability output by the output layer. Therefore, by using the scheme of the embodiment of the invention, when the Mandarin data of the dialect accent of the target dialect is used for self-adaptive adjustment, the parameters of the hidden layer of the trained acoustic model are reused, and a model is not required to be independently established for the data of the dialect accent of each dialect zone, so that the complexity of model training is simplified, and the complexity of voice model modeling is reduced.
Drawings
FIG. 1 is a flowchart of a speech modeling method according to a first embodiment of the present invention;
FIG. 2 is a diagram illustrating a voice modeling apparatus according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of a speech modeling apparatus according to a second embodiment of the present invention;
fig. 4 is a schematic diagram of an automatic speech recognition system according to a third embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention will be made with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Example one
As shown in fig. 1, a speech modeling method according to a first embodiment of the present invention includes:
step 101, taking standard Mandarin data and at least one Mandarin data with dialect accent as input data, and extracting a voice feature vector of the input data.
The Chinese language mainly comprises: standard mandarin published by the authorities and mandarin with dialect accents in various regions. Chinese dialects can be roughly divided into eight dialect areas according to regions. Mandarin chinese is a single language. However, the pronunciation of mandarin is affected by the spoken dialect accents in various regions, and there is a phenomenon that the pronunciation of mandarin is changed in words compared to the standard mandarin. Therefore, the acoustic model trained by using the data of the single standard mandarin can not effectively and correctly describe the acoustic characteristics with the voice changes; it is also difficult to engineer a Mandarin data collection with dialect accents under a particular dialect and to build a database of sufficient data.
Therefore, in the embodiment of the invention, the standard Mandarin data and the Mandarin data with dialect accents of at least one dialect area are selected as the input data to jointly extract the acoustic feature vectors and train the DNN model of the multiple hidden layers. Preferably, standard mandarin data and mandarin data with dialect accents are selected as the input data.
For input data, in order to make a subsequently established acoustic model more accurate, a windowing and framing operation is performed on the input data to obtain a speech frame. And then calculating the short-time energy value of each voice frame, and removing the mute frame according to the short-time energy value. Specifically, the short-time energy value of each speech frame is compared with a preset threshold value respectively. If the short-time energy value of a certain speech frame is less than the threshold value, the speech frame can be used as a mute frame. And removing the mute frame in the voice frame to obtain the voice feature vector. Wherein the threshold value can be set arbitrarily.
Wherein the speech feature vector may also be context dependent, configured to receive feature vectors for a plurality of frames. The speech feature vector may be, for example, Mel-scale Frequency Cepstral Coefficients (MFCCs), Perceptual Linear Prediction (PLP) features, and the like.
Step 102, training a DNN acoustic model by utilizing the voice feature vector, wherein an output layer of the acoustic model outputs a first probability.
In practical application, the DNN acoustic model includes:
and the input layer is used for receiving the voice feature vectors.
A plurality of hidden layers (at least three). Wherein each hidden layer comprises a respective plurality of nodes (neurons), each node in each hidden layer being configured to perform a linear or non-linear transformation on an output from at least one node of an adjacent lower layer in the DNN. Wherein the input of a node of an upper hidden layer may be based on the output of one or several nodes in an adjacent lower layer. Each hidden layer has a corresponding weight, wherein the weight is obtained based on the acoustic signal of the training data. When the model is trained, the model can be pre-trained by utilizing a supervised or unsupervised learning process to obtain the initial weight of each hidden layer. The fine adjustment of the weight of each hidden layer can be performed by using a Back Propagation (BP) algorithm.
And the output layer is used for receiving the output signal from the uppermost hidden layer. The nodes of the output layer process the received signal with a modeling unit consisting of phonemes pronounced according to mandarin chinese, the output of which is a probability distribution over said modeling unit, which is herein referred to as a probability.
The output unit in the output layer is a modeling unit that represents phonetic elements used in mandarin chinese. The modeling unit may use morphemes (binding triphone states) and may be a Hidden Markov Model (HMM) or other suitable modeling unit.
Specifically, in this step, the speech feature vector is input as an input signal to an input layer of the DNN acoustic model; processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a first weight corresponding to each hidden layer to obtain an output signal of each hidden layer; and processing the output signal of the uppermost hidden layer at the output layer of the DNN acoustic model to obtain a first probability.
And 103, acquiring mandarin data with the dialect accent of the target dialect.
Wherein, the mandarin data with the dialect accent of the target can be any mandarin data with the dialect accent.
And 104, learning the output layer by using the Mandarin data with the target dialect accent, and updating the first probability by using the second probability output by the output layer.
In an embodiment of the present invention, the processes of steps 103 and 104 may be referred to as a process of adaptively adjusting the DNN acoustic model of step 102 using mandarin data with a target dialect accent. In the stage of model adaptation, the mandarin data with the dialect accent of the target dialect is used for learning the output layer, and the probability value of the newly learned output layer is used for directly replacing the probability value output by the output layer in the acoustic model trained by the standard mandarin data and the dialect accent data in the step 102.
Specifically, the mandarin data with the target dialect accent is input into an input layer of the DNN acoustic model as an input signal; processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a second weight corresponding to each hidden layer to obtain an output signal of each hidden layer; processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability; updating the first probability with the second probability.
It should be noted that if the training data in step 104 is relatively small, the same weight values as those of each corresponding hidden layer in step 102 may be used in performing model adaptation. Thus, a large amount of data is not needed by the scheme, and the obtained acoustic model can be adaptive to the Mandarin data with the target dialect accent, so that the identification accuracy of the Mandarin data with the target dialect accent is improved. If there is relatively more training data in step 104, then in this step, the weights of the hidden layers can be readjusted according to the mandarin data with the dialect accent of the target dialect, and the output probability of the output layer is updated at the top hidden layer, which also improves the recognition rate of the model recognition.
The DNN acoustic model is established by training the DNN acoustic model and adaptively adjusting the DNN acoustic model in the step 101-104.
It can be seen from the above that, by using the scheme of the embodiment of the present invention, when the mandarin data of the dialect accent of the target dialect is used for adaptive adjustment, the parameters of the hidden layer of the trained acoustic model are multiplexed, and a model does not need to be established separately for the data of the dialect accent of each dialect zone, thereby simplifying the complexity of model training and reducing the complexity of speech model modeling.
In the first embodiment, after the mandarin data with the target dialect accent is obtained in step 103, in order to improve the recognition accuracy, the silence frames in the mandarin data with the target dialect accent may be removed. Specifically, windowing and framing operation is performed on the mandarin data with the target dialect accent to obtain a voice frame. And then calculating the short-time energy value of each voice frame, and removing the mute frame according to the short-time energy value. Specifically, the short-time energy value of each speech frame is compared with a preset threshold value respectively. If the short-time energy value of a certain voice frame is smaller than the threshold value, the voice frame can be used as a mute frame, and the mute frame in the voice frame is removed. Wherein the threshold value can be set arbitrarily.
After the training model and the self-adaptive adjusting model are carried out, the voice can be recognized according to the self-adaptive adjusting model. And acquiring the mandarin data with the accent to be recognized, and recognizing the mandarin data with the accent to be recognized according to the second probability recognition.
Specifically, the mandarin data with accent to be recognized is input to the acoustic model obtained after the step 101-. And matching the third probability with the second probability, and identifying words and the like in the Mandarin data with accents to be identified according to the matching degree.
Through the scheme, the classification capability of the obtained acoustic model in the multilayer hidden layer is greatly improved by using the related modeling technology of the deep neural network, so that the identification accuracy is improved. In the stage of model self-adaptation, parameters of the obtained hidden layer of the acoustic model are multiplexed, and a model is not required to be established for the data of dialect accents in each dialect zone independently, so that the complexity of model training is simplified. In addition, by using the scheme of the embodiment of the invention, the probability value of the output layer is learned and updated by using a small amount of data without establishing a database of dialect accents of individual dialect zones, and the acoustic model can be adaptive to the dialect accent data of different target dialect zones.
Example two
As shown in fig. 2, the speech modeling apparatus according to the second embodiment of the present invention includes:
an extracting module 201, configured to take standard mandarin data and at least one mandarin data with dialect accent as input data, and extract a voice feature vector of the input data; a training module 202, configured to train a deep neural network DNN acoustic model using the speech feature vectors, wherein an output layer of the acoustic model outputs a first probability; an obtaining module 203, configured to obtain mandarin data with a dialect accent of the target dialect; a modeling module 204 for learning the output layer using the Mandarin data with the target dialect accent and updating the first probability using a second probability output by the output layer.
Wherein the extracting module 201 comprises: the first obtaining submodule is used for carrying out windowing and framing operation on the input data to obtain a voice frame; and the second obtaining submodule is used for removing the mute frame in the voice frame to obtain the voice characteristic vector.
Wherein the training module 202 comprises: a first input layer sub-module for inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model; a first hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a first weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer; and the first output layer submodule is used for processing the output signal of the uppermost hidden layer in the output layer of the DNN acoustic model to obtain a first probability.
Wherein the modeling module 204 comprises: a second input layer submodule for inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model; a second hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a second weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer; the second output layer submodule is used for processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability; an update submodule for updating the first probability with the second probability.
As shown in fig. 3, the apparatus further comprises: a processing module 205 for removing the silence frames in the mandarin data with the accent of the target dialect. At this time, the second input layer sub-module is specifically configured to input mandarin data with a target dialect accent after the mute frame is removed as an input signal to the input layer of the DNN acoustic model.
As further shown in fig. 3, the apparatus further comprises: a receiving module 206, configured to obtain mandarin data with accents to be recognized; and the identifying module 207 is used for identifying the mandarin data with the accent to be identified according to the second probability identification.
The working principle of the device according to the invention can be referred to the description of the method embodiment described above.
It can be seen from the above that, by using the scheme of the embodiment of the present invention, when the mandarin data of the dialect accent of the target dialect is used for adaptive adjustment, the parameters of the hidden layer of the trained acoustic model are multiplexed, and a model does not need to be established separately for the data of the dialect accent of each dialect zone, thereby simplifying the complexity of model training and reducing the complexity of speech model modeling.
EXAMPLE III
Fig. 4 shows an automatic speech recognition system according to a third embodiment of the present invention. The system comprises: an extractor component 401, a trainer component 402, a decoder component 403, and the like.
The extractor component is used for extracting the voice feature vector of the input signal. In the process of training the DNN acoustic model, standard Mandarin data and Mandarin data of dialect accents in the region of each large dialect zone are selected to be combined to serve as input signals; and when the acoustic model is adaptive, selecting Mandarin data of dialect accents of the target dialect zone as input signals.
A trainer component (DNN) for training the DNN acoustic model and adaptively adjusting the obtained acoustic model. The method comprises the following steps:
and the input layer is used for receiving the voice feature vectors of the extractor component.
A plurality of hidden layers (at least three). Wherein each hidden layer comprises a respective plurality of nodes (neurons), each node in each hidden layer being configured to perform a linear or non-linear transformation on an output from at least one node of an adjacent lower layer in the DNN. Wherein the input of a node of an upper hidden layer may be based on the output of one or several nodes in an adjacent lower layer. Each hidden layer has a weight value corresponding thereto, wherein the weight value is obtained based on an acoustic signal of training data. When the model is trained, the model can be pre-trained by utilizing a supervised or unsupervised learning process to obtain the initial weight of each hidden layer. The fine adjustment of the weight of each hidden layer can be performed by adopting a backward propagation algorithm.
An output layer to receive an output from a top hidden layer in the DNN. The nodes of the output layer process the received signal with a modeling unit consisting of mandarin chinese pronunciation phonemes, the output of which is a probability distribution over said modeling unit, herein referred to as a first probability.
The output unit in the output layer is a modeling unit that represents phonetic elements used in mandarin chinese. The modeling unit may use morphemes (binding triphone states) and may be a Hidden Markov Model (HMM) or other suitable modeling unit.
A decoder component for identifying words of Mandarin data of the dialect accent of the target dialect zone using the probabilities output by the trainer component.
In the embodiment of the invention, standard Mandarin data is selected as training data, and the dialect accent data in the eight dialect zone is added to extract acoustic feature vectors together to train the DNN model with multiple hidden layers. In addition, in order to improve the adaptability of the DNN model to the Mandarin data of dialect accents in each large dialect zone, in the recognition system of the Mandarin data with dialect accents under the target dialect zone, the obtained DNN model is multiplexed with hidden layer parameters, and the Mandarin data with dialect accents under the target dialect zone is utilized to relearn the output layer to output the probability value. Finally, the acoustic model obtained in this manner can improve recognition rates on the recognition system compared to models trained with Mandarin data or standard Mandarin data having dialect accents in a single dialect zone.
Through the scheme, the classification capability of the obtained acoustic model in the multilayer hidden layer is greatly improved by using the related modeling technology of the deep neural network, so that the identification accuracy is improved. In the stage of model self-adaptation, parameters of the obtained hidden layer of the acoustic model are multiplexed, and a model is not required to be established for the data of dialect accents in each dialect zone independently, so that the complexity of model training is simplified. In addition, by using the scheme of the embodiment of the invention, the probability value of the output layer is learned and updated by using a small amount of data without establishing a database of dialect accents of individual dialect zones, and the acoustic model can be adaptive to the dialect accent data of different target dialect zones.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the transceiving method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (12)

1. A method of speech modeling, comprising:
taking standard Mandarin data and at least one Mandarin data with dialect accent as input data, and extracting voice feature vector of the input data;
training a Deep Neural Network (DNN) acoustic model by utilizing the voice feature vectors, wherein an output layer of the DNN acoustic model outputs a first probability;
obtaining mandarin data with a target dialect accent;
and learning the output layer by using the Mandarin data with the accent of the target dialect, multiplexing the parameters of the hidden layer of the trained deep neural network DNN acoustic model, and updating the first probability by using the second probability output by the output layer.
2. The method of claim 1, wherein the step of extracting the speech feature vector of the input data comprises:
performing windowing and framing operation on the input data to obtain a voice frame;
and removing the mute frame in the voice frame to obtain the voice feature vector.
3. The method of claim 1, wherein the step of training a Deep Neural Network (DNN) acoustic model using the speech feature vectors, wherein an output layer of the acoustic model outputs a first probability comprises:
inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model;
processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a first weight corresponding to each hidden layer to obtain an output signal of each hidden layer;
and processing the output signal of the uppermost hidden layer at the output layer of the DNN acoustic model to obtain a first probability.
4. The method of claim 1, wherein the step of learning the output layer using the mandarin chinese data with the target dialect accent and updating the first probability using the second probability output by the output layer comprises:
inputting the Mandarin data with a target dialect accent as an input signal to an input layer of the DNN acoustic model;
processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a second weight corresponding to each hidden layer to obtain an output signal of each hidden layer;
processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability;
updating the first probability with the second probability.
5. The method of claim 4, wherein prior to the step of learning the output layer using the Mandarin data with the target dialect accent and updating the first probability with a second probability output by the output layer, the method further comprises:
removing silence frames in the mandarin data with the target dialect accent;
the step of inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model comprises:
the mandarin data with the target dialect accent after the mute frame is removed is input to an input layer of the DNN acoustic model as an input signal.
6. The method according to any one of claims 1-5, further comprising:
acquiring mandarin data with accents to be identified;
and identifying the Mandarin data with the accent to be identified according to the second probability identification.
7. A speech modeling apparatus, comprising:
the extraction module is used for taking standard Mandarin data and at least one Mandarin data with dialect accent as input data and extracting the voice feature vector of the input data;
a training module for training a Deep Neural Network (DNN) acoustic model using the speech feature vectors, wherein an output layer of the DNN acoustic model outputs a first probability;
the obtaining module is used for obtaining mandarin data with the dialect accent of the target dialect;
and the modeling module is used for learning the output layer by utilizing the Mandarin data with the target dialect accent, multiplexing the parameters of the hidden layer of the trained deep neural network DNN acoustic model, and updating the first probability by utilizing the second probability output by the output layer.
8. The apparatus of claim 7, wherein the extraction module comprises:
the first obtaining submodule is used for carrying out windowing and framing operation on the input data to obtain a voice frame;
and the second obtaining submodule is used for removing the mute frame in the voice frame to obtain the voice characteristic vector.
9. The apparatus of claim 7, wherein the training module comprises:
a first input layer sub-module for inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model;
a first hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a first weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer;
and the first output layer submodule is used for processing the output signal of the uppermost hidden layer in the output layer of the DNN acoustic model to obtain a first probability.
10. The apparatus of claim 7, wherein the modeling module comprises:
a second input layer submodule for inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model;
a second hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a second weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer;
the second output layer submodule is used for processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability;
an update submodule for updating the first probability with the second probability.
11. The apparatus of claim 10, further comprising:
a processing module for removing a mute frame from the mandarin data with the target dialect accent;
the second input layer sub-module is specifically configured to input mandarin data with a target dialect accent after the silence frame is removed as an input signal to an input layer of the DNN acoustic model.
12. The apparatus according to any one of claims 7-11, further comprising:
the receiving module is used for acquiring mandarin data with accents to be identified;
and the recognition module is used for recognizing the mandarin data with the accent to be recognized according to the second probability recognition.
CN201611103738.6A 2016-12-05 2016-12-05 Voice modeling method and device Active CN108172218B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611103738.6A CN108172218B (en) 2016-12-05 2016-12-05 Voice modeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611103738.6A CN108172218B (en) 2016-12-05 2016-12-05 Voice modeling method and device

Publications (2)

Publication Number Publication Date
CN108172218A CN108172218A (en) 2018-06-15
CN108172218B true CN108172218B (en) 2021-01-12

Family

ID=62525918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611103738.6A Active CN108172218B (en) 2016-12-05 2016-12-05 Voice modeling method and device

Country Status (1)

Country Link
CN (1) CN108172218B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887497B (en) * 2019-04-12 2021-01-29 北京百度网讯科技有限公司 Modeling method, device and equipment for speech recognition
CN110033760B (en) * 2019-04-15 2021-01-29 北京百度网讯科技有限公司 Modeling method, device and equipment for speech recognition
CN110211565B (en) * 2019-05-06 2023-04-04 平安科技(深圳)有限公司 Dialect identification method and device and computer readable storage medium
CN110738991A (en) * 2019-10-11 2020-01-31 东南大学 Speech recognition equipment based on flexible wearable sensor
CN110930995B (en) * 2019-11-26 2022-02-11 中国南方电网有限责任公司 Voice recognition model applied to power industry
CN111179938A (en) * 2019-12-26 2020-05-19 安徽仁昊智能科技有限公司 Speech recognition garbage classification system based on artificial intelligence
CN111243574B (en) * 2020-01-13 2023-01-03 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN111833845B (en) * 2020-07-31 2023-11-24 平安科技(深圳)有限公司 Multilingual speech recognition model training method, device, equipment and storage medium
CN112509555B (en) * 2020-11-25 2023-05-23 平安科技(深圳)有限公司 Dialect voice recognition method, device, medium and electronic equipment
CN112528679B (en) * 2020-12-17 2024-02-13 科大讯飞股份有限公司 Method and device for training intention understanding model, and method and device for intention understanding
CN112770154A (en) * 2021-01-19 2021-05-07 深圳西米通信有限公司 Intelligent set top box with voice interaction function and interaction method thereof
CN112967720B (en) * 2021-01-29 2022-12-30 南京迪港科技有限责任公司 End-to-end voice-to-text model optimization method under small amount of accent data
CN113345451B (en) * 2021-04-26 2023-08-22 北京搜狗科技发展有限公司 Sound changing method and device and electronic equipment
CN113192492A (en) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310788B (en) * 2013-05-23 2016-03-16 北京云知声信息技术有限公司 A kind of voice information identification method and system
CN104282300A (en) * 2013-07-05 2015-01-14 中国移动通信集团公司 Non-periodic component syllable model building and speech synthesizing method and device
EP2889804A1 (en) * 2013-12-30 2015-07-01 Alcatel Lucent Systems and methods for contactless speech recognition
CN104575490B (en) * 2014-12-30 2017-11-07 苏州驰声信息科技有限公司 Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm
US9477652B2 (en) * 2015-02-13 2016-10-25 Facebook, Inc. Machine learning dialect identification
CN105118498B (en) * 2015-09-06 2018-07-31 百度在线网络技术(北京)有限公司 The training method and device of phonetic synthesis model
CN105206258B (en) * 2015-10-19 2018-05-04 百度在线网络技术(北京)有限公司 The generation method and device and phoneme synthesizing method and device of acoustic model
CN105391873A (en) * 2015-11-25 2016-03-09 上海新储集成电路有限公司 Method for realizing local voice recognition in mobile device
CN105578115B (en) * 2015-12-22 2016-10-26 深圳市鹰硕音频科技有限公司 A kind of Network teaching method with Speech Assessment function and system
CN105632501B (en) * 2015-12-30 2019-09-03 中国科学院自动化研究所 A kind of automatic accent classification method and device based on depth learning technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A General Framework for Multi-Accent Mandarin Speech Recognition Using Adaptive Neural Networks;Xiang Sui等;《IEEE》;20141231;第118-122页 *
基于语音配列的汉语方言自动辨识;顾明亮等;《中文信息学报》;20061231;第20卷(第5期);第77-82页 *

Also Published As

Publication number Publication date
CN108172218A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN108172218B (en) Voice modeling method and device
Abdel-Hamid et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code
Miao et al. Speaker adaptive training of deep neural network acoustic models using i-vectors
Saon et al. Speaker adaptation of neural network acoustic models using i-vectors
Qian et al. On the training aspects of deep neural network (DNN) for parametric TTS synthesis
CN105741832B (en) Spoken language evaluation method and system based on deep learning
CN105632501A (en) Deep-learning-technology-based automatic accent classification method and apparatus
Cai et al. From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint
JP6506074B2 (en) Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method and program
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
WO2007114605A1 (en) Acoustic model adaptation methods based on pronunciation variability analysis for enhancing the recognition of voice of non-native speaker and apparatuses thereof
KR102311922B1 (en) Apparatus and method for controlling outputting target information to voice using characteristic of user voice
Abro et al. Qur'an recognition for the purpose of memorisation using Speech Recognition technique
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
Salam et al. Malay isolated speech recognition using neural network: a work in finding number of hidden nodes and learning parameters.
Wang et al. Speech augmentation using wavenet in speech recognition
Rabiee et al. Persian accents identification using an adaptive neural network
WO2019212375A1 (en) Method for obtaining speaker-dependent small high-level acoustic speech attributes
KR20150001191A (en) Apparatus and method for recognizing continuous speech
Ons et al. A self learning vocal interface for speech-impaired users
Ponting Computational Models of Speech Pattern Processing
Tan et al. Denoised senone i-vectors for robust speaker verification
Karanasou et al. I-vector estimation using informative priors for adaptation of deep neural networks
Abraham et al. Articulatory Feature Extraction Using CTC to Build Articulatory Classifiers Without Forced Frame Alignments for Speech Recognition.
CN108182938A (en) A kind of training method of the Mongol acoustic model based on DNN

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant