CN108172218B

CN108172218B - Voice modeling method and device

Info

Publication number: CN108172218B
Application number: CN201611103738.6A
Authority: CN
Inventors: 徐衍瀚
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2016-12-05
Filing date: 2016-12-05
Publication date: 2021-01-12
Anticipated expiration: 2036-12-05
Also published as: CN108172218A

Abstract

The invention provides a voice modeling method and a voice modeling device, relates to the technical field of voice recognition, and aims to reduce the complexity of voice model modeling. The voice modeling method comprises the following steps: taking standard Mandarin data and at least one Mandarin data with dialect accent as input data, and extracting voice feature vector of the input data; training a Deep Neural Network (DNN) acoustic model by utilizing the voice feature vectors, wherein an output layer of the acoustic model outputs a first probability; obtaining mandarin data with a target dialect accent; and learning the output layer by using the Mandarin data with the accent of the target dialect, and updating the first probability by using the second probability output by the output layer. The invention can reduce the complexity of voice model modeling.

Description

Voice modeling method and device

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice modeling method and a voice modeling device.

Background

Speech recognition is the process of converting speech signals into computer recognizable inputs by a machine to understand what a human speaks. The current speech recognition technology mainly comprises a statistical pattern recognition technology and an artificial neural network technology.

Hidden Markov Models (HMM) are relatively perfect and mature models in the technical field of speech such as speech recognition at present, and a statistical concept is used for modeling time-sequence speech through a Hidden Markov process, so that a good result is obtained.

Deep Neural Network (DNN) based speech recognition systems have received increasing attention from researchers in recent years. The concept of deep learning stems from the study of artificial neural networks, proposed by Hinton et al in 2006. The essence of deep learning is to learn more useful features by constructing a machine learning model with many hidden layers and massive training data, thereby finally improving the accuracy of classification or prediction. The following points are mainly found: (1) the multi-hidden-layer artificial neural network has excellent feature learning capability, and the learned features describe data more essentially, so that classification is facilitated; (2) the difficulty of the deep neural network in training can be effectively overcome by 'layer-by-layer initialization', and the layer-by-layer initialization is realized by unsupervised learning.

In order to improve the recognition accuracy of mandarin with dialect accent background, the prior art provides various methods. Some of these methods improve on training methods in the acoustic modeling process, and some improve on language models in the recognition system. However, the conventional mandarin recognition method with dialect accent background has high complexity of training a model.

Disclosure of Invention

In view of this, the present invention provides a speech modeling method and apparatus for reducing complexity of speech model modeling.

In order to solve the above technical problem, the present invention provides a speech modeling method, including:

taking standard Mandarin data and at least one Mandarin data with dialect accent as input data, and extracting voice feature vector of the input data;

training a Deep Neural Network (DNN) acoustic model by utilizing the voice feature vectors, wherein an output layer of the acoustic model outputs a first probability;

obtaining mandarin data with a target dialect accent;

and learning the output layer by using the Mandarin data with the accent of the target dialect, and updating the first probability by using the second probability output by the output layer.

Wherein the step of extracting the speech feature vector of the input data comprises:

performing windowing and framing operation on the input data to obtain a voice frame;

and removing the mute frame in the voice frame to obtain the voice feature vector.

Wherein the step of training a deep neural network DNN acoustic model by using the speech feature vectors, wherein an output layer of the acoustic model outputs a first probability comprises:

inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model;

processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a first weight corresponding to each hidden layer to obtain an output signal of each hidden layer;

and processing the output signal of the uppermost hidden layer at the output layer of the DNN acoustic model to obtain a first probability.

Wherein the step of learning the output layer using the mandarin data with the target dialect accent and updating the first probability using the second probability output by the output layer comprises:

inputting the Mandarin data with a target dialect accent as an input signal to an input layer of the DNN acoustic model;

processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a second weight corresponding to each hidden layer to obtain an output signal of each hidden layer;

processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability;

updating the first probability with the second probability.

Wherein, before the step of learning the output layer using the Mandarin data with the target dialect accent and updating the first probability using the second probability output by the output layer, the method further comprises:

removing silence frames in the mandarin data with the target dialect accent;

the step of inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model comprises:

the mandarin data with the target dialect accent after the mute frame is removed is input to an input layer of the DNN acoustic model as an input signal.

Wherein the method further comprises:

acquiring mandarin data with accents to be identified;

and identifying the Mandarin data with the accent to be identified according to the second probability identification.

In a second aspect, the present invention provides a speech modeling apparatus, comprising:

the extraction module is used for taking standard Mandarin data and at least one Mandarin data with dialect accent as input data and extracting the voice feature vector of the input data;

a training module for training a Deep Neural Network (DNN) acoustic model by using the voice feature vector, wherein an output layer of the acoustic model outputs a first probability;

the obtaining module is used for obtaining mandarin data with the dialect accent of the target dialect;

and the modeling module is used for learning the output layer by utilizing the Mandarin data with the target dialect accent and updating the first probability by utilizing the second probability output by the output layer.

Wherein the extraction module comprises:

the first obtaining submodule is used for carrying out windowing and framing operation on the input data to obtain a voice frame;

and the second obtaining submodule is used for removing the mute frame in the voice frame to obtain the voice characteristic vector.

Wherein the training module comprises:

a first input layer sub-module for inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model;

a first hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a first weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer;

and the first output layer submodule is used for processing the output signal of the uppermost hidden layer in the output layer of the DNN acoustic model to obtain a first probability.

Wherein the modeling module comprises:

a second input layer submodule for inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model;

a second hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a second weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer;

the second output layer submodule is used for processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability;

an update submodule for updating the first probability with the second probability.

Wherein the apparatus further comprises:

a processing module for removing a mute frame from the mandarin data with the target dialect accent;

the second input layer sub-module is specifically configured to input mandarin data with a target dialect accent after the silence frame is removed as an input signal to an input layer of the DNN acoustic model.

Wherein the apparatus further comprises:

the receiving module is used for acquiring mandarin data with accents to be identified;

and the recognition module is used for recognizing the mandarin data with the accent to be recognized according to the second probability recognition.

The technical scheme of the invention has the following beneficial effects:

in an embodiment of the invention, an acoustic model is trained using a deep neural network technique based on standard Mandarin data and at least one Mandarin data with dialect accents to obtain a first probability. Outputting a layer with learned acoustic models for Mandarin data with target dialect accents, and updating the first probability with a second probability output by the output layer. Therefore, by using the scheme of the embodiment of the invention, when the Mandarin data of the dialect accent of the target dialect is used for self-adaptive adjustment, the parameters of the hidden layer of the trained acoustic model are reused, and a model is not required to be independently established for the data of the dialect accent of each dialect zone, so that the complexity of model training is simplified, and the complexity of voice model modeling is reduced.

Drawings

FIG. 1 is a flowchart of a speech modeling method according to a first embodiment of the present invention;

FIG. 2 is a diagram illustrating a voice modeling apparatus according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech modeling apparatus according to a second embodiment of the present invention;

fig. 4 is a schematic diagram of an automatic speech recognition system according to a third embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention will be made with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example one

As shown in fig. 1, a speech modeling method according to a first embodiment of the present invention includes:

step 101, taking standard Mandarin data and at least one Mandarin data with dialect accent as input data, and extracting a voice feature vector of the input data.

The Chinese language mainly comprises: standard mandarin published by the authorities and mandarin with dialect accents in various regions. Chinese dialects can be roughly divided into eight dialect areas according to regions. Mandarin chinese is a single language. However, the pronunciation of mandarin is affected by the spoken dialect accents in various regions, and there is a phenomenon that the pronunciation of mandarin is changed in words compared to the standard mandarin. Therefore, the acoustic model trained by using the data of the single standard mandarin can not effectively and correctly describe the acoustic characteristics with the voice changes; it is also difficult to engineer a Mandarin data collection with dialect accents under a particular dialect and to build a database of sufficient data.

Therefore, in the embodiment of the invention, the standard Mandarin data and the Mandarin data with dialect accents of at least one dialect area are selected as the input data to jointly extract the acoustic feature vectors and train the DNN model of the multiple hidden layers. Preferably, standard mandarin data and mandarin data with dialect accents are selected as the input data.

For input data, in order to make a subsequently established acoustic model more accurate, a windowing and framing operation is performed on the input data to obtain a speech frame. And then calculating the short-time energy value of each voice frame, and removing the mute frame according to the short-time energy value. Specifically, the short-time energy value of each speech frame is compared with a preset threshold value respectively. If the short-time energy value of a certain speech frame is less than the threshold value, the speech frame can be used as a mute frame. And removing the mute frame in the voice frame to obtain the voice feature vector. Wherein the threshold value can be set arbitrarily.

Wherein the speech feature vector may also be context dependent, configured to receive feature vectors for a plurality of frames. The speech feature vector may be, for example, Mel-scale Frequency Cepstral Coefficients (MFCCs), Perceptual Linear Prediction (PLP) features, and the like.

Step 102, training a DNN acoustic model by utilizing the voice feature vector, wherein an output layer of the acoustic model outputs a first probability.

In practical application, the DNN acoustic model includes:

and the input layer is used for receiving the voice feature vectors.

A plurality of hidden layers (at least three). Wherein each hidden layer comprises a respective plurality of nodes (neurons), each node in each hidden layer being configured to perform a linear or non-linear transformation on an output from at least one node of an adjacent lower layer in the DNN. Wherein the input of a node of an upper hidden layer may be based on the output of one or several nodes in an adjacent lower layer. Each hidden layer has a corresponding weight, wherein the weight is obtained based on the acoustic signal of the training data. When the model is trained, the model can be pre-trained by utilizing a supervised or unsupervised learning process to obtain the initial weight of each hidden layer. The fine adjustment of the weight of each hidden layer can be performed by using a Back Propagation (BP) algorithm.

And the output layer is used for receiving the output signal from the uppermost hidden layer. The nodes of the output layer process the received signal with a modeling unit consisting of phonemes pronounced according to mandarin chinese, the output of which is a probability distribution over said modeling unit, which is herein referred to as a probability.

The output unit in the output layer is a modeling unit that represents phonetic elements used in mandarin chinese. The modeling unit may use morphemes (binding triphone states) and may be a Hidden Markov Model (HMM) or other suitable modeling unit.

Specifically, in this step, the speech feature vector is input as an input signal to an input layer of the DNN acoustic model; processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a first weight corresponding to each hidden layer to obtain an output signal of each hidden layer; and processing the output signal of the uppermost hidden layer at the output layer of the DNN acoustic model to obtain a first probability.

And 103, acquiring mandarin data with the dialect accent of the target dialect.

Wherein, the mandarin data with the dialect accent of the target can be any mandarin data with the dialect accent.

And 104, learning the output layer by using the Mandarin data with the target dialect accent, and updating the first probability by using the second probability output by the output layer.

In an embodiment of the present invention, the processes of

steps

103 and 104 may be referred to as a process of adaptively adjusting the DNN acoustic model of step 102 using mandarin data with a target dialect accent. In the stage of model adaptation, the mandarin data with the dialect accent of the target dialect is used for learning the output layer, and the probability value of the newly learned output layer is used for directly replacing the probability value output by the output layer in the acoustic model trained by the standard mandarin data and the dialect accent data in the step 102.

Specifically, the mandarin data with the target dialect accent is input into an input layer of the DNN acoustic model as an input signal; processing the input signal of each hidden layer in a plurality of hidden layers of the DNN acoustic model by using a second weight corresponding to each hidden layer to obtain an output signal of each hidden layer; processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability; updating the first probability with the second probability.

It should be noted that if the training data in step 104 is relatively small, the same weight values as those of each corresponding hidden layer in step 102 may be used in performing model adaptation. Thus, a large amount of data is not needed by the scheme, and the obtained acoustic model can be adaptive to the Mandarin data with the target dialect accent, so that the identification accuracy of the Mandarin data with the target dialect accent is improved. If there is relatively more training data in step 104, then in this step, the weights of the hidden layers can be readjusted according to the mandarin data with the dialect accent of the target dialect, and the output probability of the output layer is updated at the top hidden layer, which also improves the recognition rate of the model recognition.

The DNN acoustic model is established by training the DNN acoustic model and adaptively adjusting the DNN acoustic model in the step 101-104.

It can be seen from the above that, by using the scheme of the embodiment of the present invention, when the mandarin data of the dialect accent of the target dialect is used for adaptive adjustment, the parameters of the hidden layer of the trained acoustic model are multiplexed, and a model does not need to be established separately for the data of the dialect accent of each dialect zone, thereby simplifying the complexity of model training and reducing the complexity of speech model modeling.

In the first embodiment, after the mandarin data with the target dialect accent is obtained in step 103, in order to improve the recognition accuracy, the silence frames in the mandarin data with the target dialect accent may be removed. Specifically, windowing and framing operation is performed on the mandarin data with the target dialect accent to obtain a voice frame. And then calculating the short-time energy value of each voice frame, and removing the mute frame according to the short-time energy value. Specifically, the short-time energy value of each speech frame is compared with a preset threshold value respectively. If the short-time energy value of a certain voice frame is smaller than the threshold value, the voice frame can be used as a mute frame, and the mute frame in the voice frame is removed. Wherein the threshold value can be set arbitrarily.

After the training model and the self-adaptive adjusting model are carried out, the voice can be recognized according to the self-adaptive adjusting model. And acquiring the mandarin data with the accent to be recognized, and recognizing the mandarin data with the accent to be recognized according to the second probability recognition.

Specifically, the mandarin data with accent to be recognized is input to the acoustic model obtained after the step 101-. And matching the third probability with the second probability, and identifying words and the like in the Mandarin data with accents to be identified according to the matching degree.

Through the scheme, the classification capability of the obtained acoustic model in the multilayer hidden layer is greatly improved by using the related modeling technology of the deep neural network, so that the identification accuracy is improved. In the stage of model self-adaptation, parameters of the obtained hidden layer of the acoustic model are multiplexed, and a model is not required to be established for the data of dialect accents in each dialect zone independently, so that the complexity of model training is simplified. In addition, by using the scheme of the embodiment of the invention, the probability value of the output layer is learned and updated by using a small amount of data without establishing a database of dialect accents of individual dialect zones, and the acoustic model can be adaptive to the dialect accent data of different target dialect zones.

Example two

As shown in fig. 2, the speech modeling apparatus according to the second embodiment of the present invention includes:

an extracting module 201, configured to take standard mandarin data and at least one mandarin data with dialect accent as input data, and extract a voice feature vector of the input data; a training module 202, configured to train a deep neural network DNN acoustic model using the speech feature vectors, wherein an output layer of the acoustic model outputs a first probability; an obtaining module 203, configured to obtain mandarin data with a dialect accent of the target dialect; a modeling module 204 for learning the output layer using the Mandarin data with the target dialect accent and updating the first probability using a second probability output by the output layer.

Wherein the extracting module 201 comprises: the first obtaining submodule is used for carrying out windowing and framing operation on the input data to obtain a voice frame; and the second obtaining submodule is used for removing the mute frame in the voice frame to obtain the voice characteristic vector.

Wherein the training module 202 comprises: a first input layer sub-module for inputting the speech feature vectors as input signals to an input layer of the DNN acoustic model; a first hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a first weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer; and the first output layer submodule is used for processing the output signal of the uppermost hidden layer in the output layer of the DNN acoustic model to obtain a first probability.

Wherein the modeling module 204 comprises: a second input layer submodule for inputting the Mandarin data with the target dialect accent as an input signal to an input layer of the DNN acoustic model; a second hidden layer submodule, configured to process, in a plurality of hidden layers of the DNN acoustic model, an input signal of each hidden layer by using a second weight corresponding to the hidden layer, so as to obtain an output signal of each hidden layer; the second output layer submodule is used for processing an output signal of the uppermost hidden layer on an output layer of the DNN acoustic model to obtain a second probability; an update submodule for updating the first probability with the second probability.

As shown in fig. 3, the apparatus further comprises: a processing module 205 for removing the silence frames in the mandarin data with the accent of the target dialect. At this time, the second input layer sub-module is specifically configured to input mandarin data with a target dialect accent after the mute frame is removed as an input signal to the input layer of the DNN acoustic model.

As further shown in fig. 3, the apparatus further comprises: a receiving module 206, configured to obtain mandarin data with accents to be recognized; and the identifying module 207 is used for identifying the mandarin data with the accent to be identified according to the second probability identification.

The working principle of the device according to the invention can be referred to the description of the method embodiment described above.

EXAMPLE III

Fig. 4 shows an automatic speech recognition system according to a third embodiment of the present invention. The system comprises: an extractor component 401, a trainer component 402, a decoder component 403, and the like.

The extractor component is used for extracting the voice feature vector of the input signal. In the process of training the DNN acoustic model, standard Mandarin data and Mandarin data of dialect accents in the region of each large dialect zone are selected to be combined to serve as input signals; and when the acoustic model is adaptive, selecting Mandarin data of dialect accents of the target dialect zone as input signals.

A trainer component (DNN) for training the DNN acoustic model and adaptively adjusting the obtained acoustic model. The method comprises the following steps:

and the input layer is used for receiving the voice feature vectors of the extractor component.

A plurality of hidden layers (at least three). Wherein each hidden layer comprises a respective plurality of nodes (neurons), each node in each hidden layer being configured to perform a linear or non-linear transformation on an output from at least one node of an adjacent lower layer in the DNN. Wherein the input of a node of an upper hidden layer may be based on the output of one or several nodes in an adjacent lower layer. Each hidden layer has a weight value corresponding thereto, wherein the weight value is obtained based on an acoustic signal of training data. When the model is trained, the model can be pre-trained by utilizing a supervised or unsupervised learning process to obtain the initial weight of each hidden layer. The fine adjustment of the weight of each hidden layer can be performed by adopting a backward propagation algorithm.

An output layer to receive an output from a top hidden layer in the DNN. The nodes of the output layer process the received signal with a modeling unit consisting of mandarin chinese pronunciation phonemes, the output of which is a probability distribution over said modeling unit, herein referred to as a first probability.

A decoder component for identifying words of Mandarin data of the dialect accent of the target dialect zone using the probabilities output by the trainer component.

In the embodiment of the invention, standard Mandarin data is selected as training data, and the dialect accent data in the eight dialect zone is added to extract acoustic feature vectors together to train the DNN model with multiple hidden layers. In addition, in order to improve the adaptability of the DNN model to the Mandarin data of dialect accents in each large dialect zone, in the recognition system of the Mandarin data with dialect accents under the target dialect zone, the obtained DNN model is multiplexed with hidden layer parameters, and the Mandarin data with dialect accents under the target dialect zone is utilized to relearn the output layer to output the probability value. Finally, the acoustic model obtained in this manner can improve recognition rates on the recognition system compared to models trained with Mandarin data or standard Mandarin data having dialect accents in a single dialect zone.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the transceiving method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of speech modeling, comprising:

training a Deep Neural Network (DNN) acoustic model by utilizing the voice feature vectors, wherein an output layer of the DNN acoustic model outputs a first probability;

obtaining mandarin data with a target dialect accent;

and learning the output layer by using the Mandarin data with the accent of the target dialect, multiplexing the parameters of the hidden layer of the trained deep neural network DNN acoustic model, and updating the first probability by using the second probability output by the output layer.

2. The method of claim 1, wherein the step of extracting the speech feature vector of the input data comprises:

3. The method of claim 1, wherein the step of training a Deep Neural Network (DNN) acoustic model using the speech feature vectors, wherein an output layer of the acoustic model outputs a first probability comprises:

4. The method of claim 1, wherein the step of learning the output layer using the mandarin chinese data with the target dialect accent and updating the first probability using the second probability output by the output layer comprises:

updating the first probability with the second probability.

5. The method of claim 4, wherein prior to the step of learning the output layer using the Mandarin data with the target dialect accent and updating the first probability with a second probability output by the output layer, the method further comprises:

removing silence frames in the mandarin data with the target dialect accent;

6. The method according to any one of claims 1-5, further comprising:

acquiring mandarin data with accents to be identified;

7. A speech modeling apparatus, comprising:

a training module for training a Deep Neural Network (DNN) acoustic model using the speech feature vectors, wherein an output layer of the DNN acoustic model outputs a first probability;

and the modeling module is used for learning the output layer by utilizing the Mandarin data with the target dialect accent, multiplexing the parameters of the hidden layer of the trained deep neural network DNN acoustic model, and updating the first probability by utilizing the second probability output by the output layer.

8. The apparatus of claim 7, wherein the extraction module comprises:

9. The apparatus of claim 7, wherein the training module comprises:

10. The apparatus of claim 7, wherein the modeling module comprises:

11. The apparatus of claim 10, further comprising:

12. The apparatus according to any one of claims 7-11, further comprising: