CN111667818A

CN111667818A - Method and device for training awakening model

Info

Publication number: CN111667818A
Application number: CN202010461982.XA
Authority: CN
Inventors: 靳源; 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-15
Anticipated expiration: 2040-05-27
Also published as: CN111667818B

Abstract

The invention provides a method and a device for training a wake-up model, wherein the method comprises the following steps: when the model training is triggered, a first training set and a second training set are obtained; respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model; inputting the second training set into the current acoustic model, and determining a second difference parameter by comparing the output result of the current acoustic model with the one-hot code corresponding to the awakening voice which can be identified by the current acoustic model; and adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters. The method provided by the invention can ensure that the current acoustic model can be compatible with the initial voice under the condition of ensuring the adaptation to the current scene, reduces the risk of unstable performance of the awakening model caused by updating, and ensures that the trained acoustic model is well compatible with the previous initial awakening scene on the premise of adapting to a more complex scene.

Description

Method and device for training awakening model

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for training a wake-up model.

Background

Along with the continuous development of the technology, the intelligent device is more and more popular, the existing intelligent device can determine whether to wake up the intelligent device by receiving the voice signal input by the user, one of the application scenarios is that the intelligent device can recognize the voice signal input by the user through an acoustic model in the device to wake up the intelligent device in a state to be woken up, and when the acoustic model can recognize the input voice signal, the intelligent device is woken up to enable the user to control other intelligent home devices through the woken-up intelligent device. The method comprises the steps that a user records awakening voice in advance to the intelligent device, the intelligent device conducts model training according to the recorded awakening voice to generate an acoustic model, when the intelligent device receives voice signals input by the user again, if the acoustic model can recognize the input voice signals, the voice signals are determined to be awakening voice, the intelligent device can be awakened, and further operation is conducted on the intelligent device.

Wherein, the performance of waking up the device each time is directly influenced by the quality of the acoustic model in the intelligent device, but in practical use, since too few pre-recorded wake-up voice samples in the acoustic model recorded for the first time can cause the problem that the acoustic model is too noisy due to the environment of the pronunciation of the user or the corresponding wake-up voice cannot be recognized due to the fact that the speaker has an accent, the problem of false wake-up or missed wake-up is caused, the existing wake-up device needs to continuously collect new wake-up voice input by wake-up each time, the new wake-up voice is manually labeled, or voice noise input by a user is filtered through data cleaning to determine whether the voice is voice which the intelligent equipment should wake up or voice which is awoken by mistake or voice which is awoken by omission, and the voice data which is marked or determined whether the voice is awoken is mixed with the awoken voice obtained by recording before to be retrained so as to obtain an acoustic model with better effect to replace the model obtained before.

However, when the wake-up device retrains using the mixed wake-up speech each time, the wake-up word data in the wake-up speech needs to be rearranged for training, which occupies a large computing resource of the wake-up device during retraining, the time of the acoustic model finally calculated is very long, and the actual wake-up effect of the retrained wake-up model is not stable because the proportion of the training data is not well controlled because the wake-up model needs to be rebuilt.

Disclosure of Invention

The invention provides a method and a device for training a wake-up model, which are used for solving the problems that in the prior art, the model training amount is large before continuously collecting new wake-up word data and mixing the new wake-up word data with the previously recorded wake-up word data to retrain a new acoustic model and the model with better performance is replaced, the retraining time of the acoustic model is long, and the actual wake-up effect of the retrained wake-up model is unstable due to the fact that the proportion of training data is not well controlled because the wake-up model needs to be reestablished.

A first aspect of the present invention provides a method for training a wake-up model, the method comprising:

when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;

respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;

inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;

and adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.

Optionally, determining a first difference parameter comprises:

acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;

determining a relative entropy from a difference of the first and second probability distributions.

Optionally, determining a second difference parameter comprises:

acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;

and determining the cross entropy according to the difference of the third probability distribution and the fourth probability distribution.

Optionally, after determining the second difference parameter, the method further includes:

and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.

Optionally, adjusting the model parameters of the initial acoustic model according to the second difference parameter includes:

and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.

Optionally, adjusting the model parameters of the current acoustic model according to the first difference parameter and the second difference parameter includes:

and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.

Optionally, the initial speech feature data or the new speech feature data comprises at least one of:

mel cepstrum coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK feature data.

Optionally, obtaining a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model includes:

acquiring preset awakening voice information which can be identified by a current acoustic model;

and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.

A second aspect of the present invention provides an apparatus for training a wake-up model, the apparatus comprising a memory for storing instructions;

the processor is used for reading the instructions in the memory, and the implementation method comprises the following steps:

Optionally, the processor is configured to determine a first difference parameter, comprising:

Optionally, the processor is configured to determine a second difference parameter, including:

Optionally, after determining the second difference parameter, the processor further includes:

Optionally, the processor is configured to adjust a model parameter of the initial acoustic model according to the second difference parameter, and includes:

Optionally, the processor is configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, and includes:

Optionally, the processor is configured to determine the initial speech feature data or the new speech feature data comprises at least one of:

mel cepstrum coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK feature data.

Optionally, the processor is configured to obtain a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model, and includes:

The third aspect of the present invention provides an apparatus for training a wake-up model, the apparatus comprising the following modules:

the training set acquisition module is used for acquiring a first training set and a second training set when model training is triggered, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;

the first difference parameter determining module is used for respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;

the second difference parameter determining module is used for inputting the second training set into the current acoustic model and determining a second difference parameter by comparing the output result of the current acoustic model with the one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;

and the model adjusting module is used for adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.

Optionally, the first difference parameter determining module is configured to determine the first difference parameter, and includes:

Optionally, the second difference parameter determining module is configured to determine a second difference parameter, and includes:

Optionally, after the current acoustic model determining module is configured to determine the second difference parameter, the method further includes:

Optionally, the current model determining module is configured to adjust the model parameters of the initial acoustic model according to the second difference parameter, and includes:

Optionally, the model adjusting module is configured to adjust the model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, and includes:

mel cepstrum coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK feature data.

Optionally, the second difference parameter determining module is configured to obtain a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model, and includes:

A fourth aspect of the present invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, implement a method of training a wake-up model according to any one of the aspects provided in the first aspect of the present invention.

The method for training the awakening model divides the training set of the awakening model into two types, wherein one type is initial voice characteristic data used for training the initial acoustic model, the other type is missed awakening/false awakening voice data used for training the current acoustic model, the adaptation to the actual scene is ensured by using the unique hot coding in the ASR model when the current acoustic model is trained, the difference parameters of the initial voice characteristic data in the initial acoustic model and the current acoustic model are calculated, the current acoustic model can be compatible with the initial voice data under the condition of ensuring the adaptation to the current scene by using the difference parameters, the risk of unstable awakening performance caused by updating is reduced, the awakening success rate of the voice under the current scene and the initially trained voice under the awakening model is effectively improved, and the trained acoustic model is well compatible with the previous initial awakening scene under the premise of adapting to a more complex scene, compared with the existing mode that new awakening word data are continuously collected and then mixed with the awakening word data obtained by recording before to retrain a new acoustic model, and the model before is replaced by the model with better performance, the training amount is reduced.

Drawings

FIG. 1 is a schematic diagram of a wake-up device system;

FIG. 2 is a flow diagram of a method of training a wake-up model;

FIG. 3 is a complete flow diagram of a method of training a wake-up model;

FIG. 4 is a schematic diagram of an apparatus for training a wake-up model;

fig. 5 is a block diagram of an apparatus for training a wake-up model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For convenience of understanding, terms referred to in the embodiments of the present invention are explained below:

(1) cross Entropy (Cross Entropy) is a common concept in deep learning, and is generally used to find a difference between a target and a predicted value, and is an important concept in Shannon information theory, and is mainly used to measure difference information between two probability distributions. The performance of a language model is usually measured by cross entropy and complexity (Perplexity), the cross entropy can be used as a loss function in a neural network, p represents the distribution of real marks, q is the distribution of predicted marks of the trained model, and the similarity of p and q can be measured by the cross entropy loss function;

(2) relative Entropy (Relative Entropy), also known as Kullback-Leibler Divergence or Information Divergence (Information Divergence), is a measure of asymmetry in the difference between two Probability distributions (Probability distributions). In information theory, the relative Entropy, which is the difference between the information entropies (Shannon Entropy) of two probability distributions, is the loss function of some optimization algorithms, such as the Expectation-Maximization Algorithm (EM), which uses the relative Entropy to represent the information loss generated when the theoretical distribution is used to fit the real distribution.

(3) Mel cepstrum coefficient (MFCC), which is an abbreviation of Mel frequency cepstrum coefficient, Mel frequency is extracted based on human auditory characteristics, and forms a nonlinear corresponding relation with Hz frequency, Mel frequency cepstrum coefficient is Hz spectrum feature calculated by using the relation between them, MFCC has been widely applied in speech recognition field as recognition feature.

(4) Fbank, Fbank are dct cepstrum links lacking MFCC feature extraction, and other steps are the same as MFCC spectrum features, Fbank features are close to the response characteristics of human ears, but there is a disadvantage that the adjacent features of Fbank features are highly correlated (adjacent filter banks are overlapped), so when we model phonemes by using HMM, cepstrum conversion is almost always needed to be performed first, wherein MFCC is performed on the basis of Fbank, so the calculation amount of MFCC is larger, and the correlation of Fbank features is higher.

In consideration of the problems of false awakening and missed awakening existing in the existing awakening word recognition technology, the accuracy can be improved by continuously training the awakening model. Therefore, in the embodiment, the awakened device end is controlled to collect the awakening voice data locally or in the cloud, and the awakening model is updated and optimized based on the data, so that the false awakening and missed awakening probabilities of the device end are reduced, the possibility of false awakening is reduced, and the accuracy of awakening word recognition is improved.

Based on this, the device wake-up system provided in the embodiment of the present invention wakes up a awakened device by using a wake-up device, where the wake-up device may be any electronic device capable of receiving voice, such as a smart speaker, a smart phone, and a smart home appliance, and there is no limitation to many kinds of devices.

As shown in fig. 1, the system includes: the wake-up device 101 and the awakened device 102 may be the same device or different devices, and in an optional implementation, the system may further include: a server in communication with the wake-up device. The awakening device is used for acquiring the voice characteristic data, recognizing the voice characteristic data by using the acoustic model and determining whether to awaken the awakened device or not according to the probability of recognizing the voice characteristic data when the input voice characteristic data can be recognized. The acoustic model can be generated by the awakening device through training voice data collected by the awakening device, or can be generated by the server through training voice data collected by the awakening device and sent to the server, and the server sends the obtained acoustic model data to the awakening device after training. Further, the acoustic model may also be generated by training of other devices besides the server.

The awakened device 102 may include, but is not limited to, a smart speaker, a smart television, a smart robot, a smart refrigerator, a smart air conditioner, a smart rice cooker, a smart sensor (such as an infrared sensor, a light sensor, a vibration sensor, a sound sensor, and the like), a smart water purifier, and other devices that are fixedly installed or that move in a small range. Alternatively, the awakened device 102 may be a mobile device such as an MP3 player (Moving Picture Experts Group Audio Layer III, mpeg Audio Layer IV), an MP4 player (Moving Picture Experts Group Audio Layer IV, mpeg Audio Layer 4), or an intelligent bluetooth headset.

The various awakened devices 102 may also be connected to each other via a wired or wireless network, optionally using standard communication techniques and/or protocols. The Network is typically the internet, but may be any Network including, but not limited to, any combination of Local Area Networks (LANs), Metropolitan Area Networks (MANs), Wide Area networks (MANs), mobile, wireline or wireless networks, private networks, or virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec), and so on. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

The awakening device 101 may be connected to the awakened device 102 through the wired network or the wireless network, and the user may control the awakening device 102 to enable the corresponding smart home device to perform a corresponding operation. Optionally, the wake-up device 102 may be a smart terminal. Optionally, the smart terminal may be a smart phone, a tablet computer, an electronic book reader, smart glasses, a smart watch, or the like. For example, the user may control the device a in the smart home device to send data or a signal to the device B through the smart phone, or the user may control the temperature of the smart refrigerator in the smart home device through the smart phone.

When any one of the devices trains to generate an acoustic model, the network model can be trained through initial voice characteristic data to obtain an initial acoustic model, then the network model is trained according to new voice data characteristics to obtain a self-adaptive acoustic model, awakening voice can be obtained through a microphone in the awakening device to obtain voice characteristic data, and the voice data transmitted to the awakening device can be processed voice data after data cleaning; generally speaking, there are noises in the voice data, and there are three methods for cleaning the voice data, namely, a binning method, a clustering method and a regression method, wherein the binning method is a method frequently used, and the data to be processed is put into a box according to a certain rule, then the data in each box is tested, and the voice data is processed by adopting a method according to the actual situation of each box in the data.

As mentioned above, the specific process of the acoustic model training may be at the server side, or at the wake-up device side, where the wake-up device side or the server side trains the network model through the initial voice feature data to obtain the initial acoustic model, and the current acoustic model is continuously trained and optimized through the new voice feature data to adapt to the current wake-up environment and further improve the wake-up accuracy.

For example, according to the embodiment of the application, after the user inputs the awakening voice to the awakened device, detection and analysis are performed through the acoustic model, posterior probability distribution capable of identifying the input voice feature data is obtained according to the analysis, the decoder is used for calculating the awakening confidence coefficient according to the posterior probability distribution, whether the awakening confidence coefficient is larger than or equal to the set threshold value is judged, if the awakening confidence coefficient is larger than or equal to the set threshold value, the awakening voice is considered to contain the awakening word text or be in accordance with the current acoustic model, awakening can be performed, and if the awakening voice is smaller than the set threshold value, the awakening voice is considered to not.

Specifically, the present embodiment does not limit the preset wake-up word, such as the degree, Siri, and the like. The awakening words comprise preset awakening words and/or user-defined awakening words in the server, and more awakening words can be deleted or added by the user at the later stage.

The embodiment of the invention provides a method for training a wake-up model, which is applied to the training process of the wake-up model of a wake-up word detection module, and comprises the following steps as shown in figure 2:

step S201, when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;

inputting initial voice characteristic data marked with awakening words and non-awakening words into a preset deep neural network model, and training the preset deep neural network model to obtain an initial acoustic model;

the initial voice feature data of the awakening words and the non-awakening words in the first training set can be audio segments which are manually recorded and are specific to the awakening words or the non-awakening words, the initial voice feature data is obtained through voice data extraction, the initial voice feature data can also be audio segments which are collected by a user during normal awakening and using and are obtained through voice data extraction, and the initial voice feature data which are to be awakened by the current acoustic model and the initial voice feature data which are not to be awakened by the current acoustic model are determined through the existing acoustic model and are stored in the awakening device or the server end; or, the received voice is manually screened in a manual monitoring mode, the initial voice feature data which is to be judged as the awakening voice by the current acoustic model and the initial voice feature data which is not to be determined as the non-awakening voice by the current acoustic model are stored in the awakened equipment or the server.

The process of extracting voice data can be implemented by adopting a conventional technical means in the art, and the method adopted in the step is not specifically limited in the present application, and for example, voice feature data can be extracted by adopting any one of a Mel frequency cepstrum Coefficient Method (MFCC), a Perceptual Linear prediction parameter method (PLP), and a Mel scale Filter method (FBANK).

The preset deep neural network model may be a deep neural network model, or a deep neural network-hidden markov model, where the hidden markov model includes: a plurality of network layers, each network layer consisting of a full connection layer and an activation function (usually relu or sigmoid), and the last network layer usually consisting of a full connection layer plus a softmax activation function;

the initial acoustic model and the current acoustic model are deep neural network models which are modeled to any one of words, phonemes or phoneme states, a speech frame in speech feature data is input into the current acoustic model, and posterior probability distribution is output from the current acoustic model;

wherein, the modeling mode comprises the following steps:

and segmenting and aligning the voice feature data by utilizing a baseline neural network of the deep neural network model to obtain word state level labels, phoneme (phone) level labels and phoneme state (state) level labels corresponding to each frame of voice feature data so as to form input and output of a training network of the deep neural network model.

The deep neural network model for modeling the words takes the feature vector of each frame of feature of the voice feature data as input, takes the label of the word level of each frame of feature of the voice feature data as word output, and carries out segmentation and alignment on the input and the word output.

The deep neural network model modeling the phoneme takes the feature vector of each frame feature of the voice feature data as input, takes the phoneme-level label of each frame feature of the voice feature data as phoneme output, and segments and aligns the input and the phoneme output.

The deep neural network model for modeling to the phoneme state takes the feature vector of each frame feature of the voice feature data as input, takes the mark of the phoneme state level of each frame feature of the voice feature data as the phoneme state output, and segments and aligns the input and phoneme state output.

The labeling of the phoneme level is that at a certain moment, such as the moment t, the phoneme pronunciation corresponding to each voice feature; the label of the phoneme state level is the phoneme related to the context, and the phoneme state corresponding to the characteristic at the time t is expressed by the clustered phoneme state unit.

As an optional implementation manner, the output posterior probability distribution is input to a decoder to obtain an awakening confidence score, and whether the input voice feature data can be awakened or not is determined according to the comparison between the awakening confidence score and an awakening threshold;

specifically, the method for inputting the speech after extracting the features into the current acoustic model to obtain the awakening result mainly includes the following steps: 1. extracting features by using a voice data extraction method; 2. inputting each voice frame with the extracted characteristics into a current acoustic model to obtain the posterior probability distribution of each voice frame; 3. calculating a wake-up confidence corresponding to the posterior probability distribution by using a decoder, judging whether the current acoustic model can be wakened according to input voice characteristic data when the wake-up confidence exceeds a certain threshold value based on the above contents to obtain new voice characteristic data of missed wake-up voice and false wake-up voice, wherein the decoding method of the decoder can be selecting a path with the highest score value in all paths in the decoder or selecting a path meeting a preset rule in the path searching process, and the preset rule is according to a Viterbi (Viterbi Algorithm) decoding Algorithm and the like;

the method for acquiring the new voice characteristic data of missed awakening and mistaken awakening voice in the second training set comprises the following steps:

determining the actual semantics of the received voice as awakening words but not the voice awakened by the current acoustic model as missed awakening voice by receiving a voice judgment instruction determined by the voice screening side according to the received voice; and determining the received voice with the actual semantic meaning of a non-awakening word but awakened by the current acoustic model as a false awakening voice, and obtaining new voice characteristic data in a voice data extraction mode.

Step S202, inputting the first training set into an initial acoustic model and a current acoustic model respectively, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;

specifically, determining the first difference parameter includes:

Inputting a first training set to an initial acoustic model, outputting a first probability distribution by the initial acoustic model;

the first training set is input to a current acoustic model, from which a second probability distribution is output.

Determining a relative entropy according to a difference between the first probability distribution and the second probability distribution, where the relative entropy is also called KL (Kullback-Leibler) divergence for measuring a difference between the two probability distributions, and specifically, determining the relative entropy according to the two probability distributions, which should be known by those skilled in the art and will not be described herein again.

Step S203, inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model.

Obtaining a one-hot code corresponding to a wake-up voice which can be identified by a current acoustic model, comprising:

The ASR voice model comprises an acoustic model, a pronunciation dictionary and text labels corresponding to voice characteristic data; inputting preset awakening voice information which can be identified by a current acoustic model into the ASR voice model, obtaining an optimal path and a corresponding score of the optimal path through a decoder by the preset awakening voice information in the ASR voice model, and processing the optimal path into probability distribution of a one-hot coding form.

For example, a wake-up speech of "small" has 3 frames, as shown in table 1, the wake-up word is "small", and the speech unit corresponds to a probability distribution table of one-hot coding form to phoneme level, where "small" includes three phonemes "x", "i", and "ao" at phoneme level, each speech frame of the preset wake-up speech information is input into the ASR speech model to obtain a posterior probability distribution corresponding to the preset wake-up speech information, a decoder is used to select an optimal path according to the posterior probability distribution, the optimal path may be a path with the highest score among all paths in the decoder, or may be a path satisfying a preset rule during path search, the preset rule processes the optimal path into the probability distribution of one-hot coding form according to a viterbi decoding algorithm, the probability distribution of one-hot coding form may be pre-constructed by the wake-up speech that can be recognized by the current acoustic model, the way of pre-construction should be known to those skilled in the art and will not be described herein;

TABLE 1

	The phoneme of "x"	The phoneme of "i"	The phoneme of "ao"
				Phoneme "x" frame	1	0	0
Phoneme "i" frame	0	1	0
				Phoneme "ao" frame	0	0	1

Inputting the new voice characteristic data in the second training set into the current acoustic model in a voice frame mode to obtain an output result, inputting the voice frame in the voice characteristic data into the current acoustic model, and outputting posterior probability distribution from the current acoustic model;

as shown in table 2, the new speech feature data is input into the posterior probability distribution corresponding to the current acoustic model, for example, the new speech feature data also includes 3 frames, where the wakeup word corresponding to the current acoustic model is "small", where "small" includes three phonemes "x", "i", and "ao" at the phoneme level, and each speech frame of the new speech feature data is input into the current acoustic model, so that the posterior probability distribution corresponding to each speech frame can be obtained;

TABLE 2

	The phoneme of "x"	The phoneme of "i"	The phoneme of "ao"
				First frame	0.8	0.3	0.1
Second frame	0.4	0.8	0.6
				Third frame	0.1	0.4	0.9

determining a cross entropy according to a difference between the third probability distribution and the fourth probability distribution, and determining a cross entropy according to the two probability distributions, which should be known by those skilled in the art, and will not be described herein again.

And step S204, adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.

Specifically, a loss function for adjusting the current acoustic model is obtained according to cross entropy and relative entropy weighted summation, gradient vectors are determined through layer-by-layer derivation according to a chain derivation mode, the gradient vectors are the fastest direction for increasing the loss function, and in order to make the loss function smaller and better, network layer parameters of the current acoustic model are controlled to be adjusted along the opposite direction of the gradient, the learning rate of one network layer is manually set in the specific adjustment mode to control the updating adjustment size of each time, parameters of each network layer in the current acoustic model are continuously updated, and the parameters of each network layer are updated in such a way that if the full connection function is y, which is wx + b, w and b parameters of each full connection layer need to be updated.

As an optional implementation manner, after determining the second difference parameter, the method further includes:

Adjusting model parameters of an initial acoustic model according to the second difference parameters, including:

When the current acoustic model is determined to be the initial acoustic model, determining a loss function for adjusting the initial acoustic model according to the second difference parameter, performing neural network learning by using the constructed loss function through a back propagation algorithm, adjusting model parameters of the initial acoustic model by using the loss function, and when the current acoustic model is adjusted for the first time, because the current acoustic model is the initial acoustic model, the relative entropy calculated by respectively inputting the initial voice feature data to the initial acoustic model and the current acoustic model is 0, and determining the loss function for adjusting the initial acoustic model according to the cross entropy.

As shown in fig. 3, a complete flow chart of a method for training a wake-up model includes the steps of:

step S301, when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;

step S302, inputting the second training set into an initial acoustic model, determining a loss function for adjusting the initial acoustic model by comparing an output result of the initial acoustic model with a unique hot code corresponding to the awakening voice which can be identified by the current acoustic model, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model;

in particular, using sole heatCross entropy determined loss function calculated from the difference of the probability distribution of the coding form and the posterior probability distribution of the new speech feature data input to the current acoustic model, wherein p (x) represents the posterior probability distribution of the new speech feature data input to the initial acoustic model, and p_emp(x) Representing probability distribution of a single-hot coding form, and obtaining cross entropy as follows:

determining a loss function by using the cross entropy, and adjusting parameters of each network layer in the initial acoustic model according to the loss function to obtain a current acoustic model;

step S303, respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a relative entropy by comparing output results of the initial acoustic model and the current acoustic model;

Wherein p is_si(x) Representing a first probability distribution of the initial speech feature data in the initial acoustic model, p1(x) representing a second probability distribution of the initial speech feature data in the current acoustic model;

wherein the relative entropy is:

step S304, inputting the second training set into a current acoustic model, and determining a cross entropy by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;

calculated using the difference between the probability distribution of the one-hot coded form and the posterior probability distribution of the new speech feature data input to the current acoustic modelCross-entropy determination of the loss function, where p2(x) represents a third probability distribution for inputting new speech feature data into the initial acoustic model, and p_emp(x) And expressing the fourth probability distribution of the single-hot coding form, wherein the obtained cross entropy is as follows:

step S305, determining a loss function for adjusting the current acoustic model according to the relative entropy and the cross entropy, and adjusting parameters of each network layer in the current acoustic model by using the loss function.

Defining a loss function according to cross entropy and relative entropy as follows:

J_kld＝(1-α)H(p_emp,p2)+αD_kl(p_si,p1)

and alpha is a weight coefficient for controlling the cross entropy and the KL divergence, and is generally set to be a fixed value of 0.25 empirically, if the quantity of the feature data is large, the value is increased, the gradient vector is determined by conducting derivation layer by layer according to a chain type derivation mode, and the network layer parameters of the initial acoustic model are controlled to be adjusted along the opposite direction of the gradient, so that the current acoustic model is obtained.

Finally, after the current acoustic model is obtained, inputting voice feature data of the awakening voice into the current acoustic model for recognition to obtain posterior probability distribution of each voice frame in the voice feature data, calculating awakening confidence corresponding to the posterior probability distribution by using a decoder, and determining whether to send a command for awakening the equipment to the awakening equipment by the current acoustic model according to the awakening confidence and the awakening threshold.

The embodiment of the invention provides a device for training a wake-up model, which comprises a memory, a memory and a control unit, wherein the memory is used for storing instructions;

fig. 4 is a device for training a wake model according to an embodiment of the present invention, where the device 400 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPU) 401 (e.g., one or more processors) and a memory 402, one or more storage media 403 (e.g., one or more mass storage devices) for storing applications 404 or data 406. Memory 402 and storage medium 403 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 403 may include one or more modules (not shown), and each module may include a series of instruction operations in the information processing apparatus. Further, the processor 401 may be configured to communicate with the storage medium 403 to execute a series of instruction operations in the storage medium 403 on the device 400.

The device 400 may also include one or more power supplies 409, one or more wired or wireless network interfaces 407, one or more input-output interfaces 408, and/or one or more operating systems 405, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.

mel cepstrum coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK feature data.

An embodiment of the present invention provides a device for training a wake-up model, and as shown in fig. 5, the device includes the following modules:

a training set obtaining module 501, configured to, when model training is triggered, obtain a first training set and a second training set, where the first training set includes initial voice feature data used for training an initial acoustic model, and the second training set includes new voice feature data of missed wake-up/mistaken wake-up voice corresponding to the acoustic model in a wake-up voice recognition process;

a first difference parameter determining module 502, configured to input the first training set to an initial acoustic model and a current acoustic model respectively, and determine a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;

a second difference parameter determining module 503, configured to input the second training set into a current acoustic model, and determine a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to a wake-up voice that can be recognized by the current acoustic model;

a model adjusting module 504, configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter.

Optionally, the first difference parameter determining module 502 is configured to determine a first difference parameter, including:

Optionally, the second difference parameter determining module 503 is configured to determine a second difference parameter, and includes:

Optionally, after the current acoustic model determining module 505 is configured to determine the second difference parameter, the method further includes:

Optionally, the current acoustic model determining module 505 is configured to adjust the model parameters of the initial acoustic model according to the second difference parameter, and includes:

Optionally, the model adjusting module 504 is configured to adjust the model parameters of the current acoustic model according to the first difference parameter and the second difference parameter, including:

mel cepstrum coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK feature data.

Optionally, the second difference parameter determining module 503 is configured to obtain a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model, and includes:

An embodiment of the present invention provides a computer-readable storage medium, which stores computer instructions, and when the computer instructions are executed by a processor, the computer instructions implement a method for training a wake-up model according to any one of the above embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of training a wake model, the method comprising:

2. The method of claim 1, wherein determining a first difference parameter comprises:

3. The method of claim 1, wherein determining a second difference parameter comprises:

4. The method of claim 1, wherein determining the second difference parameter further comprises:

5. The method of claim 4, wherein adjusting model parameters of an initial acoustic model based on the second difference parameter comprises:

6. The method of claim 1, wherein adjusting the model parameters of the current acoustic model based on the first difference parameter and the second difference parameter comprises:

7. The method of claim 1, wherein the initial or new speech feature data comprises at least one of:

mel cepstrum coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK feature data.

8. The method of claim 1, wherein obtaining the one-hot code corresponding to the wake-up speech recognizable by the current acoustic model comprises:

9. An apparatus for training a wake model, the apparatus comprising: a memory to store instructions;

a processor for reading the instructions in the memory to implement a method for training a wake-up model according to any one of claims 1 to 8.

10. An apparatus for training a wake-up model, the apparatus comprising:

11. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement a method of training a wake-up model as claimed in any one of claims 1 to 8.