CN111667818A - Method and device for training awakening model - Google Patents

Method and device for training awakening model Download PDF

Info

Publication number
CN111667818A
CN111667818A CN202010461982.XA CN202010461982A CN111667818A CN 111667818 A CN111667818 A CN 111667818A CN 202010461982 A CN202010461982 A CN 202010461982A CN 111667818 A CN111667818 A CN 111667818A
Authority
CN
China
Prior art keywords
acoustic model
model
voice
awakening
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010461982.XA
Other languages
Chinese (zh)
Other versions
CN111667818B (en
Inventor
靳源
冯大航
陈孝良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202010461982.XA priority Critical patent/CN111667818B/en
Publication of CN111667818A publication Critical patent/CN111667818A/en
Application granted granted Critical
Publication of CN111667818B publication Critical patent/CN111667818B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The invention provides a method and a device for training a wake-up model, wherein the method comprises the following steps: when the model training is triggered, a first training set and a second training set are obtained; respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model; inputting the second training set into the current acoustic model, and determining a second difference parameter by comparing the output result of the current acoustic model with the one-hot code corresponding to the awakening voice which can be identified by the current acoustic model; and adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters. The method provided by the invention can ensure that the current acoustic model can be compatible with the initial voice under the condition of ensuring the adaptation to the current scene, reduces the risk of unstable performance of the awakening model caused by updating, and ensures that the trained acoustic model is well compatible with the previous initial awakening scene on the premise of adapting to a more complex scene.

Description

Method and device for training awakening model
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for training a wake-up model.
Background
Along with the continuous development of the technology, the intelligent device is more and more popular, the existing intelligent device can determine whether to wake up the intelligent device by receiving the voice signal input by the user, one of the application scenarios is that the intelligent device can recognize the voice signal input by the user through an acoustic model in the device to wake up the intelligent device in a state to be woken up, and when the acoustic model can recognize the input voice signal, the intelligent device is woken up to enable the user to control other intelligent home devices through the woken-up intelligent device. The method comprises the steps that a user records awakening voice in advance to the intelligent device, the intelligent device conducts model training according to the recorded awakening voice to generate an acoustic model, when the intelligent device receives voice signals input by the user again, if the acoustic model can recognize the input voice signals, the voice signals are determined to be awakening voice, the intelligent device can be awakened, and further operation is conducted on the intelligent device.
Wherein, the performance of waking up the device each time is directly influenced by the quality of the acoustic model in the intelligent device, but in practical use, since too few pre-recorded wake-up voice samples in the acoustic model recorded for the first time can cause the problem that the acoustic model is too noisy due to the environment of the pronunciation of the user or the corresponding wake-up voice cannot be recognized due to the fact that the speaker has an accent, the problem of false wake-up or missed wake-up is caused, the existing wake-up device needs to continuously collect new wake-up voice input by wake-up each time, the new wake-up voice is manually labeled, or voice noise input by a user is filtered through data cleaning to determine whether the voice is voice which the intelligent equipment should wake up or voice which is awoken by mistake or voice which is awoken by omission, and the voice data which is marked or determined whether the voice is awoken is mixed with the awoken voice obtained by recording before to be retrained so as to obtain an acoustic model with better effect to replace the model obtained before.
However, when the wake-up device retrains using the mixed wake-up speech each time, the wake-up word data in the wake-up speech needs to be rearranged for training, which occupies a large computing resource of the wake-up device during retraining, the time of the acoustic model finally calculated is very long, and the actual wake-up effect of the retrained wake-up model is not stable because the proportion of the training data is not well controlled because the wake-up model needs to be rebuilt.
Disclosure of Invention
The invention provides a method and a device for training a wake-up model, which are used for solving the problems that in the prior art, the model training amount is large before continuously collecting new wake-up word data and mixing the new wake-up word data with the previously recorded wake-up word data to retrain a new acoustic model and the model with better performance is replaced, the retraining time of the acoustic model is long, and the actual wake-up effect of the retrained wake-up model is unstable due to the fact that the proportion of training data is not well controlled because the wake-up model needs to be reestablished.
A first aspect of the present invention provides a method for training a wake-up model, the method comprising:
when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;
and adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.
Optionally, determining a first difference parameter comprises:
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
Optionally, determining a second difference parameter comprises:
acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;
and determining the cross entropy according to the difference of the third probability distribution and the fourth probability distribution.
Optionally, after determining the second difference parameter, the method further includes:
and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.
Optionally, adjusting the model parameters of the initial acoustic model according to the second difference parameter includes:
and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.
Optionally, adjusting the model parameters of the current acoustic model according to the first difference parameter and the second difference parameter includes:
and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
Optionally, the initial speech feature data or the new speech feature data comprises at least one of:
mel cepstrum coefficient MFCC characteristic data;
perceptual linear prediction PLP feature data;
the mel scale filters the FBANK feature data.
Optionally, obtaining a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model includes:
acquiring preset awakening voice information which can be identified by a current acoustic model;
and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.
A second aspect of the present invention provides an apparatus for training a wake-up model, the apparatus comprising a memory for storing instructions;
the processor is used for reading the instructions in the memory, and the implementation method comprises the following steps:
when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;
and adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.
Optionally, the processor is configured to determine a first difference parameter, comprising:
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
Optionally, the processor is configured to determine a second difference parameter, including:
acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;
and determining the cross entropy according to the difference of the third probability distribution and the fourth probability distribution.
Optionally, after determining the second difference parameter, the processor further includes:
and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.
Optionally, the processor is configured to adjust a model parameter of the initial acoustic model according to the second difference parameter, and includes:
and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.
Optionally, the processor is configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, and includes:
and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
Optionally, the processor is configured to determine the initial speech feature data or the new speech feature data comprises at least one of:
mel cepstrum coefficient MFCC characteristic data;
perceptual linear prediction PLP feature data;
the mel scale filters the FBANK feature data.
Optionally, the processor is configured to obtain a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model, and includes:
acquiring preset awakening voice information which can be identified by a current acoustic model;
and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.
The third aspect of the present invention provides an apparatus for training a wake-up model, the apparatus comprising the following modules:
the training set acquisition module is used for acquiring a first training set and a second training set when model training is triggered, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
the first difference parameter determining module is used for respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
the second difference parameter determining module is used for inputting the second training set into the current acoustic model and determining a second difference parameter by comparing the output result of the current acoustic model with the one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;
and the model adjusting module is used for adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.
Optionally, the first difference parameter determining module is configured to determine the first difference parameter, and includes:
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
Optionally, the second difference parameter determining module is configured to determine a second difference parameter, and includes:
acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;
and determining the cross entropy according to the difference of the third probability distribution and the fourth probability distribution.
Optionally, after the current acoustic model determining module is configured to determine the second difference parameter, the method further includes:
and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.
Optionally, the current model determining module is configured to adjust the model parameters of the initial acoustic model according to the second difference parameter, and includes:
and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.
Optionally, the model adjusting module is configured to adjust the model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, and includes:
and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
Optionally, the initial speech feature data or the new speech feature data comprises at least one of:
mel cepstrum coefficient MFCC characteristic data;
perceptual linear prediction PLP feature data;
the mel scale filters the FBANK feature data.
Optionally, the second difference parameter determining module is configured to obtain a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model, and includes:
acquiring preset awakening voice information which can be identified by a current acoustic model;
and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.
A fourth aspect of the present invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, implement a method of training a wake-up model according to any one of the aspects provided in the first aspect of the present invention.
The method for training the awakening model divides the training set of the awakening model into two types, wherein one type is initial voice characteristic data used for training the initial acoustic model, the other type is missed awakening/false awakening voice data used for training the current acoustic model, the adaptation to the actual scene is ensured by using the unique hot coding in the ASR model when the current acoustic model is trained, the difference parameters of the initial voice characteristic data in the initial acoustic model and the current acoustic model are calculated, the current acoustic model can be compatible with the initial voice data under the condition of ensuring the adaptation to the current scene by using the difference parameters, the risk of unstable awakening performance caused by updating is reduced, the awakening success rate of the voice under the current scene and the initially trained voice under the awakening model is effectively improved, and the trained acoustic model is well compatible with the previous initial awakening scene under the premise of adapting to a more complex scene, compared with the existing mode that new awakening word data are continuously collected and then mixed with the awakening word data obtained by recording before to retrain a new acoustic model, and the model before is replaced by the model with better performance, the training amount is reduced.
Drawings
FIG. 1 is a schematic diagram of a wake-up device system;
FIG. 2 is a flow diagram of a method of training a wake-up model;
FIG. 3 is a complete flow diagram of a method of training a wake-up model;
FIG. 4 is a schematic diagram of an apparatus for training a wake-up model;
fig. 5 is a block diagram of an apparatus for training a wake-up model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For convenience of understanding, terms referred to in the embodiments of the present invention are explained below:
(1) cross Entropy (Cross Entropy) is a common concept in deep learning, and is generally used to find a difference between a target and a predicted value, and is an important concept in Shannon information theory, and is mainly used to measure difference information between two probability distributions. The performance of a language model is usually measured by cross entropy and complexity (Perplexity), the cross entropy can be used as a loss function in a neural network, p represents the distribution of real marks, q is the distribution of predicted marks of the trained model, and the similarity of p and q can be measured by the cross entropy loss function;
(2) relative Entropy (Relative Entropy), also known as Kullback-Leibler Divergence or Information Divergence (Information Divergence), is a measure of asymmetry in the difference between two Probability distributions (Probability distributions). In information theory, the relative Entropy, which is the difference between the information entropies (Shannon Entropy) of two probability distributions, is the loss function of some optimization algorithms, such as the Expectation-Maximization Algorithm (EM), which uses the relative Entropy to represent the information loss generated when the theoretical distribution is used to fit the real distribution.
(3) Mel cepstrum coefficient (MFCC), which is an abbreviation of Mel frequency cepstrum coefficient, Mel frequency is extracted based on human auditory characteristics, and forms a nonlinear corresponding relation with Hz frequency, Mel frequency cepstrum coefficient is Hz spectrum feature calculated by using the relation between them, MFCC has been widely applied in speech recognition field as recognition feature.
(4) Fbank, Fbank are dct cepstrum links lacking MFCC feature extraction, and other steps are the same as MFCC spectrum features, Fbank features are close to the response characteristics of human ears, but there is a disadvantage that the adjacent features of Fbank features are highly correlated (adjacent filter banks are overlapped), so when we model phonemes by using HMM, cepstrum conversion is almost always needed to be performed first, wherein MFCC is performed on the basis of Fbank, so the calculation amount of MFCC is larger, and the correlation of Fbank features is higher.
In consideration of the problems of false awakening and missed awakening existing in the existing awakening word recognition technology, the accuracy can be improved by continuously training the awakening model. Therefore, in the embodiment, the awakened device end is controlled to collect the awakening voice data locally or in the cloud, and the awakening model is updated and optimized based on the data, so that the false awakening and missed awakening probabilities of the device end are reduced, the possibility of false awakening is reduced, and the accuracy of awakening word recognition is improved.
Based on this, the device wake-up system provided in the embodiment of the present invention wakes up a awakened device by using a wake-up device, where the wake-up device may be any electronic device capable of receiving voice, such as a smart speaker, a smart phone, and a smart home appliance, and there is no limitation to many kinds of devices.
As shown in fig. 1, the system includes: the wake-up device 101 and the awakened device 102 may be the same device or different devices, and in an optional implementation, the system may further include: a server in communication with the wake-up device. The awakening device is used for acquiring the voice characteristic data, recognizing the voice characteristic data by using the acoustic model and determining whether to awaken the awakened device or not according to the probability of recognizing the voice characteristic data when the input voice characteristic data can be recognized. The acoustic model can be generated by the awakening device through training voice data collected by the awakening device, or can be generated by the server through training voice data collected by the awakening device and sent to the server, and the server sends the obtained acoustic model data to the awakening device after training. Further, the acoustic model may also be generated by training of other devices besides the server.
The awakened device 102 may include, but is not limited to, a smart speaker, a smart television, a smart robot, a smart refrigerator, a smart air conditioner, a smart rice cooker, a smart sensor (such as an infrared sensor, a light sensor, a vibration sensor, a sound sensor, and the like), a smart water purifier, and other devices that are fixedly installed or that move in a small range. Alternatively, the awakened device 102 may be a mobile device such as an MP3 player (Moving Picture Experts Group Audio Layer III, mpeg Audio Layer IV), an MP4 player (Moving Picture Experts Group Audio Layer IV, mpeg Audio Layer 4), or an intelligent bluetooth headset.
The various awakened devices 102 may also be connected to each other via a wired or wireless network, optionally using standard communication techniques and/or protocols. The Network is typically the internet, but may be any Network including, but not limited to, any combination of Local Area Networks (LANs), Metropolitan Area Networks (MANs), Wide Area networks (MANs), mobile, wireline or wireless networks, private networks, or virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec), and so on. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.
The awakening device 101 may be connected to the awakened device 102 through the wired network or the wireless network, and the user may control the awakening device 102 to enable the corresponding smart home device to perform a corresponding operation. Optionally, the wake-up device 102 may be a smart terminal. Optionally, the smart terminal may be a smart phone, a tablet computer, an electronic book reader, smart glasses, a smart watch, or the like. For example, the user may control the device a in the smart home device to send data or a signal to the device B through the smart phone, or the user may control the temperature of the smart refrigerator in the smart home device through the smart phone.
When any one of the devices trains to generate an acoustic model, the network model can be trained through initial voice characteristic data to obtain an initial acoustic model, then the network model is trained according to new voice data characteristics to obtain a self-adaptive acoustic model, awakening voice can be obtained through a microphone in the awakening device to obtain voice characteristic data, and the voice data transmitted to the awakening device can be processed voice data after data cleaning; generally speaking, there are noises in the voice data, and there are three methods for cleaning the voice data, namely, a binning method, a clustering method and a regression method, wherein the binning method is a method frequently used, and the data to be processed is put into a box according to a certain rule, then the data in each box is tested, and the voice data is processed by adopting a method according to the actual situation of each box in the data.
As mentioned above, the specific process of the acoustic model training may be at the server side, or at the wake-up device side, where the wake-up device side or the server side trains the network model through the initial voice feature data to obtain the initial acoustic model, and the current acoustic model is continuously trained and optimized through the new voice feature data to adapt to the current wake-up environment and further improve the wake-up accuracy.
For example, according to the embodiment of the application, after the user inputs the awakening voice to the awakened device, detection and analysis are performed through the acoustic model, posterior probability distribution capable of identifying the input voice feature data is obtained according to the analysis, the decoder is used for calculating the awakening confidence coefficient according to the posterior probability distribution, whether the awakening confidence coefficient is larger than or equal to the set threshold value is judged, if the awakening confidence coefficient is larger than or equal to the set threshold value, the awakening voice is considered to contain the awakening word text or be in accordance with the current acoustic model, awakening can be performed, and if the awakening voice is smaller than the set threshold value, the awakening voice is considered to not.
Specifically, the present embodiment does not limit the preset wake-up word, such as the degree, Siri, and the like. The awakening words comprise preset awakening words and/or user-defined awakening words in the server, and more awakening words can be deleted or added by the user at the later stage.
The embodiment of the invention provides a method for training a wake-up model, which is applied to the training process of the wake-up model of a wake-up word detection module, and comprises the following steps as shown in figure 2:
step S201, when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
inputting initial voice characteristic data marked with awakening words and non-awakening words into a preset deep neural network model, and training the preset deep neural network model to obtain an initial acoustic model;
the initial voice feature data of the awakening words and the non-awakening words in the first training set can be audio segments which are manually recorded and are specific to the awakening words or the non-awakening words, the initial voice feature data is obtained through voice data extraction, the initial voice feature data can also be audio segments which are collected by a user during normal awakening and using and are obtained through voice data extraction, and the initial voice feature data which are to be awakened by the current acoustic model and the initial voice feature data which are not to be awakened by the current acoustic model are determined through the existing acoustic model and are stored in the awakening device or the server end; or, the received voice is manually screened in a manual monitoring mode, the initial voice feature data which is to be judged as the awakening voice by the current acoustic model and the initial voice feature data which is not to be determined as the non-awakening voice by the current acoustic model are stored in the awakened equipment or the server.
The process of extracting voice data can be implemented by adopting a conventional technical means in the art, and the method adopted in the step is not specifically limited in the present application, and for example, voice feature data can be extracted by adopting any one of a Mel frequency cepstrum Coefficient Method (MFCC), a Perceptual Linear prediction parameter method (PLP), and a Mel scale Filter method (FBANK).
The preset deep neural network model may be a deep neural network model, or a deep neural network-hidden markov model, where the hidden markov model includes: a plurality of network layers, each network layer consisting of a full connection layer and an activation function (usually relu or sigmoid), and the last network layer usually consisting of a full connection layer plus a softmax activation function;
the initial acoustic model and the current acoustic model are deep neural network models which are modeled to any one of words, phonemes or phoneme states, a speech frame in speech feature data is input into the current acoustic model, and posterior probability distribution is output from the current acoustic model;
wherein, the modeling mode comprises the following steps:
and segmenting and aligning the voice feature data by utilizing a baseline neural network of the deep neural network model to obtain word state level labels, phoneme (phone) level labels and phoneme state (state) level labels corresponding to each frame of voice feature data so as to form input and output of a training network of the deep neural network model.
The deep neural network model for modeling the words takes the feature vector of each frame of feature of the voice feature data as input, takes the label of the word level of each frame of feature of the voice feature data as word output, and carries out segmentation and alignment on the input and the word output.
The deep neural network model modeling the phoneme takes the feature vector of each frame feature of the voice feature data as input, takes the phoneme-level label of each frame feature of the voice feature data as phoneme output, and segments and aligns the input and the phoneme output.
The deep neural network model for modeling to the phoneme state takes the feature vector of each frame feature of the voice feature data as input, takes the mark of the phoneme state level of each frame feature of the voice feature data as the phoneme state output, and segments and aligns the input and phoneme state output.
The labeling of the phoneme level is that at a certain moment, such as the moment t, the phoneme pronunciation corresponding to each voice feature; the label of the phoneme state level is the phoneme related to the context, and the phoneme state corresponding to the characteristic at the time t is expressed by the clustered phoneme state unit.
As an optional implementation manner, the output posterior probability distribution is input to a decoder to obtain an awakening confidence score, and whether the input voice feature data can be awakened or not is determined according to the comparison between the awakening confidence score and an awakening threshold;
specifically, the method for inputting the speech after extracting the features into the current acoustic model to obtain the awakening result mainly includes the following steps: 1. extracting features by using a voice data extraction method; 2. inputting each voice frame with the extracted characteristics into a current acoustic model to obtain the posterior probability distribution of each voice frame; 3. calculating a wake-up confidence corresponding to the posterior probability distribution by using a decoder, judging whether the current acoustic model can be wakened according to input voice characteristic data when the wake-up confidence exceeds a certain threshold value based on the above contents to obtain new voice characteristic data of missed wake-up voice and false wake-up voice, wherein the decoding method of the decoder can be selecting a path with the highest score value in all paths in the decoder or selecting a path meeting a preset rule in the path searching process, and the preset rule is according to a Viterbi (Viterbi Algorithm) decoding Algorithm and the like;
the method for acquiring the new voice characteristic data of missed awakening and mistaken awakening voice in the second training set comprises the following steps:
determining the actual semantics of the received voice as awakening words but not the voice awakened by the current acoustic model as missed awakening voice by receiving a voice judgment instruction determined by the voice screening side according to the received voice; and determining the received voice with the actual semantic meaning of a non-awakening word but awakened by the current acoustic model as a false awakening voice, and obtaining new voice characteristic data in a voice data extraction mode.
Step S202, inputting the first training set into an initial acoustic model and a current acoustic model respectively, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
specifically, determining the first difference parameter includes:
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
Inputting a first training set to an initial acoustic model, outputting a first probability distribution by the initial acoustic model;
the first training set is input to a current acoustic model, from which a second probability distribution is output.
Determining a relative entropy according to a difference between the first probability distribution and the second probability distribution, where the relative entropy is also called KL (Kullback-Leibler) divergence for measuring a difference between the two probability distributions, and specifically, determining the relative entropy according to the two probability distributions, which should be known by those skilled in the art and will not be described herein again.
Step S203, inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model.
Obtaining a one-hot code corresponding to a wake-up voice which can be identified by a current acoustic model, comprising:
acquiring preset awakening voice information which can be identified by a current acoustic model;
and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.
The ASR voice model comprises an acoustic model, a pronunciation dictionary and text labels corresponding to voice characteristic data; inputting preset awakening voice information which can be identified by a current acoustic model into the ASR voice model, obtaining an optimal path and a corresponding score of the optimal path through a decoder by the preset awakening voice information in the ASR voice model, and processing the optimal path into probability distribution of a one-hot coding form.
For example, a wake-up speech of "small" has 3 frames, as shown in table 1, the wake-up word is "small", and the speech unit corresponds to a probability distribution table of one-hot coding form to phoneme level, where "small" includes three phonemes "x", "i", and "ao" at phoneme level, each speech frame of the preset wake-up speech information is input into the ASR speech model to obtain a posterior probability distribution corresponding to the preset wake-up speech information, a decoder is used to select an optimal path according to the posterior probability distribution, the optimal path may be a path with the highest score among all paths in the decoder, or may be a path satisfying a preset rule during path search, the preset rule processes the optimal path into the probability distribution of one-hot coding form according to a viterbi decoding algorithm, the probability distribution of one-hot coding form may be pre-constructed by the wake-up speech that can be recognized by the current acoustic model, the way of pre-construction should be known to those skilled in the art and will not be described herein;
TABLE 1
The phoneme of "x" The phoneme of "i" The phoneme of "ao"
Phoneme "x" frame 1 0 0
Phoneme "i" frame 0 1 0
Phoneme "ao" frame 0 0 1
Inputting the new voice characteristic data in the second training set into the current acoustic model in a voice frame mode to obtain an output result, inputting the voice frame in the voice characteristic data into the current acoustic model, and outputting posterior probability distribution from the current acoustic model;
as shown in table 2, the new speech feature data is input into the posterior probability distribution corresponding to the current acoustic model, for example, the new speech feature data also includes 3 frames, where the wakeup word corresponding to the current acoustic model is "small", where "small" includes three phonemes "x", "i", and "ao" at the phoneme level, and each speech frame of the new speech feature data is input into the current acoustic model, so that the posterior probability distribution corresponding to each speech frame can be obtained;
TABLE 2
The phoneme of "x" The phoneme of "i" The phoneme of "ao"
First frame 0.8 0.3 0.1
Second frame 0.4 0.8 0.6
Third frame 0.1 0.4 0.9
Acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;
determining a cross entropy according to a difference between the third probability distribution and the fourth probability distribution, and determining a cross entropy according to the two probability distributions, which should be known by those skilled in the art, and will not be described herein again.
And step S204, adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.
And determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
Specifically, a loss function for adjusting the current acoustic model is obtained according to cross entropy and relative entropy weighted summation, gradient vectors are determined through layer-by-layer derivation according to a chain derivation mode, the gradient vectors are the fastest direction for increasing the loss function, and in order to make the loss function smaller and better, network layer parameters of the current acoustic model are controlled to be adjusted along the opposite direction of the gradient, the learning rate of one network layer is manually set in the specific adjustment mode to control the updating adjustment size of each time, parameters of each network layer in the current acoustic model are continuously updated, and the parameters of each network layer are updated in such a way that if the full connection function is y, which is wx + b, w and b parameters of each full connection layer need to be updated.
As an optional implementation manner, after determining the second difference parameter, the method further includes:
and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.
Adjusting model parameters of an initial acoustic model according to the second difference parameters, including:
and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.
When the current acoustic model is determined to be the initial acoustic model, determining a loss function for adjusting the initial acoustic model according to the second difference parameter, performing neural network learning by using the constructed loss function through a back propagation algorithm, adjusting model parameters of the initial acoustic model by using the loss function, and when the current acoustic model is adjusted for the first time, because the current acoustic model is the initial acoustic model, the relative entropy calculated by respectively inputting the initial voice feature data to the initial acoustic model and the current acoustic model is 0, and determining the loss function for adjusting the initial acoustic model according to the cross entropy.
The method for training the awakening model divides the training set of the awakening model into two types, wherein one type is initial voice characteristic data used for training the initial acoustic model, the other type is missed awakening/false awakening voice data used for training the current acoustic model, the adaptation to the actual scene is ensured by using the unique hot coding in the ASR model when the current acoustic model is trained, the difference parameters of the initial voice characteristic data in the initial acoustic model and the current acoustic model are calculated, the current acoustic model can be compatible with the initial voice data under the condition of ensuring the adaptation to the current scene by using the difference parameters, the risk of unstable awakening performance caused by updating is reduced, the awakening success rate of the voice under the current scene and the initially trained voice under the awakening model is effectively improved, and the trained acoustic model is well compatible with the previous initial awakening scene under the premise of adapting to a more complex scene, compared with the existing mode that new awakening word data are continuously collected and then mixed with the awakening word data obtained by recording before to retrain a new acoustic model, and the model before is replaced by the model with better performance, the training amount is reduced.
As shown in fig. 3, a complete flow chart of a method for training a wake-up model includes the steps of:
step S301, when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
step S302, inputting the second training set into an initial acoustic model, determining a loss function for adjusting the initial acoustic model by comparing an output result of the initial acoustic model with a unique hot code corresponding to the awakening voice which can be identified by the current acoustic model, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model;
in particular, using sole heatCross entropy determined loss function calculated from the difference of the probability distribution of the coding form and the posterior probability distribution of the new speech feature data input to the current acoustic model, wherein p (x) represents the posterior probability distribution of the new speech feature data input to the initial acoustic model, and pemp(x) Representing probability distribution of a single-hot coding form, and obtaining cross entropy as follows:
Figure BDA0002511243110000171
determining a loss function by using the cross entropy, and adjusting parameters of each network layer in the initial acoustic model according to the loss function to obtain a current acoustic model;
step S303, respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a relative entropy by comparing output results of the initial acoustic model and the current acoustic model;
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
Wherein p issi(x) Representing a first probability distribution of the initial speech feature data in the initial acoustic model, p1(x) representing a second probability distribution of the initial speech feature data in the current acoustic model;
wherein the relative entropy is:
Figure BDA0002511243110000181
step S304, inputting the second training set into a current acoustic model, and determining a cross entropy by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;
calculated using the difference between the probability distribution of the one-hot coded form and the posterior probability distribution of the new speech feature data input to the current acoustic modelCross-entropy determination of the loss function, where p2(x) represents a third probability distribution for inputting new speech feature data into the initial acoustic model, and pemp(x) And expressing the fourth probability distribution of the single-hot coding form, wherein the obtained cross entropy is as follows:
Figure BDA0002511243110000182
step S305, determining a loss function for adjusting the current acoustic model according to the relative entropy and the cross entropy, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
Defining a loss function according to cross entropy and relative entropy as follows:
Jkld=(1-α)H(pemp,p2)+αDkl(psi,p1)
and alpha is a weight coefficient for controlling the cross entropy and the KL divergence, and is generally set to be a fixed value of 0.25 empirically, if the quantity of the feature data is large, the value is increased, the gradient vector is determined by conducting derivation layer by layer according to a chain type derivation mode, and the network layer parameters of the initial acoustic model are controlled to be adjusted along the opposite direction of the gradient, so that the current acoustic model is obtained.
Finally, after the current acoustic model is obtained, inputting voice feature data of the awakening voice into the current acoustic model for recognition to obtain posterior probability distribution of each voice frame in the voice feature data, calculating awakening confidence corresponding to the posterior probability distribution by using a decoder, and determining whether to send a command for awakening the equipment to the awakening equipment by the current acoustic model according to the awakening confidence and the awakening threshold.
The embodiment of the invention provides a device for training a wake-up model, which comprises a memory, a memory and a control unit, wherein the memory is used for storing instructions;
fig. 4 is a device for training a wake model according to an embodiment of the present invention, where the device 400 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPU) 401 (e.g., one or more processors) and a memory 402, one or more storage media 403 (e.g., one or more mass storage devices) for storing applications 404 or data 406. Memory 402 and storage medium 403 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 403 may include one or more modules (not shown), and each module may include a series of instruction operations in the information processing apparatus. Further, the processor 401 may be configured to communicate with the storage medium 403 to execute a series of instruction operations in the storage medium 403 on the device 400.
The device 400 may also include one or more power supplies 409, one or more wired or wireless network interfaces 407, one or more input-output interfaces 408, and/or one or more operating systems 405, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.
The processor is used for reading the instructions in the memory, and the implementation method comprises the following steps:
when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;
and adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.
Optionally, the processor is configured to determine a first difference parameter, comprising:
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
Optionally, the processor is configured to determine a second difference parameter, including:
acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;
and determining the cross entropy according to the difference of the third probability distribution and the fourth probability distribution.
Optionally, after determining the second difference parameter, the processor further includes:
and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.
Optionally, the processor is configured to adjust a model parameter of the initial acoustic model according to the second difference parameter, and includes:
and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.
Optionally, the processor is configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, and includes:
and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
Optionally, the processor is configured to determine the initial speech feature data or the new speech feature data comprises at least one of:
mel cepstrum coefficient MFCC characteristic data;
perceptual linear prediction PLP feature data;
the mel scale filters the FBANK feature data.
Optionally, the processor is configured to obtain a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model, and includes:
acquiring preset awakening voice information which can be identified by a current acoustic model;
and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.
An embodiment of the present invention provides a device for training a wake-up model, and as shown in fig. 5, the device includes the following modules:
a training set obtaining module 501, configured to, when model training is triggered, obtain a first training set and a second training set, where the first training set includes initial voice feature data used for training an initial acoustic model, and the second training set includes new voice feature data of missed wake-up/mistaken wake-up voice corresponding to the acoustic model in a wake-up voice recognition process;
a first difference parameter determining module 502, configured to input the first training set to an initial acoustic model and a current acoustic model respectively, and determine a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
a second difference parameter determining module 503, configured to input the second training set into a current acoustic model, and determine a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to a wake-up voice that can be recognized by the current acoustic model;
a model adjusting module 504, configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter.
Optionally, the first difference parameter determining module 502 is configured to determine a first difference parameter, including:
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
Optionally, the second difference parameter determining module 503 is configured to determine a second difference parameter, and includes:
acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;
and determining the cross entropy according to the difference of the third probability distribution and the fourth probability distribution.
Optionally, after the current acoustic model determining module 505 is configured to determine the second difference parameter, the method further includes:
and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.
Optionally, the current acoustic model determining module 505 is configured to adjust the model parameters of the initial acoustic model according to the second difference parameter, and includes:
and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.
Optionally, the model adjusting module 504 is configured to adjust the model parameters of the current acoustic model according to the first difference parameter and the second difference parameter, including:
and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
Optionally, the initial speech feature data or the new speech feature data comprises at least one of:
mel cepstrum coefficient MFCC characteristic data;
perceptual linear prediction PLP feature data;
the mel scale filters the FBANK feature data.
Optionally, the second difference parameter determining module 503 is configured to obtain a one-hot code corresponding to the wake-up speech that can be recognized by the current acoustic model, and includes:
acquiring preset awakening voice information which can be identified by a current acoustic model;
and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.
An embodiment of the present invention provides a computer-readable storage medium, which stores computer instructions, and when the computer instructions are executed by a processor, the computer instructions implement a method for training a wake-up model according to any one of the above embodiments.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (11)

1. A method of training a wake model, the method comprising:
when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;
and adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.
2. The method of claim 1, wherein determining a first difference parameter comprises:
acquiring a first probability distribution corresponding to an awakening voice recognition result output by the initial acoustic model and a second probability distribution corresponding to an awakening voice recognition result output by the current acoustic model;
determining a relative entropy from a difference of the first and second probability distributions.
3. The method of claim 1, wherein determining a second difference parameter comprises:
acquiring a third probability distribution corresponding to the awakening voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the one-hot code;
and determining the cross entropy according to the difference of the third probability distribution and the fourth probability distribution.
4. The method of claim 1, wherein determining the second difference parameter further comprises:
and when the current acoustic model is determined to be the initial acoustic model, adjusting the model parameters of the initial acoustic model according to the second difference parameters.
5. The method of claim 4, wherein adjusting model parameters of an initial acoustic model based on the second difference parameter comprises:
and determining a loss function for adjusting the initial acoustic model according to the second difference parameter, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.
6. The method of claim 1, wherein adjusting the model parameters of the current acoustic model based on the first difference parameter and the second difference parameter comprises:
and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.
7. The method of claim 1, wherein the initial or new speech feature data comprises at least one of:
mel cepstrum coefficient MFCC characteristic data;
perceptual linear prediction PLP feature data;
the mel scale filters the FBANK feature data.
8. The method of claim 1, wherein obtaining the one-hot code corresponding to the wake-up speech recognizable by the current acoustic model comprises:
acquiring preset awakening voice information which can be identified by a current acoustic model;
and inputting the preset awakening voice information into an ASR voice model, and determining the one-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.
9. An apparatus for training a wake model, the apparatus comprising: a memory to store instructions;
a processor for reading the instructions in the memory to implement a method for training a wake-up model according to any one of claims 1 to 8.
10. An apparatus for training a wake-up model, the apparatus comprising:
the training set acquisition module is used for acquiring a first training set and a second training set when model training is triggered, wherein the first training set comprises initial voice characteristic data used for training an initial acoustic model, and the second training set comprises new voice characteristic data of missed awakening/mistaken awakening voice corresponding to the acoustic model in the process of awakening voice recognition;
the first difference parameter determining module is used for respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;
the second difference parameter determining module is used for inputting the second training set into the current acoustic model and determining a second difference parameter by comparing the output result of the current acoustic model with the one-hot code corresponding to the awakening voice which can be identified by the current acoustic model;
and the model adjusting module is used for adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.
11. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement a method of training a wake-up model as claimed in any one of claims 1 to 8.
CN202010461982.XA 2020-05-27 2020-05-27 Method and device for training wake-up model Active CN111667818B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010461982.XA CN111667818B (en) 2020-05-27 2020-05-27 Method and device for training wake-up model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010461982.XA CN111667818B (en) 2020-05-27 2020-05-27 Method and device for training wake-up model

Publications (2)

Publication Number Publication Date
CN111667818A true CN111667818A (en) 2020-09-15
CN111667818B CN111667818B (en) 2023-10-10

Family

ID=72384785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010461982.XA Active CN111667818B (en) 2020-05-27 2020-05-27 Method and device for training wake-up model

Country Status (1)

Country Link
CN (1) CN111667818B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185382A (en) * 2020-09-30 2021-01-05 北京猎户星空科技有限公司 Method, device, equipment and medium for generating and updating wake-up model
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium
CN112712801A (en) * 2020-12-14 2021-04-27 北京有竹居网络技术有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN113096647A (en) * 2021-04-08 2021-07-09 北京声智科技有限公司 Voice model training method and device and electronic equipment
CN113436629A (en) * 2021-08-27 2021-09-24 中国科学院自动化研究所 Voice control method and device, electronic equipment and storage medium
CN113608664A (en) * 2021-07-26 2021-11-05 京东科技控股股份有限公司 Intelligent voice robot interaction effect optimization method and device and intelligent robot
CN113782016A (en) * 2021-08-06 2021-12-10 佛山市顺德区美的电子科技有限公司 Wake-up processing method, device, equipment and computer storage medium
CN113782012A (en) * 2021-09-10 2021-12-10 北京声智科技有限公司 Wake-up model training method, wake-up method and electronic equipment
CN114565807A (en) * 2022-03-03 2022-05-31 腾讯科技(深圳)有限公司 Method and device for training target image retrieval model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107221326A (en) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 Voice awakening method, device and computer equipment based on artificial intelligence
CN107610702A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Terminal device standby wakeup method, apparatus and computer equipment
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN109545194A (en) * 2018-12-26 2019-03-29 出门问问信息科技有限公司 Wake up word pre-training method, apparatus, equipment and storage medium
CN109801636A (en) * 2019-01-29 2019-05-24 北京猎户星空科技有限公司 Training method, device, electronic equipment and the storage medium of Application on Voiceprint Recognition model
US10332513B1 (en) * 2016-06-27 2019-06-25 Amazon Technologies, Inc. Voice enablement and disablement of speech processing functionality
CN110459204A (en) * 2018-05-02 2019-11-15 Oppo广东移动通信有限公司 Audio recognition method, device, storage medium and electronic equipment
CN110534099A (en) * 2019-09-03 2019-12-03 腾讯科技(深圳)有限公司 Voice wakes up processing method, device, storage medium and electronic equipment
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10332513B1 (en) * 2016-06-27 2019-06-25 Amazon Technologies, Inc. Voice enablement and disablement of speech processing functionality
CN107221326A (en) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 Voice awakening method, device and computer equipment based on artificial intelligence
CN107610702A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Terminal device standby wakeup method, apparatus and computer equipment
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN110459204A (en) * 2018-05-02 2019-11-15 Oppo广东移动通信有限公司 Audio recognition method, device, storage medium and electronic equipment
CN109545194A (en) * 2018-12-26 2019-03-29 出门问问信息科技有限公司 Wake up word pre-training method, apparatus, equipment and storage medium
CN109801636A (en) * 2019-01-29 2019-05-24 北京猎户星空科技有限公司 Training method, device, electronic equipment and the storage medium of Application on Voiceprint Recognition model
CN110534099A (en) * 2019-09-03 2019-12-03 腾讯科技(深圳)有限公司 Voice wakes up processing method, device, storage medium and electronic equipment
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185382A (en) * 2020-09-30 2021-01-05 北京猎户星空科技有限公司 Method, device, equipment and medium for generating and updating wake-up model
CN112185382B (en) * 2020-09-30 2024-03-08 北京猎户星空科技有限公司 Method, device, equipment and medium for generating and updating wake-up model
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium
CN112435656B (en) * 2020-12-11 2024-03-01 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium
CN112712801B (en) * 2020-12-14 2024-02-02 北京有竹居网络技术有限公司 Voice wakeup method and device, electronic equipment and storage medium
CN112712801A (en) * 2020-12-14 2021-04-27 北京有竹居网络技术有限公司 Voice wake-up method and device, electronic equipment and storage medium
WO2022127620A1 (en) * 2020-12-14 2022-06-23 北京有竹居网络技术有限公司 Voice wake-up method and apparatus, electronic device, and storage medium
CN113096647A (en) * 2021-04-08 2021-07-09 北京声智科技有限公司 Voice model training method and device and electronic equipment
CN113608664A (en) * 2021-07-26 2021-11-05 京东科技控股股份有限公司 Intelligent voice robot interaction effect optimization method and device and intelligent robot
CN113782016A (en) * 2021-08-06 2021-12-10 佛山市顺德区美的电子科技有限公司 Wake-up processing method, device, equipment and computer storage medium
CN113782016B (en) * 2021-08-06 2023-05-05 佛山市顺德区美的电子科技有限公司 Wakeup processing method, wakeup processing device, equipment and computer storage medium
CN113436629A (en) * 2021-08-27 2021-09-24 中国科学院自动化研究所 Voice control method and device, electronic equipment and storage medium
CN113436629B (en) * 2021-08-27 2024-06-04 中国科学院自动化研究所 Voice control method, voice control device, electronic equipment and storage medium
CN113782012B (en) * 2021-09-10 2024-03-08 北京声智科技有限公司 Awakening model training method, awakening method and electronic equipment
CN113782012A (en) * 2021-09-10 2021-12-10 北京声智科技有限公司 Wake-up model training method, wake-up method and electronic equipment
CN114565807A (en) * 2022-03-03 2022-05-31 腾讯科技(深圳)有限公司 Method and device for training target image retrieval model

Also Published As

Publication number Publication date
CN111667818B (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN111667818B (en) Method and device for training wake-up model
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN110534099B (en) Voice wake-up processing method and device, storage medium and electronic equipment
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN107767863B (en) Voice awakening method and system and intelligent terminal
WO2017076222A1 (en) Speech recognition method and apparatus
KR100826875B1 (en) On-line speaker recognition method and apparatus for thereof
CN108564940A (en) Audio recognition method, server and computer readable storage medium
Tong et al. A comparative study of robustness of deep learning approaches for VAD
CN110570873B (en) Voiceprint wake-up method and device, computer equipment and storage medium
CN110097870B (en) Voice processing method, device, equipment and storage medium
CN102280106A (en) VWS method and apparatus used for mobile communication terminal
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN113129867B (en) Training method of voice recognition model, voice recognition method, device and equipment
KR20230107860A (en) Voice personalization and federation training using real noise
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
CN112825250A (en) Voice wake-up method, apparatus, storage medium and program product
CN111653274A (en) Method, device and storage medium for awakening word recognition
CN109065026B (en) Recording control method and device
CN110728993A (en) Voice change identification method and electronic equipment
CN114078472A (en) Training method and device for keyword calculation model with low false awakening rate
CN111429919B (en) Crosstalk prevention method based on conference real recording system, electronic device and storage medium
CN112185357A (en) Device and method for simultaneously recognizing human voice and non-human voice
CN106887226A (en) Speech recognition algorithm based on artificial intelligence recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant