Disclosure of Invention
The invention aims to at least solve one technical problem in the prior art or the related art, and provides a wake-up model generation method, an intelligent terminal wake-up method and an intelligent terminal wake-up device.
The embodiment of the invention provides the following specific technical scheme:
in a first aspect, a method for generating a wake-up model is provided, where the method includes:
marking the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the marked awakening word audio, wherein the time length of the awakening word audio is not fixed;
denoising the marked awakening word audio by using negative sample audio containing background noise to obtain positive sample audio;
respectively extracting a plurality of audio frame characteristics from the positive sample audio and the negative sample audio, and labeling frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples;
and training a recurrent neural network by using the plurality of audio training samples to generate a wake-up model.
Further, the labeling the start-stop time of each wakeup word included in the wakeup word audio in the sample audio set to obtain the labeled wakeup word audio includes:
identifying at least one key audio segment in the wake word audio that contains only the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
Further, the denoising the labeled wakeup word audio by using a negative sample audio containing background noise to obtain a positive sample audio, including:
intercepting a negative sample audio frequency segment with the same time length as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude mean value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain the positive sample audio frequency.
Further, the frame label includes a positive label, a negative label, and a middle label, and the labeling of the frame label is performed on the positive sample audio and the negative sample audio to obtain a plurality of audio training samples, including:
for each audio frame of the positive sample audio, judging whether part or all of the audio frame falls into the start-stop time period of any awakening word, and if so, marking the audio frame as a middle label;
if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, and if not, marking the audio frames as positive labels, otherwise, marking the audio frames as negative labels, wherein the audio frames do not contain the end time of the awakening word for the first time;
for each audio frame of the negative sample audio, marking the audio frame as a negative label.
In a second aspect, a method for waking up an intelligent terminal is provided, where the method includes:
the intelligent terminal acquires real-time audio at the current moment;
extracting a plurality of audio frame features from the real-time audio;
sequentially inputting the extracted multiple audio frame characteristics into a pre-deployed awakening model, and calculating by combining the state stored at the previous moment of the awakening model to obtain an awakening result of whether the real-time audio contains awakening words;
wherein the wake-up model is generated by using the wake-up model generation method of the first aspect.
In a third aspect, an apparatus for generating a wake-up model is provided, the apparatus including:
the first labeling module is used for labeling the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the labeled awakening word audio, wherein the time length of the awakening word audio is not fixed;
the noise adding processing module is used for adding noise to the marked awakening word audio by using negative sample audio containing background noise to obtain positive sample audio;
the characteristic extraction module is used for respectively extracting a plurality of audio frame characteristics from the positive sample audio and the negative sample audio;
the second labeling module is used for labeling the frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples;
and the model generation module is used for training the recurrent neural network by using the plurality of audio training samples to generate a wake-up model.
Further, the first labeling module is specifically configured to:
identifying at least one key audio segment in the wake word audio that contains only the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
Further, the denoising processing module is specifically configured to:
intercepting a negative sample audio frequency segment with the same time length as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude average value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain the positive sample audio frequency.
Further, the frame tag includes a positive tag, a negative tag, and an intermediate tag, and the second labeling module is specifically configured to:
judging whether part or all of the audio frames fall into the start-stop time period of any awakening word or not for each audio frame of the positive sample audio, and if so, marking the audio frames as middle labels;
if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, if so, marking the audio frames as positive labels, otherwise, marking the audio frames as negative labels, and if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, and the audio frames do not contain the end time of the awakening word for the first time;
for each audio frame of the negative sample audio, marking the audio frame as a negative label.
In a fourth aspect, an intelligent terminal wake-up device is provided, the device including:
the audio acquisition module is used for acquiring real-time audio at the current moment by the intelligent terminal;
the feature extraction module is used for extracting a plurality of audio frame features from the real-time audio;
the model identification module is used for sequentially inputting the extracted audio frame characteristics into a pre-deployed awakening model and calculating by combining the state stored at the previous moment of the awakening model so as to obtain an awakening result of whether the real-time audio contains awakening words or not;
wherein the wake-up model is generated by using the wake-up model generation method of the first aspect.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
1. because the time length of the awakening word audio is not fixed, the awakening word audio is used as variable-length input data to train the Recurrent Neural Network (RNN), so that the manual data interception is avoided, the manual data processing flow is reduced, the labor cost is saved, and the awakening voice with lower speed can be identified;
2. the sample audio set can contain long audio, so that the RNN can be trained uninterruptedly, the identification precision of the awakening words is improved, and the awakening effect of the intelligent terminal is improved;
3. in the terminal awakening process, for each frame of audio newly added into the terminal memory, old data does not need to be calculated repeatedly, and the calculation time and the power consumption of the terminal are reduced.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
Example one
An embodiment of the present invention provides a method for generating a wake-up model, where the method may be applied to a server, and as shown in fig. 1, the method may include:
and 101, marking the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the marked awakening word audio, wherein the time length of the awakening word audio is not fixed.
The sample audio set comprises a plurality of wake-up word audios, and each wake-up word audio comprises at least one wake-up word. In a specific implementation, a plurality of wake word audios including a wake word may be recorded in a quiet environment, where when recording one wake word audio, a certain time interval needs to be reserved between adjacent wake words, and the content of each wake word is the same, for example, "little biu little biu". In this embodiment, the time length of the audio of each wakeup word is approximately several seconds to several minutes, and the time length of the wakeup word is approximately 1 second or so.
Specifically, at least one key audio segment only containing the awakening word in the awakening word audio is identified, the start-stop time of each awakening word is respectively labeled according to the respective start-stop time of each key audio segment, and the labeled awakening word audio is obtained. In specific implementation, the start-stop time of each awakening word in the awakening word audio can be labeled on the server in a manual mode, so that the labeled awakening word audio is obtained.
For example, startN and endN may be respectively used as the start time and the end time of the nth wakeup word, as shown in fig. 2, fig. 2 shows a schematic diagram of the start-stop time labeling of the wakeup word provided by the embodiment of the present invention, where a black part is represented as the wakeup word.
And 102, denoising the marked awakening word audio by using the negative sample audio containing the background noise to obtain a positive sample audio.
Background noise in different scenes can be prerecorded to obtain negative sample audio, where the different scenes can be various scenes, such as a scene during television broadcasting, a scene during cooking, or other scenes.
Specifically, a negative sample audio segment with the same duration as the labeled wakeup word audio is intercepted from the negative sample audio, the amplitude average value of the negative sample audio segment is adjusted, and the adjusted negative sample audio segment is used for mixing and denoising the labeled wakeup word audio to obtain a positive sample audio.
In a specific implementation, the amplitude mean value of the negative sample audio segment may be adjusted to be equal to the amplitude mean value of the labeled wake-up word audio, and then the amplitude mean value of the negative sample audio segment is reduced to a preset percentage of the amplitude mean value, where the preset percentage may be between 5% and 10%.
In this embodiment, to amplify the positive sample audio data set, each of the M wake-up word audio may be denoised by using the N negative sample audio to obtain N × M positive sample audio.
103, extracting a plurality of audio frame features from the positive sample audio and the negative sample audio respectively, and labeling the frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples.
Specifically, the process of extracting a plurality of audio frame features from the positive sample audio and the negative sample audio respectively may include:
extracting a plurality of audio frame features from each audio frame of the positive sample audio and each audio frame of the negative sample audio respectively, and generating a feature spectrogram of the positive sample audio and a feature spectrogram of the negative sample audio, wherein the audio frame features may specifically be Mel-Frequency Cepstrum Coefficient features, the feature spectrogram is a Mel-Frequency Cepstrum map, that is, a spectrogram of Mel-Frequency Cepstrum coefficients (MFCCs), and each feature vector in the Mel-Frequency Cepstrum represents a MFCC feature vector of each audio frame.
Fig. 3 is a schematic diagram illustrating MFCC feature vector acquisition according to an embodiment of the present invention. As shown in fig. 3, for each positive sample audio and each negative sample audio, the mel-frequency cepstrum coefficient characteristics may be calculated according to the preset window width W, the moving step S and the mel-frequency cepstrum coefficients CMel, so as to generate the mel-frequency cepstrum.
Specifically, the labeling of the frame labels is performed on the positive sample audio and the negative sample audio, where the frame labels include a positive label, a negative label, and an intermediate label, and the process may include:
judging whether part or all of the audio frames fall into the start-stop time period of any awakening word or not aiming at each audio frame of the positive sample audio, and if so, marking the audio frames as middle labels; if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, if so, marking the audio frame as a positive label, otherwise, marking the audio frame as a negative label, and if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, and if not, marking the audio frame as a negative label; for each audio frame of negative sample audio, the audio frame is marked as a negative label.
In this embodiment, the Positive label, the Negative label, and the Middle label may be respectively denoted as "Positive", "Negative", "Middle", or "1", "-1", and "0".
Fig. 4 is a schematic diagram illustrating labeling of a frame tag according to an embodiment of the present invention. As shown in fig. 4, assuming that the start time of the window is denoted by t and the window width is w, for each audio frame of the positive sample audio, if the audio frame falls outside the start-stop time period of any wakeup word, the audio frame is labeled as "Negative", that is: (endN-1 < t) & (t + w < startN); if part or all of the audio frame falls into the start-stop time period of any awakening word, marking the audio frame as "Middle", namely: (startN < t + w) & (t < endN); if the previous audio frame of the audio frame falls into the start-stop time period of any awakening word and the audio frame does not contain the end time of the awakening word for the first time, namely: (endN ≦ t) & (t-1 < endN), the audio frame is labeled "Positive".
It will be appreciated that each audio frame of Negative sample audio is labeled "Negative".
And 104, training the recurrent neural network by using a plurality of audio training samples to generate a wake-up model.
Specifically, for the Nth audio frame of each audio training sample, the frame characteristics of the audio frame are used as input data of an input layer t moment of the recurrent neural network, the frame label of the audio frame is used as an output result of the output layer t moment of the recurrent neural network, the state value St of the hidden layer t moment of the recurrent neural network is calculated by combining the state value St-1 of the previous moment of the hidden layer t moment of the recurrent neural network, the state values of all moments of the hidden layer of the recurrent neural network are sequentially calculated, and the awakening model is generated.
It should be noted that, after the wake-up model is generated, the embodiment of the present invention may deploy the wake-up model to the intelligent terminal, so as to perform wake-up processing on the intelligent terminal by using the wake-up model.
The embodiment of the invention provides a method for generating a wake-up module, which is characterized in that because the time length of the wake-up word audio is not fixed, the wake-up word audio is used as variable-length input data to train a Recurrent Neural Network (RNN), so that the data is prevented from being manually intercepted, the labor cost is saved, and the data with lower speed can be identified; meanwhile, the sample audio set can contain long audio, so that the RNN can be trained uninterruptedly, the recognition precision of the awakening words is improved, and the awakening effect of the intelligent terminal is improved.
Example two
An embodiment of the present invention provides an intelligent terminal wake-up method, which may be applied to an intelligent terminal, where the intelligent terminal is pre-deployed with a wake-up model generated based on the wake-up model generation method in the first embodiment, and as shown in fig. 5, the method may include the steps of:
501, the intelligent terminal obtains the real-time audio at the current moment.
Specifically, the intelligent terminal may utilize a microphone to capture real-time audio at the current time in the scene. The intelligent terminal includes but is not limited to a robot, a smart phone, a wearable device, a smart home, a vehicle-mounted terminal, and the like.
A plurality of audio frame features are extracted from real-time audio 502.
Specifically, with a preset window width W, a moving step S and a mel-frequency cepstrum coefficient CMel, mel-frequency cepstrum coefficient features are respectively extracted from each audio frame of the real-time audio to obtain a plurality of audio frame features.
Further, to improve the recognition accuracy of the wake-up word and improve the wake-up effect, before executing step 202, the method provided in the embodiment of the present invention may further include:
the real-time audio at the current time is preprocessed, wherein the preprocessing includes but is not limited to echo cancellation and noise reduction processing.
503, sequentially inputting the extracted audio frame features into a pre-deployed wake-up model, and calculating by combining the state saved at the previous moment of the wake-up model to obtain a wake-up result of whether the real-time audio contains a wake-up word.
Specifically, according to the extracted time sequence of the plurality of audio frame features corresponding to the real-time audio, the audio frame features are sequentially input into the wake-up model, calculation is performed by combining the state stored at the previous moment of the wake-up model, according to the output result of the wake-up model, the frame tags corresponding to the plurality of audio frames of the real-time audio at the current moment and the state of the wake-up model at the current moment are obtained, the state of the wake-up model at the current moment is stored, and the wake-up result whether the real-time audio contains the wake-up word or not is obtained according to the frame tags corresponding to the plurality of audio frames, wherein when the frame tags corresponding to the plurality of audio frames contain the positive tags, the real-time audio is determined to contain the wake-up word.
The following describes the method for waking up an intelligent terminal according to an embodiment of the present invention with reference to fig. 6a to 6 b.
Assuming that the memory of the intelligent terminal can only store N frames of data each time, as shown in fig. 6a, when the intelligent terminal is powered on for the first time, the real-time audio at the time t-1 is loaded into the memory, the state S0 at the previous time of the RNN network in the wake-up model is 0, the real-time audio feature at the time t-1 needs to be input into the RNN network of the wake-up model, the state S1 in the RNN network at the time t-1 is obtained, and the recognition result is output. As shown in fig. 6b, at any time after the intelligent terminal is powered on, assuming that t is M, where M is greater than 1, only the real-time audio frame feature newly added to the memory when t is M needs to be input into the RNN network of the wake-up model, and the calculation is performed in combination with the state SM-1 saved at the last time of the RNN network, without repeatedly calculating all data in the memory.
In this embodiment, because the low-end chip is mostly adopted by the current intelligent terminal, the capacity of the memory of the terminal is limited, and in the prior art, in the terminal wake-up stage, because the neural network needs to process the audio frequency of the time length t in the memory of the terminal every time, a large amount of repeated data needs to be processed between two adjacent time lengths t, so that the calculation time and the power consumption of the terminal are increased. The invention judges whether the real-time audio contains the awakening words or not by utilizing the RNN awakening model input in a variable length mode without repeatedly calculating old data, thereby reducing the calculated amount, quickening the processing speed and reducing the power consumption.
EXAMPLE III
As an implementation of the wake-up model generation method provided in the first embodiment, an embodiment of the present invention provides a wake-up model generation apparatus, as shown in fig. 7, the apparatus includes:
a first labeling module 71, configured to label start-stop time of each wakeup word included in the wakeup word audio in the sample audio set, to obtain a labeled wakeup word audio, where a time length of the wakeup word audio is not fixed;
a noise adding processing module 72, configured to add noise to the labeled wake-up word audio by using a negative sample audio containing background noise to obtain a positive sample audio;
a feature extraction module 73, configured to extract a plurality of audio frame features from the positive sample audio and the negative sample audio, respectively;
a second labeling module 74, configured to label the positive sample audio and the negative sample audio with frame labels to obtain multiple audio training samples;
and a model generating module 75, configured to train the recurrent neural network using a plurality of audio training samples, and generate a wake-up model.
Further, the first labeling module 71 is specifically configured to:
identifying at least one key audio segment in the wake word audio that only contains the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
Further, the noise processing module 72 is specifically configured to:
intercepting a negative sample audio frequency segment with the same time length as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude mean value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain a positive sample audio frequency.
Further, the frame tags include a positive tag, a negative tag, and a middle tag, and the second labeling module 74 is specifically configured to:
judging whether part or all of the audio frames fall into the start-stop time period of any awakening word or not aiming at each audio frame of the positive sample audio, and if so, marking the audio frames as middle labels;
if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, if so, marking the audio frame as a positive label, otherwise, marking the audio frame as a negative label, and if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, and if not, marking the audio frame as a negative label;
for each audio frame of negative sample audio, the audio frame is marked as a negative label.
The wake-up model generation device provided by the embodiment of the invention belongs to the same inventive concept as the wake-up model generation method provided by the first embodiment of the invention, can execute the wake-up model generation method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the wake-up model generation method. For details of the technology that are not described in detail in the embodiment of the present invention, reference may be made to the method for generating a wake-up model provided in the embodiment of the present invention, and details are not described here again.
Example four
As an implementation of the method for waking up an intelligent terminal provided in the second embodiment, an embodiment of the present invention provides an apparatus for waking up an intelligent terminal, where as shown in fig. 8, the apparatus includes:
the audio acquisition module 81 is used for the intelligent terminal to acquire real-time audio at the current moment;
a feature extraction module 82, configured to extract a plurality of audio frame features from the real-time audio;
the model identification module 83 is configured to sequentially input the extracted multiple audio frame features into a pre-deployed wake-up model, and perform calculation by combining a state stored at a previous moment of the wake-up model to obtain a wake-up result indicating whether a real-time audio contains a wake-up word;
the wake-up model is generated by using the wake-up model generation method in the first embodiment.
Further, in order to improve the recognition accuracy of the wake-up word and improve the wake-up effect, the apparatus may further include:
and the preprocessing module is used for preprocessing the real-time audio at the current moment, wherein the preprocessing includes but is not limited to echo cancellation and noise reduction processing.
The feature extraction module 82 is further configured to extract a plurality of audio frame features from the pre-processed real-time audio.
The intelligent terminal awakening device provided by the embodiment of the invention belongs to the same invention concept as the intelligent terminal awakening method provided by the second embodiment of the invention, can execute the intelligent terminal awakening method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the service request processing method. For details of the technology that are not described in detail in the embodiment of the present invention, reference may be made to the method for waking up an intelligent terminal provided in the embodiment of the present invention, and details are not described herein again.
In addition, another embodiment of the present invention further provides a computer device, including:
one or more processors;
a memory;
a program stored in the memory, which when executed by the one or more processors, causes the processors to perform the steps of the wake model generation method as described in the above embodiments.
In addition, another embodiment of the present invention further provides a computer device, including:
one or more processors;
a memory;
a program stored in the memory, which when executed by the one or more processors, causes the processors to perform the steps of the intelligent terminal wake-up method as described in the embodiments above.
Furthermore, another embodiment of the present invention further provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program causes the processor to execute the steps of the wake model generation method according to the above embodiment.
In addition, another embodiment of the present invention further provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program causes the processor to execute the steps of the method for waking up an intelligent terminal according to the above embodiment.
As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart lucu flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the true spirit and scope of the embodiments of the present invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.