CN110970016B - Awakening model generation method, intelligent terminal awakening method and device - Google Patents

Awakening model generation method, intelligent terminal awakening method and device Download PDF

Info

Publication number
CN110970016B
CN110970016B CN201911028892.5A CN201911028892A CN110970016B CN 110970016 B CN110970016 B CN 110970016B CN 201911028892 A CN201911028892 A CN 201911028892A CN 110970016 B CN110970016 B CN 110970016B
Authority
CN
China
Prior art keywords
audio
awakening
wake
word
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911028892.5A
Other languages
Chinese (zh)
Other versions
CN110970016A (en
Inventor
白二伟
倪合强
宋志�
姚寿柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Biying Technology Co ltd
Jiangsu Suning Cloud Computing Co ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201911028892.5A priority Critical patent/CN110970016B/en
Publication of CN110970016A publication Critical patent/CN110970016A/en
Priority to PCT/CN2020/105998 priority patent/WO2021082572A1/en
Priority to CA3158930A priority patent/CA3158930A1/en
Application granted granted Critical
Publication of CN110970016B publication Critical patent/CN110970016B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Telephone Function (AREA)

Abstract

The invention discloses a wake-up model generation method, an intelligent terminal wake-up method and a device, belonging to the technical field of voice wake-up, wherein the wake-up model generation method comprises the following steps: marking the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the marked awakening word audio, wherein the time length of the awakening word audio is not fixed; denoising the marked awakening word audio by using negative sample audio containing background noise to obtain positive sample audio; respectively extracting a plurality of audio frame characteristics from the positive sample audio and the negative sample audio, and labeling the frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples; and training the recurrent neural network by using a plurality of audio training samples to generate a wake-up model. According to the embodiment of the invention, the model training is carried out by using the cyclic neural network with the variable-length input, so that the operation of manually intercepting the sample can be avoided, and the awakening effect of the intelligent terminal can be improved.

Description

Awakening model generation method, intelligent terminal awakening method and device
Technical Field
The invention relates to the technical field of data security, in particular to a wake-up model generation method, an intelligent terminal wake-up method and an intelligent terminal wake-up device.
Background
At present, the voice awakening application fields are wide, such as robots, mobile phones, wearable devices, smart homes, vehicles and the like. Different intelligent terminal can have different awakening words, and when the user says out specific awakening words, the intelligent terminal can be switched to a working state from a standby state, and only if the switching of the state is completed quickly and accurately, the user can use other functions of the intelligent terminal almost directly without perception, so that the improvement of the awakening effect is very important.
In the prior art, a wake-up technology based on a neural network is mainly adopted for waking up an intelligent terminal. In the data preparation stage, positive sample data needs to be manually and uniformly intercepted to a fixed time length t, and the time length for recording the awakening words cannot exceed the time length t, so that the labor cost is greatly increased, and the awakening voice with slow speech speed cannot be identified; in addition, the time for awakening words is possibly short, so that the training on the neural network is insufficient, and the awakening effect of the intelligent terminal is affected finally; in addition, in the terminal wake-up stage, the neural network needs to process the audio with the time length t in the memory of the terminal every time, so that a large amount of repeated data between two adjacent time lengths t needs to be processed, and the calculation time and the power consumption of the terminal are increased.
Disclosure of Invention
The invention aims to at least solve one technical problem in the prior art or the related art, and provides a wake-up model generation method, an intelligent terminal wake-up method and an intelligent terminal wake-up device.
The embodiment of the invention provides the following specific technical scheme:
in a first aspect, a method for generating a wake-up model is provided, where the method includes:
marking the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the marked awakening word audio, wherein the time length of the awakening word audio is not fixed;
denoising the marked awakening word audio by using negative sample audio containing background noise to obtain positive sample audio;
respectively extracting a plurality of audio frame characteristics from the positive sample audio and the negative sample audio, and labeling frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples;
and training a recurrent neural network by using the plurality of audio training samples to generate a wake-up model.
Further, the labeling the start-stop time of each wakeup word included in the wakeup word audio in the sample audio set to obtain the labeled wakeup word audio includes:
identifying at least one key audio segment in the wake word audio that contains only the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
Further, the denoising the labeled wakeup word audio by using a negative sample audio containing background noise to obtain a positive sample audio, including:
intercepting a negative sample audio frequency segment with the same time length as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude mean value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain the positive sample audio frequency.
Further, the frame label includes a positive label, a negative label, and a middle label, and the labeling of the frame label is performed on the positive sample audio and the negative sample audio to obtain a plurality of audio training samples, including:
for each audio frame of the positive sample audio, judging whether part or all of the audio frame falls into the start-stop time period of any awakening word, and if so, marking the audio frame as a middle label;
if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, and if not, marking the audio frames as positive labels, otherwise, marking the audio frames as negative labels, wherein the audio frames do not contain the end time of the awakening word for the first time;
for each audio frame of the negative sample audio, marking the audio frame as a negative label.
In a second aspect, a method for waking up an intelligent terminal is provided, where the method includes:
the intelligent terminal acquires real-time audio at the current moment;
extracting a plurality of audio frame features from the real-time audio;
sequentially inputting the extracted multiple audio frame characteristics into a pre-deployed awakening model, and calculating by combining the state stored at the previous moment of the awakening model to obtain an awakening result of whether the real-time audio contains awakening words;
wherein the wake-up model is generated by using the wake-up model generation method of the first aspect.
In a third aspect, an apparatus for generating a wake-up model is provided, the apparatus including:
the first labeling module is used for labeling the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the labeled awakening word audio, wherein the time length of the awakening word audio is not fixed;
the noise adding processing module is used for adding noise to the marked awakening word audio by using negative sample audio containing background noise to obtain positive sample audio;
the characteristic extraction module is used for respectively extracting a plurality of audio frame characteristics from the positive sample audio and the negative sample audio;
the second labeling module is used for labeling the frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples;
and the model generation module is used for training the recurrent neural network by using the plurality of audio training samples to generate a wake-up model.
Further, the first labeling module is specifically configured to:
identifying at least one key audio segment in the wake word audio that contains only the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
Further, the denoising processing module is specifically configured to:
intercepting a negative sample audio frequency segment with the same time length as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude average value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain the positive sample audio frequency.
Further, the frame tag includes a positive tag, a negative tag, and an intermediate tag, and the second labeling module is specifically configured to:
judging whether part or all of the audio frames fall into the start-stop time period of any awakening word or not for each audio frame of the positive sample audio, and if so, marking the audio frames as middle labels;
if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, if so, marking the audio frames as positive labels, otherwise, marking the audio frames as negative labels, and if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, and the audio frames do not contain the end time of the awakening word for the first time;
for each audio frame of the negative sample audio, marking the audio frame as a negative label.
In a fourth aspect, an intelligent terminal wake-up device is provided, the device including:
the audio acquisition module is used for acquiring real-time audio at the current moment by the intelligent terminal;
the feature extraction module is used for extracting a plurality of audio frame features from the real-time audio;
the model identification module is used for sequentially inputting the extracted audio frame characteristics into a pre-deployed awakening model and calculating by combining the state stored at the previous moment of the awakening model so as to obtain an awakening result of whether the real-time audio contains awakening words or not;
wherein the wake-up model is generated by using the wake-up model generation method of the first aspect.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
1. because the time length of the awakening word audio is not fixed, the awakening word audio is used as variable-length input data to train the Recurrent Neural Network (RNN), so that the manual data interception is avoided, the manual data processing flow is reduced, the labor cost is saved, and the awakening voice with lower speed can be identified;
2. the sample audio set can contain long audio, so that the RNN can be trained uninterruptedly, the identification precision of the awakening words is improved, and the awakening effect of the intelligent terminal is improved;
3. in the terminal awakening process, for each frame of audio newly added into the terminal memory, old data does not need to be calculated repeatedly, and the calculation time and the power consumption of the terminal are reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 illustrates a flowchart of a method for generating a wake-up model according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a start-stop time labeling of a wakeup word according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating MFCC feature vector acquisition provided by an embodiment of the present invention;
FIG. 4 is a labeling diagram of a frame tag according to an embodiment of the present invention;
fig. 5 shows a flowchart of an intelligent terminal wake-up method according to an embodiment of the present invention;
fig. 6a illustrates a schematic diagram of a wake-up process in a terminal memory when t is 1 according to an embodiment of the present invention;
fig. 6b illustrates a schematic diagram of a wake-up process in a terminal memory when t is M according to an embodiment of the present invention;
fig. 7 shows a schematic structural diagram of a wake-up model generating apparatus according to an embodiment of the present invention;
fig. 8 shows a schematic structural diagram of an intelligent terminal wake-up apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
Example one
An embodiment of the present invention provides a method for generating a wake-up model, where the method may be applied to a server, and as shown in fig. 1, the method may include:
and 101, marking the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the marked awakening word audio, wherein the time length of the awakening word audio is not fixed.
The sample audio set comprises a plurality of wake-up word audios, and each wake-up word audio comprises at least one wake-up word. In a specific implementation, a plurality of wake word audios including a wake word may be recorded in a quiet environment, where when recording one wake word audio, a certain time interval needs to be reserved between adjacent wake words, and the content of each wake word is the same, for example, "little biu little biu". In this embodiment, the time length of the audio of each wakeup word is approximately several seconds to several minutes, and the time length of the wakeup word is approximately 1 second or so.
Specifically, at least one key audio segment only containing the awakening word in the awakening word audio is identified, the start-stop time of each awakening word is respectively labeled according to the respective start-stop time of each key audio segment, and the labeled awakening word audio is obtained. In specific implementation, the start-stop time of each awakening word in the awakening word audio can be labeled on the server in a manual mode, so that the labeled awakening word audio is obtained.
For example, startN and endN may be respectively used as the start time and the end time of the nth wakeup word, as shown in fig. 2, fig. 2 shows a schematic diagram of the start-stop time labeling of the wakeup word provided by the embodiment of the present invention, where a black part is represented as the wakeup word.
And 102, denoising the marked awakening word audio by using the negative sample audio containing the background noise to obtain a positive sample audio.
Background noise in different scenes can be prerecorded to obtain negative sample audio, where the different scenes can be various scenes, such as a scene during television broadcasting, a scene during cooking, or other scenes.
Specifically, a negative sample audio segment with the same duration as the labeled wakeup word audio is intercepted from the negative sample audio, the amplitude average value of the negative sample audio segment is adjusted, and the adjusted negative sample audio segment is used for mixing and denoising the labeled wakeup word audio to obtain a positive sample audio.
In a specific implementation, the amplitude mean value of the negative sample audio segment may be adjusted to be equal to the amplitude mean value of the labeled wake-up word audio, and then the amplitude mean value of the negative sample audio segment is reduced to a preset percentage of the amplitude mean value, where the preset percentage may be between 5% and 10%.
In this embodiment, to amplify the positive sample audio data set, each of the M wake-up word audio may be denoised by using the N negative sample audio to obtain N × M positive sample audio.
103, extracting a plurality of audio frame features from the positive sample audio and the negative sample audio respectively, and labeling the frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples.
Specifically, the process of extracting a plurality of audio frame features from the positive sample audio and the negative sample audio respectively may include:
extracting a plurality of audio frame features from each audio frame of the positive sample audio and each audio frame of the negative sample audio respectively, and generating a feature spectrogram of the positive sample audio and a feature spectrogram of the negative sample audio, wherein the audio frame features may specifically be Mel-Frequency Cepstrum Coefficient features, the feature spectrogram is a Mel-Frequency Cepstrum map, that is, a spectrogram of Mel-Frequency Cepstrum coefficients (MFCCs), and each feature vector in the Mel-Frequency Cepstrum represents a MFCC feature vector of each audio frame.
Fig. 3 is a schematic diagram illustrating MFCC feature vector acquisition according to an embodiment of the present invention. As shown in fig. 3, for each positive sample audio and each negative sample audio, the mel-frequency cepstrum coefficient characteristics may be calculated according to the preset window width W, the moving step S and the mel-frequency cepstrum coefficients CMel, so as to generate the mel-frequency cepstrum.
Specifically, the labeling of the frame labels is performed on the positive sample audio and the negative sample audio, where the frame labels include a positive label, a negative label, and an intermediate label, and the process may include:
judging whether part or all of the audio frames fall into the start-stop time period of any awakening word or not aiming at each audio frame of the positive sample audio, and if so, marking the audio frames as middle labels; if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, if so, marking the audio frame as a positive label, otherwise, marking the audio frame as a negative label, and if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, and if not, marking the audio frame as a negative label; for each audio frame of negative sample audio, the audio frame is marked as a negative label.
In this embodiment, the Positive label, the Negative label, and the Middle label may be respectively denoted as "Positive", "Negative", "Middle", or "1", "-1", and "0".
Fig. 4 is a schematic diagram illustrating labeling of a frame tag according to an embodiment of the present invention. As shown in fig. 4, assuming that the start time of the window is denoted by t and the window width is w, for each audio frame of the positive sample audio, if the audio frame falls outside the start-stop time period of any wakeup word, the audio frame is labeled as "Negative", that is: (endN-1 < t) & (t + w < startN); if part or all of the audio frame falls into the start-stop time period of any awakening word, marking the audio frame as "Middle", namely: (startN < t + w) & (t < endN); if the previous audio frame of the audio frame falls into the start-stop time period of any awakening word and the audio frame does not contain the end time of the awakening word for the first time, namely: (endN ≦ t) & (t-1 < endN), the audio frame is labeled "Positive".
It will be appreciated that each audio frame of Negative sample audio is labeled "Negative".
And 104, training the recurrent neural network by using a plurality of audio training samples to generate a wake-up model.
Specifically, for the Nth audio frame of each audio training sample, the frame characteristics of the audio frame are used as input data of an input layer t moment of the recurrent neural network, the frame label of the audio frame is used as an output result of the output layer t moment of the recurrent neural network, the state value St of the hidden layer t moment of the recurrent neural network is calculated by combining the state value St-1 of the previous moment of the hidden layer t moment of the recurrent neural network, the state values of all moments of the hidden layer of the recurrent neural network are sequentially calculated, and the awakening model is generated.
It should be noted that, after the wake-up model is generated, the embodiment of the present invention may deploy the wake-up model to the intelligent terminal, so as to perform wake-up processing on the intelligent terminal by using the wake-up model.
The embodiment of the invention provides a method for generating a wake-up module, which is characterized in that because the time length of the wake-up word audio is not fixed, the wake-up word audio is used as variable-length input data to train a Recurrent Neural Network (RNN), so that the data is prevented from being manually intercepted, the labor cost is saved, and the data with lower speed can be identified; meanwhile, the sample audio set can contain long audio, so that the RNN can be trained uninterruptedly, the recognition precision of the awakening words is improved, and the awakening effect of the intelligent terminal is improved.
Example two
An embodiment of the present invention provides an intelligent terminal wake-up method, which may be applied to an intelligent terminal, where the intelligent terminal is pre-deployed with a wake-up model generated based on the wake-up model generation method in the first embodiment, and as shown in fig. 5, the method may include the steps of:
501, the intelligent terminal obtains the real-time audio at the current moment.
Specifically, the intelligent terminal may utilize a microphone to capture real-time audio at the current time in the scene. The intelligent terminal includes but is not limited to a robot, a smart phone, a wearable device, a smart home, a vehicle-mounted terminal, and the like.
A plurality of audio frame features are extracted from real-time audio 502.
Specifically, with a preset window width W, a moving step S and a mel-frequency cepstrum coefficient CMel, mel-frequency cepstrum coefficient features are respectively extracted from each audio frame of the real-time audio to obtain a plurality of audio frame features.
Further, to improve the recognition accuracy of the wake-up word and improve the wake-up effect, before executing step 202, the method provided in the embodiment of the present invention may further include:
the real-time audio at the current time is preprocessed, wherein the preprocessing includes but is not limited to echo cancellation and noise reduction processing.
503, sequentially inputting the extracted audio frame features into a pre-deployed wake-up model, and calculating by combining the state saved at the previous moment of the wake-up model to obtain a wake-up result of whether the real-time audio contains a wake-up word.
Specifically, according to the extracted time sequence of the plurality of audio frame features corresponding to the real-time audio, the audio frame features are sequentially input into the wake-up model, calculation is performed by combining the state stored at the previous moment of the wake-up model, according to the output result of the wake-up model, the frame tags corresponding to the plurality of audio frames of the real-time audio at the current moment and the state of the wake-up model at the current moment are obtained, the state of the wake-up model at the current moment is stored, and the wake-up result whether the real-time audio contains the wake-up word or not is obtained according to the frame tags corresponding to the plurality of audio frames, wherein when the frame tags corresponding to the plurality of audio frames contain the positive tags, the real-time audio is determined to contain the wake-up word.
The following describes the method for waking up an intelligent terminal according to an embodiment of the present invention with reference to fig. 6a to 6 b.
Assuming that the memory of the intelligent terminal can only store N frames of data each time, as shown in fig. 6a, when the intelligent terminal is powered on for the first time, the real-time audio at the time t-1 is loaded into the memory, the state S0 at the previous time of the RNN network in the wake-up model is 0, the real-time audio feature at the time t-1 needs to be input into the RNN network of the wake-up model, the state S1 in the RNN network at the time t-1 is obtained, and the recognition result is output. As shown in fig. 6b, at any time after the intelligent terminal is powered on, assuming that t is M, where M is greater than 1, only the real-time audio frame feature newly added to the memory when t is M needs to be input into the RNN network of the wake-up model, and the calculation is performed in combination with the state SM-1 saved at the last time of the RNN network, without repeatedly calculating all data in the memory.
In this embodiment, because the low-end chip is mostly adopted by the current intelligent terminal, the capacity of the memory of the terminal is limited, and in the prior art, in the terminal wake-up stage, because the neural network needs to process the audio frequency of the time length t in the memory of the terminal every time, a large amount of repeated data needs to be processed between two adjacent time lengths t, so that the calculation time and the power consumption of the terminal are increased. The invention judges whether the real-time audio contains the awakening words or not by utilizing the RNN awakening model input in a variable length mode without repeatedly calculating old data, thereby reducing the calculated amount, quickening the processing speed and reducing the power consumption.
EXAMPLE III
As an implementation of the wake-up model generation method provided in the first embodiment, an embodiment of the present invention provides a wake-up model generation apparatus, as shown in fig. 7, the apparatus includes:
a first labeling module 71, configured to label start-stop time of each wakeup word included in the wakeup word audio in the sample audio set, to obtain a labeled wakeup word audio, where a time length of the wakeup word audio is not fixed;
a noise adding processing module 72, configured to add noise to the labeled wake-up word audio by using a negative sample audio containing background noise to obtain a positive sample audio;
a feature extraction module 73, configured to extract a plurality of audio frame features from the positive sample audio and the negative sample audio, respectively;
a second labeling module 74, configured to label the positive sample audio and the negative sample audio with frame labels to obtain multiple audio training samples;
and a model generating module 75, configured to train the recurrent neural network using a plurality of audio training samples, and generate a wake-up model.
Further, the first labeling module 71 is specifically configured to:
identifying at least one key audio segment in the wake word audio that only contains the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
Further, the noise processing module 72 is specifically configured to:
intercepting a negative sample audio frequency segment with the same time length as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude mean value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain a positive sample audio frequency.
Further, the frame tags include a positive tag, a negative tag, and a middle tag, and the second labeling module 74 is specifically configured to:
judging whether part or all of the audio frames fall into the start-stop time period of any awakening word or not aiming at each audio frame of the positive sample audio, and if so, marking the audio frames as middle labels;
if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, if so, marking the audio frame as a positive label, otherwise, marking the audio frame as a negative label, and if not, judging whether the previous audio frame of the audio frame falls into the start-stop time period of any awakening word or not, and if not, marking the audio frame as a negative label;
for each audio frame of negative sample audio, the audio frame is marked as a negative label.
The wake-up model generation device provided by the embodiment of the invention belongs to the same inventive concept as the wake-up model generation method provided by the first embodiment of the invention, can execute the wake-up model generation method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the wake-up model generation method. For details of the technology that are not described in detail in the embodiment of the present invention, reference may be made to the method for generating a wake-up model provided in the embodiment of the present invention, and details are not described here again.
Example four
As an implementation of the method for waking up an intelligent terminal provided in the second embodiment, an embodiment of the present invention provides an apparatus for waking up an intelligent terminal, where as shown in fig. 8, the apparatus includes:
the audio acquisition module 81 is used for the intelligent terminal to acquire real-time audio at the current moment;
a feature extraction module 82, configured to extract a plurality of audio frame features from the real-time audio;
the model identification module 83 is configured to sequentially input the extracted multiple audio frame features into a pre-deployed wake-up model, and perform calculation by combining a state stored at a previous moment of the wake-up model to obtain a wake-up result indicating whether a real-time audio contains a wake-up word;
the wake-up model is generated by using the wake-up model generation method in the first embodiment.
Further, in order to improve the recognition accuracy of the wake-up word and improve the wake-up effect, the apparatus may further include:
and the preprocessing module is used for preprocessing the real-time audio at the current moment, wherein the preprocessing includes but is not limited to echo cancellation and noise reduction processing.
The feature extraction module 82 is further configured to extract a plurality of audio frame features from the pre-processed real-time audio.
The intelligent terminal awakening device provided by the embodiment of the invention belongs to the same invention concept as the intelligent terminal awakening method provided by the second embodiment of the invention, can execute the intelligent terminal awakening method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the service request processing method. For details of the technology that are not described in detail in the embodiment of the present invention, reference may be made to the method for waking up an intelligent terminal provided in the embodiment of the present invention, and details are not described herein again.
In addition, another embodiment of the present invention further provides a computer device, including:
one or more processors;
a memory;
a program stored in the memory, which when executed by the one or more processors, causes the processors to perform the steps of the wake model generation method as described in the above embodiments.
In addition, another embodiment of the present invention further provides a computer device, including:
one or more processors;
a memory;
a program stored in the memory, which when executed by the one or more processors, causes the processors to perform the steps of the intelligent terminal wake-up method as described in the embodiments above.
Furthermore, another embodiment of the present invention further provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program causes the processor to execute the steps of the wake model generation method according to the above embodiment.
In addition, another embodiment of the present invention further provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program causes the processor to execute the steps of the method for waking up an intelligent terminal according to the above embodiment.
As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart lucu flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the true spirit and scope of the embodiments of the present invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. A wake model generation method, the method comprising:
marking the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the marked awakening word audio, wherein the time length of the awakening word audio is not fixed;
denoising the marked awakening word audio by using negative sample audio containing background noise to obtain positive sample audio;
respectively extracting a plurality of audio frame features from the positive sample audio and the negative sample audio, labeling frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples, wherein the frame labels comprise a positive label, a negative label and a middle label, and labeling the frame labels of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples, and the method comprises the following steps:
for each audio frame of the positive sample audio, judging whether part or all of the audio frame falls into the start-stop time period of any awakening word, and if so, marking the audio frame as a middle label;
if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, if so, marking the audio frames as positive labels, otherwise, marking the audio frames as negative labels, and if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, and the audio frames do not contain the end time of the awakening word for the first time;
for each audio frame of the negative sample audio, marking the audio frame as a negative label; and training a recurrent neural network by using the plurality of audio training samples to generate a wake-up model.
2. The method according to claim 1, wherein the labeling the start-stop time of each wakeup word included in the wakeup word audio in the sample audio set to obtain the labeled wakeup word audio comprises:
identifying at least one key audio segment in the wake word audio that contains only the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
3. The method of claim 1, wherein the denoising the labeled wakeup word audio by using a negative sample audio containing background noise to obtain a positive sample audio comprises:
intercepting a negative sample audio frequency segment with the same duration as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude mean value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain the positive sample audio frequency.
4. An intelligent terminal awakening method is characterized by comprising the following steps:
the intelligent terminal obtains real-time audio at the current moment;
extracting a plurality of audio frame features from the real-time audio;
sequentially inputting the extracted multiple audio frame characteristics into a pre-deployed awakening model, and calculating by combining the state stored at the previous moment of the awakening model to obtain an awakening result of whether the real-time audio contains awakening words;
wherein the wake-up model is generated using the wake-up model generation method of any one of claims 1 to 3.
5. An apparatus for wake-up model generation, the apparatus comprising:
the first labeling module is used for labeling the start-stop time of each awakening word contained in the awakening word audio in the sample audio set to obtain the labeled awakening word audio, wherein the time length of the awakening word audio is not fixed;
the noise adding processing module is used for adding noise to the marked awakening word audio by using negative sample audio containing background noise to obtain positive sample audio;
the characteristic extraction module is used for respectively extracting a plurality of audio frame characteristics from the positive sample audio and the negative sample audio;
a second labeling module, configured to label frame labels for the positive sample audio and the negative sample audio to obtain multiple audio training samples, where the frame labels include a positive label, a negative label, and a middle label, and the second labeling module is specifically configured to:
judging whether part or all of the audio frames fall into the start-stop time period of any awakening word or not for each audio frame of the positive sample audio, and if so, marking the audio frames as middle labels;
if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, if so, marking the audio frames as positive labels, otherwise, marking the audio frames as negative labels, and if not, judging whether the previous audio frame of the audio frames falls into the start-stop time period of any awakening word or not, and the audio frames do not contain the end time of the awakening word for the first time;
for each audio frame of the negative sample audio, marking the audio frame as a negative label;
and the model generation module is used for training the recurrent neural network by using the plurality of audio training samples to generate a wake-up model.
6. The apparatus of claim 5, wherein the first labeling module is specifically configured to:
identifying at least one key audio segment in the wake word audio that contains only the wake word;
and respectively labeling the start-stop time of each awakening word according to the respective start-stop time of each key audio segment to obtain the labeled awakening word audio.
7. The apparatus of claim 5, wherein the denoising processing module is specifically configured to:
intercepting a negative sample audio frequency segment with the same duration as the marked awakening word audio frequency from the negative sample audio frequency;
and adjusting the amplitude mean value of the negative sample audio frequency segment, and mixing and denoising the marked awakening word audio frequency by using the adjusted negative sample audio frequency segment to obtain the positive sample audio frequency.
8. An intelligent terminal awakening device, characterized in that the device includes:
the audio acquisition module is used for acquiring real-time audio at the current moment by the intelligent terminal;
the feature extraction module is used for extracting a plurality of audio frame features from the real-time audio;
the model identification module is used for sequentially inputting the extracted multiple audio frame characteristics into a pre-deployed awakening model and calculating by combining the state stored at the previous moment of the awakening model so as to obtain an awakening result whether the real-time audio contains awakening words or not;
wherein the wake-up model is generated using the wake-up model generation method of any one of claims 1 to 3.
CN201911028892.5A 2019-10-28 2019-10-28 Awakening model generation method, intelligent terminal awakening method and device Active CN110970016B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201911028892.5A CN110970016B (en) 2019-10-28 2019-10-28 Awakening model generation method, intelligent terminal awakening method and device
PCT/CN2020/105998 WO2021082572A1 (en) 2019-10-28 2020-07-30 Wake-up model generation method, smart terminal wake-up method, and devices
CA3158930A CA3158930A1 (en) 2019-10-28 2020-07-30 Arousal model generating method, intelligent terminal arousing method, and corresponding devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911028892.5A CN110970016B (en) 2019-10-28 2019-10-28 Awakening model generation method, intelligent terminal awakening method and device

Publications (2)

Publication Number Publication Date
CN110970016A CN110970016A (en) 2020-04-07
CN110970016B true CN110970016B (en) 2022-08-19

Family

ID=70029890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911028892.5A Active CN110970016B (en) 2019-10-28 2019-10-28 Awakening model generation method, intelligent terminal awakening method and device

Country Status (3)

Country Link
CN (1) CN110970016B (en)
CA (1) CA3158930A1 (en)
WO (1) WO2021082572A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970016B (en) * 2019-10-28 2022-08-19 苏宁云计算有限公司 Awakening model generation method, intelligent terminal awakening method and device
CN111653274B (en) * 2020-04-17 2023-08-04 北京声智科技有限公司 Wake-up word recognition method, device and storage medium
CN111833902A (en) * 2020-07-07 2020-10-27 Oppo广东移动通信有限公司 Awakening model training method, awakening word recognition device and electronic equipment
CN112201239B (en) * 2020-09-25 2024-05-24 海尔优家智能科技(北京)有限公司 Determination method and device of target equipment, storage medium and electronic device
CN112259085A (en) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 Two-stage voice awakening algorithm based on model fusion framework
CN113223499B (en) * 2021-04-12 2022-11-04 青岛信芯微电子科技股份有限公司 Method and device for generating audio negative sample
CN113903334B (en) * 2021-09-13 2022-09-23 北京百度网讯科技有限公司 Method and device for training sound source positioning model and sound source positioning
CN116110112B (en) * 2023-04-12 2023-06-16 广东浩博特科技股份有限公司 Self-adaptive adjustment method and device of intelligent switch based on face recognition

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
EP3754653A1 (en) * 2016-06-15 2020-12-23 Cerence Operating Company Techniques for wake-up word recognition and related systems and methods
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108694940B (en) * 2017-04-10 2020-07-03 北京猎户星空科技有限公司 Voice recognition method and device and electronic equipment
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
CN110097876A (en) * 2018-01-30 2019-08-06 阿里巴巴集团控股有限公司 Voice wakes up processing method and is waken up equipment
CN109036393A (en) * 2018-06-19 2018-12-18 广东美的厨房电器制造有限公司 Wake-up word training method, device and the household appliance of household appliance
CN109215647A (en) * 2018-08-30 2019-01-15 出门问问信息科技有限公司 Voice awakening method, electronic equipment and non-transient computer readable storage medium
CN110176226B (en) * 2018-10-25 2024-02-02 腾讯科技(深圳)有限公司 Speech recognition and speech recognition model training method and device
CN109448725A (en) * 2019-01-11 2019-03-08 百度在线网络技术(北京)有限公司 A kind of interactive voice equipment awakening method, device, equipment and storage medium
CN109785850A (en) * 2019-01-18 2019-05-21 腾讯音乐娱乐科技(深圳)有限公司 A kind of noise detecting method, device and storage medium
CN110364147B (en) * 2019-08-29 2021-08-20 厦门市思芯微科技有限公司 Awakening training word acquisition system and method
CN110970016B (en) * 2019-10-28 2022-08-19 苏宁云计算有限公司 Awakening model generation method, intelligent terminal awakening method and device

Also Published As

Publication number Publication date
CN110970016A (en) 2020-04-07
CA3158930A1 (en) 2021-05-06
WO2021082572A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
CN110970016B (en) Awakening model generation method, intelligent terminal awakening method and device
CN107103903B (en) Acoustic model training method and device based on artificial intelligence and storage medium
CN105632486B (en) Voice awakening method and device of intelligent hardware
DE102018010463B3 (en) Portable device, computer-readable storage medium, method and device for energy-efficient and low-power distributed automatic speech recognition
CN108630193B (en) Voice recognition method and device
CN110570873B (en) Voiceprint wake-up method and device, computer equipment and storage medium
CN111161714B (en) Voice information processing method, electronic equipment and storage medium
CN109741753A (en) A kind of voice interactive method, device, terminal and server
CN109872713A (en) A kind of voice awakening method and device
CN112735482B (en) Endpoint detection method and system based on joint deep neural network
CN109697978B (en) Method and apparatus for generating a model
CN112562742B (en) Voice processing method and device
CN103514882A (en) Voice identification method and system
CN111722696B (en) Voice data processing method and device for low-power-consumption equipment
CN113838462B (en) Voice wakeup method, voice wakeup device, electronic equipment and computer readable storage medium
CN111128174A (en) Voice information processing method, device, equipment and medium
CN108322770A (en) Video frequency program recognition methods, relevant apparatus, equipment and system
CN111128150A (en) Method and device for awakening intelligent voice equipment
CN111179913B (en) Voice processing method and device
CN110610697B (en) Voice recognition method and device
CN110930997B (en) Method for labeling audio by using deep learning model
CN113096692A (en) Voice detection method and device, equipment and storage medium
CN108010518B (en) Voice acquisition method, system and storage medium of voice interaction equipment
CN113409792B (en) Voice recognition method and related equipment thereof
CN112306560B (en) Method and apparatus for waking up an electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210000

Patentee after: Jiangsu Suning cloud computing Co.,Ltd.

Country or region after: China

Address before: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210000

Patentee before: Suning Cloud Computing Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20240131

Address after: Room 3104, Building A5, No. 3 Gutan Avenue, Economic Development Zone, Gaochun District, Nanjing City, Jiangsu Province, 210000

Patentee after: Jiangsu Biying Technology Co.,Ltd.

Country or region after: China

Address before: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210000

Patentee before: Jiangsu Suning cloud computing Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right