WO2021136054A1 - 语音唤醒方法、装置、设备及存储介质 - Google Patents

语音唤醒方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021136054A1
WO2021136054A1 PCT/CN2020/138922 CN2020138922W WO2021136054A1 WO 2021136054 A1 WO2021136054 A1 WO 2021136054A1 CN 2020138922 W CN2020138922 W CN 2020138922W WO 2021136054 A1 WO2021136054 A1 WO 2021136054A1
Authority
WO
WIPO (PCT)
Prior art keywords
probability
wake
feature
layer
word
Prior art date
Application number
PCT/CN2020/138922
Other languages
English (en)
French (fr)
Inventor
宋天龙
Original Assignee
Oppo广东移动通信有限公司
上海瑾盛通信科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司, 上海瑾盛通信科技有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021136054A1 publication Critical patent/WO2021136054A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the embodiments of the present application relate to the field of human-computer interaction, and in particular, to a voice wake-up method, device, device, and storage medium.
  • Voice wake-up refers to a technology that wakes up the device through a specific wake-up word when the device is in a sleep state. Voice wake-up can make the device switch from the sleep state to the working state, and start to serve the user.
  • the electronic device can continuously acquire external voice data in a sleep state, and then preprocess the voice data, and perform feature extraction on the preprocessed voice data to obtain voice features.
  • the electronic device uses the voice feature as the input of the Gaussian mixture model, predicts the probability of the wake-up word through the Gaussian mixture model, and determines whether to wake up the electronic device according to the probability of the wake-up word.
  • the probability of the wake word is used to indicate the probability that the voice data contains the preset wake word.
  • the embodiments of the present application provide a voice wake-up method, device, equipment, and storage medium.
  • the technical solution is as follows:
  • an embodiment of the present application provides a voice wake-up method, and the method includes:
  • the first output feature is used as the input of the attention model, and attention is calculated on the features of each channel of the first output feature through the attention model to obtain the attention weight vector, and the attention weight vector Performing scaling processing, and determining a second output feature according to the processed attention weight vector and the first output feature;
  • a voice wake-up device in another aspect, includes:
  • the feature extraction module is used to perform feature extraction on the collected voice data to obtain voice features
  • the first processing module is configured to use the voice feature as the input of the U-shaped convolutional neural network model, and perform feature extraction and feature fusion on the voice feature through the U-shaped convolutional neural network model to obtain the first output feature ;
  • the second processing module is configured to use the first output feature as the input of the attention model, and perform attention calculation on the features of each channel of the first output feature through the attention model to obtain an attention weight vector, Performing scaling processing on the attention weight vector, and determining a second output feature according to the processed attention weight vector and the first output feature;
  • a third processing module configured to perform probability conversion on the second output feature to obtain a first wake-up word probability, where the first wake-up word probability is used to indicate the probability that a preset wake-up word is included in the voice data;
  • the wake-up module is used to wake up the electronic device based on the probability of the first wake-up word.
  • an electronic device in another aspect, includes a processor and a memory; the memory stores at least one instruction, and the at least one instruction is used to be executed by the processor to implement the above message merging method.
  • a computer-readable storage medium stores at least one instruction, and the at least one instruction is used to be executed by a processor to implement the above-mentioned voice wake-up method.
  • a computer program product stores at least one instruction, and the at least one instruction is used to be executed by a processor to realize the above-mentioned voice wake-up method.
  • FIG. 1 is a flowchart of a voice wake-up method provided by an embodiment of the present application
  • FIG. 2 is a flowchart of feature extraction on voice data provided by an embodiment of the present application
  • Fig. 3 is a model structure diagram of a U-shaped convolutional neural network model provided by an embodiment of the present application.
  • Fig. 4 is an attention feature extraction process provided by an embodiment of the present application.
  • FIG. 5 is a model structure diagram of an attention model provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of attention scaling provided by an embodiment of the present application.
  • FIG. 7 is a model structure diagram of a historical window memory model and a memory fusion processing model provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of another wake-up method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the logical structure of a first-level wake-up algorithm provided by an embodiment of the present application.
  • FIG. 10 is a flowchart of yet another voice wake-up method provided by an embodiment of the present application.
  • FIG. 11 is a flowchart of another voice wake-up method provided by an embodiment of the present application.
  • FIG. 12 is a structural block diagram of a voice wake-up device provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the "plurality” mentioned herein means two or more.
  • “And/or” describes the association relationship of the associated objects, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone.
  • the character “/” generally indicates that the associated objects before and after are in an "or” relationship.
  • the Gaussian mixture model due to the insufficient processing capability of the Gaussian mixture model on the extracted speech features and poor generalization ability, and the Gaussian mixture model is mainly used to identify isolated wake-up words, it has no effect on the recognition of wake-up words in continuous speech. Good, this will lead to a lower accuracy in predicting the probability of arousal words, which will lead to false awakening.
  • the present application provides a voice wake-up method, which can solve the problem that the prediction accuracy rate of the above-mentioned application of the Gaussian mixture model for the probability of wake-up words is low, which may lead to false wake-ups.
  • This program is introduced as follows:
  • a voice wake-up method wherein the method includes: extracting features of collected voice data to obtain voice features; using the voice features as the input of a U-shaped convolutional neural network model, and passing the U-shaped convolutional neural network
  • the network model performs feature extraction and feature fusion on the voice feature to obtain the first output feature; the first output feature is used as the input of the attention model, and each channel of the first output feature is processed by the attention model.
  • Attention calculation is performed on the features of, to obtain an attention weight vector, the attention weight vector is scaled, and a second output feature is determined according to the processed attention weight vector and the first output feature; Probability conversion is performed on the second output feature to obtain the first wake-up word probability, where the first wake-up word probability is used to indicate the probability that the voice data includes a preset wake-up word; based on the first wake-up word probability, the electronic device Wake up.
  • the U-shaped convolutional neural network includes N network layer groups, each network layer group includes a convolutional neural network layer, a batch normalization layer, and a linear activation layer, and the N network layer groups
  • the output feature of the designated shallow network layer flows to the designated deep network layer to perform feature fusion on the shallow network and the deep network in the N network layers.
  • the attention model includes a pooling layer, a convolutional layer, a first fully connected layer, and a first nonlinear activation layer;
  • Performing attention calculation on the feature to obtain an attention weight vector includes: performing a pooling operation on the features of each channel of the first output feature through the pooling layer to obtain the output feature of the pooling layer;
  • the output feature of the pooling layer is used as the input of the convolutional layer, and the output feature of the pooling layer is convolved through the convolutional layer to obtain the output feature of the convolutional layer;
  • the output feature of the build-up layer is used as the input of the first fully connected layer, and the output feature of the convolutional layer is processed through the first fully connected layer to obtain the output feature of the first fully connected layer;
  • the output feature of the first fully connected layer is used as the input of the non-linear activation layer, and the output feature of the first fully connected layer is non-linearly processed by the non-linear activation layer to obtain the attention weight vector.
  • the attention model further includes an attention scaling layer, and the input of the attention scaling layer includes the first output feature and the attention weight vector; Vector and the first output feature to determine the second output feature, including: performing scaling processing on the attention weight vector through the attention scaling layer to obtain a first scaling weight vector; The force scaling layer performs normalization processing on the first scaling weight vector to obtain a second scaling weight vector; through the attention scaling layer, the first scaling weight vector is processed according to the second scaling weight vector.
  • One output feature is weighted to obtain a third output feature; the second output feature is determined according to the third output feature.
  • the input of the attention model further includes the voice feature; the determining the second output feature according to the third output feature includes: comparing the voice feature with the third output feature Merging is performed to obtain the second output feature.
  • the performing probability conversion on the second output feature to obtain the first wake word probability includes: performing a global pooling operation on the second output feature to obtain a global pooling feature; Perform global normalization processing on the characteristics to obtain the probability of the first wake-up word.
  • the waking up the electronic device based on the first wake word probability includes: determining M historical wake word probabilities, where the M historical wake word probabilities are obtained by predicting historical voice data; The M historical wake word probabilities and the first wake word probability are fused to obtain a second wake word probability; based on the second wake word probability, the electronic device is awakened.
  • the fusion processing of the M historical wake word probabilities and the first wake word probability to obtain the second wake word probability includes: combining the M historical wake word probabilities with the first The wake word probability is used as the input of the historical window memory model.
  • feature extraction is performed on the M historical wake word probabilities, and the extracted features are multiplied point by point with the first wake word probability, Obtain the fusion feature; use the probability of the first wake-up word as the input of the feature extraction model, and perform feature extraction on the probability of the first wake-up word through the feature extraction model to obtain the first probability feature; according to the first probability The feature and the fusion feature determine the probability of the second wake word.
  • the historical window memory model includes a bidirectional cyclic neural network RNN layer, a first point-wise multiplication layer, a normalization processing layer, and a second point-wise multiplication layer
  • the bidirectional RNN layer includes the first RNN layer And the second RNN layer
  • the probability of the M historical wake-up words and the probability of the first wake-up word are used as the input of the historical window memory model
  • the probability of the M historical wake-up words is determined by the historical window memory model
  • An RNN layer and the second RNN layer respectively perform feature extraction on the M historical wake word probabilities to obtain a second probability feature and a third probability feature; combine the first wake word probability and the second probability feature As the input of the first point-by-point multiplication layer, the first point-by-point multiplication layer multiplies the first wake word probability and the second probability feature point by point to obtain the first point-by-point multiplication layer.
  • the output feature of the multiplication layer; the output feature of the first point-wise multiplication layer is used as the input of the normalization processing layer, and the output of the first point-wise multiplication layer through the normalization processing layer
  • the feature is normalized to obtain the output feature of the normalized processing layer; the output feature of the normalized processing layer and the third probability feature are used as the input of the second point-by-point multiplication layer,
  • the output feature of the normalization processing layer and the third probability feature are multiplied point by point by the second point-by-point multiplication layer to obtain the fusion feature.
  • the feature extraction model includes a second fully connected layer and a second nonlinear activation layer; the feature extraction model is used to perform feature extraction on the probability of the first wake-up word to obtain the first probability feature, It includes: processing the probability of the first wake-up word through the second fully connected layer to obtain the output feature of the second fully connected layer; and using the output feature of the second fully connected layer as the second non- For the input of the linear activation layer, the second non-linear activation layer performs non-linear processing on the output feature of the second fully connected layer to obtain the first probability feature.
  • the determining the second wake word probability according to the first probability feature and the fusion feature includes: updating the first probability feature based on a probability threshold to obtain the updated first probability Feature, wherein if the first probability feature is greater than the probability threshold, the updated first probability feature is 1, and if the first probability feature is less than or equal to the probability threshold, then the updated The first probability feature of is 0; the first product and the second product are added to obtain the second arousal word probability, and the first product is the updated first probability feature and the first arousal
  • the product of word probabilities, the second product is the product of the designated difference and the fusion feature, and the designated difference refers to the difference between 1 and the updated first probability feature.
  • the waking up the electronic device based on the second wake word probability includes: if the second wake word probability is greater than a probability threshold, using the voice feature as the input of the RNN model, and pass The RNN model predicts the probability of including the preset wake-up word in the voice data to obtain a third wake-up word probability; if the third wake-up word probability is greater than the probability threshold, perform the operation on the electronic device wake.
  • the electronic device is configured with a first processor and a second processor, and the power consumption of the first processor is less than that of the second processor; before performing feature extraction on the acquired voice data, The method further includes: collecting voice data through the first processor; if the probability of the second wake-up word is greater than a probability threshold, using the voice feature as the input of the RNN model, and using the RNN model to Predicting the probability of the preset wake-up word included in the voice data to obtain the third wake-up word probability includes: if the probability of the second wake-up word is greater than the probability threshold, switching the first processor from a working state to a sleep state State, start the second processor, use the voice feature as the input of the RNN model through the second processor, and predict the probability of including the preset wake word in the voice data through the RNN model to obtain The third wake-up word probability; the voice feature is used as the input of the RNN model, and the probability of including the preset wake-up word in the voice data is predicted through the RNN model, and after the third
  • the waking up the electronic device includes: performing voiceprint recognition on the voice data to identify whether the voiceprint feature of the voice data matches the stored voiceprint feature; if it is determined that the If the voiceprint feature of the voice data matches the stored voiceprint feature, the electronic device is awakened.
  • the waking up the electronic device includes: if the electronic device is in the off-screen state, triggering the electronic device to turn on the screen, or triggering the electronic device to turn on and unlock the screen, or arouse a voice assistant; if When the electronic device is in the on-screen state, the electronic device is triggered to be unlocked, or the voice assistant is aroused.
  • the method further includes: in response to the electronic device having established a wireless communication connection with the voice collecting device, a microphone is provided in the voice collecting device; and receiving the voice data sent by the voice collecting device.
  • the waking up the electronic device based on the probability of the first wake-up word includes: in response to the voice collection device being a vehicle-mounted terminal and the speed of the electronic device is greater than a speed threshold, based on the first wake-up Word probability, wake up the electronic device.
  • This application performs feature extraction and feature fusion on the voice features through the U-shaped convolutional neural network model after feature extraction on the acquired voice data, which can fuse low-level features and high-level features to obtain the first output feature, and then Attention is calculated on the features of each channel of the first output feature through the attention model, and the attention weight vector is obtained, and the attention weight vector is scaled, so as to calculate the first output feature according to the processed attention weight vector Weighted processing, in this way, can enhance useful features, weaken useless features, because the extracted features are fully feature fusion and attention calculation, so the predicted probability of arousal words is more accurate, the generalization ability is stronger, and through attention Force computing can focus the attention of speech recognition on the wake-up words, and the recognition effect is better when the wake-up words are included in the continuous speech, thereby reducing the probability of false wake-ups.
  • the voice wake-up method provided in the embodiments of the present application is applied to an electronic device.
  • the electronic device may be a smart speaker, a smart TV, a wearable device, or a terminal
  • the terminal may be a mobile phone, a tablet, or a computer.
  • the terminal may use the method provided in the embodiments of the present application to collect voice data from the outside world, recognize whether the voice data contains a specific wake-up word, and wake up the terminal according to the recognition result.
  • FIG. 1 is a flowchart of a voice wake-up method provided by an embodiment of the present application. The method is applied to an electronic device. As shown in FIG. 1, the method includes the following steps:
  • Step 101 Perform feature extraction on the collected voice data to obtain voice features.
  • the electronic device can continuously collect voice data from the outside world, and then perform feature extraction on the collected voice data.
  • the electronic device is equipped with a microphone, and the electronic device can collect voice data through the microphone.
  • the voice feature can be MFCC (Mel-scale Frequency Cepstral Coefficients, Mel Cepstral Coefficients), or other voice features.
  • MFCC Mel-scale Frequency Cepstral Coefficients, Mel Cepstral Coefficients
  • Figure 2 is a flow chart of feature extraction on voice data provided by an embodiment of the present application.
  • the process of feature extraction on voice data may include preprocessing and smoothing. , Fourier transform and MFCC extraction process.
  • the voice signal corresponding to the voice data is filtered through a Gaussian filter, and then the filtered voice signal is smoothed to smooth the edge of the frame signal, and then the smoothed voice signal is Fourier transformed, from The MFCC is extracted from the Fourier transform result, and the MFCC is used as the speech feature.
  • z is the speech signal
  • a is the correction coefficient, generally 0.95-0.97
  • H(z) is the filter processing result.
  • the mathematical expression to extract MFCC from the Fourier transform result is: Among them, f is the frequency after Fourier transform, and F mel (f) is MFCC.
  • Step 102 Use the voice feature as the input of the U-shaped convolutional neural network model, and perform feature extraction and feature fusion on the voice feature through the U-shaped convolutional neural network model to obtain the first output feature.
  • the input of the U-shaped convolutional neural network model is the speech feature extracted in step 101, and the output is the first output feature.
  • the U-shaped convolutional neural network model may be a U-shaped residual convolutional neural network model.
  • the U-shaped convolutional neural network model includes N network layer groups, each network layer group includes a convolutional neural network layer, a batch normalization layer, and a linear activation layer, and a shallow layer is specified in the N network layer groups
  • the output features of the network layer flow to the designated deep network layer to perform feature fusion of the shallow network and the deep network in the N network layers.
  • FIG. 3 is a model structure diagram of a U-shaped convolutional neural network model provided by an embodiment of the present application.
  • the U-shaped convolutional neural network model includes N network layer groups.
  • One network layer group includes convolutional neural network layer 1, batch normalization layer 1, and linear activation layer 1
  • the second network layer group includes convolutional neural network layer 2, batch normalization layer 2 and linear activation layer 2
  • the N-1th network layer group includes convolutional neural network layer N-1, batch normalization layer N-1 and linear activation layer N-1
  • the Nth network layer group includes convolutional neural network Layer N, batch normalization layer N, and linear activation layer N.
  • the U-shaped convolutional neural network model also includes a U-shaped structure, which is used to stream the output features of the shallow network to the deep network to perform feature fusion between the shallow network and the deep network.
  • the convolutional neural network layer is a neural network layer that uses convolution as the main calculation method. It is used to extract voice features into a data format of C*R*1, where C is the number of feature columns and R is the feature row The number of channels is 1. By sequentially inputting the extracted voice features into the convolutional neural network layer, the local features of the voice feature can be calculated through the convolutional neural network layer.
  • the calculation formula of the convolutional neural network layer can be as shown in the following formula (1):
  • I represents the input of the convolutional neural network layer
  • W represents the weight corresponding to the convolution
  • bias represents the bias.
  • the result calculated by the convolutional neural network layer is a 3D feature with a size of c*r*l.
  • the batch normalization layer refers to the batch normalization neural network layer, which is a network layer that effectively performs adaptive normalization on the output of each layer.
  • the calculation formula of the batch normalization layer can be as shown in the following formulas (2)-(5):
  • x is the input of the batch normalization layer
  • the variance and mean of x are calculated through the batch normalization layer
  • the adaptive factors ⁇ , ⁇ are calculated, and then the calculated adaptive parameters are calculated in the model inference process .
  • the linear activation layer is used to linearly transform the output features of the previous layer, and has the function of linearly improving the output features.
  • the calculation formula of the linear activation layer is shown in the following formula (6):
  • x is the input of the linear activation layer
  • y is the output of the linear activation layer
  • is the factor
  • the part of the feature x whose output is a positive value needs to be multiplied by a factor ⁇ as a linear enhancement means, and the part of the feature x whose output is a negative value or 0 is 0.
  • the U-shaped structure is a layered structure that separates and merges the features of each layer. It can make the output features of the specified shallow network flow to the specified deep network, and feature fusion with the output features of the specified deep network.
  • the designated shallow network and the designated deep network can be set in advance.
  • the output feature of the convolutional neural network layer 1 in the first network layer group can flow to the linear activation layer N of the last network layer group, so that the output feature of the convolutional neural network layer 1 and the output feature of the linear activation layer N Fusion;
  • the output features of the convolutional neural network layer 2 in the second network layer group can flow to the linear activation layer N-1 of the penultimate network layer group, so that the output feature of the convolutional neural network layer 2 is linearly activated
  • the output features of layer N-1 are fused.
  • multi-scale fusion is required during feature fusion, that is, if the scales of the two output features before fusion are different, the scales of the two output features need to be adjusted to be consistent, and then feature fusion is performed.
  • the final result can be increased by 3%.
  • the U-shaped convolutional neural network model repeatedly applies the convolutional neural network layer, batch normalization layer, linear activation layer and U-shaped structure to deepen the longitudinal dimension of the model, and effectively classifies the model feature abstraction and extraction. And continuously reduce the dimensionality of the model output, and the final model obtains the final output of the U-shaped convolutional neural network model after multiple stacking.
  • Step 103 Use the first output feature as the input of the attention model, and perform attention calculation on the features of each channel of the first output feature through the attention model to obtain the attention weight vector, and scale the attention weight vector. According to the processed attention weight vector and the first output feature, the second output feature is determined.
  • the input of the attention model is the first output feature
  • the output is the second output feature
  • the attention model can extract the attention features of input features channel by channel.
  • the purpose of attention feature extraction is to scale the model's information representation ability of each channel on high-dimensional features, and then obtain deep learning tasks based on voice awakening. Different scales.
  • the attention of each channel is extracted, and the information flow is divided, and the original input features are separately scaled for attention of each channel and the original input features are retained.
  • FIG. 4 is an attention feature extraction process provided by an embodiment of the present application. As shown in FIG. 4, the first output feature can be subjected to channel-by-channel attention feature extraction.
  • Figure 5 is a model structure diagram of an attention model provided by an embodiment of the present application.
  • the attention model includes a pooling layer, a convolutional layer, and a first full The connection layer and the first non-linear activation layer.
  • the attention calculation is performed on the features of each channel of the first output feature through the attention model, and the operation of obtaining the attention weight vector includes the following steps 1) to 4):
  • Step 1) Perform a pooling operation on the features of each channel of the first output feature through the pooling layer to obtain the output features of the pooling layer.
  • the input of the pooling layer is the first output feature.
  • the pooling layer can perform pooling operations on the features of each channel of the first output feature respectively.
  • the pooling layer is a TopN pooling layer, which is used to perform TopN-dimensional feature extraction on each channel of the first output feature. That is, for each channel of the first output feature, the TopN pooling layer can sort all the features of each channel in descending order, and extract the top N-bit features as the pool of the channel ⁇ The results. Perform the above operations on all channels in turn to get the output characteristics.
  • the size of the first output feature is C*H*W, where C is the number of channels, H is the height, W is the width, and the pooling layer is the TopN pooling layer.
  • the TopN pooling layer sorts all the features of the channel in descending order, and extracts the top N-bit features as the pooling value of the channel. Perform the above operations on all channels in turn to obtain an output feature with a size of C*N*1.
  • Step 2) Use the output feature of the pooling layer as the input of the convolutional layer, and perform convolution processing on the output feature of the pooling layer through the convolutional layer to obtain the output feature of the convolutional layer.
  • the convolutional layer is a convolutional neural network layer, which is used to perform convolution processing on the output features of the pooling layer. For example, after the output feature of the pooling layer with a size of C*N*1, the pooling layer can input the obtained output feature to the convolutional layer for convolution processing to obtain a size of C/N*1*1.
  • the dimensional vector output feature is a convolutional neural network layer, which is used to perform convolution processing on the output features of the pooling layer.
  • the calculation formula of the convolutional layer is as follows:
  • I represents the input of the convolutional layer
  • W represents the weight corresponding to the convolution
  • bias represents the bias
  • Step 3 Use the output feature of the convolutional layer as the input of the first fully connected layer, and process the output feature of the convolutional layer through the first fully connected layer to obtain the output feature of the first fully connected layer.
  • the first fully connected layer is a neural network layer that uses weights as a calculation method to calculate local features for the input features. For example, if the size of the output feature of the convolutional layer is C/N*1*1, the size of the output feature of the first fully connected layer obtained by the calculation of the first fully connected layer is C*1*1.
  • the attention model may include one or more first fully connected layers, and each first fully connected layer is used to process the output features of the previous network layer, and then input the output features to the next network.
  • the attention model includes two first fully connected layers.
  • Step 4) Use the output feature of the first fully connected layer as the input of the nonlinear activation layer, and perform nonlinear processing on the output feature of the first fully connected layer through the nonlinear activation layer to obtain the attention weight vector.
  • the non-linear activation layer is used to perform non-linear transformation on the output feature of the first fully connected layer, and has the function of non-linear enhancement of the output feature.
  • the size of the attention weight vector is C*1*1.
  • the calculation formula of the nonlinear activation layer is as follows:
  • y is the output of the nonlinear activation layer, that is, the attention weight vector
  • x is the input of the nonlinear activation layer.
  • the attention model further includes an attention scaling layer, and the input of the attention scaling layer includes the first output feature and the attention weight vector. That is, the U-shaped convolutional neural network model can input the first output feature to the pooling layer and the attention scaling layer of the attention model, and after the attention weight vector is calculated by the nonlinear activation layer, it can be The attention weight vector is also input to the attention scaling layer, and the attention scaling layer processes the first output feature and the attention weight vector to obtain the second output feature.
  • the attention weight vector is scaled through the attention model.
  • the operation of determining the second output feature may include the following steps:
  • the attention weight vector is scaled to obtain the first scaling weight vector.
  • the attention weight vector can be scaled by any one of the following formulas to obtain the first scaled weight vector:
  • a t the first scaling weight vector, h t of the attention weight vector, b is a preset parameter.
  • the attention weight vector can be scaled separately through the above five scaling processing methods to obtain five first scaling weight vectors, and then the five first scaling weight vectors are determined. The mean value of is used as the final first scaled weight vector.
  • the first scaled weight vector may be normalized to obtain the second scaled weight vector.
  • calculation formula for normalization is as follows:
  • k t is the dimension of the second weight vector, a t second weight vector scaling.
  • the first output feature is weighted according to the second scaling weight vector to obtain the third output feature.
  • the first output feature can be weighted according to the second scaled weight vector through the following formula:
  • is the third output feature
  • k is the second scaled weight vector
  • j is the first output feature
  • the size of the first output feature is C*H*W
  • the size of the second scaled weight vector is C*1*1
  • the size of the third output feature is C*H*W.
  • the third output feature can be directly determined as the second output feature.
  • the input of the attention model may also include voice features, and the voice features and the third output feature may be combined to obtain the second output feature.
  • the processing flow of the attention scaling layer may be as shown in FIG. 6, which is an attention scaling flowchart provided by an embodiment of the present application.
  • the attention weight vector is scaled through the attention model, and the first output feature is weighted according to the processed attention weight vector to obtain the second output feature.
  • Low-dimensional features and high-dimensional features can be combined to make the model in It has better generalization ability in a variety of scenarios.
  • Step 104 Perform probability conversion on the second output feature to obtain the probability of the first wake-up word, which is used to indicate the probability that the preset wake-up word is included in the voice data.
  • Probability conversion is performed on the second output feature, that is, feature mapping is performed between the second output feature and the wake-up word probability to obtain the first wake-up word probability.
  • the probability of the first wake word is the probability estimate for the category, and the range is generally between [0,1].
  • the operation of performing probability conversion on the second output feature to obtain the probability of the first wake-up word includes: performing a global pooling operation on the second output feature to obtain a global pooling feature; performing a global normalization on the global pooling feature Processing, get the probability of the first wake word.
  • the calculation formula for global pooling can be as follows:
  • I the global pooling feature
  • ⁇ i the second output feature
  • the size of the global pooling feature is C*1*1.
  • calculation formula for normalization is as follows:
  • g t is the probability of the first arousal word, It is a global pooling feature.
  • the electronic device can be waked up based on the probability of the first wake-up word. For example, if the probability of the first wake-up word is greater than the probability threshold, it is determined that the voice recognition is passed and the electronic device is triggered to wake up. If the probability of the first wake-up word is less than or equal to the probability threshold, it is determined that the voice recognition is not passed, and the electronic device is not triggered to wake up, and continue. Collect the voice data, repeat the above steps to recognize the voice data.
  • the probability threshold may be the probability threshold when the EER (Equal Error Rate, equal error rate) in the data set sample is minimized, so that the false wake-up rate and false rejection rate of the model can be balanced.
  • the probability of the target wake-up word may also be determined based on the probability of the first wake-up word, so as to wake up the electronic device based on the target wake-up word probability. For example, if the target wake word probability is greater than the probability threshold, it is judged that the voice recognition is passed and the electronic device is triggered to wake up. If the target wake word probability is less than or equal to the probability threshold, it is judged that the voice recognition is not passed, the electronic device is not triggered to wake up, and the voice continues to be collected Data, repeat the above steps to recognize the voice data.
  • Step 105 Determine the target arousal word probability based on the first arousal word probability.
  • the operation of determining the probability of the fourth wake word may include the following two implementation manners:
  • the first implementation method Determine the probability of the first wake-up word as the probability of the target wake-up word.
  • the electronic device can be awakened based on the probability of the first awakening word.
  • the second implementation method the probability of the first arousal word and the probability of the historical arousal word are fused to obtain the second arousal word probability, and the second arousal word probability is determined as the target arousal word probability.
  • the prediction accuracy of the probability of the awakening word can be further improved, thereby reducing the false arousal rate.
  • the probability of M historical wake-up words can be determined.
  • the probability of M historical wake-up words is obtained by predicting historical voice data; then the probability of M historical wake-up words and the probability of the first wake-up word are fused to obtain the second wake-up word. Word probability.
  • the M historical wake word probabilities and the first wake word probability can be fused, and the operation of obtaining the second wake word probability includes the following steps:
  • Step 1051 Use the probability of the M historical wake-up words and the probability of the first wake-up word as the input of the historical window memory model, and perform feature extraction on the probability of the M historical wake-up words through the historical window memory model, and compare the extracted features with the first wake-up word The probabilities are multiplied point by point to obtain the fusion feature.
  • the historical window memory model can store the output probabilities of M historical wake-up words in the historical memory model in turn, and perform secondary feature extraction on the historical-retained wake-up word probabilities to estimate the model probability with memory ability.
  • the data size of the probability of M historical wake-up words is M*C.
  • FIG. 7 is a model structure diagram of a historical window memory model and a memory fusion processing model provided by an embodiment of the present application.
  • the historical window memory model includes a bidirectional RNN (Recurrent Neural Network, a cyclic convolutional neural network) layer, a first point-wise multiplication layer, a normalization processing layer, and a second point-wise multiplication layer.
  • the bidirectional RNN layer includes a first RNN layer and a second RNN layer.
  • step 1051 may include the following steps:
  • the bidirectional RNN layer can effectively extract and process the features of sequence information.
  • the bidirectional RNN layer may be an N-node bidirectional RNN layer.
  • the M historical wake word probabilities are used as the input of the first RNN layer and the second RNN layer respectively.
  • the first RNN layer performs feature extraction on the M historical wake word probabilities to obtain the second probability feature, and the M historical wake word probabilities are extracted through the second RNN layer. Perform feature extraction on historical wake word probabilities to obtain the third probability feature.
  • next network layer of the first RNN layer is the first point-wise multiplication layer
  • the input of the first point-wise multiplication layer includes not only the second probability feature output by the first RNN layer, but also the first wake-up Word probability.
  • the feature size of the first wake word probability is the same as the feature size of the second probability feature.
  • the output feature of the first point-by-point multiplication layer may be a one-dimensional feature vector of size C.
  • the next network layer of the first point-by-point multiplication layer is the normalized processing layer.
  • the normalization processing layer may be a softmax layer.
  • calculation formula of the normalization processing layer can be as follows:
  • h t is the output feature of the normalized processing layer
  • c t is the output feature of the first point-by-point multiplication layer.
  • the output feature of the normalized processing layer and the third probability feature are used as the input of the second point-by-point multiplication layer, and the output features of the normalized processing layer and the third probability feature are processed through the second point-wise multiplication layer. Multiply point by point to get the fusion feature.
  • the output feature of the normalized processing layer can be multiplied point by point with the output feature of another bidirectional RNN layer to obtain the fusion feature.
  • Step 1052 Use the probability of the first arousal word as the input of the memory fusion processing model, and perform feature extraction on the probability of the first arousal word through the memory fusion processing model to obtain the first probability feature. According to the first probability feature and the fusion feature, determine the second Probability of wake word.
  • the memory fusion processing model includes a feature extraction model.
  • the probability of the first wake-up word can be used as the input of the feature extraction model.
  • feature extraction is performed on the probability of the first wake-up word to obtain the first wake-up word probability.
  • a probability feature is a probability feature.
  • the feature extraction model includes a second fully connected layer and a second non-linear activation layer; through the feature extraction model, when feature extraction is performed on the probability of the first wake-up word, the second fully connected
  • the layer processes the probability of the first wake-up word to obtain the output characteristics of the second fully connected layer, and then uses the output characteristics of the second fully connected layer as the input of the second nonlinear activation layer.
  • the output features of the fully connected layer are subjected to non-linear processing to obtain the first probability feature.
  • performing feature extraction on the probability of the first wake-up word, and obtaining the first probability feature may include the following steps:
  • the first probability feature is updated based on the probability threshold to obtain the updated first probability feature.
  • the updated first probability feature is 1, and if the second probability feature is less than or equal to the probability threshold, the updated first probability feature is 0.
  • the first probability feature can be updated by the following formula:
  • G is the first probability feature
  • thre is the probability threshold
  • the first product and the second product can be added to obtain the second arousal word probability.
  • the first product is the product of the updated first probability feature and the first wake word probability
  • the second product is the product of the specified difference and the fusion feature
  • the specified difference refers to 1 and the updated first probability The difference between features.
  • the probability of the second wake word can be determined by the following formula:
  • G is the updated first probability feature
  • input is the first wake word probability
  • memory is the fusion feature
  • feature extraction and feature fusion are performed on the voice features first through the U-shaped convolutional neural network model, and low-level features and high-level features can be fused to obtain the first Output features, and then use the attention model to perform attention calculations on the features of each channel of the first output feature to obtain the attention weight vector, and scale the attention weight vector, so that the attention weight vector can be adjusted according to the processed attention weight vector.
  • the first output feature is weighted. In this way, useful features can be enhanced and useless features can be weakened. Since the extracted features are fully feature fusion and attention calculations, the predicted wake word probability is more accurate and the generalization ability is stronger.
  • the attention of speech recognition can be focused on the wake-up words, and the recognition effect is better when the wake-up words are included in the continuous speech, thereby reducing the probability of false wake-ups.
  • the probability of the historical wake-up word can be used to suppress the jump and false arousal of the wake-up word detection.
  • the embodiment of the present application may also adopt a multi-level wake-up algorithm to recognize voice data information.
  • the wake-up in the embodiment of FIG. The algorithm is called the first-level wake-up algorithm.
  • the method of identifying voice data information through the multi-level wake-up algorithm will be introduced in detail.
  • FIG. 8 is a flowchart of another wake-up method provided by an embodiment of the present application. The method is applied to an electronic device. As shown in FIG. 8, the method includes the following steps:
  • Step 801 Collect voice data.
  • the electronic device can continuously collect voice data from the outside world, so as to predict the probability of wake-up words on the collected voice data.
  • the electronic device is equipped with a microphone, and the electronic device can collect voice data through the microphone.
  • Step 802 Recognize the collected voice data through the first-level wake-up algorithm to obtain the target wake-up word probability.
  • the target awakening probability can be the probability of the first awakening word predicted by the U-shaped convolutional neural network model and the attention model, or it can be the U-shaped convolutional neural network model, attention model, historical window memory model and The probability of the second arousal word predicted by the memory fusion processing model.
  • FIG. 9 is a schematic diagram of the logical structure of a first-level wake-up algorithm provided by an embodiment of the present application.
  • the first-level wake-up algorithm includes a voice feature extraction module 901, a U-shaped convolution A neural network module 902, an attention feature extraction module 903, a wake word probability prediction module 904, a historical window memory module 905, and a memory fusion processing module 906.
  • the voice feature extraction module 901 is used to perform feature extraction on voice data to obtain voice features.
  • the U-shaped convolutional neural network module 902 is used to perform feature extraction and feature fusion on voice features through the U-shaped convolutional neural network model to obtain the first output feature.
  • the attention feature extraction module 903 is used to perform attention calculation on the features of each channel of the first output feature through the attention model to obtain the attention weight vector, and scale the attention weight vector according to the processed attention weight The vector performs weighting processing on the first output feature to obtain the second output feature.
  • the wake word probability prediction module 904 is configured to perform probability conversion on the second output feature to obtain the first wake word probability.
  • the historical window memory module 905 is used to extract features from the probability of M historical wake-up words through the historical window memory model, and multiply the extracted features and the probability of the first wake-up word point by point to obtain the fusion feature.
  • the memory fusion processing module 906 is configured to perform feature extraction on the probability of the first arousal word through the memory fusion processing model to obtain the first probability feature, and determine the probability of the second arousal word according to the first probability feature and the fusion feature.
  • Step 803 Determine whether the probability of the target wake word is greater than the probability threshold.
  • the probability threshold can be preset or calculated.
  • the probability threshold may be the probability threshold when the EER (Equal Error Rate) in the data set sample is minimized, so that the false wake-up rate and false rejection rate of the model can be balanced.
  • step 804 If the target awakening word probability is greater than the probability threshold, it is determined that the first-level awakening algorithm speech recognition is passed, and step 804 is jumped to. If the target wake word probability is less than or equal to the probability threshold, it is determined that the voice recognition of the first-level wake-up algorithm fails, and the process returns to step 801 to continue to collect voice data, and the collected voice data is recognized through the first-level wake-up algorithm.
  • Step 804 Start the second-level wake-up algorithm, identify the collected voice data through the second-level wake-up algorithm, and obtain the third wake-up word probability.
  • the second-level wake-up algorithm is a wake-up algorithm with higher recognition accuracy than the first-level wake-up algorithm, so the voice data can be further recognized through the second-level wake-up algorithm on the basis of the first-level wake-up algorithm speech recognition. Sum check, in this way, can further improve the accuracy of speech recognition, bring better speech recognition effect, and reduce the false wake-up rate.
  • the two-level wake-up algorithm may be a wake-up algorithm based on the RNN model.
  • the RNN model may be a sequence-based LSTM RNN model.
  • the operation of predicting the probability of arousal words on the collected voice data through the two-level wake-up algorithm includes: using the voice features as the input of the RNN model, and predicting the probability of including the preset wake-up words in the voice data through the RNN model. Get the probability of the third wake word.
  • Step 805 If the probability of the third wake word is greater than the probability threshold, wake up the electronic device.
  • the probability of the third wake word can be compared with the probability threshold. If the probability of the third wake word is less than or equal to the probability threshold, it is determined that the voice recognition of the second-level wake-up algorithm fails, and the voice data continues to be collected and passed the first-level The wake-up algorithm predicts the probability of wake-up words on the voice data. If the probability of the third wake-up word is greater than the probability threshold, it is determined that the secondary wake-up algorithm voice recognition is passed and the electronic device is triggered to wake up, or if the third wake-up word is greater than the probability threshold, the voice data is further recognized by voice, and the voice recognition result is The electronic device wakes up.
  • the first-level wake-up algorithm can also be stopped. Level wake up algorithm, and stop running the second level wake up algorithm. In this way, the high power consumption caused by the simultaneous operation of the first-level wake-up algorithm and the second-level wake-up algorithm can be reduced, so that the first-level wake-up algorithm and the second-level wake-up algorithm can run alternately.
  • the electronic device can also be improved in hardware.
  • the first processor and the second processor are configured in the electronic device, and the power consumption of the first processor is less than that of the second processor.
  • the first processor is used to collect voice data, and recognize the voice data through a first-level wake-up algorithm.
  • the second processor is used for recognizing voice data through a two-level wake-up algorithm.
  • the first processor is a DSP (Digital Signal Processor, digital signal processor)
  • the second processor is an ARM (Advanced RISC Machine, reduced instruction set microprocessor).
  • the workflow of the first processor and the second processor is: the first processor continuously collects voice data, and the first-level wake-up algorithm recognizes the collected voice data to obtain the target wake-up word probability. If the probability of the target wake-up word is less than or equal to the probability threshold, the first processor continues to collect voice data, and the first-level wake-up algorithm is used to recognize the collected voice data. If the probability of the target wake-up word is greater than the probability threshold, the first processor is switched from the working state to the dormant state, the second processor is started, and the second processor uses the two-level wake-up algorithm to recognize the voice data to obtain the third wake-up word Probability.
  • the second processor is switched from the working state to the sleep state, and the first processor is started.
  • the first processor continues to collect voice data, and the first-level wake-up algorithm is used to collect voice data. Voice data for recognition. If the probability of the third wake word is greater than the probability threshold, the electronic device is triggered to wake up, or voice recognition is further performed on the voice data, and the electronic device is waked up according to the voice recognition result.
  • the preset wake-up words described in the embodiments of the present application may be set by default by the electronic device, or may be set by the user.
  • the preset wake-up word is set by the user, the user can register the preset wake-up word in the electronic device, for example, pre-record the preset wake-up word through a microphone.
  • different wake-up methods can be set for the electronic device to improve flexibility and meet the diverse needs of users. For example, for the off-screen state and the on-screen state of an electronic device, different wake-up methods can be set.
  • the operation of triggering the wake-up of the electronic device includes: if the electronic device is in the off-screen state, triggering the electronic device to turn on the screen, or trigger the electronic device to turn on and unlock the screen, or arouse the voice assistant; if the electronic device is When the screen is on, the electronic device is triggered to unlock, or the voice assistant is aroused.
  • the embodiment of the application provides a multi-level wake-up algorithm speech recognition method.
  • the first-level wake-up algorithm can fully recognize the wake-up words in the voice data
  • the second-level wake-up algorithm can accurately recognize the wake-up words in the voice data, which can improve The accuracy of predicting the probability of arousal words reduces the probability of false arousal.
  • the embodiment of the present application improves the hardware of the electronic device.
  • the electronic device is equipped with a first processor and a second processor, and power consumption can be reduced by switching the working states of the two.
  • the embodiment of the present application adopts different wake-up schemes for the screen on and off of the electronic device, which improves the user recognition rate and reduces the false wake-up rate while helping to reduce power consumption.
  • the embodiment of the application may also adopt a multi-level wake-up algorithm + voiceprint recognition scheme for electronic devices.
  • the device wakes up.
  • the method of waking up electronic devices through multi-level wake-up algorithms and voiceprint recognition will be introduced in detail.
  • FIG. 10 is a flowchart of another voice wake-up method provided by an embodiment of the present application. The method is applied to an electronic device. As shown in FIG. 10, the method includes the following steps:
  • Step 1001 Collect voice data.
  • the electronic device can continuously collect voice data from the outside world, so as to predict the probability of wake-up words on the collected voice data.
  • the electronic device is equipped with a microphone, and the electronic device can collect voice data through the microphone.
  • Step 1002 Recognize the collected voice data through the first-level wake-up algorithm to obtain the target wake-up word probability.
  • the target awakening probability can be the probability of the first awakening word predicted by the U-shaped convolutional neural network model and the attention model, or it can be the U-shaped convolutional neural network model, attention model, historical window memory model and The probability of the second arousal word predicted by the memory fusion processing model.
  • Step 1003 Determine whether the probability of the target wake word is greater than the probability threshold.
  • step 1004 If the target wake word probability is greater than the probability threshold, it is determined that the voice recognition of the first-level wake-up algorithm is passed, and step 1004 is jumped to. If the target wake word probability is less than or equal to the probability threshold, it is determined that the voice recognition of the first-level wake-up algorithm has failed, and the process returns to step 1001 to continue to collect voice data, and the collected voice data is recognized through the first-level wake-up algorithm.
  • Step 1004 Start the second-level wake-up algorithm, identify the collected voice data through the second-level wake-up algorithm, and obtain the third wake-up word probability.
  • Step 1005 Determine whether the probability of the third wake word is greater than the probability threshold.
  • step 1006 is jumped to. If the probability of the third wake-up word is less than or equal to the probability threshold, it is determined that the second-level wake-up algorithm voice recognition fails, and returns to step 1001 to continue to collect voice data, and the first-level wake-up algorithm is used to predict the wake-up word probability of the collected voice data .
  • Step 1006 Perform voiceprint recognition on the voice data to identify whether the voiceprint feature of the voice data matches the stored voiceprint feature.
  • step 1007 If it is determined that the voiceprint feature of the voice data matches the stored voiceprint feature, then jump to step 1007. If it is determined that the voiceprint feature of the voice data does not match the stored voiceprint feature, return to step 1001, continue to collect voice data, and recognize the collected voice data through the first-level wake-up algorithm.
  • the stored voiceprint feature may be a voiceprint feature registered by the user in advance.
  • the user can pre-register the voiceprint features N times, where N is an integer greater than 1.
  • Step 1007 Wake up the electronic device.
  • the operation of waking up the electronic device includes: if the electronic device is in the off-screen state, trigger the electronic device to turn on the screen, or trigger the electronic device to turn on and unlock the screen, or awaken the voice assistant; if the electronic device is in the on-screen state, trigger The electronic device is unlocked, or the voice assistant is invoked.
  • the first-level wake-up algorithm when the first wake-up algorithm speech recognition is passed and the second-level wake-up algorithm is started, the first-level wake-up algorithm can also be stopped; when the second-level wake-up algorithm fails to pass the speech recognition, a second wake-up algorithm is started.
  • Level wake up algorithm, and stop running the second level wake up algorithm when the second level wake up algorithm passes, start the voiceprint recognition algorithm, stop running the second level wake up algorithm; when the voiceprint recognition passes, trigger the wake-up of the electronic device; when the voiceprint recognition fails When it passes, the first-level wake-up algorithm is started, and the voiceprint recognition algorithm is stopped.
  • the high power consumption caused by the simultaneous operation of the first-level wake-up algorithm, the second-level wake-up algorithm, and the voiceprint recognition algorithm can be reduced, so that the first-level wake-up algorithm, the second-level wake-up algorithm, and the voiceprint recognition algorithm run alternately.
  • the electronic device can also be improved in hardware.
  • the first processor and the second processor are configured in the electronic device, and the power consumption of the first processor is less than that of the second processor.
  • the first processor is used to collect voice data, and recognize the voice data through a first-level wake-up algorithm.
  • the second processor is used for recognizing voice data through the secondary wake-up algorithm, and when the voice recognition of the secondary wake-up algorithm passes, perform voiceprint recognition on the voice data.
  • the first processor is a DSP
  • the second processor is an ARM.
  • the workflow of the first processor and the second processor is: the first processor continuously collects voice data, and the first-level wake-up algorithm recognizes the collected voice data to obtain the target wake-up word probability. If the probability of the target wake-up word is less than or equal to the probability threshold, the first processor continues to collect voice data, and the first-level wake-up algorithm is used to recognize the collected voice data. If the probability of the target wake-up word is greater than the probability threshold, the first processor is switched from the working state to the dormant state, the second processor is started, and the second processor uses the two-level wake-up algorithm to recognize the voice data to obtain the third wake-up word Probability.
  • the second processor is switched from the working state to the sleep state, and the first processor is started.
  • the first processor continues to collect voice data, and the first-level wake-up algorithm is used to collect voice data.
  • Voice data for recognition. If the probability of the third wake-up word is greater than the probability threshold, perform voiceprint recognition on the voice data; if the voiceprint recognition passes, trigger the wake-up of the electronic device; if the voiceprint recognition fails, switch the second processor from the working state to the dormant state State and start the first processor, continue to collect voice data through the first processor, and recognize the collected voice data through the first-level wake-up algorithm.
  • the preset wake-up words described in the embodiments of the present application may be set by default by the electronic device, or may be set by the user.
  • the preset wake-up word is set by the user, the user can register the preset wake-up word in the electronic device, for example, pre-record the preset wake-up word through a microphone.
  • the first-level wake-up model corresponding to the first-level wake-up algorithm, the second-level wake-up model corresponding to the second-level wake-up algorithm, and the voiceprint recognition model corresponding to the first-level voiceprint recognition algorithm can be obtained by pre-training. For example, it can be obtained by training through multiple sample voice data, and the sample voice data refers to the voice data including the preset wake-up words.
  • the embodiment of the application provides a multi-level wake-up algorithm + voiceprint recognition voice wake-up method.
  • the first-level wake-up algorithm can fully recognize the wake-up words in the voice data
  • the second-level wake-up algorithm can accurately recognize the wake-up words in the voice data.
  • voiceprint recognition it can identify whether the awakener is the user himself, and protect the security of the device and user privacy.
  • the embodiment of the present application improves the hardware of the electronic device.
  • the electronic device is equipped with a first processor and a second processor, and power consumption can be reduced by switching the working states of the two.
  • the embodiment of the present application adopts different wake-up schemes for the screen on and off of the electronic device, which improves the user recognition rate and reduces the false wake-up rate while helping to reduce power consumption.
  • FIG. 11 is a flowchart of another voice wake-up method provided by an embodiment of the present application. As shown in FIG. 11, the method includes the following steps:
  • Step (1) Open the voice wake-up application, and determine whether there are voiceprint features registered N times in the electronic device.
  • Step (2) Continuously collect voice data through the microphone, send the collected voice data to the secondary wake-up algorithm of the secondary wake-up module, and perform keyword detection and storage.
  • Step (3) When the first-level wake-up module does not detect voice data, the first processor is still in a sleep state.
  • Step (4) When the voice data is monitored by the first-level wake-up module, but the voice data does not pass the voice recognition of the first-level wake-up algorithm, the second processor is still in a sleep state.
  • Step (5) When the first-level wake-up module detects the voice signal and the voice data passes the voice recognition of the first-level wake-up algorithm, the first processor sends an interrupt signal, and the second processor switches from the sleep state to the working state, and at the same time, The first-level wake-up module transmits the voice data containing the wake-up word to the second wake-up module, the first processor switches from the working state to the dormant state, and the second-level wake-up module performs voice recognition on the voice data through the second-level wake-up algorithm and gives a judgment Signal, the judgment signal is used to indicate whether the voice data passes the voice recognition of the secondary wake-up algorithm.
  • Step (6) If the judgment signal indicates that the voice data has passed the voice recognition of the two-level wake-up algorithm, the voice data is sent to the voiceprint recognition module, and the voice data is recognized by the voiceprint recognition module. If the voiceprint recognition is passed, The electronic device is triggered to wake up; if the voice recognition of the two-level wake-up algorithm or the voiceprint recognition is not passed, the second processor is switched from the working state to the sleep state, and the first processor is switched from the sleep state to the working state, and pass again The microphone continuously collects voice data and sends it to the first wake-up module for voice recognition.
  • Step (7) The first processor is always in working state, continuously collecting voice data through the microphone, and sending it to the first-level wake-up algorithm of the first wake-up module for voice recognition.
  • Step (8) When the first-level wake-up module does not detect voice data, or the voice data is monitored but the voice data does not pass the voice recognition of the first-level wake-up algorithm, the first-level wake-up module is still in working state, and the microphone continues to collect voice data , And sent to the first-level wake-up module for voice recognition.
  • Step (9) When the first-level wake-up module detects voice data, and the voice data passes the voice recognition of the first-level wake-up algorithm, the voice data including the wake-up word is sent to the second-level wake-up module, and the first processor is operated by The state is switched to the dormant state, the microphone stops collecting audio data, the voice data is recognized by the two-level wake-up algorithm, and a judgment signal is given.
  • Step (10) If the judgment signal indicates that the voice data has passed the voice recognition of the secondary wake-up algorithm, the voice data is sent to the voiceprint recognition module, and the voice data is recognized by the voiceprint recognition module. If the voiceprint recognition is passed, The electronic device is triggered to wake up; if the voice recognition of the two-level wake-up algorithm or the voiceprint recognition is not passed, the second processor is switched from the working state to the sleep state, and the first processor is switched from the sleep state to the working state, and pass again The microphone continuously collects voice data and sends it to the first wake-up module for voice recognition.
  • the voice wake-up method provided in the embodiments of the present application can be applied to a mobile terminal.
  • a scenario applied to a mobile terminal will be used as an example for description.
  • the voice wake-up process of the mobile terminal may include the following steps:
  • the search method of the voice wake-up application can be: setting-security-smart unlock-set a digital password-voice wake-up application.
  • the voice wake-up application reminds the user to enter a wake-up word.
  • S3 The user first say the wake-up word, such as "Xiaobu Xiaobu”.
  • the voiceprint detection is turned on.
  • the user needs to be authenticated based on the voice data, and the payment can be made after the identity authentication is passed.
  • the bright-screen wake-up scheme of two-level wake-up algorithm + voiceprint recognition can be used.
  • the electronic device can also establish a wireless communication connection with other voice collection devices.
  • the voice collection device may be a wearable device or a vehicle-mounted terminal.
  • the vehicle-mounted terminal is a terminal in a fixed car, and the vehicle-mounted terminal usually has a 7-inch or 9-inch display screen.
  • the voice collection device of the vehicle-mounted terminal may be a microphone, which is installed on the steering wheel or other locations inside the vehicle.
  • the technology based on the wireless communication may be Bluetooth technology, Wi-Fi technology or ZigBee technology, which is not limited in the embodiment of the present application.
  • the electronic device After the electronic device establishes a wireless communication connection with the voice collection device, the electronic device can receive the voice data sent by the voice collection device. Based on this application scenario, the user can control the electronic device via the voice collection device. When the electronic device is far away from the user, the user can also wake up the electronic device by voice through the voice collection device nearby.
  • the electronic device can wake up the electronic device based on the probability of the first wake-up word when its own speed is greater than the speed threshold. Among them, the electronic device can obtain its own speed through the navigation satellite.
  • the speed threshold may be a preset value, such as a value such as 20km/h, 30km/h, or 40km/h.
  • the embodiment of this application only uses the voice wake-up method in the above application scenario as an example for description, while in other embodiments, it can also be applied in other scenarios, or other voice wake-up methods can also be used.
  • the application embodiment does not limit this.
  • Fig. 12 is a structural block diagram of a voice wake-up device provided by an embodiment of the present application.
  • the device can be integrated into an electronic device.
  • the device can include a feature extraction module 1201, a first processing module 1202, a second processing module 1203, and a second processing module 1203.
  • the feature extraction module 1201 is used to perform feature extraction on the collected voice data to obtain voice features
  • the first processing module 1202 is configured to use the voice feature as the input of the U-shaped convolutional neural network model, and perform feature extraction and feature fusion on the voice feature through the U-shaped convolutional neural network model to obtain the first output feature;
  • the second processing module is used to use the first output feature as the input of the attention model, and perform attention calculation on the features of each channel of the first output feature through the attention model to obtain the attention weight vector.
  • the force weight vector is scaled, and the second output feature is determined according to the processed attention weight vector and the first output feature;
  • the third processing module 1204 is configured to perform probability conversion on the second output feature to obtain a first wake-up word probability, where the first wake-up word probability is used to indicate the probability that the voice data includes a preset wake-up word;
  • the wake-up module 1205 is used to wake up the electronic device based on the probability of the first wake-up word.
  • the U-shaped convolutional neural network includes N network layer groups, each network layer group includes a convolutional neural network layer, a batch normalization layer, and a linear activation layer, and the N network layer groups are designated as shallow
  • the output features of the layer network layer flow to the designated deep network layer to perform feature fusion on the shallow network and the deep network in the N network layers.
  • the attention model includes a pooling layer, a convolutional layer, a first fully connected layer, and a first nonlinear activation layer;
  • the second processing module is used for:
  • the output feature of the first fully connected layer is used as the input of the nonlinear activation layer, and the output feature of the first fully connected layer is nonlinearly processed through the nonlinear activation layer to obtain the attention weight vector.
  • the attention model further includes an attention scaling layer, and the input of the attention scaling layer includes the first output feature and the attention weight vector;
  • the second processing module is used for:
  • the attention weight vector is scaled to obtain the first scaling weight vector
  • the second output characteristic is determined.
  • the input of the attention model further includes speech features; the second processing module is used for:
  • the voice feature is combined with the third input feature to obtain the second output feature.
  • the third processing module is used for:
  • the wake-up module includes:
  • the determining unit is used to determine the probability of M historical wake-up words, and the probability of the M historical wake-up words is obtained by predicting historical voice data;
  • the fusion unit is used to perform fusion processing on the M historical wake word probabilities and the first wake word probability to obtain the second wake word probability;
  • the wake-up unit is used to wake up the electronic device based on the second wake word probability.
  • the fusion unit is used for:
  • the probability of the M historical wake-up words and the probability of the first wake-up word are used as the input of the historical window memory model.
  • feature extraction is performed on the probability of the M historical wake-up words, and the extracted features are compared with the first
  • the probabilities of the wake word are multiplied point by point to obtain the fusion feature;
  • the probability of the second wake word is determined.
  • the historical window memory model includes a bidirectional recurrent neural network RNN layer, a first point-wise multiplication layer, a normalization processing layer, and a second point-wise multiplication layer.
  • the bidirectional RNN layer includes a first RNN layer and a second point-wise multiplication layer. Two RNN layers;
  • the fusion unit is used for:
  • the M historical wake word probabilities are used as the input of the bidirectional RNN layer, and the M historical wake word probabilities are extracted through the first RNN layer and the second RNN layer respectively to obtain the second probability feature and the third probability feature;
  • the first wake word probability and the second probability feature are used as the input of the first point-by-point multiplication layer, and the first wake-up word probability and the second probability feature are compared point by point through the first point-by-point multiplication layer. Multiply to obtain the output feature of the first point-by-point multiplication layer;
  • the output feature of the first pointwise multiplication layer is used as the input of the normalization processing layer, and the output feature of the first pointwise multiplication layer is normalized by the normalization processing layer to obtain the normalization
  • the output characteristics of a processing layer
  • the output feature of the normalized processing layer and the third probability feature are used as the input of the second point-wise multiplication layer, and the output feature of the normalized processing layer and the first point-by-point multiplication layer are passed through the second point-wise multiplication layer.
  • the three probability features are multiplied point by point to obtain the fusion feature.
  • the feature extraction model includes a second fully connected layer and a second nonlinear activation layer
  • the fusion unit is used for:
  • the output feature of the second fully connected layer is used as the input of the second non-linear activation layer, and the output feature of the second fully connected layer is non-linearly processed through the second non-linear activation layer to obtain the first probability feature .
  • the fusion unit is used for:
  • the first probability feature is updated based on the probability threshold to obtain the updated first probability feature, where if the first probability feature is greater than the probability threshold, the updated first probability feature is 1. If the probability feature is less than or equal to the probability threshold, the updated first probability feature is 0;
  • the first product is the product of the updated first probability feature and the first wake word probability
  • the second product is the specified difference
  • the product of the value and the fusion feature, and the specified difference refers to the difference between 1 and the updated first probability feature.
  • the wake-up unit is used for:
  • the voice feature is used as the input of the RNN model, and the probability of including the preset wake word in the voice data is predicted by the RNN model to obtain the third wake word probability;
  • the electronic device is configured with a first processor and a second processor, and the power consumption of the first processor is less than that of the second processor; the apparatus further includes:
  • An acquisition module configured to collect voice data through the first processor
  • the wake-up module is used to:
  • the first processor is switched from the working state to the dormant state, the second processor is started, and the voice feature is used as the input of the RNN model through the second processor.
  • the RNN model predicts the probability of including the preset wake word in the voice data to obtain the third wake word probability
  • the fourth processing module is configured to switch the second processor from the working state to the sleep state if the probability of the third wake word is less than or equal to the probability threshold, start the first processor, and continue through the first processor Collect voice data.
  • the wake-up unit is used for:
  • the electronic device If it is determined that the voiceprint feature of the voice data matches the stored voiceprint feature, the electronic device is awakened.
  • the wake-up module is used to:
  • the electronic device is in the on-screen state, the electronic device is triggered to be unlocked, or the voice assistant is aroused.
  • the device further includes a voice collection module, configured to receive the voice data sent by the voice collection device in response to the electronic device having established a wireless communication connection with the voice collection device.
  • a microphone is set up.
  • the wake-up module is further configured to wake up the electronic device based on the first wake-up word probability in response to the voice collection device being a vehicle-mounted terminal and the speed of the electronic device is greater than a speed threshold.
  • the voice wake-up device provided in the foregoing embodiment performs voice wake-up
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned functions can be allocated by different functional modules according to needs. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the voice wake-up device provided in the foregoing embodiment and the voice wake-up method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • FIG. 13 is a schematic structural diagram of an electronic device 1300 provided by an embodiment of the present application.
  • the electronic device may be a smart speaker, a smart TV, a smart wearable device, or a terminal, and the terminal may be a mobile phone, a tablet, or a computer.
  • the electronic device may have relatively large differences due to different configurations or performances, and may include one or more processors 1301 and one or more memories 1302, wherein at least one instruction is stored in the memory 1302, and the at least one instruction is stored in the memory 1302. An instruction is loaded and executed by the processor 1301 to implement the access point identification method provided in the foregoing method embodiments.
  • the electronic device includes a first processor and a second processor, the power consumption of the first processor is less than that of the second processor, the first processor is used to execute the first-level wake-up algorithm, and the second processor is used to execute the second-level wake-up algorithm , Or two-level wake-up algorithm and voiceprint recognition.
  • the first processor is a DSP
  • the second processor is an ARM.
  • the electronic device may also have components such as a wired or wireless network interface, a keyboard, an input and output interface for input and output, and the electronic device may also include other components for implementing device functions, which will not be repeated here.
  • a computer-readable storage medium is also provided, and instructions are stored on the computer-readable storage medium, and when the instructions are executed by a processor, the above voice wake-up method is implemented.
  • a computer program product is also provided, and when the computer program product is executed, it is used to implement the above voice wake-up method.
  • the functions described in the embodiments of the present application may be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.
  • the computer-readable medium includes a computer storage medium and a communication medium, where the communication medium includes any medium that facilitates the transfer of a computer program from one place to another.
  • the storage medium may be any available medium that can be accessed by a general-purpose or special-purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种语音唤醒方法、装置、设备及存储介质,属于人机交互领域。语音唤醒方法包括:对采集的语音数据进行特征提取,得到语音特征(101);通过U型卷积神经网络模型对语音特征进行特征提取和特征融合,得到第一输出特征(102);通过注意力模型对第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对注意力权重向量进行尺度化处理,根据处理后的注意力权重向量对第一输出特征进行加权处理,得到第二输出特征(103);对第二输出特征进行概率转换,得到第一唤醒词概率(104);基于第一唤醒词概率,确定目标唤醒词概率(105)。由于对提取的特征进行了充分的特征融合和注意力计算,减小了误唤醒概率。

Description

语音唤醒方法、装置、设备及存储介质
本申请要求于2019年12月30日提交的申请号为201911392963.X、发明名称为“语音唤醒方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人机交互领域,特别涉及一种语音唤醒方法、装置、设备及存储介质。
背景技术
在人机交互领域,为了便于用户对设备进行控制,用户可以通过语音唤醒技术唤醒具有语音功能的设备。语音唤醒是指在设备处于休眠状态时,通过特定的唤醒词唤醒设备的技术。语音唤醒能使设备从休眠状态切换为工作状态,开始为用户进行服务。
相关技术中,电子设备可以在休眠状态下不断获取外界的语音数据,然后对语音数据进行预处理,对预处理后的语音数据进行特征提取,得到语音特征。电子设备再将语音特征作为高斯混合模型的输入,通过高斯混合模型来预测唤醒词概率,根据唤醒词概率确定是否唤醒电子设备。其中,唤醒词概率用于指示语音数据中包含预设唤醒词的概率。
发明内容
本申请实施例提供了一种语音唤醒方法、装置、设备及存储介质。所述技术方案如下:
一方面,本申请实施例提供了一种语音唤醒方法,所述方法包括:
对采集的语音数据进行特征提取,得到语音特征;
将所述语音特征作为U型卷积神经网络模型的输入,通过所述U型卷积神经网络模型对所述语音特征进行特征提取和特征融合,得到第一输出特征;
将所述第一输出特征作为注意力模型的输入,通过所述注意力模型对所述第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对所述注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和所述第一输出特征,确定第二输出特征;
对所述第二输出特征进行概率转换,得到第一唤醒词概率,所述第一唤醒词概率用于指示所述语音数据中包括预设唤醒词的概率;
基于所述第一唤醒词概率,对电子设备进行唤醒。
另一方面,提供了一种语音唤醒装置,所述装置包括:
特征提取模块,用于对采集的语音数据进行特征提取,得到语音特征;
第一处理模块,用于将所述语音特征作为U型卷积神经网络模型的输入,通过所述U型卷积神经网络模型对所述语音特征进行特征提取和特征融合,得到第一输出特征;
第二处理模块,用于将所述第一输出特征作为注意力模型的输入,通过所述注意力模型对所述第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对所述注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和所述第一输出特征,确定第二输出特征;
第三处理模块,用于对所述第二输出特征进行概率转换,得到第一唤醒词概率,所述第一唤醒词概率用于指示所述语音数据中包括预设唤醒词的概率;
唤醒模块,用于基于所述第一唤醒词概率,对电子设备进行唤醒。
另一方面,提供了一种电子设备,所述电子设备包括处理器和存储器;所述存储器存储有至少一条指令,所述至少一条指令用于被所述处理器执行以实现上述消息合并方法。
另一方面,提供了计算机可读存储介质,所述存储介质存储有至少一条指令,所述至少一条指令用于被处理器执行以实现上述语音唤醒方法。
另一方面,还提供了一种计算机程序产品,该计算机程序产品存储有至少一条指令,所述至少一条指令用于被处理器执行以实现上述语音唤醒方法。
附图说明
图1是本申请实施例提供的一种语音唤醒方法的流程图;
图2是本申请实施例提供的一种对语音数据进行特征提取的流程图;
图3是本申请实施例提供的一种U型卷积神经网络模型的模型结构图;
图4是本申请实施例提供的一种注意力特征提取流程;
图5是本申请实施例提供的一种注意力模型的模型结构图;
图6是本申请实施例提供的一种注意力尺度化流程图;
图7是本申请实施例提供的一种历史窗口记忆模型和记忆融合处理模型的模型结构图;
图8是本申请实施例提供的另一种唤醒方法的流程图;
图9是本申请实施例提供的一种一级唤醒算法的逻辑结构示意图;
图10是本申请实施例提供的又一种语音唤醒方法的流程图;
图11是本申请实施例提供的又一种语音唤醒方法的流程图;
图12是本申请实施例提供的一种语音唤醒装置的结构框图;
图13是本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
在本申请实施例中,由于高斯混合模型对提取的语音特征的处理能力不足,泛化能力较差,且高斯混合模型主要用于识别孤立的唤醒词,对于连续语音中的唤醒词识别效果不佳,这将导致对唤醒词概率的预测准确率较低,进而导致容易出现误唤醒的情况。
本申请提供了一种语音唤醒方法,能够解决上述应用高斯混合模型对唤醒词概率的预测准确率较低,进而导致容易出现误唤醒的问题。本方案介绍如下:
一种语音唤醒方法,其中,所述方法包括:对采集的语音数据进行特征提取,得到语音特征;将所述语音特征作为U型卷积神经网络模型的输入,通过所述U型卷积神经网络模型对所述语音特征进行特征提取和特征融合,得到第一输出特征;将所述第一输出特征作为注意力模型的输入,通过所述注意力模型对所述第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对所述注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和所述第一输出特征,确定第二输出特征;对所述第二输出特征进行概率转换,得到第一唤醒词概率,所述第一唤醒词概率用于指示所述语音数据中包括预设唤醒词的概率;基于所述第一唤醒词概率,对电子设备进行唤醒。
可选地,所述U型卷积神经网络包括N个网络层组,每个网络层组包括卷积神经网络层、批归一化层和线性激活层,且所述N个网络层组中指定浅层网络层的输出特征流向指定深层网络层,以对所述N个网络层中的浅层网络与深层网络进行特征融合。
可选地,所述注意力模型包括池化层、卷积层、第一全连接层和第一非线性激活层;所述通过所述注意力模型对所述第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,包括:通过所述池化层对所述第一输出特征的各个通道的特征分别进行池化操作,得到所述池化层的输出特征;将所述池化层的输出特征作为所述卷积层的输入,通过所述卷积层对所述池化层的输出特征进行卷积处理,得到所述卷积层的输出特征;将所述卷积层的输出特征作为所述第一全连接层的输入,通过所述第一全连接层对所述卷积层的输出特征进行处理,得到所述第一全连接层的输出特征;将所述第一全连接层的输出特征作为所述非线性激活层的输入,通过所述非线性激活层对所述第一全连接层的输出特征进行非线性处理,得到所述注意力权重向量。
可选地,所述注意力模型还包括注意力尺度化层,所述注意力尺度化层的输入包括所述第一输出特征和所述注意力权重向量;所述根据处理后的注意力权重向量和所述第一输出特征,确定第二输出特征,包括:通过所述注意力尺度化层,对所述注意力权重向量进行尺度化处理,得到第一尺度化权重向量;通过所述注意力尺度化层,对所述第一尺度化权重向量进行归一化处理,得到第二尺度化权重向量;通过所述注意力尺度化层,根据所述第二尺度化权重向量对所述第一输出特征进行加权处理,得到第三输出特征;根据所述第三输出特征,确定所述第二输出特征。
可选地,所述注意力模型的输入还包括所述语音特征;所述根据所述第三输出特征,确定所述第二输出特征,包括:将所述语音特征与所述第三输出特征进行合并,得到所述第二输出特征。
可选的,所述对所述第二输出特征进行概率转换,得到第一唤醒词概率,包括:对所述第二输出特征进行全局池化操作,得到全局池化特征;对所述全局池化特征进行全局归一化处理,得到所述第一唤醒词概率。
可选地,所述基于所述第一唤醒词概率,对电子设备进行唤醒,包括:确定M个历史唤醒词概率,所述M个历史唤醒词概率是对历史语音数据进行预测得到;对所述M个历史唤醒词概率和所述第一唤醒词 概率进行融合处理,得到第二唤醒词概率;基于所述第二唤醒词概率,对所述电子设备进行唤醒。
可选地,所述对所述M个历史唤醒词概率和所述第一唤醒词概率进行融合处理,得到第二唤醒词概率,包括:将所述M个历史唤醒词概率和所述第一唤醒词概率作为历史窗口记忆模型的输入,通过所述历史窗口记忆模型,对所述M个历史唤醒词概率进行特征提取,将提取的特征与所述第一唤醒词概率进行逐点相乘,得到融合特征;将所述第一唤醒词概率作为特征提取模型的输入,通过所述特征提取模型,对所述第一唤醒词概率进行特征提取,得到第一概率特征;根据所述第一概率特征和所述融合特征,确定所述第二唤醒词概率。
可选地,所述历史窗口记忆模型包括双向循环神经网络RNN层、第一逐点相乘层、归一化处理层和第二逐点相乘层,所述双向RNN层包括第一RNN层和第二RNN层;所述将所述M个历史唤醒词概率和所述第一唤醒词概率作为历史窗口记忆模型的输入,通过所述历史窗口记忆模型,对所述M个历史唤醒词概率进行特征提取,将提取的特征与所述第一唤醒词概率进行逐点相乘,得到融合特征,包括:将所述M个历史唤醒词概率作为所述双向RNN层的输入,通过所述第一RNN层和所述第二RNN层分别对所述M个历史唤醒词概率进行特征提取,得到第二概率特征和第三概率特征;将所述第一唤醒词概率和所述第二概率特征作为第一逐点相乘层的输入,通过所述第一逐点相乘层对所述第一唤醒词概率和所述第二概率特征进行逐点相乘,得到所述第一逐点相乘层的输出特征;将所述第一逐点相乘层的输出特征作为所述归一化处理层的输入,通过所述归一化处理层对所述第一逐点相乘层的输出特征进行归一化处理,得到所述归一化处理层的输出特征;将所述归一化处理层的输出特征和所述第三概率特征作为所述第二逐点相乘层的输入,通过所述第二逐点相乘层对所述归一化处理层的输出特征和所述第三概率特征进行逐点相乘,得到所述融合特征。
可选地,所述特征提取模型包括第二全连接层和第二非线性激活层;所述通过所述特征提取模型,对所述第一唤醒词概率进行特征提取,得到第一概率特征,包括:通过所述第二全连接层对所述第一唤醒词概率进行处理,得到所述第二全连接层的输出特征;将所述第二全连接层的输出特征作为所述第二非线性激活层的输入,通过所述第二非线性激活层对所述第二全连接层的输出特征进行非线性处理,得到所述第一概率特征。
可选地,所述根据所述第一概率特征和所述融合特征,确定所述第二唤醒词概率,包括:基于概率阈值对所述第一概率特征进行更新,得到更新后的第一概率特征,其中,若所述第一概率特征大于所述概率阈值,则所述更新后的第一概率特征为1,若所述第一概率特征小于或等于所述概率阈值,则所述更新后的第一概率特征为0;对第一乘积和第二乘积进行相加,得到所述第二唤醒词概率,所述第一乘积为所述更新后的第一概率特征与所述第一唤醒词概率的乘积,所述第二乘积为指定差值与所述融合特征的乘积,所述指定差值是指1与所述更新后的第一概率特征之间的差值。
可选地,所述基于所述第二唤醒词概率,对所述电子设备进行唤醒,包括:若所述第二唤醒词概率大于概率阈值,则将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率;若所述第三唤醒词概率大于所述概率阈值,则对所述电子设备进行唤醒。
可选地,所述电子设备配置有第一处理器和第二处理器,且所述第一处理器的功耗小于所述第二处理器;在对获取的语音数据进行特征提取之前,所述方法还包括:通过所述第一处理器采集语音数据;所述若所述第二唤醒词概率大于概率阈值,则将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率,包括:若所述第二唤醒词概率大于概率阈值,则将所述第一处理器从工作状态切换为休眠状态,启动第二处理器,通过所述第二处理器将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率;所述将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率之后,还包括:若所述第三唤醒词概率小于或等于所述概率阈值,则将所述第二处理器从工作状态切换为休眠状态,启动所述第一处理器,通过所述第一处理器继续采集语音数据。
可选地,所述对所述电子设备进行唤醒,包括:对所述语音数据进行声纹识别,以识别所述语音数据的声纹特征与已存储的声纹特征是否匹配;若确定所述语音数据的声纹特征与已存储的声纹特征匹配,则唤醒所述电子设备。
可选地,所述唤醒所述电子设备,包括:若所述电子设备处于熄屏状态,则触发所述电子设备亮屏,或者触发所述电子设备亮屏并解锁,或者唤起语音助手;若所述电子设备处于亮屏状态,则触发所述电子设备解锁,或者唤起语音助手。
可选地,所述方法还包括:响应于所述电子设备已与语音采集设备建立无线通信连接,所述语音采集设备中设置有麦克风;接收所述语音采集设备发送的所述语音数据。
可选地,所述基于所述第一唤醒词概率,对电子设备进行唤醒,包括:响应于所述语音采集设备是车 载终端且所述电子设备的速度大于速度阈值,基于所述第一唤醒词概率,对电子设备进行唤醒。
本申请提供的技术方案至少可以带来以下有益效果:
本申请通过在对获取的语音数据进行特征提取之后,先通过U型卷积神经网络模型对语音特征进行特征提取和特征融合,可以将低级特征和高级特征进行融合,得到第一输出特征,之后通过注意力模型对第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,并对注意力权重向量进行尺度化处理,以便根据处理后的注意力权重向量对第一输出特征进行加权处理,如此,可以增强有用特征,削弱无用特征,由于对提取的特征进行了充分的特征融合和注意力计算,因此预测得到的唤醒词概率更加准确,泛化能力更强,而且通过注意力计算能够将语音识别的注意力集中在唤醒词上,对连续语音中包含唤醒词的情况识别效果较好,从而减小了误唤醒概率。
在对本申请实施例进行详细介绍之前,先对本申请实施例的实施环境进行介绍。本申请实施例提供的语音唤醒方法应用于电子设备中,该电子设备可以为智能音箱、智能电视、可穿戴设备或终端等,终端可以为手机、平板电脑或计算机等。以该电子设备为终端为例,终端可以采用本申请实施例提供的方法采集外界的语音数据,识别语音数据中是否包含特定的唤醒词,根据识别结果对终端进行唤醒。
图1是本申请实施例提供的一种语音唤醒方法的流程图,该方法应用于电子设备中,如图1所示,该方法包括如下步骤:
步骤101:对采集的语音数据进行特征提取,得到语音特征。
电子设备可以不断采集外界的语音数据,然后对采集的语音数据进行特征提取。示例的,电子设备中配置有麦克风,电子设备可以通过麦克风采集语音数据。
其中,语音特征可以为MFCC(Mel-scale Frequency Cepstral Coefficients,梅尔倒谱系数),或其他语音特征。
作为一个示例,请参考图2,图2是本申请实施例提供的一种对语音数据进行特征提取的流程图,如图2所示,对语音数据进行特征提取的过程可以包括预处理、平滑、傅里叶变换和MFCC提取这几个过程。
比如,先通过高斯滤波器对语音数据对应的语音信号进行滤波处理,然后对滤波后的语音信号进行平滑处理,以平滑帧信号的边缘,再对平滑后的语音信号进行傅里叶变换,从傅里叶变换结果中提取MFCC,将MFCC作为语音特征。
作为一个示例,滤波处理的数学表达式可以为:H(z)=1-az -1。其中,z为语音信号,a为修正系数,一般取0.95-0.97,H(z)为滤波处理结果。
作为一个示例,平滑处理时可以采用汉明窗进行平滑处理,平滑处理的数学表达式可以为:
Figure PCTCN2020138922-appb-000001
其中,n为正整数,n=0,1,2,3....M;M为傅里叶变换的点数,比如M可以为512;ω(n)为平滑处理结果。
作为一个示例,从傅里叶变换结果中提取MFCC的数学表达式为:
Figure PCTCN2020138922-appb-000002
其中,f为傅里叶变换后的频点,F mel(f)为MFCC。
步骤102:将语音特征作为U型卷积神经网络模型的输入,通过U型卷积神经网络模型对语音特征进行特征提取和特征融合,得到第一输出特征。
其中,该U型卷积神经网络模型的输入为步骤101提取得到的语音特征,输出为第一输出特征。
作为一个示例,U型卷积神经网络模型可以为U型残差卷积神经网络模型。
作为一个示例,U型卷积神经网络模型包括N个网络层组,每个网络层组包括卷积神经网络层、批归一化层和线性激活层,且N个网络层组中指定浅层网络层的输出特征流向指定深层网络层,以对N个网络层中的浅层网络与深层网络进行特征融合。
请参考图3,图3是本申请实施例提供的一种U型卷积神经网络模型的模型结构图,如图3所示,该U型卷积神经网络模型包括N个网络层组,第一个网络层组包括卷积神经网络层1、批归一化层1和线性激活层1,第二个网络层组包括卷积神经网络层2、批归一化层2和线性激活层2,...,第N-1个网络层组包括卷积神经网络层N-1、批归一化层N-1和线性激活层N-1,第N个网络层组包括卷积神经网络层N、批归一化层N和线性激活层N。另外,该U型卷积神经网络模型还包括U型结构,用于将浅层网络的输出特征流向深层网络,以对浅层网络与深层网络进行特征融合。
卷积神经网络层是一种以卷积作为主要计算方式的神经网络层,用于将语音特征提取为大小为 C*R*1的数据形式,其中,C为特征列数、R为特征行数,通道数为1。通过将提取得到的语音特征依次输入到卷积神经网络层中,可以通过卷积神经网络层计算语音特征的局部特征。
作为一个示例,卷积神经网络层的计算公式可以如以下公式(1)所示:
Figure PCTCN2020138922-appb-000003
其中,I表示卷积神经网络层的输入,W表示卷积对应的权重,bias表示偏置,经过卷积神经网络层计算得到的结果是尺寸为c*r*l的3D特征。
批归一化层是指批归一化神经网络层,其是一种有效对各层输出进行自适应归一化的网络层。作为一个示例,批归一化层的计算公式可以如以下公式(2)-(5)所示:
Figure PCTCN2020138922-appb-000004
Figure PCTCN2020138922-appb-000005
Figure PCTCN2020138922-appb-000006
β (k)=E[x (k)]         (5)
其中,x为批归一化层的输入,通过批归一化层对x进行方差和均值计算,之后计算自适应因子β,γ,再将计算得到的自适应参数在模型推理过程中进行计算。
线性激活层用于对上一层的输出特征进行线性变换,具有对输出特征进行线性提升的功能。作为一个示例,线性激活层的计算公式如以下公式(6)所示:
y=f(x),f=max(λ*x,0)        (6)
其中,x为线性激活层的输入,y为线性激活层的输出,λ为因子。
公式(6)中,对于输出为正值的部分特征x需乘以因子λ作为线性增强手段,对于输出为负值或0的部分特征x则为0。
U型结构是一种以各层特征进行分离和合并的层状结构,能够令指定浅层网络的输出特征流向指定深层网络,与指定深层网络的输出特征进行特征融合。
其中,指定浅层网络和指定深层网络可以预先设置。比如,第一个网络层组中的卷积神经网络层1的输出特征可以流向最后一个网络层组的线性激活层N,使得卷积神经网络层1的输出特征与线性激活层N的输出特征进行融合;第二个网络层组中的卷积神经网络层2的输出特征可以流向倒数第二个网络层组的线性激活层N-1,使得卷积神经网络层2的输出特征与线性激活层N-1的输出特征进行融合。另外,特征融合时需要进行多尺度融合,也即是,若融合前的两个输出特征的尺度不同,则需要将这两个输出特征的尺度调整为一致,再进行特征融合。
通过将浅层网络与深层网络进行特征融合这种形式,将全部特征信息流予以保留和计算,对推理过程中的低级特征和高级特征进行融合,提升预测结果。比如,通过实际验证,可以将最终将结果提升3%。
如图3所示,U型卷积神经网络模型反复应用卷积神经网络层、批归一化层、线性激活层和U型结构进行模型纵向维度的加深,对于模型特征抽象和提取进行有效分类并不断降低模型输出的维度,最终模型在多次叠加后得到U型卷积神经网络模型的最终输出。
步骤103:将第一输出特征作为注意力模型的输入,通过注意力模型对第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和第一输出特征,确定第二输出特征。
其中,该注意力模型的输入为第一输出特征,输出为第二输出特征。
注意力模型能够对输入特征进行逐通道的注意力特征提取,注意力特征提取的目的是将模型在高维特征上的各通道信息表征能力进行尺度化,进而得到基于语音唤醒等深度学习任务的不同尺度。各通道注意力提取在U型卷积神经网络模型之后进行信息流的分流,将原有的输入特征分别进行各通道注意力尺度化和原始输入特征的保留。作为一个示例,请参考图4,图4是本申请实施例提供的一种注意力特征提取流程,如图4所示,可以对第一输出特征进行逐通道的注意力特征提取。
通过对注意力模型对第一输出特征的各个通道的特征进行注意力计算,并根据注意力权重向量对第一输出特征进行加权处理,可以增强有用特征,削弱无用特征,将语音识别的注意力集中在唤醒词上,提高 识别效果,对连续语音中包含唤醒词的情况识别效果较好,减小了误唤醒概率。
作为一个示例,请参考图5,图5是本申请实施例提供的一种注意力模型的模型结构图,如图5所示,该注意力模型包括池化层、卷积层、第一全连接层和第一非线性激活层。相应地,通过注意力模型对第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量的操作包括如下步骤1)至步骤4):
步骤1)通过池化层对第一输出特征的各个通道的特征分别进行池化操作,得到池化层的输出特征。
其中,该池化层的输入为第一输出特征。将第一输出特征输入至池化层后,池化层可以对第一输出特征的各个通道的特征分别进行池化操作。
作为一个示例,池化层为TopN池化层,用于对第一输出特征的各个通道进行TopN维的特征提取。也即是,对于第一输出特征的每个通道,TopN池化层可以对每个通道的全部特征按照从大到小的顺序进行排序,并提取排序在前的N位特征作为该通道的池化结果。依次对所有通道进行如上操作,即可得到输出特征。
作为一个示例,第一输出特征的尺寸为C*H*W,其中C为通道数,H为高度,W为宽度,池化层为TopN池化层。对于每一个通道c,c∈C,TopN池化层对该通道的全部特征按照从大到小的顺序进行排序,提取排序在前的N位特征作为该通道的池化值。依次对所有通道进行如上操作,即可得到尺寸为C*N*1的输出特征。
步骤2)将池化层的输出特征作为卷积层的输入,通过卷积层对池化层的输出特征进行卷积处理,得到卷积层的输出特征。
其中,该卷积层为卷积神经网络层,用于对池化层的输出特征进行卷积处理。比如,在池化层输出尺寸为C*N*1的输出特征之后,池化层可以将得到的输出特征输入至卷积层进行卷积处理,得到尺寸为C/N*1*1的一维向量输出特征。
作为一个示例,卷积层的计算公式如下:
Figure PCTCN2020138922-appb-000007
其中,I表示卷积层的输入,W表示卷积对应的权重,bias表示偏置。
步骤3)将卷积层的输出特征作为第一全连接层的输入,通过第一全连接层对卷积层的输出特征进行处理,得到第一全连接层的输出特征。
第一全连接层是一种以权重作为计算方式的神经网络层,用于对输入的特征计算局部特征。比如,若卷积层的输出特征的尺寸为C/N*1*1,则通过第一全连击层的计算得到的第一全连接层的输出特征的尺寸为C*1*1。
需要说明的是,该注意力模型可以包括一个或多个第一全连接层,每个第一全连接层用于对上一个网络层的输出特征进行处理,在将输出特征输入至下一个网络层。如图5所示,该注意力模型包括两个第一全连接层。
步骤4)将第一全连接层的输出特征作为非线性激活层的输入,通过非线性激活层对第一全连接层的输出特征进行非线性处理,得到注意力权重向量。
非线性激活层用于对第一全连接层的输出特征进行非线性变换,具有对输出特征进行非线性提升的功能。示例的,注意力权重向量的尺寸为C*1*1。
作为一个示例,非线性激活层的计算公式如下所示:
y=sigmoid(x)        (8)
其中,y为非线性激活层的输出,即注意力权重向量,x为非线性激活层的输入。
另外,如图4所示,该注意力模型还包括注意力尺度化层,注意力尺度化层的输入包括第一输出特征和注意力权重向量。也即是,U型卷积神经网络模型可以将第一输出特征分别输入至该注意力模型的池化层和注意力尺度化层,在非线性激活层计算得到注意力权重向量之后,可以将注意力权重向量也输入至注意力尺度化层,由注意力尺度化层对第一输出特征和注意力权重向量进行处理,得到第二输出特征。
作为一个示例,通过注意力模型对注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和第一输出特征,确定第二输出特征的操作可以包括如下步骤:
1)通过注意力尺度化层,对注意力权重向量进行尺度化处理,得到第一尺度化权重向量。
作为一个示例,可以通过以下公式中的任一种对注意力权重向量进行尺度化处理,得到第一尺度化权重向量:
a t=g BO(h t)=b t        (9)
a t=g L(h t)=w t Th t+b t        (10)
a t=g SL(h t)=w Th t+b         (11)
Figure PCTCN2020138922-appb-000008
a t=g SNL(h t)=V Ttanh(w Th t+b)         (13)
其中,a t为第一尺度化权重向量,h t为注意力权重向量,b为预设参数。
上述5种尺度化处理方式都可以通过端到端的训练达到收敛的结果,同时针对不同特征分布的模型有各自的优势。
在另一种实施例中,还可以通过上述5种尺度化处理方式分别对注意力权重向量进行尺度化处理,得到5种第一尺度化权重向量,然后确定这5种第一尺度化权重向量的均值作为最终的第一尺度化权重向量。
2)通过注意力尺度化层,对第一尺度化权重向量进行归一化处理,得到第二尺度化权重向量。
在得到第一尺度化权重向量之后,还可以对第一尺度化权重向量进行归一化处理,得到第二尺度化权重向量。
作为一个示例,归一化处理的计算公式如下:
Figure PCTCN2020138922-appb-000009
其中,k t为第二尺度化权重向量,a t为第二尺度化权重向量。
3)通过注意力尺度化层,根据第二尺度化权重向量对第一输出特征进行加权处理,得到第三输出特征。
作为一个示例,可以通过以下公式,根据第二尺度化权重向量对第一输出特征进行加权处理:
Figure PCTCN2020138922-appb-000010
其中,ω为第三输出特征,k为第二尺度化权重向量,j为第一输出特征。
示例的,第一输出特征的尺寸为C*H*W,第二尺度化权重向量的尺寸为C*1*1,第三输出特征的尺寸为C*H*W。
4)根据第三输出特征,确定第二输出特征。
第一种实现方式中,可以直接将第三输出特征确定为第二输出特征。
第二种实现方式中,注意力模型的输入还可以包括语音特征,可以将语音特征和第三输出特征进行合并,得到第二输出特征。
作为一个示例,注意力尺度化层的处理流程可以如图6所示,图6是本申请实施例提供的一种注意力尺度化流程图。
通过注意力模型对注意力权重向量进行尺度化处理,根据处理后的注意力权重向量对第一输出特征进行加权处理,得到第二输出特征,可以融合低维特征和高纬特征,使得模型在多种场景下有更好的泛化能力。
步骤104:对第二输出特征进行概率转换,得到第一唤醒词概率,第一唤醒词概率用于指示语音数据中包括预设唤醒词的概率。
对第二输出特征进行概率转换,也即是,将第二输出特征与唤醒词概率进行特征映射,得到第一唤醒词概率。第一唤醒词概率为对于类别的概率估计,范围一般在[0,1]之间。
作为一个示例,对第二输出特征进行概率转换,得到第一唤醒词概率的操作包括:对第二输出特征进行全局池化操作,得到全局池化特征;对全局池化特征进行全局归一化处理,得到第一唤醒词概率。
通过全局池化可以对第二输出特征进行特征降维,对第二输出特征进行高度和宽度方向上的池化。比如,全局池化的计算公式可以如下所示:
Figure PCTCN2020138922-appb-000011
其中,
Figure PCTCN2020138922-appb-000012
为全局池化特征,β i为第二输出特征。
作为一个示例,全局池化特征的尺寸为C*1*1。
作为一个示例,归一化处理的计算公式如下所示:
Figure PCTCN2020138922-appb-000013
其中,g t为第一唤醒词概率,
Figure PCTCN2020138922-appb-000014
为全局池化特征。
在得到第一唤醒词概率之后,可以基于第一唤醒词概率,对电子设备进行唤醒。比如,若第一唤醒词概率大于概率阈值,则判断语音识别通过,触发唤醒电子设备,若第一唤醒词概率小于或等于概率阈值,则判断语音识别未通过,不触发唤醒电子设备,并继续采集语音数据,重复上述步骤对语音数据进行识别。
作为一个示例,概率阈值可以为令数据集样本中的EER(Equal Error Rate,等错误率)最小时的概率阈值,这样可以使得模型的误唤醒率和误拒绝率达到平衡。
在另一示例中,在计算得到第一唤醒词概率之后,还可以基于第一唤醒词概率,确定目标唤醒词概率,以便基于目标唤醒词概率对电子设备进行唤醒。比如,若目标唤醒词概率大于概率阈值,则判断语音识别通过,触发唤醒电子设备,若目标唤醒词概率小于或等于概率阈值,则判断语音识别未通过,不触发唤醒电子设备,并继续采集语音数据,重复上述步骤对语音数据进行识别。
步骤105:基于第一唤醒词概率,确定目标唤醒词概率。
作为一个示例,基于第一唤醒词概率,确定第四唤醒词概率的操作可以包括以下两种实现方式:
第一种实现方式:将第一唤醒词概率确定为目标唤醒词概率。
也即是,可以基于第一唤醒词概率,对电子设备进行唤醒。
第二种实现方式:将第一唤醒词概率与历史唤醒词概率进行融合处理,得到第二唤醒词概率,将第二唤醒词概率确定为目标唤醒词概率。
通过将第一唤醒词概率与历史唤醒词概率进行融合处理,可以进一步提高唤醒词概率的预测准确度,进而减小误唤醒率。
作为一个示例,可以确定M个历史唤醒词概率,M个历史唤醒词概率是对历史语音数据进行预测得到;然后对M个历史唤醒词概率和第一唤醒词概率进行融合处理,得到第二唤醒词概率。
作为一个示例,可以对M个历史唤醒词概率和第一唤醒词概率进行融合处理,得到第二唤醒词概率的操作包括以下步骤:
步骤1051:将M个历史唤醒词概率和第一唤醒词概率作为历史窗口记忆模型的输入,通过历史窗口记忆模型,对M个历史唤醒词概率进行特征提取,将提取的特征与第一唤醒词概率进行逐点相乘,得到融合特征。
其中,历史窗口记忆模型能够将已输出的M个历史唤醒词概率依次保存在历史记忆模型中,并将历史保留的唤醒词概率进行二次特征提取,进行含有记忆能力的模型概率估计。示例的,M个历史唤醒词概率的数据大小为M*C。
作为一个示例,请参考图7,图7是本申请实施例提供的一种历史窗口记忆模型和记忆融合处理模型的模型结构图,如图7所示,历史窗口记忆模型包括双向RNN(Recurrent Neural Network,循环卷积神经网络)层、第一逐点相乘层、归一化处理层和第二逐点相乘层,双向RNN层包括第一RNN层和第二RNN层。相应的,步骤1051可以包括如下步骤:
1)将M个历史唤醒词概率作为双向RNN层的输入,通过第一RNN层和第二RNN层分别对M个历史唤醒词概率进行特征提取,得到第二概率特征和第三概率特征。
双向RNN层可以对有效地对序列信息特征进行特征提取和处理。作为一个示例,该双向RNN层可以为N节点的双向RNN层。
将M个历史唤醒词概率分别作为第一RNN层和第二RNN层的输入,通过第一RNN层对M个历史唤醒词概率进行特征提取得到第二概率特征,通过第二RNN层对M个历史唤醒词概率进行特征提取得到第三概率特征。
2)将第一唤醒词概率和第二概率特征作为第一逐点相乘层的输入,通过第一逐点相乘层对第一唤醒词概率和第二概率特征进行逐点相乘,得到第一逐点相乘层的输出特征。
也即是,第一RNN层的下一个网络层为第一逐点相乘层,且第一逐点相乘层的输入不仅包括第一RNN层输出的第二概率特征,还包括第一唤醒词概率。
作为一个示例,第一唤醒词概率与第二概率特征的特征尺寸相同。第一逐点相乘层的输出特征可以为尺寸为C的一维特征向量。
3)将第一逐点相乘层的输出特征作为归一化处理层的输入,通过归一化处理层对第一逐点相乘层的输出特征进行归一化处理,得到归一化处理层的输出特征。
也即是,第一逐点相乘层的下一个网络层为为归一化处理层。示例的,归一化处理层可以为softmax层。
作为一个示例,归一化处理层的计算公式可以如下所示:
Figure PCTCN2020138922-appb-000015
其中,h t为归一化处理层的输出特征,c t为第一逐点相乘层的输出特征。
4)将归一化处理层的输出特征和第三概率特征作为第二逐点相乘层的输入,通过第二逐点相乘层对归一化处理层的输出特征和第三概率特征进行逐点相乘,得到该融合特征。
也即是,得到归一化处理层的输出特征后,可以将归一化处理层的输出特征与另一路双向RNN层的输出特征进行逐点相乘,得到融合特征。
步骤1052:将第一唤醒词概率作为记忆融合处理模型的输入,通过记忆融合处理模型对第一唤醒词概率进行特征提取,得到第一概率特征,根据第一概率特征和融合特征,确定第二唤醒词概率。
作为一个示例,如图7所示,记忆融合处理模型包括特征提取模型,可以将第一唤醒词概率作为特征提取模型的输入,通过特征提取模型,对第一唤醒词概率进行特征提取,得到第一概率特征。
作为一个示例,如图7所示,特征提取模型包括第二全连接层和第二非线性激活层;通过特征提取模型,对第一唤醒词概率进行特征提取时,可以先通过第二全连接层对第一唤醒词概率进行处理,得到第二全连接层的输出特征,然后将第二全连接层的输出特征作为第二非线性激活层的输入,通过第二非线性激活层对第二全连接层的输出特征进行非线性处理,得到第一概率特征。
作为一个示例,对第一唤醒词概率进行特征提取,得到第一概率特征的操作可以包括如下步骤:
1)基于概率阈值对第一概率特征进行更新,得到更新后的第一概率特征。
其中,若第一概率特征大于概率阈值,则更新后的第一概率特征为1,若第二概率特征小于或等于概率阈值,则更新后的第一概率特征为0。
作为一个示例,可以通过如下公式对第一概率特征进行更新:
Figure PCTCN2020138922-appb-000016
其中,G为第一概率特征,thre为概率阈值。
2)基于更新后的第一概率特征、第一唤醒词概率和融合特征,确定第二唤醒词概率。
作为一个示例,可以对第一乘积和第二乘积进行相加,得到第二唤醒词概率。其中,第一乘积为所述更新后的第一概率特征与第一唤醒词概率的乘积,第二乘积为指定差值与融合特征的乘积,指定差值是指1与更新后的第一概率特征之间的差值。
作为一个示例,可以通过如下公式确定第二唤醒词概率:
result=G*input+(1-G)*memory           (20)
其中,G为更新后的第一概率特征,input为第一唤醒词概率,memory为融合特征。
本申请实施例中,通过在对获取的语音数据进行特征提取之后,先通过U型卷积神经网络模型对语音特征进行特征提取和特征融合,可以将低级特征和高级特征进行融合,得到第一输出特征,之后通过注意力模型对第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,并对注意力权重向量进行尺度化处理,以便根据处理后的注意力权重向量对第一输出特征进行加权处理,如此,可以增强有用特征,削弱无用特征,由于对提取的特征进行了充分的特征融合和注意力计算,因此预测得到的唤醒词概率更加准确,泛化能力更强,而且通过注意力计算能够将语音识别的注意力集中在唤醒词上,对连续语音中包含唤醒词的情况识别效果较好,从而减小了误唤醒概率。另外,通过将历史唤醒词概率与当前唤醒词概率进行融合处理,对于唤醒词检测的跳变和误唤醒能够进行有效抑制。
需要说明的是,为了提高唤醒词概率预测的准确度,进一步减少误唤醒概率,本申请实施例还可以采用多级唤醒算法对语音数据信息识别,为了便于说明,将上述图1实施例的唤醒算法称为一级唤醒算法。接下来将对通过多级唤醒算法对语音数据信息识别的方式进行详细介绍。
图8是本申请实施例提供的另一种唤醒方法的流程图,该方法应用于电子设备中,如图8所示,该方法包括如下步骤:
步骤801:采集语音数据。
电子设备可以不断采集外界的语音数据,以便对采集的语音数据进行唤醒词概率预测。示例的,电子设备中配置有麦克风,电子设备可以通过麦克风采集语音数据。
步骤802:通过一级唤醒算法,对采集的语音数据进行识别,得到目标唤醒词概率。
其中,该目标唤醒概率可以为通过U型卷积神经网络模型和注意力模型预测得到的第一唤醒词概率,也可以为通过U型卷积神经网络模型、注意力模型、历史窗口记忆模型和记忆融合处理模型预测得到的第二唤醒词概率。
作为一个示例,请参考图9,图9是本申请实施例提供的一种一级唤醒算法的逻辑结构示意图,如图9所示,一级唤醒算法包括语音特征提取模块901、U型卷积神经网络模块902、注意力特征提取模块903、唤醒词概率预测模块904、历史窗口记忆模块905和记忆融合处理模块906。
语音特征提取模块901用于对语音数据进行特征提取,得到语音特征。
U型卷积神经网络模块902用于通过U型卷积神经网络模型对语音特征进行特征提取和特征融合,得到第一输出特征。
注意力特征提取模块903用于通过注意力模型对第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对注意力权重向量进行尺度化处理,根据处理后的注意力权重向量对第一输出特征进行加权处理,得到第二输出特征。
唤醒词概率预测模块904用于对第二输出特征进行概率转换,得到第一唤醒词概率。
历史窗口记忆模块905用于通过历史窗口记忆模型,对M个历史唤醒词概率进行特征提取,将提取的特征与第一唤醒词概率进行逐点相乘,得到融合特征。
记忆融合处理模块906用于通过记忆融合处理模型对第一唤醒词概率进行特征提取,得到第一概率特征,根据第一概率特征和融合特征,确定第二唤醒词概率。
步骤803:判断目标唤醒词概率是否大于概率阈值。
其中,该概率阈值可以预先设置,也可以计算得到。比如,该概率阈值可以为令数据集样本中的EER(Equal Error Rate)最小时的概率阈值,这样可以使得模型的误唤醒率和误拒绝率达到平衡。
若目标唤醒词概率大于概率阈值,则确定一级唤醒算法语音识别通过,跳转至步骤804。若目标唤醒词概率小于或等于概率阈值,则确定一级唤醒算法语音识别未通过,并返回至步骤801,继续采集语音数据,通过一级唤醒算法,对采集的语音数据进行识别。
步骤804:启动二级唤醒算法,通过二级唤醒算法,对采集的语音数据进行识别,得到第三唤醒词概率。
需要说明的是,二级唤醒算法是比一级唤醒算法的识别准确度更高的唤醒算法,如此可以在一级唤醒算法语音识别通过的基础上,通过二级唤醒算法对语音数据进一步进行识别和校验,如此,可以进一步提高语音识别的准确度,带来更好的语音识别效果,减小误唤醒率。
作为一个示例,二级唤醒算法可以为基于RNN模型的唤醒算法。示例的,该RNN模型可以为基于序列的LSTM的RNN模型。
作为一个示例,通过二级唤醒算法,对采集的语音数据进行唤醒词概率预测的操作包括:将语音特征作为RNN模型的输入,通过RNN模型对语音数据中包括预设唤醒词的概率进行预测,得到第三唤醒词概率。
步骤805:若第三唤醒词概率大于概率阈值,则对电子设备进行唤醒。
作为一个示例,可以将第三唤醒词概率与概率阈值进行比较,若第三唤醒词概率小于或等于概率阈值,则确定二级唤醒算法语音识别未通过,并继续采集语音数据,以及通过一级唤醒算法对语音数据进行唤醒词概率预测。若第三唤醒词概率大于概率阈值,则确定二级唤醒算法语音识别通过,触发唤醒电子设备,或者,若第三唤醒词大于概率阈值,则进一步对语音数据进行语音识别,根据语音识别结果对电子设备进行唤醒。
另外,为了降低电子设备的功耗,当第一唤醒算法语音识别通过,并启动二级唤醒算法时,还可以停止运行一级唤醒算法,当二级唤醒算法语音识别未通过时,再启动一级唤醒算法,并停止运行二级唤醒算法。如此,可以降低一级唤醒算法和二级唤醒算法同时运行导致的高功耗,使得一级唤醒算法和二级唤醒算法可以交替运行。
另外,为了进一步降低电子设备的功耗,还可以在硬件上对电子设备进行改进。比如,在电子设备中配置第一处理器和第二处理器,且第一处理器的功耗小于第二处理器。第一处理器用于采集语音数据,通过一级唤醒算法对语音数据进行识别。第二处理器用于通过二级唤醒算法对语音数据进行识别。示例的,第一处理器为DSP(Digital Signal Processor,数字信号处理器),第二处理器为ARM(Advanced RISC Machine,精简指令集微处理器)。
作为一个示例,第一处理器和第二处理器的工作流程为:通过第一处理器不断采集语音数据,通过一级唤醒算法对采集的语音数据进行识别,得到目标唤醒词概率。若目标唤醒词概率小于或等于概率阈值,则通过第一处理器继续采集语音数据,通过一级唤醒算法对采集的语音数据进行识别。若目标唤醒词概率大于概率阈值,则将第一处理器从工作状态切换为休眠状态,启动第二处理器,通过第二处理器采用二级唤醒算法对语音数据进行识别,得到第三唤醒词概率。若第三唤醒词概率小于或等于概率阈值,则将第二 处理器从工作状态切换为休眠状态,并启动第一处理器,通过第一处理器继续采集语音数据,通过一级唤醒算法对采集的语音数据进行识别。若第三唤醒词概率大于概率阈值,则对触发唤醒电子设备,或者进一步对语音数据进行语音识别,根据语音识别结果对电子设备进行唤醒。
需要说明的是,本申请实施例所述的预设唤醒词可以由电子设备默认设置,也可以由用户设置。当由用户设置预设唤醒词时,用户可以在电子设备中注册预设唤醒词,比如通过麦克风预先录入预设唤醒词。
作为一个示例,针对电子设备的不同状态,可以为电子设备设置不同的唤醒方式,以提高灵活性,满足用户的多样化需求。比如,针对电子设备的熄屏状态和亮屏状态,可以设置不同的唤醒方式。
作为一个示例,在语音识别通过后,触发唤醒电子设备的操作包括:若电子设备处于熄屏状态,则触发电子设备亮屏,或者触发电子设备亮屏并解锁,或者唤起语音助手;若电子设备处于亮屏状态,则触发电子设备解锁,或者唤起语音助手。
本申请实施例提供了一种多级唤醒算法的语音识别方法,通过一级唤醒算法可以全面识别语音数据中的唤醒词,通过二级唤醒算法可以精准识别语音数据中的唤醒词,如此可以提高唤醒词概率预测的准确度,减少误唤醒概率。另外,本申请实施例在硬件上对电子设备进行了改进,为电子设备配置第一处理器和第二处理器,通过对两者的工作状态进行切换,可以降低功耗。另外,本申请实施例针对电子设备的亮屏和熄屏采用了不同的唤醒方案,在提高用户识别率,降低误唤醒率的同时,有利于降低功耗。
需要说明的是,在提高唤醒词概率预测的准确度,减少误唤醒概率的基础上,为了保护设备安全和用户隐私,本申请实施例还可以采用多级唤醒算法+声纹识别的方案对电子设备进行唤醒。接下来将对通过多级唤醒算法和声纹识别,对电子设备进行唤醒的方式进行详细介绍。
图10是本申请实施例提供的又一种语音唤醒方法的流程图,该方法应用于电子设备中,如图10所示,该方法包括如下步骤:
步骤1001:采集语音数据。
电子设备可以不断采集外界的语音数据,以便对采集的语音数据进行唤醒词概率预测。示例的,电子设备中配置有麦克风,电子设备可以通过麦克风采集语音数据。
步骤1002:通过一级唤醒算法,对采集的语音数据进行识别,得到目标唤醒词概率。
其中,该目标唤醒概率可以为通过U型卷积神经网络模型和注意力模型预测得到的第一唤醒词概率,也可以为通过U型卷积神经网络模型、注意力模型、历史窗口记忆模型和记忆融合处理模型预测得到的第二唤醒词概率。
步骤1003:判断目标唤醒词概率是否大于概率阈值。
若目标唤醒词概率大于概率阈值,则确定一级唤醒算法语音识别通过,跳转至步骤1004。若目标唤醒词概率小于或等于概率阈值,则确定一级唤醒算法语音识别未通过,并返回至步骤1001,继续采集语音数据,通过一级唤醒算法,对采集的语音数据进行识别。
步骤1004:启动二级唤醒算法,通过二级唤醒算法,对采集的语音数据进行识别,得到第三唤醒词概率。
步骤1005:判断第三唤醒词概率是否大于概率阈值。
若第三唤醒词概率大于概率阈值,则确定二级唤醒算法语音识别通过,跳转至步骤1006。若第三唤醒词概率小于或等于概率阈值,则确定二级唤醒算法语音识别未通过,并返回至步骤1001,继续采集语音数据,通过一级唤醒算法,对采集的语音数据进行唤醒词概率预测。
步骤1006:对语音数据进行声纹识别,以识别该语音数据的声纹特征与已存储的声纹特征是否匹配。
若确定语音数据的声纹特征与已存储的声纹特征匹配,则跳转至步骤1007。若确定语音数据的声纹特征与已存储的声纹特征不匹配,则返回步骤1001,继续采集语音数据,通过一级唤醒算法,对采集的语音数据进行识别。
其中,已存储的声纹特征可以为用户预先注册的声纹特征。为了提高注册准确度,用户可以预先注册N次声纹特征,N为大于1的整数。
步骤1007:唤醒电子设备。
作为一个示例,唤醒电子设备的操作包括:若电子设备处于熄屏状态,则触发电子设备亮屏,或者触发电子设备亮屏并解锁,或者唤起语音助手;若电子设备处于亮屏状态,则触发电子设备解锁,或者唤起语音助手。
另外,为了降低电子设备的功耗,当第一唤醒算法语音识别通过,并启动二级唤醒算法时,还可以停止运行一级唤醒算法;当二级唤醒算法语音识别未通过时,再启动一级唤醒算法,并停止运行二级唤醒算法;当二级唤醒算法通过时,启动声纹识别算法,停止运行二级唤醒算法;当声纹识别通过时,触发唤醒电子设备;当声纹识别未通过时,启动一级唤醒算法,并停止运行声纹识别算法。如此,可以降低一级唤醒算法、二级唤醒算法和声纹识别算法同时运行导致的高功耗,使得一级唤醒算法、二级唤醒算法和声纹 识别算法交替运行。
另外,为了进一步降低电子设备的功耗,还可以在硬件上对电子设备进行改进。比如,在电子设备中配置第一处理器和第二处理器,且第一处理器的功耗小于第二处理器。第一处理器用于采集语音数据,通过一级唤醒算法对语音数据进行识别。第二处理器用于通过二级唤醒算法对语音数据进行识别,当二级唤醒算法语音识别通过时,对语音数据进行声纹识别。示例的,第一处理器为DSP,第二处理器为ARM。
作为一个示例,第一处理器和第二处理器的工作流程为:通过第一处理器不断采集语音数据,通过一级唤醒算法对采集的语音数据进行识别,得到目标唤醒词概率。若目标唤醒词概率小于或等于概率阈值,则通过第一处理器继续采集语音数据,通过一级唤醒算法对采集的语音数据进行识别。若目标唤醒词概率大于概率阈值,则将第一处理器从工作状态切换为休眠状态,启动第二处理器,通过第二处理器采用二级唤醒算法对语音数据进行识别,得到第三唤醒词概率。若第三唤醒词概率小于或等于概率阈值,则将第二处理器从工作状态切换为休眠状态,并启动第一处理器,通过第一处理器继续采集语音数据,通过一级唤醒算法对采集的语音数据进行识别。若第三唤醒词概率大于概率阈值,则对语音数据进行声纹识别,若声纹识别通过,则触发唤醒电子设备,若声纹识别未通过,则将第二处理器从工作状态切换为休眠状态,并启动第一处理器,通过第一处理器继续采集语音数据,通过一级唤醒算法对采集的语音数据进行识别。
需要说明的是,本申请实施例所述的预设唤醒词可以由电子设备默认设置,也可以由用户设置。当由用户设置预设唤醒词时,用户可以在电子设备中注册预设唤醒词,比如通过麦克风预先录入预设唤醒词。
还需要说明的是,本申请实施例中的一级唤醒算法对应的一级唤醒模型,二级唤醒算法对应的二级唤醒模型,一级声纹识别算法对应的声纹识别模型可以预先训练得到,比如可以通过多个样本语音数据进行训练得到,样本语音数据是指包括预设唤醒词的语音数据。
本申请实施例提供了一种多级唤醒算法+声纹识别的语音唤醒方法,通过一级唤醒算法可以全面识别语音数据中的唤醒词,通过二级唤醒算法可以精准识别语音数据中的唤醒词,通过声纹识别可以识别唤醒人是否为用户本人,保护设备安全和用户隐私。另外,本申请实施例在硬件上对电子设备进行了改进,为电子设备配置第一处理器和第二处理器,通过对两者的工作状态进行切换,可以降低功耗。另外,本申请实施例针对电子设备的亮屏和熄屏采用了不同的唤醒方案,在提高用户识别率,降低误唤醒率的同时,有利于降低功耗。
接下来基于多级唤醒算法+声纹识别的方案,对电子设备的语音唤醒过程进行举例说明,该电子设备配置有第一处理器和第二处理器,且第一处理器的功耗小于第二处理器。请参考图11,图11是本申请实施例提供的又一种语音唤醒方法的流程图,如图11所示,该方法包括如下步骤:
步骤(1):打开语音唤醒应用,判断电子设备中是否存储有已注册N次的声纹特征。
当电子设备未存储有N次声纹注册信息时:
步骤(2):通过麦克风不断采集语音数据,将采集的语音数据送入二级唤醒模块的二级唤醒算法,并进行关键词检测和保存。
步骤(3):当一级唤醒模块未监测到语音数据时,第一处理器仍处于休眠状态。
步骤(4):当一级唤醒模块监测到语音数据,但语音数据未通过一级唤醒算法的语音识别时,第二处理器仍处于休眠状态。
步骤(5):当一级唤醒模块监测到语音信号,且语音数据通过一级唤醒算法的语音识别时,第一处理器发送中断信号,第二处理器由休眠状态转换为工作状态,同时,一级唤醒模块将包含唤醒词的语音数据传送给第二唤醒模块,第一处理器由工作状态切换到休眠状态,二级唤醒模块通过二级唤醒算法对语音数据进行语音识别,并给出判断信号,该判断信号用于指示语音数据是否通过二级唤醒算法的语音识别。
步骤(6):若判断信号指示语音数据通过二级唤醒算法的语音识别,则将语音数据传送给声纹识别模块,通过声纹识别模块对语音数据进行声纹识别,若通过声纹识别,则触发唤醒电子设备;若未通过二级唤醒算法的语音识别或未通过声纹识别,则第二处理器由工作状态切换到休眠状态,第一处理器由休眠状态切换为工作状态,重新通过麦克风不断采集语音数据,送入第一唤醒模块进行语音识别。
当电子设备存储有N次声纹注册信息时:
步骤(7):第一处理器一直处于工作状态,通过麦克风不断采集语音数据,送入第一唤醒模块的一级唤醒算法进行语音识别。
步骤(8):当一级唤醒模块未监测到语音数据,或者监测到语音数据但语音数据未通过一级唤醒算法的语音识别时,一级唤醒模块仍然处于工作状态,麦克风仍然不断采集语音数据,并送入到一级唤醒模块进行语音识别。
步骤(9):当一级唤醒模块监测到语音数据,且语音数据通过一级唤醒算法的语音识别时,则将包含唤醒词的语音数据发送给二级唤醒模块,同时第一处理器由工作状态切换为休眠状态,麦克风停止采集音频数据,通过二级唤醒算法对语音数据进行语音识别,并给出判断信号。
步骤(10):若判断信号指示语音数据通过二级唤醒算法的语音识别,则将语音数据发送给声纹识别模块,通过声纹识别模块对语音数据进行声纹识别,若通过声纹识别,则触发唤醒电子设备;若未通过二级唤醒算法的语音识别或未通过声纹识别,则第二处理器由工作状态切换到休眠状态,第一处理器由休眠状态切换为工作状态,重新通过麦克风不断采集语音数据,送入第一唤醒模块进行语音识别。
本申请实施例提供的语音唤醒方法可以应用于移动终端中,接下来将以应用于移动终端的场景为例进行说明。作为一个示例,移动终端的语音唤醒过程可以包括如下步骤:
S1:打开语音唤醒应用。
作为一个示例,可以从设置界面中找到语音唤醒应用,打开语音唤醒应用。比如,语音唤醒应用的查找方式可以为:设置-安全-智能解锁-设定数字密码-语音唤醒应用。
S2:语音唤醒应用提醒用户录入唤醒词。
S3:用户先说一遍唤醒词,例如“小布小布”。
S4:重复上述步骤N遍后,将用户录入的语音数据作为训练数据送入语音唤醒模型,对语音唤醒模型进行训练。
S5:训练成功提示,训练完成。
S6:熄屏时,采用一级唤醒算法+二级唤醒算法+声纹识别的熄屏唤醒方案,当基于采集的语音数据识别到正确用户时,触发移动终端亮屏或者唤起语音助手,当然,用户也可以定义唤醒时解锁。
S7:亮屏时,采用二级唤醒算法+声纹识别的亮屏唤醒方案,当基于采集的语音数据识别到正确用户时,触发移动终端解锁或者唤起语音助手。
S8:当进行支付时,开启声纹检测,需要基于语音数据对用户进行身份认证,且身份认证通过后才可支付,此时可以采用二级唤醒算法+声纹识别的亮屏唤醒方案。
在另一种可能的应用场景中,电子设备还能够与其它的语音采集设备建立无线通信连接。
其中,语音采集设备可以是可穿戴设备,也可以是车载终端。其中,车载终端是固定的汽车中的终端,该车载终端通常具有7英寸或9英寸等尺寸的显示屏。车载终端的语音采集设备可以是麦克风,该麦克风安装在方向盘或者车辆内部的其它位置上。可选地,无线通信所基于的技术可以是蓝牙技术、Wi-Fi技术或者ZigBee技术,本申请实施例对此不作限定。
在电子设备与语音采集设备建立无线通信连接后,电子设备能够接收语音采集设备发送的语音数据。基于该应用场景,用户可以经由语音采集设备来控制电子设备。当电子设备离用户距离较远时,用户也能够通过身边的语音采集设备来语音唤醒电子设备。
当语音采集设备是车载终端时,电子设备能够在自身的速度大于速度阈值时,基于第一唤醒词概率,对所述电子设备进行唤醒。其中,电子设备能够通过导航卫星获得自身的速度。速度阈值可以是预设的一个数值,例如20km/h、30km/h或40km/h等数值。
需要说明的是,本申请实施例仅是在上述应用场景采用上述语音唤醒方式为例进行说明,而在其他实施例中,还可以应用在其他场景中,或者也可以采用其他语音唤醒方式,本申请实施例对此不做限定。
图12是本申请实施例提供的一种语音唤醒装置的结构框图,该装置可以集成于电子设备中,该装置可以为包括特征提取模块1201,第一处理模块1202,第二处理模块1203,第三处理模块1204和唤醒模块1205。
特征提取模块1201,用于对采集的语音数据进行特征提取,得到语音特征;
第一处理模块1202,用于将该语音特征作为U型卷积神经网络模型的输入,通过该U型卷积神经网络模型对该语音特征进行特征提取和特征融合,得到第一输出特征;
第二处理模块,用于将该第一输出特征作为注意力模型的输入,通过该注意力模型对该第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对该注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和该第一输出特征,确定第二输出特征;
第三处理模块1204,用于对该第二输出特征进行概率转换,得到第一唤醒词概率,该第一唤醒词概率用于指示该语音数据中包括预设唤醒词的概率;
唤醒模块1205,用于基于该第一唤醒词概率,对电子设备进行唤醒。
可选地,该U型卷积神经网络包括N个网络层组,每个网络层组包括卷积神经网络层、批归一化层和线性激活层,且该N个网络层组中指定浅层网络层的输出特征流向指定深层网络层,以对该N个网络层中的浅层网络与深层网络进行特征融合。
可选地,该注意力模型包括池化层、卷积层、第一全连接层和第一非线性激活层;
该第二处理模块用于:
通过该池化层对该第一输出特征的各个通道的特征分别进行池化操作,得到该池化层的输出特征;
将该池化层的输出特征作为该卷积层的输入,通过该卷积层对该池化层的输出特征进行卷积处理,得到该卷积层的输出特征;
将该卷积层的输出特征作为该第一全连接层的输入,通过该第一全连接层对该卷积层的输出特征进行处理,得到该第一全连接层的输出特征;
将该第一全连接层的输出特征作为该非线性激活层的输入,通过该非线性激活层对该第一全连接层的输出特征进行非线性处理,得到该注意力权重向量。
可选地,该注意力模型还包括注意力尺度化层,该注意力尺度化层的输入包括该第一输出特征和该注意力权重向量;
该第二处理模块用于:
通过该注意力尺度化层,对该注意力权重向量进行尺度化处理,得到第一尺度化权重向量;
通过该注意力尺度化层,对该第一尺度化权重向量进行归一化处理,得到第二尺度化权重向量;
通过该注意力尺度化层,根据该第二尺度化权重向量对该第一输出特征进行加权处理,得到该第三输出特征;
根据第三输出特征,确定第二输出特征。
可选地,该注意力模型的输入还包括语音特征;该第二处理模块用于:
将语音特征与第三输入特征进行合并,得到第二输出特征。
可选地,该第三处理模块用于:
对该第二输出特征进行全局池化操作,得到全局池化特征;
对该全局池化特征进行全局归一化处理,得到该第一唤醒词概率。
可选地,该唤醒模块包括,包括:
确定单元,用于确定M个历史唤醒词概率,该M个历史唤醒词概率是对历史语音数据进行预测得到;
融合单元,用于对该M个历史唤醒词概率和该第一唤醒词概率进行融合处理,得到第二唤醒词概率;
唤醒单元,用于基于该第二唤醒词概率,对该电子设备进行唤醒。
可选地,该融合单元用于:
将该M个历史唤醒词概率和该第一唤醒词概率作为历史窗口记忆模型的输入,通过该历史窗口记忆模型,对该M个历史唤醒词概率进行特征提取,将提取的特征与该第一唤醒词概率进行逐点相乘,得到融合特征;
将该第一唤醒词概率作为特征提取模型的输入,通过该特征提取模型,对该第一唤醒词概率进行特征提取,得到第一概率特征;
根据该第一概率特征和该融合特征,确定该第二唤醒词概率。
可选地,该历史窗口记忆模型包括双向循环神经网络RNN层、第一逐点相乘层、归一化处理层和第二逐点相乘层,该双向RNN层包括第一RNN层和第二RNN层;
该融合单元用于:
将该M个历史唤醒词概率作为该双向RNN层的输入,通过该第一RNN层和该第二RNN层分别对该M个历史唤醒词概率进行特征提取,得到第二概率特征和第三概率特征;
将该第一唤醒词概率和该第二概率特征作为第一逐点相乘层的输入,通过该第一逐点相乘层对该第一唤醒词概率和该第二概率特征进行逐点相乘,得到该第一逐点相乘层的输出特征;
将该第一逐点相乘层的输出特征作为该归一化处理层的输入,通过该归一化处理层对该第一逐点相乘层的输出特征进行归一化处理,得到该归一化处理层的输出特征;
将该归一化处理层的输出特征和该第三概率特征作为该第二逐点相乘层的输入,通过该第二逐点相乘层对该归一化处理层的输出特征和该第三概率特征进行逐点相乘,得到该融合特征。
可选地,该特征提取模型包括第二全连接层和第二非线性激活层;
该融合单元用于:
通过该第二全连接层对该第一唤醒词概率进行处理,得到该第二全连接层的输出特征;
将该第二全连接层的输出特征作为该第二非线性激活层的输入,通过该第二非线性激活层对该第二全连接层的输出特征进行非线性处理,得到该第一概率特征。
可选地,该融合单元用于:
基于概率阈值对该第一概率特征进行更新,得到更新后的第一概率特征,其中,若该第一概率特征大于该概率阈值,则该更新后的第一概率特征为1,若该第一概率特征小于或等于该概率阈值,则该更新后的第一概率特征为0;
对第一乘积和第二乘积进行相加,得到该第二唤醒词概率,该第一乘积为该更新后的第一概率特征与该第一唤醒词概率的乘积,该第二乘积为指定差值与该融合特征的乘积,该指定差值是指1与该更新后的第一概率特征之间的差值。
可选地,该唤醒单元用于:
若该第二唤醒词概率大于概率阈值,则将该语音特征作为RNN模型的输入,通过该RNN模型对该语音数据中包括该预设唤醒词的概率进行预测,得到第三唤醒词概率;
若该第三唤醒词概率大于概率阈值,则对该电子设备进行唤醒。
可选地,该电子设备配置有第一处理器和第二处理器,且该第一处理器的功耗小于该第二处理器;该装置还包括:
获取模块,用于通过该第一处理器采集语音数据;
该唤醒模块用于:
若该第二唤醒词概率大于概率阈值,则将该第一处理器从工作状态切换为休眠状态,启动第二处理器,通过该第二处理器将该语音特征作为RNN模型的输入,通过该RNN模型对该语音数据中包括该预设唤醒词的概率进行预测,得到第三唤醒词概率;
第四处理模块,用于若该第三唤醒词概率小于或等于该概率阈值,则将该第二处理器从工作状态切换为休眠状态,启动该第一处理器,通过该第一处理器继续采集语音数据。
可选地,该唤醒单元用于:
若该第三唤醒词概率大于该概率阈值,则对该语音数据进行声纹识别,以识别该语音数据的声纹特征与已存储的声纹特征是否匹配;
若确定该语音数据的声纹特征与已存储的声纹特征匹配,则唤醒该电子设备。
可选地,该唤醒模块用于:
若该电子设备处于熄屏状态,则触发该电子设备亮屏,或者触发该电子设备亮屏并解锁,或者唤起语音助手;
若该电子设备处于亮屏状态,则触发该电子设备解锁,或者唤起语音助手。
可选地,所述装置还包括语音采集模块,用于响应于所述电子设备已与语音采集设备建立无线通信连接,接收所述语音采集设备发送的所述语音数据,所述语音采集设备中设置有麦克风。
可选地,所述唤醒模块,还用于响应于所述语音采集设备是车载终端且所述电子设备的速度大于速度阈值,基于所述第一唤醒词概率,对所述电子设备进行唤醒。
需要说明的是:上述实施例提供的语音唤醒装置在进行语音唤醒时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音唤醒装置与语音唤醒方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图13是本申请实施例提供的一种电子设备1300的结构示意图,该电子设备可以为智能音箱、智能电视、智能可穿戴设备或终端等,终端可以为手机、平板电脑或计算机等。该电子设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1301和一个或一个以上的存储器1302,其中,所述存储器1302中存储有至少一条指令,所述至少一条指令由所述处理器1301加载并执行以实现上述各个方法实施例提供的接入点的识别方法。比如,该电子设备包括第一处理器和第二处理器,第一处理器的功耗小于第二处理器,第一处理器用于执行一级唤醒算法,第二处理器用于执行二级唤醒算法,或者二级唤醒算法和声纹识别。比如,第一处理器为DSP,第二处理器为ARM。当然,该电子设备还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该电子设备还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性的实施例中,还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,所述指令被处理器执行时实现上述语音唤醒方法。
在示例性实施例中,还提供了一种计算机程序产品,当该计算机程序产品被执行时,其用于实现上述语音唤醒方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种语音唤醒方法,其中,所述方法包括:
    对采集的语音数据进行特征提取,得到语音特征;
    将所述语音特征作为U型卷积神经网络模型的输入,通过所述U型卷积神经网络模型对所述语音特征进行特征提取和特征融合,得到第一输出特征;
    将所述第一输出特征作为注意力模型的输入,通过所述注意力模型对所述第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对所述注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和所述第一输出特征,确定第二输出特征;
    对所述第二输出特征进行概率转换,得到第一唤醒词概率,所述第一唤醒词概率用于指示所述语音数据中包括预设唤醒词的概率;
    基于所述第一唤醒词概率,对电子设备进行唤醒。
  2. 根据权利要求1所述的方法,其中,所述U型卷积神经网络包括N个网络层组,每个网络层组包括卷积神经网络层、批归一化层和线性激活层,且所述N个网络层组中指定浅层网络层的输出特征流向指定深层网络层,以对所述N个网络层中的浅层网络与深层网络进行特征融合。
  3. 根据权利要求1所述的方法,其中,所述注意力模型包括池化层、卷积层、第一全连接层和第一非线性激活层;
    所述通过所述注意力模型对所述第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,包括:
    通过所述池化层对所述第一输出特征的各个通道的特征分别进行池化操作,得到所述池化层的输出特征;
    将所述池化层的输出特征作为所述卷积层的输入,通过所述卷积层对所述池化层的输出特征进行卷积处理,得到所述卷积层的输出特征;
    将所述卷积层的输出特征作为所述第一全连接层的输入,通过所述第一全连接层对所述卷积层的输出特征进行处理,得到所述第一全连接层的输出特征;
    将所述第一全连接层的输出特征作为所述非线性激活层的输入,通过所述非线性激活层对所述第一全连接层的输出特征进行非线性处理,得到所述注意力权重向量。
  4. 根据权利要求3所述的方法,其中,所述注意力模型还包括注意力尺度化层,所述注意力尺度化层的输入包括所述第一输出特征和所述注意力权重向量;
    所述根据处理后的注意力权重向量和所述第一输出特征,确定第二输出特征,包括:
    通过所述注意力尺度化层,对所述注意力权重向量进行尺度化处理,得到第一尺度化权重向量;
    通过所述注意力尺度化层,对所述第一尺度化权重向量进行归一化处理,得到第二尺度化权重向量;
    通过所述注意力尺度化层,根据所述第二尺度化权重向量对所述第一输出特征进行加权处理,得到第三输出特征;
    根据所述第三输出特征,确定所述第二输出特征。
  5. 根据权利要求4所述的方法,其中,所述注意力模型的输入还包括所述语音特征;
    所述根据所述第三输出特征,确定所述第二输出特征,包括:
    将所述语音特征与所述第三输出特征进行合并,得到所述第二输出特征。
  6. 根据权利要求1所述的方法,其中,所述对所述第二输出特征进行概率转换,得到第一唤醒词概率,包括:
    对所述第二输出特征进行全局池化操作,得到全局池化特征;
    对所述全局池化特征进行全局归一化处理,得到所述第一唤醒词概率。
  7. 根据权利要求1-6任一所述的方法,其中,所述基于所述第一唤醒词概率,对电子设备进行唤醒,包括:
    确定M个历史唤醒词概率,所述M个历史唤醒词概率是对历史语音数据进行预测得到;
    对所述M个历史唤醒词概率和所述第一唤醒词概率进行融合处理,得到第二唤醒词概率;
    基于所述第二唤醒词概率,对所述电子设备进行唤醒。
  8. 根据权利要求7所述的方法,其中,所述对所述M个历史唤醒词概率和所述第一唤醒词概率进行融合处理,得到第二唤醒词概率,包括:
    将所述M个历史唤醒词概率和所述第一唤醒词概率作为历史窗口记忆模型的输入,通过所述历史窗口记忆模型,对所述M个历史唤醒词概率进行特征提取,将提取的特征与所述第一唤醒词概率进行逐点相乘,得到融合特征;
    将所述第一唤醒词概率作为特征提取模型的输入,通过所述特征提取模型,对所述第一唤醒词概率进行特征提取,得到第一概率特征;
    根据所述第一概率特征和所述融合特征,确定所述第二唤醒词概率。
  9. 根据权利要求8所述的方法,其中,所述历史窗口记忆模型包括双向循环神经网络RNN层、第一逐点相乘层、归一化处理层和第二逐点相乘层,所述双向RNN层包括第一RNN层和第二RNN层;
    所述将所述M个历史唤醒词概率和所述第一唤醒词概率作为历史窗口记忆模型的输入,通过所述历史窗口记忆模型,对所述M个历史唤醒词概率进行特征提取,将提取的特征与所述第一唤醒词概率进行逐点相乘,得到融合特征,包括:
    将所述M个历史唤醒词概率作为所述双向RNN层的输入,通过所述第一RNN层和所述第二RNN层分别对所述M个历史唤醒词概率进行特征提取,得到第二概率特征和第三概率特征;
    将所述第一唤醒词概率和所述第二概率特征作为第一逐点相乘层的输入,通过所述第一逐点相乘层对所述第一唤醒词概率和所述第二概率特征进行逐点相乘,得到所述第一逐点相乘层的输出特征;
    将所述第一逐点相乘层的输出特征作为所述归一化处理层的输入,通过所述归一化处理层对所述第一逐点相乘层的输出特征进行归一化处理,得到所述归一化处理层的输出特征;
    将所述归一化处理层的输出特征和所述第三概率特征作为所述第二逐点相乘层的输入,通过所述第二逐点相乘层对所述归一化处理层的输出特征和所述第三概率特征进行逐点相乘,得到所述融合特征。
  10. 根据权利要求8所述的方法,其中,所述特征提取模型包括第二全连接层和第二非线性激活层;
    所述通过所述特征提取模型,对所述第一唤醒词概率进行特征提取,得到第一概率特征,包括:
    通过所述第二全连接层对所述第一唤醒词概率进行处理,得到所述第二全连接层的输出特征;
    将所述第二全连接层的输出特征作为所述第二非线性激活层的输入,通过所述第二非线性激活层对所述第二全连接层的输出特征进行非线性处理,得到所述第一概率特征。
  11. 根据权利要求8所述的方法,其中,所述根据所述第一概率特征和所述融合特征,确定所述第二唤醒词概率,包括:
    基于概率阈值对所述第一概率特征进行更新,得到更新后的第一概率特征,其中,若所述第一概率特征大于所述概率阈值,则所述更新后的第一概率特征为1,若所述第一概率特征小于或等于所述概率阈值,则所述更新后的第一概率特征为0;
    对第一乘积和第二乘积进行相加,得到所述第二唤醒词概率,所述第一乘积为所述更新后的第一概率特征与所述第一唤醒词概率的乘积,所述第二乘积为指定差值与所述融合特征的乘积,所述指定差值是指1与所述更新后的第一概率特征之间的差值。
  12. 根据权利要求7所述的方法,其中,所述基于所述第二唤醒词概率,对所述电子设备进行唤醒,包括:
    若所述第二唤醒词概率大于概率阈值,则将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率;
    若所述第三唤醒词概率大于所述概率阈值,则对所述电子设备进行唤醒。
  13. 根据权利要求12所述的方法,其中,所述电子设备配置有第一处理器和第二处理器,且所述第一处理器的功耗小于所述第二处理器;
    在对获取的语音数据进行特征提取之前,所述方法还包括:
    通过所述第一处理器采集语音数据;
    所述若所述第二唤醒词概率大于概率阈值,则将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率,包括:
    若所述第二唤醒词概率大于概率阈值,则将所述第一处理器从工作状态切换为休眠状态,启动第二处理器,通过所述第二处理器将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率;
    所述将所述语音特征作为RNN模型的输入,通过所述RNN模型对所述语音数据中包括所述预设唤醒词的概率进行预测,得到第三唤醒词概率之后,还包括:
    若所述第三唤醒词概率小于或等于所述概率阈值,则将所述第二处理器从工作状态切换为休眠状态,启动所述第一处理器,通过所述第一处理器继续采集语音数据。
  14. 根据权利要求12所述的方法,其中,所述对所述电子设备进行唤醒,包括:
    对所述语音数据进行声纹识别,以识别所述语音数据的声纹特征与已存储的声纹特征是否匹配;
    若确定所述语音数据的声纹特征与已存储的声纹特征匹配,则唤醒所述电子设备。
  15. 根据权利要求14所述的方法,其中,所述唤醒所述电子设备,包括:
    若所述电子设备处于熄屏状态,则触发所述电子设备亮屏,或者触发所述电子设备亮屏并解锁,或者唤起语音助手;
    若所述电子设备处于亮屏状态,则触发所述电子设备解锁,或者唤起语音助手。
  16. 根据权利要求1所述的方法,其中,所述方法还包括:
    响应于所述电子设备已与语音采集设备建立无线通信连接,接收所述语音采集设备发送的所述语音数据,所述语音采集设备中设置有麦克风。
  17. 根据权利要求16所述的方法,其中,所述基于所述第一唤醒词概率,对电子设备进行唤醒,包括:
    响应于所述语音采集设备是车载终端且所述电子设备的速度大于速度阈值,基于所述第一唤醒词概率,对所述电子设备进行唤醒。
  18. 一种语音唤醒装置,其中,所述装置包括:
    特征提取模块,用于对采集的语音数据进行特征提取,得到语音特征;
    第一处理模块,用于将所述语音特征作为U型卷积神经网络模型的输入,通过所述U型卷积神经网络模型对所述语音特征进行特征提取和特征融合,得到第一输出特征;
    第二处理模块,用于将所述第一输出特征作为注意力模型的输入,通过所述注意力模型对所述第一输出特征的各个通道的特征进行注意力计算,得到注意力权重向量,对所述注意力权重向量进行尺度化处理,根据处理后的注意力权重向量和所述第一输出特征,确定第二输出特征;
    第三处理模块,用于对所述第二输出特征进行概率转换,得到第一唤醒词概率,所述第一唤醒词概率用于指示所述语音数据中包括预设唤醒词的概率;
    唤醒模块,用于基于所述第一唤醒词概率,对电子设备进行唤醒。
  19. 一种电子设备,其中,所述电子设备包括处理器和存储器;所述存储器存储有至少一条指令,所述至少一条指令用于被所述处理器执行以实现如权利要求1至17任一所述的语音唤醒方法。
  20. 一种计算机可读存储介质,其中,所述存储介质存储有至少一条指令,所述至少一条指令用于被处理器执行以实现如权利要求1至17任一所述的语音唤醒方法。
PCT/CN2020/138922 2019-12-30 2020-12-24 语音唤醒方法、装置、设备及存储介质 WO2021136054A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911392963.X 2019-12-30
CN201911392963.XA CN111223488B (zh) 2019-12-30 2019-12-30 语音唤醒方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021136054A1 true WO2021136054A1 (zh) 2021-07-08

Family

ID=70829179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/138922 WO2021136054A1 (zh) 2019-12-30 2020-12-24 语音唤醒方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111223488B (zh)
WO (1) WO2021136054A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933111A (zh) * 2020-08-12 2020-11-13 北京猎户星空科技有限公司 语音唤醒方法、装置、电子设备和存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111223488B (zh) * 2019-12-30 2023-01-17 Oppo广东移动通信有限公司 语音唤醒方法、装置、设备及存储介质
CN112466327B (zh) * 2020-10-23 2022-02-22 北京百度网讯科技有限公司 语音处理方法、装置和电子设备
CN112669818B (zh) * 2020-12-08 2022-12-02 北京地平线机器人技术研发有限公司 语音唤醒方法及装置、可读存储介质、电子设备
CN112530410A (zh) * 2020-12-24 2021-03-19 北京地平线机器人技术研发有限公司 一种命令词识别方法及设备
CN112951235B (zh) * 2021-01-27 2022-08-16 北京云迹科技股份有限公司 一种语音识别方法及装置
CN113450800A (zh) * 2021-07-05 2021-09-28 上海汽车集团股份有限公司 一种唤醒词激活概率的确定方法、装置和智能语音产品
CN115312068B (zh) * 2022-07-14 2023-05-09 荣耀终端有限公司 语音控制方法、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107221326A (zh) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 基于人工智能的语音唤醒方法、装置和计算机设备
CN109903750A (zh) * 2019-02-21 2019-06-18 科大讯飞股份有限公司 一种语音识别方法及装置
US20190221206A1 (en) * 2019-03-27 2019-07-18 Intel Corporation Spoken keyword detection based utterance-level wake on intent system
CN110473554A (zh) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110570858A (zh) * 2019-09-19 2019-12-13 芋头科技(杭州)有限公司 语音唤醒方法、装置、智能音箱和计算机可读存储介质
CN111223488A (zh) * 2019-12-30 2020-06-02 Oppo广东移动通信有限公司 语音唤醒方法、装置、设备及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10839790B2 (en) * 2017-02-06 2020-11-17 Facebook, Inc. Sequence-to-sequence convolutional architecture
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
US10354656B2 (en) * 2017-06-23 2019-07-16 Microsoft Technology Licensing, Llc Speaker recognition
JP6987378B2 (ja) * 2017-07-18 2021-12-22 国立研究開発法人情報通信研究機構 ニューラルネットワークの学習方法及びコンピュータプログラム
CN108010514B (zh) * 2017-11-20 2021-09-10 四川大学 一种基于深度神经网络的语音分类方法
CN108876792B (zh) * 2018-04-13 2020-11-10 北京迈格威科技有限公司 语义分割方法、装置和系统及存储介质
CN109509178B (zh) * 2018-10-24 2021-09-10 苏州大学 一种基于改进的U-net网络的OCT图像脉络膜分割方法
CN109448719B (zh) * 2018-12-11 2022-09-09 杭州易现先进科技有限公司 神经网络模型建立方法及语音唤醒方法、装置、介质和设备
KR102013777B1 (ko) * 2018-12-12 2019-10-21 한국과학기술정보연구원 동영상 왜곡 복원 방법 및 이를 적용한 장치
CN109712203B (zh) * 2018-12-29 2020-11-17 福建帝视信息科技有限公司 一种基于自注意力生成对抗网络的图像着色方法
CN109886243B (zh) * 2019-03-01 2021-03-26 腾讯医疗健康(深圳)有限公司 图像处理方法、装置、存储介质、设备以及系统
CN110246490B (zh) * 2019-06-26 2022-04-19 合肥讯飞数码科技有限公司 语音关键词检测方法及相关装置
CN110502610A (zh) * 2019-07-24 2019-11-26 深圳壹账通智能科技有限公司 基于文本语义相似度的智能语音签名方法、装置及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107221326A (zh) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 基于人工智能的语音唤醒方法、装置和计算机设备
CN109903750A (zh) * 2019-02-21 2019-06-18 科大讯飞股份有限公司 一种语音识别方法及装置
US20190221206A1 (en) * 2019-03-27 2019-07-18 Intel Corporation Spoken keyword detection based utterance-level wake on intent system
CN110473554A (zh) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110570858A (zh) * 2019-09-19 2019-12-13 芋头科技(杭州)有限公司 语音唤醒方法、装置、智能音箱和计算机可读存储介质
CN111223488A (zh) * 2019-12-30 2020-06-02 Oppo广东移动通信有限公司 语音唤醒方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUO, YU: "Study on Wake Up Word Recognition Based on Deep Learning", CHINESE MASTER'S THESES FULL-TEXT DATABASE, 1 May 2019 (2019-05-01), pages 1 - 59, XP055825860 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933111A (zh) * 2020-08-12 2020-11-13 北京猎户星空科技有限公司 语音唤醒方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN111223488B (zh) 2023-01-17
CN111223488A (zh) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2021136054A1 (zh) 语音唤醒方法、装置、设备及存储介质
JP7005099B2 (ja) 音声キーワードの認識方法、装置、コンピュータ読み取り可能な記憶媒体、及びコンピュータデバイス
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
KR101437757B1 (ko) 콘텍스트 감지 및 융합을 위한 방법, 장치 및 컴퓨터 프로그램제품
WO2021179818A1 (zh) 行驶状态识别方法、装置、终端及存储介质
US9443202B2 (en) Adaptation of context models
CN111179975A (zh) 用于情绪识别的语音端点检测方法、电子设备及存储介质
CN111508493B (zh) 语音唤醒方法、装置、电子设备及存储介质
US11217246B2 (en) Communication robot and method for operating the same
CN107240398A (zh) 智能语音交互方法及装置
WO2021139337A1 (zh) 基于深度学习模型的步态识别方法、装置和计算机设备
WO2020181523A1 (zh) 唤醒屏幕的方法和装置
CN110570873A (zh) 声纹唤醒方法、装置、计算机设备以及存储介质
CN110544468B (zh) 应用唤醒方法、装置、存储介质及电子设备
CN111327949A (zh) 一种视频的时序动作检测方法、装置、设备及存储介质
CN110972112A (zh) 地铁运行方向的确定方法、装置、终端及存储介质
CN111462756A (zh) 声纹识别方法、装置、电子设备及存储介质
US20220207305A1 (en) Multi-object detection with single detection per object
CN113450771A (zh) 唤醒方法、模型训练方法和装置
CN113448975A (zh) 一种人物画像库的更新方法、装置、系统和存储介质
CN116978368A (zh) 一种唤醒词检测方法和相关装置
CN115954019A (zh) 一种融合自注意力和卷积操作的环境噪声识别方法及系统
WO2022121188A1 (zh) 关键词检测方法、装置、设备和存储介质
CN111640440B (zh) 一种音频流解码方法、装置、存储介质及设备
CN117063229A (zh) 交互语音信号处理方法、相关设备及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911217

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911217

Country of ref document: EP

Kind code of ref document: A1