CN112530418A - Voice wake-up method, device and related equipment - Google Patents

Voice wake-up method, device and related equipment Download PDF

Info

Publication number
CN112530418A
CN112530418A CN201910800728.5A CN201910800728A CN112530418A CN 112530418 A CN112530418 A CN 112530418A CN 201910800728 A CN201910800728 A CN 201910800728A CN 112530418 A CN112530418 A CN 112530418A
Authority
CN
China
Prior art keywords
voice
model
crowd
awakening
wake
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910800728.5A
Other languages
Chinese (zh)
Inventor
陈孝良
靳源
冯大航
常乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN201910800728.5A priority Critical patent/CN112530418A/en
Publication of CN112530418A publication Critical patent/CN112530418A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application provides a voice awakening method, a voice awakening device and related equipment, wherein the method comprises the following steps: acquiring an initial voice signal, and extracting voice features from the initial voice signal; inputting the voice characteristics into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model; according to the crowd attribute classification result, determining the crowd attribute corresponding to the initial voice signal; selecting a wake-up model corresponding to the crowd attribute from the wake-up model group as a target wake-up model; and inputting the initial voice signal into the target awakening model so as to enable the target awakening model to perform voice awakening. In the application, the reliability of voice awakening can be improved through the above mode.

Description

Voice wake-up method, device and related equipment
Technical Field
The present application relates to the field of voice processing technologies, and in particular, to a voice wake-up method, apparatus, and related device.
Background
The voice wake-up means that a user wakes up the electronic device by speaking a wake-up word, so that the electronic device enters a state of waiting for a voice instruction or directly executes a predetermined voice instruction.
Voice wake-up technology is increasingly applied to voice interaction devices, but how to improve the reliability of voice wake-up becomes a problem.
Disclosure of Invention
In order to solve the foregoing technical problem, embodiments of the present application provide a voice wake-up method, apparatus, and related device, so as to achieve the purpose of improving reliability of voice wake-up, and the technical solution is as follows:
a voice wake-up method, comprising:
acquiring an initial voice signal, and extracting voice features from the initial voice signal;
inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, wherein the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes through training;
according to the crowd attribute classification result, determining the crowd attribute corresponding to the initial voice signal;
selecting a wake-up model corresponding to the crowd attributes from a wake-up model group as a target wake-up model, wherein the wake-up model group comprises a plurality of different types of wake-up models, and each type of wake-up model is obtained by training with a voice training sample corresponding to the crowd attributes;
and inputting the initial voice signal into the target awakening model so as to enable the target awakening model to carry out voice awakening.
Preferably, the extracting the speech feature from the initial speech signal includes:
performing VAD interception processing on the initial voice signal to obtain an effective voice signal;
and extracting voice features from the effective voice signal.
Preferably, the crowd attribute classification model is a convolutional neural network model;
the training process of the convolutional neural network model comprises the following steps:
initializing parameters of each layer in the convolutional neural network model;
selecting an unused voice training sample from the voice training sample set as a target voice training sample;
performing VAD interception processing on the target voice training sample to obtain an effective voice training signal;
extracting voice features from the effective voice training signal, and inputting the voice features into the convolutional neural network model to obtain a classification result output by the convolutional neural network model;
calculating the cross entropy of the classification result output by the convolutional neural network model and the crowd attribute labeled by the target voice training sample;
updating parameters of each layer in the convolutional neural network model, returning to the step of selecting an unused voice training sample from the voice training sample set until a cross entropy is obtained, and judging whether the cross entropy is converged according to the cross entropy obtained by the current calculation and the cross entropy obtained by the previous calculation;
if yes, ending the training;
if not, returning to the step of updating the parameters of each layer in the convolutional neural network model.
Preferably, the updating the parameters of each layer in the convolutional neural network model includes:
taking the cross entropy as a loss function result;
and respectively transmitting the loss function result to each layer in the convolutional neural network model according to the sequence from the output layer to the input layer in the convolutional neural network model, and updating the parameters of each layer in the convolutional neural network model.
Preferably, the voice features are voice features with non-fixed frame length;
the crowd-sourcing attribute classification model includes at least: an input layer, a convolutional layer, a pooling layer, a global pooling layer, and an output layer.
Preferably, the training process of the wake-up model includes:
respectively determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model;
dividing voice training samples containing awakening words with the same crowd attributes into a group as an awakening word training sample group according to the crowd attributes corresponding to the voice training samples containing the awakening words;
and training a wake-up model according to the wake-up word training sample group to obtain the wake-up models corresponding to various crowd attributes.
A voice wake-up apparatus comprising:
the extraction module is used for acquiring an initial voice signal and extracting voice features from the initial voice signal;
the classification module is used for inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, and the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes for training;
the determining module is used for determining the crowd attribute corresponding to the initial voice signal according to the crowd attribute classification result;
the selection module is used for selecting the awakening model corresponding to the crowd attribute from the awakening model group as a target awakening model, the awakening model group comprises a plurality of awakening models of different types, and the awakening models of the different types are obtained by utilizing the voice training samples corresponding to the crowd attribute to train;
and the awakening module is used for inputting the initial voice signal into the target awakening model so as to enable the target awakening model to be awakened by voice.
Preferably, the extraction module includes:
the VAD interception processing submodule is used for carrying out VAD interception processing on the initial voice signal to obtain an effective voice signal;
and the extraction sub-module is used for extracting the voice features from the effective voice signals.
Preferably, the crowd attribute classification model is a convolutional neural network model;
the device further comprises:
a convolutional neural network model training module to:
initializing parameters of each layer in the convolutional neural network model;
selecting an unused voice training sample from the voice training sample set as a target voice training sample;
performing VAD interception processing on the target voice training sample to obtain an effective voice training signal;
extracting voice features from the effective voice training signal, and inputting the voice features into the convolutional neural network model to obtain a classification result output by the convolutional neural network model;
calculating the cross entropy of the classification result output by the convolutional neural network model and the crowd attribute labeled by the target voice training sample;
updating parameters of each layer in the convolutional neural network model, returning to the step of selecting an unused voice training sample from the voice training sample set until a cross entropy is obtained, and judging whether the cross entropy is converged according to the cross entropy obtained by the current calculation and the cross entropy obtained by the previous calculation;
if yes, ending the training;
if not, returning to the step of updating the parameters of each layer in the convolutional neural network model.
Preferably, the convolutional neural network model training module is specifically configured to:
taking the cross entropy as a loss function result;
and respectively transmitting the loss function result to each layer in the convolutional neural network model according to the sequence from the output layer to the input layer in the convolutional neural network model, and updating the parameters of each layer in the convolutional neural network model.
Preferably, the voice features are voice features with non-fixed frame length;
the crowd-sourcing attribute classification model includes at least: an input layer, a convolutional layer, a pooling layer, a global pooling layer, and an output layer.
Preferably, the apparatus further comprises:
a wake-up module training module to:
respectively determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model;
dividing voice training samples containing awakening words with the same crowd attributes into a group as an awakening word training sample group according to the crowd attributes corresponding to the voice training samples containing the awakening words;
and training a wake-up model according to the wake-up word training sample group to obtain the wake-up models corresponding to various crowd attributes.
A voice wake-up device comprising:
a memory for storing a program;
the processor is configured to run the program, and when the processor runs the program, the processor implements the voice wake-up method according to any one of the above items.
A storage medium storing a program which, when executed, is adapted to implement the voice wake-up method as claimed in claims 1 to 6.
Compared with the prior art, the beneficial effect of this application is:
in the application, the voice features extracted from the initial voice signals are input into the crowd attribute classification model to obtain the crowd attribute classification result output by the crowd attribute classification model, the crowd attributes corresponding to the initial voice signals are determined according to the crowd attribute classification result, the awakening model corresponding to the crowd attributes is selected from the awakening model group to serve as the target awakening model, and voice awakening is performed by using the target awakening model.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
Fig. 1 is a flowchart of an embodiment 1 of a voice wake-up method provided in the present application;
fig. 2 is a flowchart of another embodiment 1 of a voice wake-up method provided in the present application;
FIG. 3 is a flow chart of a convolutional neural network model training process provided herein;
FIG. 4 is a flow chart of a training process of a wake-up model provided herein;
fig. 5 is a schematic logic structure diagram of a voice wake-up apparatus provided in the present application.
Detailed Description
In the application process of the voice wake-up technology, the inventor of the present application noticed that the voice wake-up rate of the existing wake-up model for some people (e.g., the elderly and children) is not high, and in order to improve the above problem and achieve the improvement of the voice wake-up rate, the present application proposes a voice wake-up method, and then introduces the voice wake-up method provided by the present application in detail.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
As shown in fig. 1, a flowchart of an embodiment 1 of a voice wake-up method provided by the present application is provided, where the method includes the following steps:
and step S11, acquiring an initial voice signal and extracting voice features from the initial voice signal.
In this embodiment, the voice features may include, but are not limited to: MFCC (Mel-scalefree coeffients), or MFCC and PITCH features.
It should be noted that, in human-computer interaction, the length of the valid speech signal input by the user is generally not fixed, and therefore, the speech feature extracted from the initial speech signal is a speech feature with a non-fixed frame length.
And step S12, inputting the voice characteristics into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model.
The crowd attribute classification model is obtained by training with a voice training sample marked with crowd attributes.
The crowd attribute classification result can be understood as: the probability value of the voice feature corresponding to each type of crowd attribute is, for example, the probability value of the adult man corresponding to the voice feature, the probability value of the adult woman corresponding to the voice feature, the probability value of the old man corresponding to the voice feature, and the probability value of the child corresponding to the voice feature.
In this embodiment, the crowd attribute classification model may include, but is not limited to: CNN (Convolutional Neural Networks) model.
And under the condition that the crowd attribute classification model is a CNN model, if the voice feature is a voice feature with a non-fixed frame length, a global pooling layer is used for replacing a full connection layer in the CNN model, so that the input and processing of the voice feature with the non-fixed frame length are realized. Specifically, the CNN model at least includes: an input layer, a pooling layer, a convolutional layer, a global pooling layer, and an output layer. The connection relationship of each layer in the CNN model may be, but is not limited to: the input layer is connected with the pooling layer, the pooling layer is connected with the convolution layer, the convolution layer is connected with the global pooling layer, and the global pooling layer is connected with the output layer.
And step S13, determining the crowd attribute corresponding to the initial voice signal according to the crowd attribute classification result.
On the basis that the crowd attribute classification result comprises the probability values of various crowd attributes corresponding to the voice features, the probability values of various crowd attributes corresponding to the voice features can be compared, and the crowd attribute corresponding to the initial voice signal is determined according to the comparison result. The crowd attribute with the maximum probability value of various crowd attributes corresponding to the voice feature can be specifically used as the crowd attribute corresponding to the initial voice signal. If the probability value of the crowd attribute corresponding to the voice feature is 0.3 for adult men, the probability value of the crowd attribute corresponding to the voice feature is 0.2 for adult women, the probability value of the crowd attribute corresponding to the voice feature is 0.4 for old people, and the probability value of the crowd attribute corresponding to the voice feature is 0.1 for children, the old people are taken as the crowd attribute of the voice feature.
And step S14, selecting the awakening model corresponding to the crowd attribute from the awakening model group as the target awakening model.
The awakening model group comprises a plurality of awakening models of different types, and the awakening models of the different types are obtained by training with voice training samples corresponding to the crowd attributes respectively.
In this embodiment, different wake-up models are set for different attributes of the population, for example, for adult men, an adult man wake-up model is set; aiming at adult ladies, setting an adult lady awakening model; setting an old person awakening model for the old person; and setting a child awakening model aiming at the child.
It can be understood that different types of arousal models need to be trained by using voice training samples with different crowd attributes, for example, an adult male arousal model needs to be trained by using a voice training sample of an adult male; the adult female awakening model is obtained by training with a voice training sample of an adult female; the old person awakening model is obtained by training by using a voice training sample of the old person; the child awakening model needs to be trained by utilizing a voice training sample of the child.
And step S15, inputting the initial voice signal into the target awakening model so as to make the target awakening model perform voice awakening.
The target awakening model is obtained by training the voice training sample of the crowd attribute corresponding to the initial voice signal, so that the target awakening model can accurately identify awakening words in the initial signal and provide voice awakening reliability.
In the application, the voice features extracted from the initial voice signals are input into the crowd attribute classification model to obtain the crowd attribute classification result output by the crowd attribute classification model, the crowd attributes corresponding to the initial voice signals are determined according to the crowd attribute classification result, the awakening model corresponding to the crowd attributes is selected from the awakening model group to serve as the target awakening model, and voice awakening is performed by using the target awakening model.
As another alternative embodiment of the present application, referring to fig. 2, a flowchart of an embodiment 2 of a voice wakeup method provided by the present application is shown, where this embodiment mainly describes a refinement scheme of the voice wakeup method described in the above embodiment 1, and as shown in fig. 2, the method may include, but is not limited to, the following steps:
step S21, obtaining an initial voice signal, and performing VAD truncation processing on the initial voice signal to obtain an effective voice signal.
Performing VAD (Voice Activity Detection) interception on the initial Voice signal, which can be understood as: and detecting a starting point and an end point of the voice signal in the initial voice signal, and separating the effective voice signal from the initial voice signal according to the starting point and the end point.
Preferably, the VAD algorithm may be a VAD algorithm that utilizes short-time energy and zero-crossing rate.
And step S22, extracting voice features from the effective voice signal.
The voice features are extracted from the effective voice signals, and the efficiency and the accuracy of extracting the voice features can be improved.
Steps S21-S22 are a specific implementation of step S11 in example 1.
And step S23, inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, wherein the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes for training.
And step S24, determining the crowd attribute corresponding to the initial voice signal according to the crowd attribute classification result.
And step S25, selecting an awakening model corresponding to the crowd attributes from an awakening model group as a target awakening model, wherein the awakening model group comprises a plurality of awakening models of different types, and each type of awakening model is obtained by utilizing a voice training sample corresponding to the crowd attributes for training.
And step S26, inputting the initial voice signal into the target awakening model so as to make the target awakening model perform voice awakening.
The detailed procedures of steps S23-S26 can be found in the related descriptions of steps S12-S15 in embodiment 1, and are not repeated herein.
In another embodiment of the present application, a training process of the convolutional neural network model described in embodiment 1 is described, please refer to fig. 3, which may include the following steps:
and step S31, initializing parameters of each layer in the convolutional neural network model.
The initialization process may refer to the process of initializing parameters of each layer in the convolutional neural network model in the prior art, and is not described herein again.
Step S32, selecting an unused voice training sample from the voice training sample set as a target voice training sample.
In order to ensure the training precision, a voice training sample set comprising a plurality of voice training samples can be set, and the comprehensiveness and integrity of the voice training samples are ensured.
And step S33, performing VAD truncation processing on the target voice training sample to obtain an effective voice training signal.
Performing VAD truncation processing on the target speech training sample to obtain an effective speech training signal, which can be understood as: and detecting a starting point and an end point of a voice signal in the target voice training sample, and separating an effective voice training signal from the target voice training sample according to the starting point and the end point.
And step S34, extracting voice features from the effective voice training signal, and inputting the voice features into the convolutional neural network model to obtain a classification result output by the convolutional neural network model.
The voice features are extracted from the effective voice training signals, so that the efficiency of extracting the voice features can be improved, a convolutional neural network model can be trained more effectively, and the interference of non-voice sections on training is avoided.
And step S35, calculating the cross entropy of the classification result output by the convolutional neural network model and the crowd attribute labeled by the target voice training sample.
And S36, updating the parameters of each layer in the convolutional neural network model, and returning to the step S32 until cross entropy is obtained.
Preferably, the process of updating the parameters of each layer in the convolutional neural network model may include:
a11, taking the cross entropy as a loss function result;
a12, respectively transmitting the loss function result to each layer in the convolutional neural network model according to the sequence from the output layer to the input layer in the convolutional neural network model, and updating the parameters of each layer in the convolutional neural network model.
According to the sequence from the output layer to the input layer in the convolutional neural network model, the loss function result is respectively transmitted to each layer in the convolutional neural network model, and the parameters of each layer in the convolutional neural network model are updated, which can be understood as: and updating parameters of each layer in the convolutional neural network model by using a back propagation principle.
Of course, the process of updating the parameters of each layer in the convolutional neural network model is not limited to the process shown in this embodiment, and other updating manners that can implement the parameters of each layer in the convolutional neural network model are also possible.
And step S37, judging whether the cross entropy is converged according to the cross entropy obtained by the current calculation and the cross entropy obtained by the previous calculation.
Judging whether the cross entropy converges can be understood as: judging whether the change trend of the cross entropy is a reduction trend;
or, judging whether the variation trend of the cross entropy is a reduction trend, if so, judging whether the difference value between the cross entropy obtained by the calculation and the cross entropy obtained by the last calculation is smaller than a preset value.
If yes, the training of the convolutional neural network model is considered to reach the set requirement, and step S38 is executed; if not, the training of the convolutional neural network model can be considered to not meet the set requirement, and the step of updating the parameters of each layer in the convolutional neural network model is returned to be executed.
And step S38, finishing training.
In another embodiment of the present application, a training process of the wake-up model described in embodiment 1 is described, please refer to fig. 4, which may include the following steps:
and step S41, respectively determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model.
The process of determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model may include:
and B11, respectively extracting voice characteristics from the voice training samples containing the awakening words.
And B12, respectively inputting the voice features extracted from the voice training samples containing the awakening words into the crowd attribute classification model to obtain the crowd attribute classification result output by the crowd attribute classification model.
And B13, determining the crowd attributes corresponding to the voice training samples containing the awakening words according to the crowd attribute classification results.
Step S42, dividing the voice training samples containing the wake-up word with the same crowd attributes into a group as a wake-up word training sample group according to the crowd attributes corresponding to the voice training samples containing the wake-up word.
The voice training samples containing the awakening words with the same crowd attributes are divided into a group, so that the awakening models corresponding to different crowd attributes can be trained conveniently.
The number of the awakening word training sample groups is the same as the type of the crowd attributes, and if the crowd attributes are 4 types, the number of the awakening word training sample groups is 4.
And step S43, training a wake-up model according to the wake-up word training sample group to obtain wake-up models corresponding to various crowd attributes.
Training a wake-up model according to the wake-up word training sample group to obtain a specific process of the wake-up model corresponding to various crowd attributes, which can include:
and C11, respectively extracting the voice characteristics of the voice training samples containing the awakening words in each awakening word training sample group, and forcibly aligning each frame of the voice training samples by using a voice recognition model according to the voice characteristics to obtain the phoneme labels corresponding to each frame of the voice training samples.
Preferably, the speech recognition model may be an ASR model.
And C12, obtaining the speech features of the current frame of the speech training sample and the speech features of the context frame to form a feature vector of the current frame, and training the awakening model frame by using the feature vector of each frame and the corresponding phoneme label to obtain the awakening models corresponding to various crowd attributes.
The process of training the wake-up model frame by using the feature vectors and the corresponding phoneme labels of each frame to obtain the wake-up models corresponding to various types of crowd attributes may include:
d11, initializing parameters of each layer in the awakening model;
d12, training a wake-up model from the first frame of the voice training sample; since there is no context in the first frame, the context of the first frame is used in all copies of the first frame, for example, the context is defined as 20 frames, the context is defined as 10 frames, and the context is defined as 10 frames, the context is defined as 10 frames of the first frame, and the context is defined as 10 frames following the first frame; for the previous frame less than 10 frames, copying the first frame for supplementing, and for the next frame less than 10 frames, copying the last frame for supplementing; thus, the window frame of the current frame is 21 frames.
The feature vector of each frame refers to a feature vector formed by the speech features of the current frame and the context frame, that is, a feature vector formed by the speech features of all frames of the window frame. Where the speech feature per frame may be 30-dimensional, or other dimensions.
D13, inputting the feature vector of the first frame into the awakening model for training, and obtaining the cross entropy of the current frame according to the training result and by combining the phoneme label corresponding to the current first frame;
d14, updating parameters of each layer in the awakening model according to the cross entropy;
d15, training the awakening model according to the second frame of the voice training sample, inputting the feature vector of the second frame into the awakening model for training, and obtaining the cross entropy of the second frame according to the training result and the phoneme label corresponding to the current second frame, and so on; after the cross entropy of the current frame is obtained, judging whether the cross entropy is converged or not according to the cross entropy of the current frame and the cross entropy calculated before;
if the convergence is achieved, ending the model training; if not, continuing to train the model according to the next frame until convergence.
The above training of the wake-up model is performed on a frame-by-frame basis, optionally, batch frames may also be used to train the wake-up model, and the process of training the wake-up model using batch frames may include:
e11, initializing parameters of each layer in the awakening model;
e12, dividing the voice frames of the voice training samples into a plurality of batches according to a certain number, preferably, the certain number may be several thousands, that is, each batch of frames includes several thousands of frames;
e13, training the awakening model from the first batch of frames, inputting the feature vectors of the first batch of frames into the awakening model for training, and obtaining the overall cross entropy of the current batch of frames according to the training result and by combining the phoneme labels corresponding to the current first batch of frames;
e14, updating parameters of each layer in the awakening model according to the overall cross entropy;
e15, training the awakening model according to the second batch of frames, inputting the feature vectors of the second batch of frames into the awakening model for training, and obtaining the overall cross entropy of the second batch of frames according to the training result and by combining the phoneme labels corresponding to the current second batch of frames, and so on; after the integral cross entropy of the current batch frame is obtained, judging whether the integral cross entropy is converged or not according to the integral cross entropy of the current batch frame and the integral cross entropy calculated before;
if the convergence is achieved, ending the model training; if not, continuing to train the awakening model according to the next batch of frames until convergence.
The following describes the voice wake-up apparatus provided in the present application, and the voice wake-up apparatus described below and the voice wake-up method described above may be referred to correspondingly.
Referring to fig. 5, the voice wake-up apparatus includes: an extraction module 11, a classification module 12, a first determination module 13, a selection module 14 and a wake-up module 15.
The extraction module 11 is configured to acquire an initial voice signal and extract a voice feature from the initial voice signal;
the classification module 12 is configured to input the voice features into a preset crowd attribute classification model, so as to obtain a crowd attribute classification result output by the crowd attribute classification model, where the crowd attribute classification model is obtained by training with a voice training sample labeled with crowd attributes;
a determining module 13, configured to determine, according to the crowd attribute classification result, a crowd attribute corresponding to the initial voice signal;
a selecting module 14, configured to select a wake-up model corresponding to the crowd attribute from a wake-up model group as a target wake-up model, where the wake-up model group includes multiple different types of wake-up models, and each type of wake-up model is obtained by training with a speech training sample corresponding to the crowd attribute;
and the awakening module 15 is configured to input the initial voice signal into the target awakening model, so that the target awakening model performs voice awakening.
In this embodiment, the extracting module 11 may include:
the VAD interception processing submodule is used for carrying out VAD interception processing on the initial voice signal to obtain an effective voice signal;
and the extraction sub-module is used for extracting the voice features from the effective voice signals.
In this embodiment, the crowd attribute classification model may be a convolutional neural network model.
Accordingly, the voice wake-up apparatus may further include:
a convolutional neural network model training module to:
initializing parameters of each layer in the convolutional neural network model;
selecting an unused voice training sample from the voice training sample set as a target voice training sample;
performing VAD interception processing on the target voice training sample to obtain an effective voice training signal;
extracting voice features from the effective voice training signal, and inputting the voice features into the convolutional neural network model to obtain a classification result output by the convolutional neural network model;
calculating the cross entropy of the classification result output by the convolutional neural network model and the crowd attribute labeled by the target voice training sample;
updating parameters of each layer in the convolutional neural network model, returning to the step of selecting an unused voice training sample from the voice training sample set until a cross entropy is obtained, and judging whether the cross entropy is converged according to the cross entropy obtained by the current calculation and the cross entropy obtained by the previous calculation;
if yes, ending the training;
if not, returning to the step of updating the parameters of each layer in the convolutional neural network model.
In this embodiment, the convolutional neural network model training module may be specifically configured to:
taking the cross entropy as a loss function result;
and respectively transmitting the loss function result to each layer in the convolutional neural network model according to the sequence from the output layer to the input layer in the convolutional neural network model, and updating the parameters of each layer in the convolutional neural network model.
In this embodiment, the speech feature may be a speech feature with a non-fixed frame length;
accordingly, the demographic property classification model includes at least: an input layer, a convolutional layer, a pooling layer, a global pooling layer, and an output layer.
In this embodiment, the voice wake-up apparatus may further include:
a wake-up model training module to:
respectively determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model;
dividing voice training samples containing awakening words with the same crowd attributes into a group as an awakening word training sample group according to the crowd attributes corresponding to the voice training samples containing the awakening words;
and training a wake-up model according to the wake-up word training sample group to obtain the wake-up models corresponding to various crowd attributes.
In another embodiment of the present application, a voice wake-up apparatus is presented, which may include:
a memory for storing a program;
the processor is configured to run the program, and when the processor runs the program, the processor implements the voice wake-up method described in the above embodiments.
In another embodiment of the present application, a storage medium is described that may be used to store a program that, when executed, is used to implement the voice wake-up method as described in the various embodiments above.
It should be noted that each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The voice wake-up method and device provided by the present application are introduced in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A voice wake-up method, comprising:
acquiring an initial voice signal, and extracting voice features from the initial voice signal;
inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, wherein the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes through training;
according to the crowd attribute classification result, determining the crowd attribute corresponding to the initial voice signal;
selecting a wake-up model corresponding to the crowd attributes from a wake-up model group as a target wake-up model, wherein the wake-up model group comprises a plurality of different types of wake-up models, and each type of wake-up model is obtained by training with a voice training sample corresponding to the crowd attributes;
and inputting the initial voice signal into the target awakening model so as to enable the target awakening model to carry out voice awakening.
2. The method of claim 1, wherein said extracting speech features from said initial speech signal comprises:
performing VAD interception processing on the initial voice signal to obtain an effective voice signal;
and extracting voice features from the effective voice signal.
3. The method of claim 1, wherein the demographic property classification model is a convolutional neural network model;
the training process of the convolutional neural network model comprises the following steps:
initializing parameters of each layer in the convolutional neural network model;
selecting an unused voice training sample from the voice training sample set as a target voice training sample;
performing VAD interception processing on the target voice training sample to obtain an effective voice training signal;
extracting voice features from the effective voice training signal, and inputting the voice features into the convolutional neural network model to obtain a classification result output by the convolutional neural network model;
calculating the cross entropy of the classification result output by the convolutional neural network model and the crowd attribute labeled by the target voice training sample;
updating parameters of each layer in the convolutional neural network model, returning to the step of selecting an unused voice training sample from the voice training sample set until a cross entropy is obtained, and judging whether the cross entropy is converged according to the cross entropy obtained by the current calculation and the cross entropy obtained by the previous calculation;
if yes, ending the training;
if not, returning to the step of updating the parameters of each layer in the convolutional neural network model.
4. The method of claim 3, wherein updating parameters of the layers in the convolutional neural network model comprises:
taking the cross entropy as a loss function result;
and respectively transmitting the loss function result to each layer in the convolutional neural network model according to the sequence from the output layer to the input layer in the convolutional neural network model, and updating the parameters of each layer in the convolutional neural network model.
5. The method of claim 3, wherein the speech feature is a non-fixed frame length speech feature;
the crowd-sourcing attribute classification model includes at least: an input layer, a convolutional layer, a pooling layer, a global pooling layer, and an output layer.
6. The method of claim 1, wherein the training process of the wake-up model comprises:
respectively determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model;
dividing voice training samples containing awakening words with the same crowd attributes into a group as an awakening word training sample group according to the crowd attributes corresponding to the voice training samples containing the awakening words;
and training a wake-up model according to the wake-up word training sample group to obtain the wake-up models corresponding to various crowd attributes.
7. A voice wake-up apparatus, comprising:
the extraction module is used for acquiring an initial voice signal and extracting voice features from the initial voice signal;
the classification module is used for inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, and the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes for training;
the determining module is used for determining the crowd attribute corresponding to the initial voice signal according to the crowd attribute classification result;
the selection module is used for selecting the awakening model corresponding to the crowd attribute from the awakening model group as a target awakening model, the awakening model group comprises a plurality of awakening models of different types, and the awakening models of the different types are obtained by utilizing the voice training samples corresponding to the crowd attribute to train;
and the awakening module is used for inputting the initial voice signal into the target awakening model so as to enable the target awakening model to be awakened by voice.
8. The apparatus of claim 7, wherein the extraction module comprises:
the VAD interception processing submodule is used for carrying out VAD interception processing on the initial voice signal to obtain an effective voice signal;
and the extraction sub-module is used for extracting the voice features from the effective voice signals.
9. A voice wake-up device, comprising:
a memory for storing a program;
the processor, configured to execute the program, and when the processor executes the program, the processor implements the voice wake-up method according to any one of claims 1 to 6.
10. A storage medium storing a program which, when executed, implements the voice wake-up method of any of claims 1 to 6.
CN201910800728.5A 2019-08-28 2019-08-28 Voice wake-up method, device and related equipment Pending CN112530418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910800728.5A CN112530418A (en) 2019-08-28 2019-08-28 Voice wake-up method, device and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910800728.5A CN112530418A (en) 2019-08-28 2019-08-28 Voice wake-up method, device and related equipment

Publications (1)

Publication Number Publication Date
CN112530418A true CN112530418A (en) 2021-03-19

Family

ID=74973928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910800728.5A Pending CN112530418A (en) 2019-08-28 2019-08-28 Voice wake-up method, device and related equipment

Country Status (1)

Country Link
CN (1) CN112530418A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012682A (en) * 2021-03-24 2021-06-22 北京百度网讯科技有限公司 False wake-up rate determination method, device, apparatus, storage medium, and program product

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714812A (en) * 2013-12-23 2014-04-09 百度在线网络技术(北京)有限公司 Voice identification method and voice identification device
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof
CN106548773A (en) * 2016-11-04 2017-03-29 百度在线网络技术(北京)有限公司 Child user searching method and device based on artificial intelligence
CN107507612A (en) * 2017-06-30 2017-12-22 百度在线网络技术(北京)有限公司 A kind of method for recognizing sound-groove and device
CN107705793A (en) * 2017-09-22 2018-02-16 百度在线网络技术(北京)有限公司 Information-pushing method, system and its equipment based on Application on Voiceprint Recognition
CN109189980A (en) * 2018-09-26 2019-01-11 三星电子(中国)研发中心 The method and electronic equipment of interactive voice are carried out with user
US20190027129A1 (en) * 2017-07-18 2019-01-24 Baidu Online Network Technology (Beijing) Co., Ltd Method, apparatus, device and storage medium for switching voice role
CN109903750A (en) * 2019-02-21 2019-06-18 科大讯飞股份有限公司 A kind of audio recognition method and device
CN109947984A (en) * 2019-02-28 2019-06-28 北京奇艺世纪科技有限公司 A kind of content delivery method and driving means for children
CN109976703A (en) * 2019-04-04 2019-07-05 广东美的厨房电器制造有限公司 Guide illustration method, computer readable storage medium and cooking equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714812A (en) * 2013-12-23 2014-04-09 百度在线网络技术(北京)有限公司 Voice identification method and voice identification device
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof
CN106548773A (en) * 2016-11-04 2017-03-29 百度在线网络技术(北京)有限公司 Child user searching method and device based on artificial intelligence
CN107507612A (en) * 2017-06-30 2017-12-22 百度在线网络技术(北京)有限公司 A kind of method for recognizing sound-groove and device
US20190027129A1 (en) * 2017-07-18 2019-01-24 Baidu Online Network Technology (Beijing) Co., Ltd Method, apparatus, device and storage medium for switching voice role
CN107705793A (en) * 2017-09-22 2018-02-16 百度在线网络技术(北京)有限公司 Information-pushing method, system and its equipment based on Application on Voiceprint Recognition
CN109189980A (en) * 2018-09-26 2019-01-11 三星电子(中国)研发中心 The method and electronic equipment of interactive voice are carried out with user
CN109903750A (en) * 2019-02-21 2019-06-18 科大讯飞股份有限公司 A kind of audio recognition method and device
CN109947984A (en) * 2019-02-28 2019-06-28 北京奇艺世纪科技有限公司 A kind of content delivery method and driving means for children
CN109976703A (en) * 2019-04-04 2019-07-05 广东美的厨房电器制造有限公司 Guide illustration method, computer readable storage medium and cooking equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012682A (en) * 2021-03-24 2021-06-22 北京百度网讯科技有限公司 False wake-up rate determination method, device, apparatus, storage medium, and program product
CN113012682B (en) * 2021-03-24 2022-10-14 北京百度网讯科技有限公司 False wake-up rate determination method, device, apparatus, storage medium, and program product

Similar Documents

Publication Publication Date Title
CN110838289B (en) Wake-up word detection method, device, equipment and medium based on artificial intelligence
CN109918680B (en) Entity identification method and device and computer equipment
JP7005099B2 (en) Voice keyword recognition methods, devices, computer-readable storage media, and computer devices
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN108346436B (en) Voice emotion detection method and device, computer equipment and storage medium
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN108831439B (en) Voice recognition method, device, equipment and system
CN108899013B (en) Voice search method and device and voice recognition system
Tong et al. A comparative study of robustness of deep learning approaches for VAD
CN108062954B (en) Speech recognition method and device
CN109360572B (en) Call separation method and device, computer equipment and storage medium
CN111444329A (en) Intelligent conversation method and device and electronic equipment
CN110070859B (en) Voice recognition method and device
CN113314119B (en) Voice recognition intelligent household control method and device
CN112673421A (en) Training and/or using language selection models to automatically determine a language for voice recognition of spoken utterances
JP2020004382A (en) Method and device for voice interaction
CN111462751A (en) Method, apparatus, computer device and storage medium for decoding voice data
CN110491375B (en) Target language detection method and device
CN113450771A (en) Awakening method, model training method and device
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN116978368B (en) Wake-up word detection method and related device
CN114155854A (en) Voice data processing method and device
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN112530418A (en) Voice wake-up method, device and related equipment
JP6910002B2 (en) Dialogue estimation method, dialogue activity estimation device and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination