CN112530418A

CN112530418A - Voice wake-up method, device and related equipment

Info

Publication number: CN112530418A
Application number: CN201910800728.5A
Authority: CN
Inventors: 陈孝良; 靳源; 冯大航; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2021-03-19

Abstract

The application provides a voice awakening method, a voice awakening device and related equipment, wherein the method comprises the following steps: acquiring an initial voice signal, and extracting voice features from the initial voice signal; inputting the voice characteristics into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model; according to the crowd attribute classification result, determining the crowd attribute corresponding to the initial voice signal; selecting a wake-up model corresponding to the crowd attribute from the wake-up model group as a target wake-up model; and inputting the initial voice signal into the target awakening model so as to enable the target awakening model to perform voice awakening. In the application, the reliability of voice awakening can be improved through the above mode.

Description

Voice wake-up method, device and related equipment

Technical Field

The present application relates to the field of voice processing technologies, and in particular, to a voice wake-up method, apparatus, and related device.

Background

The voice wake-up means that a user wakes up the electronic device by speaking a wake-up word, so that the electronic device enters a state of waiting for a voice instruction or directly executes a predetermined voice instruction.

Voice wake-up technology is increasingly applied to voice interaction devices, but how to improve the reliability of voice wake-up becomes a problem.

Disclosure of Invention

In order to solve the foregoing technical problem, embodiments of the present application provide a voice wake-up method, apparatus, and related device, so as to achieve the purpose of improving reliability of voice wake-up, and the technical solution is as follows:

a voice wake-up method, comprising:

acquiring an initial voice signal, and extracting voice features from the initial voice signal;

inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, wherein the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes through training;

according to the crowd attribute classification result, determining the crowd attribute corresponding to the initial voice signal;

selecting a wake-up model corresponding to the crowd attributes from a wake-up model group as a target wake-up model, wherein the wake-up model group comprises a plurality of different types of wake-up models, and each type of wake-up model is obtained by training with a voice training sample corresponding to the crowd attributes;

and inputting the initial voice signal into the target awakening model so as to enable the target awakening model to carry out voice awakening.

Preferably, the extracting the speech feature from the initial speech signal includes:

performing VAD interception processing on the initial voice signal to obtain an effective voice signal;

and extracting voice features from the effective voice signal.

Preferably, the crowd attribute classification model is a convolutional neural network model;

the training process of the convolutional neural network model comprises the following steps:

initializing parameters of each layer in the convolutional neural network model;

selecting an unused voice training sample from the voice training sample set as a target voice training sample;

performing VAD interception processing on the target voice training sample to obtain an effective voice training signal;

extracting voice features from the effective voice training signal, and inputting the voice features into the convolutional neural network model to obtain a classification result output by the convolutional neural network model;

calculating the cross entropy of the classification result output by the convolutional neural network model and the crowd attribute labeled by the target voice training sample;

updating parameters of each layer in the convolutional neural network model, returning to the step of selecting an unused voice training sample from the voice training sample set until a cross entropy is obtained, and judging whether the cross entropy is converged according to the cross entropy obtained by the current calculation and the cross entropy obtained by the previous calculation;

if yes, ending the training;

if not, returning to the step of updating the parameters of each layer in the convolutional neural network model.

Preferably, the updating the parameters of each layer in the convolutional neural network model includes:

taking the cross entropy as a loss function result;

and respectively transmitting the loss function result to each layer in the convolutional neural network model according to the sequence from the output layer to the input layer in the convolutional neural network model, and updating the parameters of each layer in the convolutional neural network model.

Preferably, the voice features are voice features with non-fixed frame length;

the crowd-sourcing attribute classification model includes at least: an input layer, a convolutional layer, a pooling layer, a global pooling layer, and an output layer.

Preferably, the training process of the wake-up model includes:

respectively determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model;

dividing voice training samples containing awakening words with the same crowd attributes into a group as an awakening word training sample group according to the crowd attributes corresponding to the voice training samples containing the awakening words;

and training a wake-up model according to the wake-up word training sample group to obtain the wake-up models corresponding to various crowd attributes.

A voice wake-up apparatus comprising:

the extraction module is used for acquiring an initial voice signal and extracting voice features from the initial voice signal;

the classification module is used for inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, and the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes for training;

the determining module is used for determining the crowd attribute corresponding to the initial voice signal according to the crowd attribute classification result;

the selection module is used for selecting the awakening model corresponding to the crowd attribute from the awakening model group as a target awakening model, the awakening model group comprises a plurality of awakening models of different types, and the awakening models of the different types are obtained by utilizing the voice training samples corresponding to the crowd attribute to train;

and the awakening module is used for inputting the initial voice signal into the target awakening model so as to enable the target awakening model to be awakened by voice.

Preferably, the extraction module includes:

the VAD interception processing submodule is used for carrying out VAD interception processing on the initial voice signal to obtain an effective voice signal;

and the extraction sub-module is used for extracting the voice features from the effective voice signals.

the device further comprises:

a convolutional neural network model training module to:

if yes, ending the training;

Preferably, the convolutional neural network model training module is specifically configured to:

taking the cross entropy as a loss function result;

Preferably, the voice features are voice features with non-fixed frame length;

Preferably, the apparatus further comprises:

a wake-up module training module to:

A voice wake-up device comprising:

a memory for storing a program;

the processor is configured to run the program, and when the processor runs the program, the processor implements the voice wake-up method according to any one of the above items.

A storage medium storing a program which, when executed, is adapted to implement the voice wake-up method as claimed in claims 1 to 6.

Compared with the prior art, the beneficial effect of this application is:

in the application, the voice features extracted from the initial voice signals are input into the crowd attribute classification model to obtain the crowd attribute classification result output by the crowd attribute classification model, the crowd attributes corresponding to the initial voice signals are determined according to the crowd attribute classification result, the awakening model corresponding to the crowd attributes is selected from the awakening model group to serve as the target awakening model, and voice awakening is performed by using the target awakening model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flowchart of an embodiment 1 of a voice wake-up method provided in the present application;

fig. 2 is a flowchart of another embodiment 1 of a voice wake-up method provided in the present application;

FIG. 3 is a flow chart of a convolutional neural network model training process provided herein;

FIG. 4 is a flow chart of a training process of a wake-up model provided herein;

fig. 5 is a schematic logic structure diagram of a voice wake-up apparatus provided in the present application.

Detailed Description

In the application process of the voice wake-up technology, the inventor of the present application noticed that the voice wake-up rate of the existing wake-up model for some people (e.g., the elderly and children) is not high, and in order to improve the above problem and achieve the improvement of the voice wake-up rate, the present application proposes a voice wake-up method, and then introduces the voice wake-up method provided by the present application in detail.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

As shown in fig. 1, a flowchart of an embodiment 1 of a voice wake-up method provided by the present application is provided, where the method includes the following steps:

and step S11, acquiring an initial voice signal and extracting voice features from the initial voice signal.

In this embodiment, the voice features may include, but are not limited to: MFCC (Mel-scalefree coeffients), or MFCC and PITCH features.

It should be noted that, in human-computer interaction, the length of the valid speech signal input by the user is generally not fixed, and therefore, the speech feature extracted from the initial speech signal is a speech feature with a non-fixed frame length.

And step S12, inputting the voice characteristics into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model.

The crowd attribute classification model is obtained by training with a voice training sample marked with crowd attributes.

The crowd attribute classification result can be understood as: the probability value of the voice feature corresponding to each type of crowd attribute is, for example, the probability value of the adult man corresponding to the voice feature, the probability value of the adult woman corresponding to the voice feature, the probability value of the old man corresponding to the voice feature, and the probability value of the child corresponding to the voice feature.

In this embodiment, the crowd attribute classification model may include, but is not limited to: CNN (Convolutional Neural Networks) model.

And under the condition that the crowd attribute classification model is a CNN model, if the voice feature is a voice feature with a non-fixed frame length, a global pooling layer is used for replacing a full connection layer in the CNN model, so that the input and processing of the voice feature with the non-fixed frame length are realized. Specifically, the CNN model at least includes: an input layer, a pooling layer, a convolutional layer, a global pooling layer, and an output layer. The connection relationship of each layer in the CNN model may be, but is not limited to: the input layer is connected with the pooling layer, the pooling layer is connected with the convolution layer, the convolution layer is connected with the global pooling layer, and the global pooling layer is connected with the output layer.

And step S13, determining the crowd attribute corresponding to the initial voice signal according to the crowd attribute classification result.

On the basis that the crowd attribute classification result comprises the probability values of various crowd attributes corresponding to the voice features, the probability values of various crowd attributes corresponding to the voice features can be compared, and the crowd attribute corresponding to the initial voice signal is determined according to the comparison result. The crowd attribute with the maximum probability value of various crowd attributes corresponding to the voice feature can be specifically used as the crowd attribute corresponding to the initial voice signal. If the probability value of the crowd attribute corresponding to the voice feature is 0.3 for adult men, the probability value of the crowd attribute corresponding to the voice feature is 0.2 for adult women, the probability value of the crowd attribute corresponding to the voice feature is 0.4 for old people, and the probability value of the crowd attribute corresponding to the voice feature is 0.1 for children, the old people are taken as the crowd attribute of the voice feature.

And step S14, selecting the awakening model corresponding to the crowd attribute from the awakening model group as the target awakening model.

The awakening model group comprises a plurality of awakening models of different types, and the awakening models of the different types are obtained by training with voice training samples corresponding to the crowd attributes respectively.

In this embodiment, different wake-up models are set for different attributes of the population, for example, for adult men, an adult man wake-up model is set; aiming at adult ladies, setting an adult lady awakening model; setting an old person awakening model for the old person; and setting a child awakening model aiming at the child.

It can be understood that different types of arousal models need to be trained by using voice training samples with different crowd attributes, for example, an adult male arousal model needs to be trained by using a voice training sample of an adult male; the adult female awakening model is obtained by training with a voice training sample of an adult female; the old person awakening model is obtained by training by using a voice training sample of the old person; the child awakening model needs to be trained by utilizing a voice training sample of the child.

And step S15, inputting the initial voice signal into the target awakening model so as to make the target awakening model perform voice awakening.

The target awakening model is obtained by training the voice training sample of the crowd attribute corresponding to the initial voice signal, so that the target awakening model can accurately identify awakening words in the initial signal and provide voice awakening reliability.

As another alternative embodiment of the present application, referring to fig. 2, a flowchart of an embodiment 2 of a voice wakeup method provided by the present application is shown, where this embodiment mainly describes a refinement scheme of the voice wakeup method described in the above embodiment 1, and as shown in fig. 2, the method may include, but is not limited to, the following steps:

step S21, obtaining an initial voice signal, and performing VAD truncation processing on the initial voice signal to obtain an effective voice signal.

Performing VAD (Voice Activity Detection) interception on the initial Voice signal, which can be understood as: and detecting a starting point and an end point of the voice signal in the initial voice signal, and separating the effective voice signal from the initial voice signal according to the starting point and the end point.

Preferably, the VAD algorithm may be a VAD algorithm that utilizes short-time energy and zero-crossing rate.

And step S22, extracting voice features from the effective voice signal.

The voice features are extracted from the effective voice signals, and the efficiency and the accuracy of extracting the voice features can be improved.

Steps S21-S22 are a specific implementation of step S11 in example 1.

And step S23, inputting the voice features into a preset crowd attribute classification model to obtain a crowd attribute classification result output by the crowd attribute classification model, wherein the crowd attribute classification model is obtained by utilizing a voice training sample marked with crowd attributes for training.

And step S24, determining the crowd attribute corresponding to the initial voice signal according to the crowd attribute classification result.

And step S25, selecting an awakening model corresponding to the crowd attributes from an awakening model group as a target awakening model, wherein the awakening model group comprises a plurality of awakening models of different types, and each type of awakening model is obtained by utilizing a voice training sample corresponding to the crowd attributes for training.

And step S26, inputting the initial voice signal into the target awakening model so as to make the target awakening model perform voice awakening.

The detailed procedures of steps S23-S26 can be found in the related descriptions of steps S12-S15 in embodiment 1, and are not repeated herein.

In another embodiment of the present application, a training process of the convolutional neural network model described in embodiment 1 is described, please refer to fig. 3, which may include the following steps:

and step S31, initializing parameters of each layer in the convolutional neural network model.

The initialization process may refer to the process of initializing parameters of each layer in the convolutional neural network model in the prior art, and is not described herein again.

Step S32, selecting an unused voice training sample from the voice training sample set as a target voice training sample.

In order to ensure the training precision, a voice training sample set comprising a plurality of voice training samples can be set, and the comprehensiveness and integrity of the voice training samples are ensured.

And step S33, performing VAD truncation processing on the target voice training sample to obtain an effective voice training signal.

Performing VAD truncation processing on the target speech training sample to obtain an effective speech training signal, which can be understood as: and detecting a starting point and an end point of a voice signal in the target voice training sample, and separating an effective voice training signal from the target voice training sample according to the starting point and the end point.

And step S34, extracting voice features from the effective voice training signal, and inputting the voice features into the convolutional neural network model to obtain a classification result output by the convolutional neural network model.

The voice features are extracted from the effective voice training signals, so that the efficiency of extracting the voice features can be improved, a convolutional neural network model can be trained more effectively, and the interference of non-voice sections on training is avoided.

And step S35, calculating the cross entropy of the classification result output by the convolutional neural network model and the crowd attribute labeled by the target voice training sample.

And S36, updating the parameters of each layer in the convolutional neural network model, and returning to the step S32 until cross entropy is obtained.

Preferably, the process of updating the parameters of each layer in the convolutional neural network model may include:

a11, taking the cross entropy as a loss function result;

a12, respectively transmitting the loss function result to each layer in the convolutional neural network model according to the sequence from the output layer to the input layer in the convolutional neural network model, and updating the parameters of each layer in the convolutional neural network model.

According to the sequence from the output layer to the input layer in the convolutional neural network model, the loss function result is respectively transmitted to each layer in the convolutional neural network model, and the parameters of each layer in the convolutional neural network model are updated, which can be understood as: and updating parameters of each layer in the convolutional neural network model by using a back propagation principle.

Of course, the process of updating the parameters of each layer in the convolutional neural network model is not limited to the process shown in this embodiment, and other updating manners that can implement the parameters of each layer in the convolutional neural network model are also possible.

And step S37, judging whether the cross entropy is converged according to the cross entropy obtained by the current calculation and the cross entropy obtained by the previous calculation.

Judging whether the cross entropy converges can be understood as: judging whether the change trend of the cross entropy is a reduction trend;

or, judging whether the variation trend of the cross entropy is a reduction trend, if so, judging whether the difference value between the cross entropy obtained by the calculation and the cross entropy obtained by the last calculation is smaller than a preset value.

If yes, the training of the convolutional neural network model is considered to reach the set requirement, and step S38 is executed; if not, the training of the convolutional neural network model can be considered to not meet the set requirement, and the step of updating the parameters of each layer in the convolutional neural network model is returned to be executed.

And step S38, finishing training.

In another embodiment of the present application, a training process of the wake-up model described in embodiment 1 is described, please refer to fig. 4, which may include the following steps:

and step S41, respectively determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model.

The process of determining the crowd attributes corresponding to the voice training samples containing the awakening words by using the crowd attribute classification model may include:

and B11, respectively extracting voice characteristics from the voice training samples containing the awakening words.

And B12, respectively inputting the voice features extracted from the voice training samples containing the awakening words into the crowd attribute classification model to obtain the crowd attribute classification result output by the crowd attribute classification model.

And B13, determining the crowd attributes corresponding to the voice training samples containing the awakening words according to the crowd attribute classification results.

Step S42, dividing the voice training samples containing the wake-up word with the same crowd attributes into a group as a wake-up word training sample group according to the crowd attributes corresponding to the voice training samples containing the wake-up word.

The voice training samples containing the awakening words with the same crowd attributes are divided into a group, so that the awakening models corresponding to different crowd attributes can be trained conveniently.

The number of the awakening word training sample groups is the same as the type of the crowd attributes, and if the crowd attributes are 4 types, the number of the awakening word training sample groups is 4.

And step S43, training a wake-up model according to the wake-up word training sample group to obtain wake-up models corresponding to various crowd attributes.

Training a wake-up model according to the wake-up word training sample group to obtain a specific process of the wake-up model corresponding to various crowd attributes, which can include:

and C11, respectively extracting the voice characteristics of the voice training samples containing the awakening words in each awakening word training sample group, and forcibly aligning each frame of the voice training samples by using a voice recognition model according to the voice characteristics to obtain the phoneme labels corresponding to each frame of the voice training samples.

Preferably, the speech recognition model may be an ASR model.

And C12, obtaining the speech features of the current frame of the speech training sample and the speech features of the context frame to form a feature vector of the current frame, and training the awakening model frame by using the feature vector of each frame and the corresponding phoneme label to obtain the awakening models corresponding to various crowd attributes.

The process of training the wake-up model frame by using the feature vectors and the corresponding phoneme labels of each frame to obtain the wake-up models corresponding to various types of crowd attributes may include:

d11, initializing parameters of each layer in the awakening model;

d12, training a wake-up model from the first frame of the voice training sample; since there is no context in the first frame, the context of the first frame is used in all copies of the first frame, for example, the context is defined as 20 frames, the context is defined as 10 frames, and the context is defined as 10 frames, the context is defined as 10 frames of the first frame, and the context is defined as 10 frames following the first frame; for the previous frame less than 10 frames, copying the first frame for supplementing, and for the next frame less than 10 frames, copying the last frame for supplementing; thus, the window frame of the current frame is 21 frames.

The feature vector of each frame refers to a feature vector formed by the speech features of the current frame and the context frame, that is, a feature vector formed by the speech features of all frames of the window frame. Where the speech feature per frame may be 30-dimensional, or other dimensions.

D13, inputting the feature vector of the first frame into the awakening model for training, and obtaining the cross entropy of the current frame according to the training result and by combining the phoneme label corresponding to the current first frame;

d14, updating parameters of each layer in the awakening model according to the cross entropy;

d15, training the awakening model according to the second frame of the voice training sample, inputting the feature vector of the second frame into the awakening model for training, and obtaining the cross entropy of the second frame according to the training result and the phoneme label corresponding to the current second frame, and so on; after the cross entropy of the current frame is obtained, judging whether the cross entropy is converged or not according to the cross entropy of the current frame and the cross entropy calculated before;

if the convergence is achieved, ending the model training; if not, continuing to train the model according to the next frame until convergence.

The above training of the wake-up model is performed on a frame-by-frame basis, optionally, batch frames may also be used to train the wake-up model, and the process of training the wake-up model using batch frames may include:

e11, initializing parameters of each layer in the awakening model;

e12, dividing the voice frames of the voice training samples into a plurality of batches according to a certain number, preferably, the certain number may be several thousands, that is, each batch of frames includes several thousands of frames;

e13, training the awakening model from the first batch of frames, inputting the feature vectors of the first batch of frames into the awakening model for training, and obtaining the overall cross entropy of the current batch of frames according to the training result and by combining the phoneme labels corresponding to the current first batch of frames;

e14, updating parameters of each layer in the awakening model according to the overall cross entropy;

e15, training the awakening model according to the second batch of frames, inputting the feature vectors of the second batch of frames into the awakening model for training, and obtaining the overall cross entropy of the second batch of frames according to the training result and by combining the phoneme labels corresponding to the current second batch of frames, and so on; after the integral cross entropy of the current batch frame is obtained, judging whether the integral cross entropy is converged or not according to the integral cross entropy of the current batch frame and the integral cross entropy calculated before;

if the convergence is achieved, ending the model training; if not, continuing to train the awakening model according to the next batch of frames until convergence.

The following describes the voice wake-up apparatus provided in the present application, and the voice wake-up apparatus described below and the voice wake-up method described above may be referred to correspondingly.

Referring to fig. 5, the voice wake-up apparatus includes: an extraction module 11, a classification module 12, a first determination module 13, a selection module 14 and a wake-up module 15.

The extraction module 11 is configured to acquire an initial voice signal and extract a voice feature from the initial voice signal;

the classification module 12 is configured to input the voice features into a preset crowd attribute classification model, so as to obtain a crowd attribute classification result output by the crowd attribute classification model, where the crowd attribute classification model is obtained by training with a voice training sample labeled with crowd attributes;

a determining module 13, configured to determine, according to the crowd attribute classification result, a crowd attribute corresponding to the initial voice signal;

a selecting module 14, configured to select a wake-up model corresponding to the crowd attribute from a wake-up model group as a target wake-up model, where the wake-up model group includes multiple different types of wake-up models, and each type of wake-up model is obtained by training with a speech training sample corresponding to the crowd attribute;

and the awakening module 15 is configured to input the initial voice signal into the target awakening model, so that the target awakening model performs voice awakening.

In this embodiment, the extracting module 11 may include:

In this embodiment, the crowd attribute classification model may be a convolutional neural network model.

Accordingly, the voice wake-up apparatus may further include:

a convolutional neural network model training module to:

if yes, ending the training;

In this embodiment, the convolutional neural network model training module may be specifically configured to:

taking the cross entropy as a loss function result;

In this embodiment, the speech feature may be a speech feature with a non-fixed frame length;

accordingly, the demographic property classification model includes at least: an input layer, a convolutional layer, a pooling layer, a global pooling layer, and an output layer.

In this embodiment, the voice wake-up apparatus may further include:

a wake-up model training module to:

In another embodiment of the present application, a voice wake-up apparatus is presented, which may include:

a memory for storing a program;

the processor is configured to run the program, and when the processor runs the program, the processor implements the voice wake-up method described in the above embodiments.

In another embodiment of the present application, a storage medium is described that may be used to store a program that, when executed, is used to implement the voice wake-up method as described in the various embodiments above.

It should be noted that each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The voice wake-up method and device provided by the present application are introduced in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A voice wake-up method, comprising:

2. The method of claim 1, wherein said extracting speech features from said initial speech signal comprises:

and extracting voice features from the effective voice signal.

3. The method of claim 1, wherein the demographic property classification model is a convolutional neural network model;

if yes, ending the training;

4. The method of claim 3, wherein updating parameters of the layers in the convolutional neural network model comprises:

taking the cross entropy as a loss function result;

5. The method of claim 3, wherein the speech feature is a non-fixed frame length speech feature;

6. The method of claim 1, wherein the training process of the wake-up model comprises:

7. A voice wake-up apparatus, comprising:

8. The apparatus of claim 7, wherein the extraction module comprises:

9. A voice wake-up device, comprising:

a memory for storing a program;

the processor, configured to execute the program, and when the processor executes the program, the processor implements the voice wake-up method according to any one of claims 1 to 6.

10. A storage medium storing a program which, when executed, implements the voice wake-up method of any of claims 1 to 6.