CN113782009A

CN113782009A - Voice awakening system based on Savitzky-Golay filter smoothing method

Info

Publication number: CN113782009A
Application number: CN202111322974.8A
Authority: CN
Inventors: 李郡; 付冠宇; 乔树山; 尚德龙; 周玉梅
Original assignee: Zhongke Nanjing Intelligent Technology Research Institute
Current assignee: Zhongke Nanjing Intelligent Technology Research Institute
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2021-12-10

Abstract

The invention relates to a method and a system for waking up a voice wake-up system. The method comprises the steps of obtaining continuous acoustic feature frames of a voice stream; according to the continuous acoustic characteristic frames of the voice stream, a neural network of a voice wake-up system is utilized to determine the probability of continuous non-keywords and the probability of keywords; smoothing the probability of the non-keywords and the probability of the keywords by using a Savitzky-Golay filter; and determining the output of the current frame of the voice awakening system by using the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing. The invention can improve the overall stability and accuracy of the voice awakening system.

Description

Voice awakening system based on Savitzky-Golay filter smoothing method

Technical Field

The present invention relates to the field of voice wake-up, and in particular, to a wake-up method and system for a voice wake-up system.

Background

With the development of intelligent devices, voice interaction is widely applied, and a voice wake-up system is a key for enabling voice interaction. The goal of the voice wake-up system is to find the set keywords in the continuous voice input without manual operation. Because the system usually runs on the edge device, the memory of the edge device is small, and the computing power is limited, the voice awakening system simultaneously meets the requirements of high accuracy, small number of times of mistaken awakening and mistaken refusing, small memory for running and small amount of computation.

However, when the neural network model output of the voice wake-up system is directly used as the wake-up judgment basis, the output noise is too large, and the system is incorrectly woken up unstably, so that the overall stability and accuracy of the voice wake-up system are not high.

Disclosure of Invention

The invention aims to provide a method and a system for waking up a voice wake-up system, which can improve the overall stability and accuracy of the voice wake-up system.

In order to achieve the purpose, the invention provides the following scheme:

a wake-up method of a voice wake-up system includes:

acquiring continuous acoustic feature frames of a voice stream;

according to the continuous acoustic characteristic frames of the voice stream, a neural network of a voice wake-up system is utilized to determine the probability of continuous non-keywords and the probability of keywords; the voice awakening system neural network takes an acoustic characteristic frame as input and takes the probability of a non-keyword and the probability of a keyword as output;

smoothing the probability of the non-keywords and the probability of the keywords by using a Savitzky-Golay filter;

and determining the output of the current frame of the voice awakening system by using the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing.

Optionally, the acquiring the continuous acoustic feature frames of the voice stream specifically includes:

acquiring a continuous voice stream by using a microphone;

and performing feature extraction on the continuous voice stream to determine continuous acoustic feature frames.

Optionally, the determining, according to the continuous acoustic feature frames of the voice stream, the probability of the continuous non-keyword and the probability of the keyword by using a voice wake-up system neural network specifically includes:

acquiring keywords and non-keywords of a voice awakening system;

marking keywords and non-keywords; the labels of different keywords are different, and the labels of different non-keywords are the same;

acquiring continuous acoustic feature frames extracted by keywords and non-keywords;

taking continuous acoustic characteristic frames as input and the probability of a label as output, and constructing and training a voice wake-up system neural network; the probability of the label corresponds to the probability of the non-keyword and the probability of the keyword;

and taking continuous acoustic feature frames as input, and enabling the trained voice wake-up system neural network to generate continuous non-keyword probability and keyword probability.

Optionally, the determining the output of the current frame of the voice wake-up system by using the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing specifically includes:

determining the maximum probability according to the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing;

if the maximum probability is a non-keyword, not waking up the voice wake-up system;

if the maximum probability is the keyword, the maximum probability is greater than or equal to a set awakening threshold value, and the time from the last awakening exceeds a set time limit, awakening the voice awakening system according to the corresponding keyword; otherwise, the voice wake-up system is not awakened.

A wake-up system for a voice wake-up system, comprising:

the acoustic feature frame acquisition module is used for acquiring continuous acoustic feature frames of the voice stream;

the probability determining module is used for awakening the neural network of the system by utilizing voice according to the continuous acoustic characteristic frames of the voice stream and determining the probability of continuous non-keywords and the probability of keywords; the voice awakening system neural network takes an acoustic characteristic frame as input and takes the probability of a non-keyword and the probability of a keyword as output;

the probability smoothing processing module is used for smoothing the probability of the non-keyword and the probability of the keyword by utilizing a Savitzky-Golay filter;

and the output determining module of the current frame of the voice awakening system is used for determining the output of the current frame of the voice awakening system by utilizing the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing.

Optionally, the acoustic feature frame obtaining module specifically includes:

a voice stream acquiring unit for acquiring a continuous voice stream by using a microphone;

and the acoustic feature frame determining unit is used for extracting features of the continuous voice stream and determining continuous acoustic feature frames.

Optionally, the probability determining module specifically includes:

the data acquisition unit is used for acquiring keywords and non-keywords of the voice wake-up system;

a data marking unit for marking keywords and non-keywords; the labels of different keywords are different, and the labels of different non-keywords are the same;

the device comprises a characteristic acquisition unit, a processing unit and a processing unit, wherein the characteristic acquisition unit is used for acquiring continuous acoustic characteristic frames extracted by keywords and non-keywords;

the voice wake-up system neural network construction unit is used for constructing and training a voice wake-up system neural network by taking continuous acoustic characteristic frames as input and taking the probability of a label as output; the probability of the label corresponds to the probability of the non-keyword and the probability of the keyword;

and the non-keyword and keyword probability generating unit is used for taking the continuous acoustic characteristic frames as input so as to enable the trained voice wake-up system neural network to generate continuous non-keyword probability and keyword probability.

Optionally, the output determining module of the current frame of the voice wake-up system specifically includes:

the maximum probability determining unit determines the maximum probability according to the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing;

the first determining unit of the output of the voice awakening system is used for not awakening the voice awakening system if the maximum probability is the non-keyword;

the output second determining unit of the voice awakening system is used for awakening the voice awakening system according to the corresponding keyword if the maximum probability is the keyword, the maximum probability is greater than or equal to a set awakening threshold value, and the time of awakening the voice awakening system last time exceeds a set time limit; otherwise, the voice wake-up system is not awakened.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a method and a system for waking up a voice wake-up system, which utilize a Savitzky-Golay filter to carry out smooth processing on the probability of non-keywords and the probability of keywords, filter output noise and ensure that the system is not woken up by mistake frequently any more. And moreover, the Savitzky-Golay filter is adopted for smoothing, so that the phenomenon that the local detail filtering trend is filtered by adopting an average smoothing filter, and the number of times of mistakenly rejecting the keywords is increased is avoided. The Savitzky-Golay filter keeps the original trend of probability output, reduces the times of mistakenly rejecting keywords, and improves the overall stability and accuracy of the voice awakening system.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a wake-up method of a voice wake-up system according to the present invention;

fig. 2 is a schematic diagram of a wake-up system structure of a voice wake-up system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a wake-up method of a voice wake-up system according to the present invention. As shown in fig. 1, the wake-up method for a voice wake-up system provided by the present invention includes:

s101, obtaining continuous acoustic feature frames of a voice stream;

s101 specifically comprises the following steps:

acquiring a continuous voice stream by using a microphone;

S102, according to continuous acoustic characteristic frames of voice streams, a neural network of a voice wake-up system is utilized, and the probability of continuous non-keywords and the probability of keywords are determined; the voice awakening system neural network takes an acoustic characteristic frame as input and takes the probability of a non-keyword and the probability of a keyword as output;

s102, specifically comprising:

acquiring keywords and non-keywords of a voice awakening system;

marking keywords and non-keywords; the labels of different keywords are different, and the labels of different non-keywords are the same; for example: the label of the non-keyword is noted as 0, and the label of the keyword is noted as 1, 2, 3, …, n.

Acquiring continuous acoustic feature frames extracted by keywords and non-keywords; for example: mel-frequency cepstrum coefficients.

Taking continuous acoustic characteristic frames as input and the probability of a label as output, and constructing and training a voice wake-up system neural network; the probability of the label corresponds to the probability of the non-keyword and the probability of the keyword; the middle network layer can be a linear layer or a convolution layer and the like and is used as a feature extraction layer; the last layer is a combination of a linear layer and a softmax layer, and is used as a classification layer. And finally outputting a vector with the length of n +1 after the input of the keywords passes through the feature extraction layer and the classification layer according to the number of the selected keywords, wherein each value in the vector represents the probability of hitting the keyword corresponding to the subscript label of the vector.

Fixing all parameters of the neural network of the voice awakening system, completing acoustic modeling of non-keywords and keywords, and deploying the voice awakening model to hardware equipment.

Since the input of the voice wake-up system is acoustic features with fixed frame number, once a new acoustic feature frame is obtained, the new acoustic feature frame can be spliced with the feature frame which is generated before to obtain new voice wake-up system neural network input.

S103, smoothing the probability of the non-keywords and the probability of the keywords by using a Savitzky-Golay filter;

the basic principle of the Savitzky-Golay filter is least square fitting, the fitting value is used for replacing the original probability output, high-frequency noise points can be removed, and the probability output of each label is smoothed. The fit value can be obtained by convolving the Savitzky-Golay coefficient with the probability output of each original label. The Savitzky-Golay coefficient is only determined by the length of a smoothing window and the order of a polynomial, so that the Savitzky-Golay coefficient can be repeatedly used after the calculation of the Savitzky-Golay coefficient is completed for the first time, and the calculation amount is reduced. The Savitzky-Golay filter smoothing is carried out on the probability output of the neural network, namely the finite long unit impulse response filter smoothing is carried out on the probability output, the smoothing calculation is simple and convenient, and the probability output trend, namely the keyword awakening trend, is ensured to be unchanged while noise is filtered.

And S104, determining the output of the current frame of the voice awakening system by using the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing.

S104 specifically comprises the following steps:

Compared with the existing voice awakening technology, the voice awakening method and the voice awakening system have the advantages that noise output by the neural network of the voice awakening system is reduced at lower calculation cost, the voice awakening system is enabled to be more stable, and the false awakening rate of the system is reduced.

This advantage comes from smoothing the probability outputs with Savitzky-Golay filters, respectively, according to the labels. Directly outputting the model probability as an awakening basis before being smoothed by a Savitzky-Golay filter, which can cause the output noise of the system to be too large, and the system is mistakenly awakened unstably. If an average smoothing filter is used, local detail trends are filtered out, so that the number of times of mistakenly rejecting keywords is increased. And by using the Savitzky-Golay filter, not only can output noise be filtered, but also the system is not frequently and mistakenly awakened any more, the original trend of probability output is kept, the times of mistakenly refusing keywords are reduced, and the overall stability and accuracy of the voice awakening system are improved.

Fig. 2 is a schematic structural diagram of a wake-up system of a voice wake-up system according to the present invention, and as shown in fig. 2, the wake-up system of the voice wake-up system according to the present invention includes:

an acoustic feature frame obtaining module 201, configured to obtain continuous acoustic feature frames of a voice stream;

a probability determination module 202, configured to determine a probability of a continuous non-keyword and a probability of a keyword by using a voice to wake up a system neural network according to a continuous acoustic feature frame of a voice stream; the voice awakening system neural network takes an acoustic characteristic frame as input and takes the probability of a non-keyword and the probability of a keyword as output;

a probability smoothing module 203, configured to smooth the probability of the non-keyword and the probability of the keyword by using a Savitzky-Golay filter;

and an output determining module 204 for the current frame of the voice wake-up system, configured to determine the output of the current frame of the voice wake-up system by using the probability of the non-keyword after the smoothing processing and the probability of the keyword after the smoothing processing.

The acoustic feature frame obtaining module 201 specifically includes:

The probability determination module 202 specifically includes:

The output determining module 204 for waking up the current frame of the system by using the speech specifically includes:

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for waking up a voice wake-up system, comprising:

acquiring continuous acoustic feature frames of a voice stream;

2. The method for waking up a voice wake-up system according to claim 1, wherein the obtaining of the continuous acoustic feature frames of the voice stream specifically includes:

acquiring a continuous voice stream by using a microphone;

3. The method according to claim 1, wherein the determining the probability of continuous non-keywords and the probability of keywords by using a voice wake-up system neural network according to continuous acoustic feature frames of a voice stream specifically comprises:

acquiring keywords and non-keywords of a voice awakening system;

4. The method for waking up a voice wake-up system according to claim 1, wherein the determining the output of the current frame of the voice wake-up system by using the probability of the smoothed non-keyword and the probability of the smoothed keyword specifically comprises:

5. A wake-up system for a voice wake-up system, comprising:

6. The wake-up system of a voice wake-up system according to claim 5, wherein the acoustic feature frame acquiring module specifically includes:

7. The wake-up system of a voice wake-up system according to claim 5, wherein the probability determination module specifically comprises:

and the non-keyword and keyword probability generating unit is used for generating continuous non-keyword probability and keyword probability by using the trained voice wake-up system neural network by taking the continuous acoustic characteristic frames as input.

8. The system for waking up a voice system according to claim 1, wherein the module for determining the output of the current frame of the voice system specifically comprises: