CN112397086A

CN112397086A - Voice keyword detection method and device, terminal equipment and storage medium

Info

Publication number: CN112397086A
Application number: CN202011225861.1A
Authority: CN
Inventors: 黎冰; 魏健龙
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-23

Abstract

The application relates to the technical field of voice recognition, and provides a voice keyword detection method, a voice keyword detection device, terminal equipment and a computer storage medium. After the audio stream to be detected is obtained, firstly, the audio features of the audio stream are extracted, and the voice endpoint detection module can detect whether each frame of audio signal of the audio stream contains a voice signal according to the audio features; in addition, the voice endpoint detection module triggers the keyword detection module to start only when detecting that the audio signals of more than two continuous frames in the audio stream all contain voice signals, so that the system power consumption caused by misjudgment generated by the voice endpoint detection module can be reduced.

Description

Voice keyword detection method and device, terminal equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for detecting a speech keyword, a terminal device, and a storage medium.

Background

The main function of voice keyword detection is to detect whether a segment of audio signal contains a predetermined keyword. With the advent of the mobile internet era, the voice keyword detection technology has been more widely applied, for example, in devices such as smart homes and smart phones, the device can continuously monitor specific keywords by adopting the voice keyword detection technology, and a user can wake up the device to start working only by speaking preset keywords, so that the user is provided with hands-free voice recognition experience.

At present, a conventional voice keyword detection method generally comprises: detecting whether the acquired audio stream contains a voice signal or not by adopting a voice endpoint detection module; if the voice signal is detected to be contained, extracting the audio characteristic of the voice signal; and after a certain number of audio features are extracted, detecting keywords in the audio features by using a keyword detection module.

However, there is a certain possibility of misjudgment in the voice endpoint detection module, and when misjudgment occurs, the keyword detection module may be started by mistake, resulting in unnecessary system power consumption.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for detecting a voice keyword, a terminal device, and a storage medium, which can reduce system power consumption caused by misjudgment generated by a voice endpoint detection module.

A first aspect of an embodiment of the present application provides a method for detecting a voice keyword, including:

acquiring an audio stream to be tested;

extracting the audio features of the audio stream to be detected;

detecting whether more than two continuous frames of audio signals in the audio stream to be detected contain voice signals or not according to the audio characteristics;

and if the audio signals of more than two continuous frames in the audio stream to be detected contain voice signals, executing keyword detection operation on the audio stream to be detected according to the audio characteristics.

After the audio stream to be detected is obtained, firstly, the audio features of the audio stream are extracted, and the voice endpoint detection module can detect whether each frame of audio signal of the audio stream contains a voice signal according to the audio features; in addition, the voice endpoint detection module triggers the keyword detection module to start only when detecting that the audio signals of more than two continuous frames in the audio stream all contain voice signals, so that the system power consumption caused by misjudgment generated by the voice endpoint detection module can be reduced.

In an embodiment of the present application, extracting the audio feature of the audio stream to be tested may include:

performing framing processing on the audio stream to be tested to obtain a plurality of audio clip frames;

performing discrete cosine transform processing on the plurality of audio segment frames to obtain frequency domain signals of the plurality of audio segment frames;

and filtering the frequency domain signal by adopting a filter to obtain the audio characteristics of the audio stream to be detected.

When the audio stream to be tested is subjected to framing processing, a window function can be added and a storage device is combined. In addition, a discrete cosine transform module can be adopted to process a plurality of audio segment frames obtained by framing, so that the conversion from time domain signals to frequency domain signals is completed, the energy characteristics of the signals obtained by calculation after the discrete cosine transform are more concentrated, and the voice related characteristics are easier to extract. After discrete cosine transformation, the time domain signal is transformed to a frequency domain signal, and then the frequency domain signal can be filtered by a filter to obtain the audio characteristics of the audio stream to be detected.

Further, performing framing processing on the audio stream to be tested to obtain a plurality of audio segment frames, which may include:

writing the audio stream to be tested into a storage device with a preset size, wherein when the storage device is full, the newly written data covers the old data;

when the storage device is full for the first time, extracting the data written by the storage device currently to serve as a first audio clip frame;

when the storage device writes data with the data length half of the preset size again after the storage device is full for the first time, extracting the currently written data of the storage device to be used as a second audio clip frame;

when the storage device writes data with the data length of the preset size again after the storage device is full for the first time, extracting the currently written data of the storage device to serve as a third audio clip frame;

and continuously repeating the operation until all data of the audio stream to be tested are traversed to obtain the plurality of audio fragment frames.

Assuming that the adopted storage device is an SRAM with a storage area of 512-12 bit, the realization of windowing operation only needs to judge the write address of the SRAM, and when the SRAM is fully written with data for the first time, the currently stored data of the addresses 0-511 is a first audio fragment frame; after the SRAM is fully written with data, the newly written data overwrites the old data, and when data (i.e., 256 × 12 bits) with a data length half the size of the SRAM is written again, the currently written data of the SRAM memory device is extracted as a second audio segment frame, i.e., from the beginning of the SRAM address 256 to the address 255 of the next cycle, which is a second audio segment frame; then, after 256 × 12bit of data is written into the SRAM, the currently written data in the SRAM memory device is extracted as a third audio segment frame, and so on until all data of the audio stream to be tested are traversed.

In an embodiment of the present application, whether any one frame of target audio signal in the audio stream to be tested contains a speech signal may be detected by:

processing the audio characteristics of the target audio signal by adopting a pre-constructed voice signal detection model to obtain a first probability that the target audio signal contains a voice signal and a second probability that the target audio signal does not contain the voice signal;

if the first probability is greater than the second probability, determining that the target audio signal contains a voice signal, otherwise determining that the target audio signal does not contain a voice signal;

and the bias value and the activation value of the voice signal detection model and the weight precision of each layer of neural network can be configured.

The filtered characteristic data can be stored in a double-ended SRAM, and the SRAM can also store intermediate values generated in the calculation process of the neural network. The audio characteristics are processed by adopting a pre-constructed voice signal detection model, so that the probability of whether the voice signal is contained in the audio signal can be obtained. Specifically, the speech signal detection model may be a neural network model, and the speech signal detection model is calculated by reading audio feature data stored in a corresponding SRAM into a processing unit of the neural network, where an input layer of the model is a speech feature value, an output layer uses a softmax function, outputs two probabilities of speech and non-speech, and determines that the current audio signal includes a speech signal when the probability of speech is greater than the probability of non-speech.

Further, before processing the audio feature of the target audio signal by using the pre-constructed speech signal detection model, the method may further include:

detecting the signal-to-noise ratio of the audio stream to be detected;

if the signal-to-noise ratio is larger than or equal to a first threshold value, setting the weight precision of each layer of neural network of the voice signal detection model as a first numerical value;

and if the signal-to-noise ratio is smaller than the first threshold value, setting the weight precision of each layer of neural network of the voice signal detection model as a second numerical value, wherein the second numerical value is larger than the first numerical value.

Based on the configurable characteristics of the parameters of the voice signal detection model, the parameters of the model can be configured in a targeted manner according to the signal-to-noise ratio of the field environment before the model is used. Specifically, for an environment with low signal-to-noise ratio, the weight precision of the model can be configured to be a high numerical value, so that the precision and accuracy of model detection can be ensured; for an environment with high signal-to-noise ratio, the weight precision of the model can be configured to be a lower numerical value, so that the power consumption of the system can be saved, and the problem of great reduction of the detection accuracy can be avoided.

In an embodiment of the present application, the performing, according to the audio feature, a keyword detection operation on the audio stream to be detected may include:

processing the audio signal by adopting a pre-established keyword detection model aiming at each frame of audio signal in the audio stream to be detected to obtain the probability that the audio signal contains each preset voice keyword;

determining the keyword corresponding to the maximum probability in the probabilities containing all preset voice keywords as the keyword of the audio signal;

and the bias value, the activation value and the weight precision of each layer of neural network of the keyword detection model can be configured.

And starting the keyword detection module when detecting that the audio signals of more than two continuous frames in the audio stream to be detected contain the voice signals. The audio signal is processed by adopting a pre-constructed keyword detection model, so that the probability of each preset voice keyword contained in the audio signal can be obtained, wherein the keyword with the highest probability is determined as the keyword contained in the audio signal.

Further, before processing the audio signal by using the pre-constructed keyword detection model, the method may further include:

detecting the signal-to-noise ratio of the audio stream to be detected;

if the signal-to-noise ratio is larger than or equal to a second threshold value, setting the weight precision of each layer of neural network of the keyword detection model as a third numerical value;

and if the signal-to-noise ratio is smaller than the second threshold value, setting the weight precision of each layer of neural network of the keyword detection model as a fourth numerical value, wherein the fourth numerical value is larger than the third numerical value.

Based on the configurable characteristics of the parameters of the keyword detection model, before the model is used, the parameters of the model can be configured in a targeted manner according to the signal-to-noise ratio of the field environment. Specifically, for an environment with low signal-to-noise ratio, the weight precision of the model can be configured to be a high numerical value, so that the precision and accuracy of model detection can be ensured; for an environment with high signal-to-noise ratio, the weight precision of the model can be configured to be a lower numerical value, so that the power consumption of the system can be saved, and the problem of great reduction of the detection accuracy can be avoided.

A second aspect of the embodiments of the present application provides a device for detecting a speech keyword, including:

the audio acquisition module to be tested is used for acquiring an audio stream to be tested;

the audio characteristic extraction module is used for extracting the audio characteristics of the audio stream to be detected;

the voice signal detection module is used for detecting whether the audio signals of more than two continuous frames in the audio stream to be detected contain voice signals or not according to the audio features;

and the keyword detection module is used for executing keyword detection operation on the audio stream to be detected according to the audio characteristics if the audio signals of more than two continuous frames in the audio stream to be detected contain voice signals.

A third aspect of an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the voice keyword detection method provided in the first aspect of the embodiment of the present application when executing the computer program.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the voice keyword detection method provided in the first aspect of the embodiments of the present application.

A fifth aspect of the embodiments of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the steps of the method for detecting a voice keyword according to the first aspect of the embodiments of the present application.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of an embodiment of a method for detecting a speech keyword according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a process of windowing and framing audio data in a speech keyword detection method according to an embodiment of the present application;

FIG. 3 is a flowchart of an embodiment of another method for detecting a speech keyword according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a speech keyword detection system according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech keyword detection chip according to an embodiment of the present application;

FIG. 6 is a graph of the filter characteristics of the Mel-filter with 4bit accuracy provided by the embodiments of the present application;

FIG. 7 is a diagram illustrating the operation of the Mel-filter with 4-bit precision according to an embodiment of the present application;

fig. 8 is a schematic diagram of a neural network structure adopted by a VAD module provided in the embodiment of the present application;

fig. 9 is a schematic diagram of a neural network structure adopted by the KWS module according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a neural network processing unit provided in an embodiment of the present application;

fig. 11 is a block diagram of an embodiment of a speech keyword detection apparatus according to an embodiment of the present application;

fig. 12 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

In recent years, with the rise of deep learning, people begin to try to learn voice activation detection and keyword detection by using a neural network, and obtain a very good effect. Although a full-precision neural network is very good in performance, since a large amount of calculation is required in the inference process of the neural network, and large resources and power consumption are required for the implementation of the full-precision mathematical calculation in hardware, how to implement voice activation detection and keyword detection with lower power consumption on the basis of not significantly reducing the inference performance of the neural network becomes an exploration target of researchers.

At present, a conventional voice keyword detection system generally includes a voice endpoint detection module, a voice feature extraction module, and a keyword detection module. When the keyword detection device works, a voice endpoint detection module detects whether a voice signal exists in an audio stream, if the voice signal is detected, a subsequent voice feature extraction module is activated to perform data processing on the subsequent audio stream by taking a frame as a unit, voice features such as MFCC (Mel frequency cepstrum coefficient) and the like are extracted, and then the voice features are temporarily stored in a storage device such as SRAM (static random access memory), and after the audio data (about 500ms) required by keyword detection processing is stored for one time, a keyword detection module is started to judge whether the audio data is a preset keyword. However, there is a certain possibility of misjudgment in the voice endpoint detection module, and when misjudgment occurs, the keyword detection module may be started by mistake, resulting in unnecessary system power consumption. Moreover, the calculation of the CNN network needs to be started after all data required for keyword detection are prepared, which wastes time and increases delay, and most importantly, this method increases the number of intermediate values generated in the operation process of the convolutional network, thereby resulting in a large SRAM being required to store these intermediate values.

In view of the above problems, embodiments of the present application provide a method and an apparatus for detecting a voice keyword, a terminal device, and a storage medium, which can reduce system power consumption caused by misjudgment generated by a voice endpoint detection module. It should be understood that the execution subject of the method embodiments of the present application may be various types of terminal devices or servers, such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a wearable device, and the like.

Referring to fig. 1, a method for detecting a speech keyword in an embodiment of the present application is shown, including:

101. acquiring an audio stream to be tested;

firstly, an audio stream to be detected is obtained, wherein the audio stream to be detected is a certain section of audio signal needing to execute voice keyword detection.

102. Extracting the audio features of the audio stream to be detected;

after the audio stream to be detected is obtained, the audio stream to be detected can be input into the voice feature extraction module to extract the audio features of the audio stream to be detected, and the extracted audio features can be used for subsequent voice signal detection and keyword detection.

Further, extracting the audio feature of the audio stream to be detected may include:

(1) performing framing processing on the audio stream to be tested to obtain a plurality of audio clip frames;

(2) performing discrete cosine transform processing on the plurality of audio segment frames to obtain frequency domain signals of the plurality of audio segment frames;

(3) and filtering the frequency domain signal by adopting a filter to obtain the audio characteristics of the audio stream to be detected.

For the step (1), when performing framing processing on the audio stream to be tested, the method may be completed by adding a window function and combining a storage device, and specifically may include:

(1.1) writing the audio stream to be tested into a storage device with a preset size, wherein when the storage device is full, the newly written data overwrites the old data;

(1.2) when the storage device is full for the first time, extracting the data which is written by the storage device currently as a first audio clip frame;

(1.3) when the storage device writes data with the data length being half of the preset size again after the storage device is full for the first time, extracting the currently written data of the storage device to be used as a second audio clip frame;

(1.4) when the storage device writes the data with the data length of the preset size again after the storage device is full for the first time, extracting the currently written data of the storage device to be used as a third audio clip frame;

and (1.5) continuously repeating the operation until all data of the audio stream to be tested are traversed to obtain the plurality of audio fragment frames.

To reduce the amount of computation, a rectangular window of value 1 may be added when adding the window function, frame-shifted to 16ms for a total of 256 audio sample values. Assuming that the adopted storage device is an SRAM with a storage area of 512-12 bit, the realization of windowing operation only needs to judge the write address of the SRAM, and when the SRAM is fully written with data for the first time, the currently stored data of the addresses 0-511 is a first audio fragment frame; after the SRAM is fully written with data, the newly written data overwrites the old data, and when data (i.e., 256 × 12 bits) with a data length half the size of the SRAM is written again, the currently written data of the SRAM memory device is extracted as a second audio segment frame, i.e., from the beginning of the SRAM address 256 to the address 255 of the next cycle, which is a second audio segment frame; then, after 256 × 12bit of data is written into the SRAM, the currently written data in the SRAM memory device is extracted as a third audio segment frame, and so on until all data of the audio stream to be tested are traversed. For the sake of understanding, the process of windowing and framing the audio data is schematically illustrated in FIG. 2, and 5 audio segment frames, i.e., frames 1-5, are shown in FIG. 2.

For the step (2), a discrete cosine transform module may be used to process the multiple audio segment frames obtained by framing, so as to complete the conversion from the time domain signal to the frequency domain signal. The discrete cosine transform is a transform defined on a real signal, and a real signal is obtained in a frequency domain after the transform. Most of natural signals (sound and image) are concentrated in the low-frequency part after discrete cosine transform, so that the energy characteristics of the signals obtained by calculation after discrete cosine transform are more concentrated, and the voice related characteristics are easier to extract.

For the step (3), after the discrete cosine transform, the time domain signal is transformed into the frequency domain signal, and then the frequency domain signal may be filtered by using a filter. Specifically, a Mel-filter can be adopted, and as the full-precision floating point operation is relatively complex and can bring relatively large power consumption, after verification, the performance of the whole voice keyword detection system is not obviously reduced under the condition of using 4-bit Mel-filter parameters, and the requirement of the accuracy rate of keyword detection can still be met. Therefore, the method preferably adopts a 4-bit Mel-filter, and the audio characteristic data of the audio stream to be detected is obtained after the frequency domain signal is processed by the filter.

103. Detecting whether more than two continuous frames of audio signals in the audio stream to be detected contain voice signals or not according to the audio characteristics;

after the audio features are obtained, whether each frame of audio signal in the audio stream to be tested contains a speech signal or not can be detected through the audio features. Whether any frame of target audio signal in the audio stream to be detected contains a voice signal can be detected in the following modes:

(1) processing the audio characteristics of the target audio signal by adopting a pre-constructed voice signal detection model to obtain a first probability that the target audio signal contains a voice signal and a second probability that the target audio signal does not contain the voice signal;

(2) if the first probability is greater than the second probability, determining that the target audio signal contains a voice signal, otherwise determining that the target audio signal does not contain a voice signal; and the bias value and the activation value of the voice signal detection model and the weight precision of each layer of neural network can be configured.

The filtered characteristic data can be stored in a double-ended SRAM, and the SRAM can also store intermediate values generated in the calculation process of the neural network. The audio characteristics are processed by adopting a pre-constructed voice signal detection model, so that the probability of whether the voice signal is contained in the audio signal can be obtained. Specifically, the speech signal detection model may be a neural network model, and the speech signal detection model is calculated by reading audio feature data stored in a corresponding SRAM into a processing unit of the neural network, where an input layer of the model is a speech feature value, an output layer uses a softmax function, outputs two probabilities of speech and non-speech, and determines that the current audio signal includes a speech signal when the probability of speech is greater than the probability of non-speech. In addition, the bias value, the activation value and the weight precision of each layer of neural network of the voice signal detection model can be configured, that is, the weight precision of the neural networks of different layers can be different.

(1) detecting the signal-to-noise ratio of the audio stream to be detected;

(2) if the signal-to-noise ratio is larger than or equal to a first threshold value, setting the weight precision of each layer of neural network of the voice signal detection model as a first numerical value;

(3) and if the signal-to-noise ratio is smaller than the first threshold value, setting the weight precision of each layer of neural network of the voice signal detection model as a second numerical value, wherein the second numerical value is larger than the first numerical value.

If the audio signals of more than two continuous frames in the audio stream to be tested contain voice signals, executing step 104, otherwise executing step 105.

104. Performing keyword detection operation on the audio stream to be detected according to the audio characteristics;

the audio signals of more than two continuous frames in the audio stream to be detected all contain voice signals, which indicate that the voice signals are detected currently, so that the keyword detection module can be triggered and started to execute subsequent keyword detection operations, and the specific keyword detection mode can refer to the following embodiments.

105. And not executing the operation of executing the keyword detection on the audio stream to be detected.

The audio signal with more than two continuous frames not detected in the audio stream to be detected contains a voice signal, which indicates that the voice signal is not detected currently, and at this time, a keyword detection module does not need to be started, so that the power consumption of the system is saved.

Referring to fig. 3, another method for detecting a speech keyword in an embodiment of the present application is shown, including:

301. acquiring an audio stream to be tested;

302. extracting the audio features of the audio stream to be detected;

303. detecting whether more than two continuous frames of audio signals in the audio stream to be detected contain voice signals or not according to the audio characteristics;

the steps 301-303 are the same as the steps 101-103, and the related description of the steps 101-103 can be referred to. If the audio signal of more than two consecutive frames in the audio stream to be tested contains a speech signal, step 304-305 is executed, otherwise step 306 is executed.

304. Processing the audio signal by adopting a pre-established keyword detection model aiming at each frame of audio signal in the audio stream to be detected to obtain the probability that the audio signal contains each preset voice keyword;

305. determining the keyword corresponding to the maximum probability in the probabilities containing all preset voice keywords as the keyword of the audio signal;

and starting the keyword detection module when detecting that the audio signals of more than two continuous frames in the audio stream to be detected contain the voice signals. The audio signal is processed by adopting a pre-constructed keyword detection model, so that the probability of each preset voice keyword contained in the audio signal can be obtained, wherein the keyword with the highest probability is determined as the keyword contained in the audio signal. Specifically, the keyword detection model may be a neural network model, and the audio feature data stored in the corresponding SRAM is read into a processing unit of the neural network to perform calculation, where an input layer of the model is a speech feature value matrix, an activation function of an output layer is a softmax function, and N probability vectors are output, where N is the number of preset speech keywords. By comparing the N probabilities, the keyword corresponding to the current audio stream is the highest probability. In addition, the bias value, the activation value and the weight precision of each layer of neural network of the keyword detection model can be configured, that is, the weight precision of the neural networks in different layers can be different.

(1) detecting the signal-to-noise ratio of the audio stream to be detected;

(2) if the signal-to-noise ratio is larger than or equal to a second threshold value, setting the weight precision of each layer of neural network of the keyword detection model as a third numerical value;

(3) and if the signal-to-noise ratio is smaller than the second threshold value, setting the weight precision of each layer of neural network of the keyword detection model as a fourth numerical value, wherein the fourth numerical value is larger than the third numerical value.

306. And not executing the operation of executing the keyword detection on the audio stream to be detected.

After the audio stream to be detected is obtained, firstly, the audio features of the audio stream are extracted, and the voice endpoint detection module can detect whether each frame of audio signal of the audio stream contains a voice signal according to the audio features; in addition, the voice endpoint detection module triggers the keyword detection module to start only when detecting that the audio signals of more than two continuous frames in the audio stream all contain voice signals, so that the system power consumption caused by misjudgment generated by the voice endpoint detection module can be reduced. Compared with the first embodiment of the present application, this embodiment proposes a specific implementation manner of performing a keyword detection operation on an audio stream to be detected according to audio features.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

For ease of understanding, the speech keyword detection method proposed in the present application is described below in a practical application scenario.

The voice keyword detection method provided by the application can be applied to a voice keyword detection system shown in fig. 4, and the keyword detection system comprises a voice feature extraction module, a voice endpoint detection module, a keyword detection module and other functional modules. The voice feature extraction module comprises a discrete cosine transform module and a Mel-filter.

The working principle of the voice keyword detection system is as follows: the method comprises the steps that an obtained audio stream to be detected is firstly input into a voice feature extraction module for feature extraction, then extracted audio features are sent to a voice endpoint detection module for voice detection, and if the voice endpoint detection module detects that audio signals of more than two continuous frames (such as 3 frames) are all voice signals, a keyword detection module is started to detect keywords, and finally voice keywords in the audio stream to be detected are obtained.

An embodiment of the present application further provides a chip for implementing the method for detecting a voice keyword, a schematic structural diagram of the chip is shown in fig. 5, and the chip mainly includes an I2C module, an I2S module, a flaming module, an fdct module, a Mel-filter module, a VAD module (voice endpoint detection module), a KWS module (keyword detection module), a PE, Max-pool, Relu, a softmax function, a Memory System module (SRAM, for storing data), and the like, and each module is described below:

(1) I2C module

I2C is a bidirectional two-wire synchronous serial bus, I2C module is mainly used to transfer the weight and bias of parameters of neural network in VAD module and KWS module, and parameters in fdct module and Mel-filter module, besides, various configuration registers in chip and the enable bits of some modules are also written by I2C. After the VAD module and the KWS module make decisions, the decision result can also be read out of the chip through I2C.

(2) I2S module

The chip internally comprises an I2S receiving module which works by adopting a time sequence of an I2S standard format and is used for receiving original audio data and writing the original audio data into a 512-12 SRAM, and the original audio data is overwritten when the data exceeds 512, and the original audio data is continuously and circularly written.

(3) flaming module

The windowing module is a windowing and framing module, and the specific working principle thereof can refer to the related contents in the foregoing.

(4) fdct module

The fdct module is a fast discrete cosine transform module that transforms an audio signal from a time sequence to a frequency domain by performing a Discrete Cosine Transform (DCT) on a plurality of audio segment frames obtained after framing. The definition of DCT is as follows:

in order to reduce the computational complexity of the transformation, a recursive butterfly fast algorithm can be adopted, so that the computational complexity is increased from N²Is reduced to

For N2^t(t>0) The DCT transform of (a) may be derived for the transform coefficients x (k) separately for both parity parts, where the even terms are:

the odd terms are:

in the above two formulas, k is 1,2, … and N/2-1. According to the step-by-step decomposition, the N-point DCT can be calculated by being divided into two N/2-point DCTs, the N/2-point DCT can be calculated by being divided into N/4, and finally the calculation can be divided into the calculation of the two-point DCT, so that the calculation complexity is greatly reduced.

(5) Mel-filter module

After DCT transformation, the time domain signal is transformed to the frequency domain signal, and then the Mel-filter module is used for filtering the frequency domain signal. Since the full-precision floating point operation is relatively complex and can bring about relatively large power consumption, and verification shows that the performance of the whole KWS system is not significantly reduced under the condition of using the 4-bit Mel-filter parameter, and the performance can still meet the requirements, the 4-bit Mel-filter is used in the application, and the characteristic curve of the 4-bit Mel-filter is shown in fig. 6. The Mel-filter operation is mainly a multiply and accumulate operation, the intermediate value of the operation is stored in the register reg, the operation diagram is shown in FIG. 7, and the specific calculation formula is as follows:

where l denotes the filter bank channel, mel _ para_l(k) K coefficient, N, representing the l channel_min(l) First non-zero mel _ para corresponding to l channel_l(k) Num (l) is non-zero mel _ para corresponding to l channel_l(k) Number, mel _ para_l、N_min(l) And num (l) are filter parameters with fixed values, which can be written into an internal SRAM through I2C, X (k) is a value after discrete cosine transform, and m (l) is audio characteristic data required by the VAD module and the KWS module.

(6) VAD module

The filtered audio feature data can be stored in a 3184x8bit double-ended SRAM, and in addition, the SRAM also stores intermediate values generated in the calculation process of the neural network. After the filtering is completed, the Mel-filter module sets the VAD activation signal, thereby activating the VAD module. The VAD module realizes the main functions of generating data of each layer of the neural network and corresponding weight and offset addresses, reading the data stored in the corresponding SRAM through the addresses to a processing unit (PE) of the neural network for calculation, and obtaining a hidden value of each layer through an activation function Relu and storing the hidden value in the SRAM. The VAD module also has a built-in fully-connected neural network structure used by VAD, which is a configurable fixed-point fully-connected neural network, the input layer is a frame speech characteristic value of 32 × 1, the output layer is a softmax function, and outputs two probabilities of speech and non-speech, respectively, when the probability of speech is greater than the probability of non-speech, the current frame is determined to be a speech frame, and the schematic diagram of the neural network structure adopted by the VAD module is shown in fig. 8.

(7) Relu function

The activation function adopted by the application is a Relu function, and the expression of the activation function is as follows:

f(x)＝max(0,x)

when x is less than 0, the operation results are all 0, and when x is more than or equal to 0, the operation results are self. Since the application uses a fixed point neural network, all x maps to the fixed point value closest to the value of x after passing through the activation function.

(8) KWS module

When more than two times (for example, 3 times) of voice signals are detected continuously, the KWS module is started to perform keyword detection, the KWS module has the main function of generating data of each layer of the KWS neural network and corresponding weight and offset addresses, then the data stored in the corresponding SRAM is read through the addresses and is calculated in a processing unit (PE) of the neural network, and the calculated result is temporarily stored in a data SRAM after passing through a maximum pooling layer or directly passing through a Relu function. The neural network of the KWS module is composed of a fixed-point convolutional neural network and a fully-connected neural network, and a schematic structural diagram of the neural network adopted by the KWS module is shown in fig. 9.

In fig. 9, the convolution operations of

layers

2, 3, and 5 of the convolutional network all include a maximum pooling layer, and the input layer is a feature value matrix of 32 × 32, and there are 32 frames of voice feature data, and each frame has 32 feature values; the activation function of the output layer is a softmax function, a probability vector of 11x1 is output, the first 10 values represent the probabilities of 10 different preset keywords, the 11 th value is the probability of a non-keyword, and the keyword corresponding to the current audio stream with the highest probability is found by comparing the 11 probabilities. The activation function of other layers is Relu function, the convolution kernel size is 3x3, the number of convolution kernels of each layer is 8, 16, 32 and 64, and the step size of each layer is 1. In order to reduce the SRAM space required for data temporary storage in the convolutional neural network calculation process, a time-sharing serial calculation mode may be adopted in the present application: for the convolution layer which does not need to be pooled, only three lines of data need to be temporarily stored, the calculation can be carried out by storing three lines, and the following two lines of data are moved forward by one line after the calculation is finished, which is equivalent to the completion of the downward moving operation of the convolution kernel. Then, the first row is dropped, so that the vacated third row can store the data of the new row. For the convolution layer needing pooling, four lines of data need to be temporarily stored, calculation is carried out only when four lines are full, the data of two lines move forward after calculation, and the data of the 3 rd and 4 th lines are vacated to store new data

(9) Neural network processing unit PE

The PE is a processing unit of the neural network and is responsible for calculating various data in the neural network. In order to reduce the area of the chip, only two PEs are used in the present application, and are respectively applied to the calculation of data with a weight precision of 2 bits and a weight precision of 4 bits, and a schematic structural diagram thereof is shown in fig. 10.

(10) Max-pooling maximum pooling unit

In the convolutional network adopted in the present application, the used pooling layer size is 2x2, and for the Max-pooling layer size of 2x2, the operation returns the maximum value of 4 numbers of the 2x2 matrix.

(11) softmax function

The activation functions of the VAD output layer and the KWS output layer adopt softmax functions, and the expressions are as follows:

wherein x is₁，x₂，…，x_MFor the output values in VAD and KWS neural networks, M is 2 for VAD and 11 for KWS.

In summary, compared with the prior art, the voice keyword detection method provided by the application has the following advantages:

1. the voice characteristics are simplified, the calculated amount is reduced, and the power consumption is reduced.

2. The KWS is triggered only when the voice signal is detected for more than two continuous frames, so that the power consumption caused by triggering the KWS due to VAD misjudgment can be reduced.

3. After the KWS module is triggered, each frame of new data is generated, the corresponding convolutional layer is calculated, calculation is not needed after all data are prepared, so that delay can be reduced, and above all, the size of a temporary storage data SRAM of the convolutional layer can be greatly reduced by about 76.66%.

4. The weight precision supporting each layer of neural network can be configured, and the bias and activation values of the neural network can be configured, that is, the weight precision of different layers of neural networks can be different, such as 4bit and 2 bit. When the network is configured to be 2bit, the corresponding neural network becomes a three-valued neural network, and the calculation of the network can be completed only by an adder without a multiplier. The bias and activation values of the whole VAD or the KWS can be configured to be 4 bits and 8 bits, so that the high-precision VAD can be configured to be high-precision in a low signal-to-noise ratio environment, and the low-precision VAD can be configured to be low-precision in a high signal-to-noise ratio environment, so that the power consumption can be saved in the high signal-to-noise ratio environment, and the method can also be suitable for various complex environments.

5. The configurable weight precision supporting each layer of the neural network can be configured into a low-precision 2-bit weight in some layers with large calculation amount, and the layers with small calculation amount are configured into a 4-bit weight, so that the overall performance is not reduced too much compared with the neural network with all layers being 4-bit, but most multiplication operations can be reduced.

6. The Discrete Cosine Transform (DCT) is a transform defined on a real signal, and the transform results in a real signal in the frequency domain. Most natural signals (sound and image) are concentrated in energy of a low-frequency part after discrete cosine transform, so that energy characteristics of signals obtained by calculation after DCT are more concentrated, and voice related characteristics are easier to extract.

VAD and KWS adopt the same audio feature to make judgment, and can share one feature extraction module.

The above mainly describes a voice keyword detection method, and a voice keyword detection apparatus will be described below.

Referring to fig. 11, an embodiment of a speech keyword detection apparatus in an embodiment of the present application includes:

the audio acquiring module 401 to be tested is used for acquiring an audio stream to be tested;

an audio feature extraction module 402, configured to extract an audio feature of the audio stream to be detected;

a voice signal detection module 403, configured to detect whether or not the audio signals of two or more consecutive frames in the audio stream to be detected all contain voice signals according to the audio features;

and a keyword detection module 404, configured to perform a keyword detection operation on the audio stream to be detected according to the audio feature if the audio signals of more than two consecutive frames in the audio stream to be detected all include a voice signal.

Further, the audio feature extraction module may include:

the framing unit is used for framing the audio stream to be tested to obtain a plurality of audio segment frames;

a cosine transform unit, configured to perform discrete cosine transform processing on the multiple audio segment frames to obtain frequency domain signals of the multiple audio segment frames;

and the filtering unit is used for filtering the frequency domain signal by adopting a filter to obtain the audio characteristics of the audio stream to be detected.

Further, the framing unit may include:

the audio stream writing subunit is used for writing the audio stream to be tested into a storage device with a preset size, wherein when the storage device is full, the newly written data covers the old data;

the first audio frame extracting subunit is used for extracting data which is written currently by the storage device when the storage device is full for the first time to serve as a first audio fragment frame;

a second audio frame extraction subunit, configured to, when the storage device is full of data for the first time and data with a data length half of the preset size is written again, extract data that has been currently written in the storage device, and use the data as a second audio segment frame;

a third audio frame extraction subunit, configured to, when the storage device is full of data written for the first time and data with the data length equal to the preset size is written again, extract data that has been currently written in the storage device, and use the data as a third audio segment frame;

and the audio clip frame extraction subunit is used for continuously repeating the operation until all data of the audio stream to be tested are traversed to obtain the plurality of audio clip frames.

Further, the voice signal detection module may include:

the voice detection unit is used for processing the audio characteristics of the target audio signal by adopting a pre-constructed voice signal detection model to obtain a first probability that the target audio signal contains the voice signal and a second probability that the target audio signal does not contain the voice signal;

a voice determination unit, configured to determine that the target audio signal contains a voice signal if the first probability is greater than the second probability, and otherwise determine that the target audio signal does not contain a voice signal; and the bias value and the activation value of the voice signal detection model and the weight precision of each layer of neural network can be configured.

Further, the voice signal detection module may further include:

the first signal-to-noise ratio detection unit is used for detecting the signal-to-noise ratio of the audio stream to be detected;

the first model weight setting unit is used for setting the weight precision of each layer of neural network of the voice signal detection model as a first numerical value if the signal-to-noise ratio is greater than or equal to a first threshold value;

and the second model weight setting unit is used for setting the weight precision of each layer of neural network of the voice signal detection model as a second numerical value if the signal-to-noise ratio is smaller than the first threshold value, wherein the second numerical value is larger than the first numerical value.

Further, the keyword detection module may include:

the keyword detection unit is used for processing each frame of audio signal in the audio stream to be detected by adopting a pre-established keyword detection model to obtain the probability that each preset voice keyword is contained in the audio signal;

a keyword determining unit, configured to determine a keyword corresponding to a maximum probability among the probabilities including the preset speech keywords as a keyword included in the audio signal; and the bias value, the activation value and the weight precision of each layer of neural network of the keyword detection model can be configured.

Further, the keyword detection module may further include:

the second signal-to-noise ratio detection unit is used for detecting the signal-to-noise ratio of the audio stream to be detected;

the third model weight setting unit is used for setting the weight precision of each layer of neural network of the keyword detection model as a third numerical value if the signal-to-noise ratio is greater than or equal to a second threshold value;

and the fourth model weight setting unit is used for setting the weight precision of each layer of the neural network of the keyword detection model as a fourth numerical value if the signal-to-noise ratio is smaller than the second threshold, wherein the fourth numerical value is larger than the third numerical value.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the voice keyword detection methods shown in fig. 1 or fig. 3 are implemented.

An embodiment of the present application further provides a computer program product, which, when running on a terminal device, causes the terminal device to execute steps for implementing any one of the voice keyword detection methods shown in fig. 1 or fig. 3.

Fig. 12 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 12, the terminal device 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52 stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps of the above-mentioned embodiments of the neural network-based speech keyword detection method, such as the steps 101 to 105 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of each module/unit in the above-mentioned device embodiments, for example, the functions of the modules 401 to 404 shown in fig. 11.

The computer program 52 may be divided into one or more modules/units, which are stored in the memory 51 and executed by the processor 50 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 52 in the terminal device 5.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing the computer program and other programs and data required by the terminal device. The memory 51 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for detecting a voice keyword is characterized by comprising the following steps:

acquiring an audio stream to be tested;

extracting the audio features of the audio stream to be detected;

2. The method for detecting speech keywords according to claim 1, wherein extracting the audio features of the audio stream to be detected comprises:

3. The method of claim 2, wherein the step of framing the audio stream to be tested to obtain a plurality of audio segment frames comprises:

4. The method for detecting speech keywords according to claim 1, wherein whether any frame of target audio signal in the audio stream to be detected contains speech signals is detected by:

5. The method of claim 4, wherein before the processing the audio features of the target audio signal using the pre-constructed speech signal detection model, the method further comprises:

detecting the signal-to-noise ratio of the audio stream to be detected;

6. The method according to any one of claims 1 to 5, wherein the performing keyword detection on the audio stream to be detected according to the audio features comprises:

7. The method of claim 6, wherein before the audio signal is processed using the pre-constructed keyword detection model, the method further comprises:

detecting the signal-to-noise ratio of the audio stream to be detected;

8. A speech keyword detection apparatus, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the speech keyword detection method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for detecting a speech keyword as claimed in any one of claims 1 to 7.