CN111091839B

CN111091839B - Voice awakening method and device, storage medium and intelligent device

Info

Publication number: CN111091839B
Application number: CN202010198736.XA
Authority: CN
Inventors: 徐泓洋; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-06-26
Anticipated expiration: 2040-03-20
Also published as: CN111091839A

Abstract

The invention discloses a voice awakening method, a voice awakening device, a storage medium and intelligent equipment; the method comprises the following steps: coding and calculating an input voice sequence through an Encoder, and outputting a first matrix with the same row and column number as the voice sequence; performing linear re-expression on the first matrix through a feedforward neural network, and outputting a second matrix, wherein the row and column number of the second matrix is the same as the row and column number of the first matrix; performing dimension compression on the second matrix through soft-attention to obtain an attention vector; identifying probabilities of a plurality of classes from the attention vector; and judging whether to execute the awakening function or not according to the probability result of the category. By using the Encoder, the feedforward neural network, the soft-attention and other structures and by referring to the internal structure of the Transformer, the attention vector containing local and global information is finally generated to obtain the probability of the category, whether the awakening function is executed or not is judged according to the probability result of the category, the parameter quantity is less, the end-to-end voice awakening judgment is realized, the response speed of the voice awakening method is higher, and the voice awakening method is suitable for voice awakening.

Description

Voice awakening method and device, storage medium and intelligent device

Technical Field

The present invention relates to the field of voice wake-up, and in particular, to a voice wake-up method, apparatus, storage medium, and intelligent device.

Background

In the fields of existing translation, voice recognition and the like, the current selectable basic network structures for constructing the voice recognition model comprise CNN, RNN/LSTM, a multi-head attention mechanism and the like, and each manufacturer can select a network suitable for the application requirement of the manufacturer to construct the voice recognition model; the effect achieved by the transform based on the multi-head attention machine system is better than the effect of a model based on prediction of a cnn/lstm combined with a CTC (connected _ Temporal _ classification) structure, which shows that the multi-head attention machine system has specific advantages in the aspect of feature extraction, but compared with the transform structure, the transform structure is more complex, the model is larger, and the transform is not suitable for a voice awakening scene.

Disclosure of Invention

The invention mainly aims to provide a voice awakening method, a voice awakening device, a storage medium and intelligent equipment, and can solve the problem that the existing Transformer structure is not suitable for voice awakening.

The invention provides a voice awakening method, which comprises the following steps:

coding and calculating an input voice sequence through an Encoder, and outputting a first matrix with the same row and column number as the voice sequence;

performing linear re-expression on the first matrix through a feedforward neural network, and outputting a second matrix, wherein the row and column number of the second matrix is the same as the row and column number of the first matrix;

performing dimension compression on the second matrix through soft-attention to obtain an attention vector;

identifying probabilities of a plurality of classes from the attention vector;

and judging whether to execute the awakening function or not according to the probability result of the category.

Further, the step of determining whether to execute the wake-up function according to the probability result of the category includes:

extracting the category with the highest probability as the identified category;

judging whether the identified category is a target category:

if yes, judging whether the probability of the identified category reaches a threshold value;

if so, executing the awakening function corresponding to the category;

if not, the identification result is ignored, and the awakening function is not executed.

Further, the step of performing encoding calculation on the input voice sequence through the Encoder and outputting a first matrix with the same row and column number as the voice sequence includes:

and performing coding calculation on the input voice sequence through N layers of superposed encoders, and outputting a first matrix with the same row number and column number as the voice sequence, wherein N is a positive integer.

Further, the step of identifying probabilities of the plurality of classes based on the attention vector includes:

inputting the attention vector into a full-connection layer for classification to obtain a plurality of classes;

the probability of belonging to each category is calculated according to the softmax function.

Further, the number of speech sequences and the first matrix rows and columns is 1 × 99 × 40.

Further, the number of attention vector rows and columns is 40 × 1.

Further, before the step of performing encoding calculation on the input speech sequence by the Encoder and outputting the first matrix with the same number of rows and columns as the speech sequence, the method includes:

and extracting FBANK features from the recorded original audio frame to obtain a voice sequence, wherein the voice sequence is a matrix with the row and column number of 1 × 99 × 40.

The present application further provides a voice wake-up apparatus, including:

the encoding module is used for encoding and calculating the input voice sequence through the Encoder and outputting a first matrix with the same row and column number as the voice sequence;

the second expression module is used for performing linear re-expression on the first matrix through a feedforward neural network and outputting a second matrix, and the row number and the column number of the second matrix are the same as those of the first matrix;

the attention vector acquisition module is used for carrying out dimension compression on the second matrix through soft-attention to obtain an attention vector;

the probability acquisition module is used for identifying the probabilities of a plurality of categories according to the attention vector;

and the awakening judgment module is used for judging whether to execute the awakening function according to the probability result of the category.

The present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which when executed implements the above-mentioned voice wake-up method.

The application also provides an intelligent device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the voice wake-up method when executing the computer program.

The voice awakening method comprises the steps of establishing an Encoder based on a multi-head attention mechanism by referencing the internal structure of a transducer through an Encoder, a feedforward neural network and a soft-attention structure and the like, carrying out dimension compression on input data after the encodings of the Encoder and the feedforward neural network, finally generating an attention vector containing local and global information to obtain the probability of a category, judging whether to execute an awakening function according to the probability result of the category, and realizing end-to-end voice awakening judgment, so that the response speed of the voice awakening method is higher, and the voice awakening method is suitable for voice awakening.

Drawings

FIG. 1 is a schematic diagram of a step structure of an embodiment of a voice wake-up method according to the present invention;

FIG. 2 is a schematic structural diagram of a voice wake-up apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a storage medium according to the present invention;

fig. 4 is a schematic structural diagram of an embodiment of the smart device of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the content clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, units, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, units, modules, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, the voice wake-up method of the present invention includes the following steps:

s1, carrying out coding calculation on the input voice sequence through an Encoder, and outputting a first matrix with the same row and column number as the voice sequence;

s2, performing linear re-expression on the first matrix through a feedforward neural network, and outputting a second matrix, wherein the row and column number of the second matrix is the same as the row and column number of the first matrix;

s3, performing dimension compression on the second matrix through soft-attribute to obtain an attention vector;

s4, identifying the probability of a plurality of categories according to the attention vector;

and S5, judging whether to execute the awakening function according to the probability result of the category.

In step S1, in some embodiments, for an Encoder, the input feature sequence (e.g., 1 × 99 × 40 matrix) is linearly transformed (dense layer) and then uniformly divided into n parts, where n is the number of heads. Each head is subjected to matrix operation (self-integration) to obtain weighted small matrixes (for example, the small matrixes are 1 x 99 x 8), n small matrixes are spliced into a large matrix (namely, the large matrix is spliced again into a 1 x 99 x 40 matrix), then the result of linear transformation is added to the large matrix according to a residual linking mechanism, and a first matrix is output after normalization, wherein the first matrix is still 1 x 99 x 40. The rear side of the Encoder can be connected with the next Encoder, or only one Encoder; encoder performs feature learning on feature information from a local perspective.

In the above step S2, the feedforward neural network is a point-wise feedforward neural network, where the point-wise feedforward neural network is a point-wise feedforward neural network.

In the step S3, the soft-attribute can learn the feature information from the global perspective, and perform dimension compression to simplify the data processed in the next step, thereby increasing the processing speed.

In the above step S4, in some embodiments, a plurality of categories are obtained by inputting the attention vector into the fully-connected layer for classification; and calculating the probability of belonging to each category according to a softmax function, wherein the maximum value of the probability of each category is 1, and the minimum value of the probability of each category is 0.

In the above step S5, the category with the highest probability may be extracted as the identified category; then judging whether the identified category is a target category: if yes, judging whether the probability of the identified category reaches a threshold value; if so, executing the awakening function corresponding to the category; if not, the identification result is ignored, and the awakening function is not executed.

The voice awakening method comprises the steps that a matrix output by an Encoder is linearly re-expressed through a feedforward neural network, dimensionality compression is conducted on the matrix output by the feedforward neural network through soft-attention to obtain attention vectors, probabilities of multiple categories are obtained, the Encoder, the feedforward neural network, the soft-attention and other structures are used for reference of the internal structure of a transform, the Encoder is constructed based on a multi-head attention mechanism, input data are coded through the Encoder and then calculated through the feedforward neural network, dimensionality compression is conducted through a soft-attention mechanism, the attention vectors containing local and global information are finally generated, the probabilities of the categories are obtained, whether an awakening function is executed or not is judged according to the probability results of the categories, the parameter quantity is less, end-to-end voice awakening judgment is achieved, and the response speed of the voice awakening method is higher; and on the basis of local feature learning of atttion in Encoder, global feature learning of atttion in soft-atttion is added, so that the learning capability is stronger, the recognition effect is better, and the method is suitable for voice awakening.

Further, in some embodiments, the step S5 of determining whether to execute the wake-up function according to the probability result of the category includes:

s51, extracting the category with the highest probability as the identified category;

s52, judging whether the identified category is a target category:

s53, if yes, judging whether the probability of the identified category reaches a threshold value;

s54, if yes, executing the awakening function corresponding to the category;

and S55, if not, ignoring the identification result and not executing the awakening function.

In the above steps S51-S55, the threshold is a predetermined probability reference value, for example, 80%, when the probability of the identified category is greater than or equal to 80%, the probability of the identified category is determined to reach the threshold, and when the probability of the identified category is less than 80%, the probability of the identified category is determined not to reach the threshold, wherein the wake-up functions corresponding to different categories may be different.

Further, the step S1 of performing encoding calculation on the input speech sequence by the Encoder and outputting the first matrix having the same number of rows and columns as the speech sequence includes:

and S11, coding and calculating the input voice sequence through N layers of superposed encoders, and outputting a first matrix with the same row and column number as the voice sequence, wherein N is a positive integer.

In step S11, the first Encoder performs encoding calculation on the input speech sequence, outputs a matrix having the same row and column number as the speech sequence, and then sequentially inputs the matrix output by the previous Encoder for the next Encoder, and outputs the matrix having the same row and column number until the last Encoder outputs the first matrix; wherein, N is preferably 6, which not only ensures higher accuracy of the awakening result, but also can ensure higher processing speed.

Further, the step S4 of identifying probabilities of the plurality of categories according to the attention vector includes:

s41, inputting the attention vector into the full-connection layer for classification to obtain a plurality of classes;

and S42, calculating the probability of belonging to each category according to the softmax function.

In step S41, the category is a recognizable category, and only the predetermined category needs to be recognized, and the number of recognizable categories is small, the recognition speed is high, and it is not necessary to have a large number of recognizable categories like voice recognition, which results in a slow operation speed.

In the above step S42, the maximum value of the probability for each category is 1, i.e., 100%, and the minimum value of the probability for each category is 0.

By means of the Encoder, the feedforward neural network, soft-attention and other structures, the internal structure of a transform is used for reference, the Encoder is constructed based on a multi-head attention mechanism, input data are subjected to calculation through the feedforward neural network after being coded by the Encoder, dimension compression is carried out next to a soft-attention mechanism, attention vectors containing local and global information are finally generated, the probability of the type is obtained, whether the awakening function is executed or not is judged according to the probability result of the type, the parameter quantity is less, end-to-end voice awakening judgment is achieved, the response speed of the voice awakening method is higher, and the voice awakening method is suitable for voice awakening.

In some embodiments, the selected category may be the one with the highest probability of being the yes or no category, and whether to perform the wake-up function may be directly determined according to the selected category, for example, if the selected category is "yes", the wake-up function is performed, and if the selected category is "no", the wake-up function is not performed.

Further, preferably, the number of speech sequences and the number of rows and columns of the first matrix is 1 × 99 × 40. When the number of the rows and the columns of the matrix is 1 x 99 x 40, higher processing speed and higher accuracy can be ensured at the same time; in some embodiments, the speech sequence may adopt other row and column numbers, which are determined according to different requirements.

Further, the number of attention vector rows and columns is 40 × 1. And compressing each row in the second matrix with the row and column number of 1 × 99 × 40 to obtain the attention vector, and obtaining the attention vector with the row and column number of 40 × 1.

Further, before step S1 of performing encoding calculation on the input speech sequence by the Encoder and outputting the first matrix having the same number of rows and columns as the speech sequence, the method includes:

and S1a, extracting FBANK features from the recorded original audio frame to obtain a voice sequence, wherein the voice sequence is a matrix with the row and column number of 1 x 99 x 40.

In the above step S1a, the FBANK feature is 40 dimensions, the sampling rate of the original audio is 16000, the duration of the audio is 1 second, the audio is framed with twenty milliseconds as the window length and ten milliseconds as the step length, so the number of rows and columns of the speech sequence is 1 × 99 × 40; in some embodiments, parameters such as sampling rate, window length, step size, etc. may be appropriately adjusted according to application requirements.

Referring to fig. 2, the present application further provides a voice wake-up apparatus, including:

the encoding module 1 is used for encoding and calculating an input voice sequence through an Encoder and outputting a first matrix with the same row and column number as the voice sequence;

the re-expression module 2 is used for performing linear re-expression on the first matrix through a feedforward neural network and outputting a second matrix, and the row number and the column number of the second matrix are the same as those of the first matrix;

the attention vector acquisition module 3 is used for carrying out dimension compression on the second matrix through soft-attention to obtain an attention vector;

a probability obtaining module 4, configured to identify probabilities of multiple categories according to the attention vector;

and the awakening judgment module 5 is used for judging whether to execute the awakening function according to the probability result of the category.

In the encoding module 1, in some embodiments, for an Encoder, the input signature sequence (e.g., a matrix of 1 × 99 × 40) is linearly transformed (dense layer) and then uniformly divided into n parts, where n is the number of heads. Each head is subjected to matrix operation (self-integration) to obtain weighted small matrixes (for example, the small matrixes are 1 x 99 x 8), n small matrixes are spliced into a large matrix (namely, the large matrix is spliced again into a 1 x 99 x 40 matrix), then the result of linear transformation is added to the large matrix according to a residual linking mechanism, and a first matrix is output after normalization, wherein the first matrix is still 1 x 99 x 40. The rear side of the Encoder can be connected with the next Encoder, or only one Encoder; encoder performs feature learning on feature information from a local perspective.

In the re-expression module 2, the feedforward neural network is a point-wise feedforward neural network, wherein the point-wise feedforward neural network is a point-wise feedforward neural network.

In the attention vector acquisition module 3, the soft-attention can learn the feature information from the global angle, and through dimension compression, data processed in the next step is simplified, and the processing speed is increased.

In the probability obtaining module 4, in some embodiments, a plurality of categories are obtained by inputting the attention vector into the full-link layer for classification; and calculating the probability of belonging to each category according to a softmax function, wherein the maximum value of the probability of each category is 1, and the minimum value of the probability of each category is 0.

In the wake-up judging module 5, the category with the highest probability can be extracted as the identified category; then judging whether the identified category is a target category: if yes, judging whether the probability of the identified category reaches a threshold value; if so, executing the awakening function corresponding to the category; if not, the identification result is ignored, and the awakening function is not executed.

Referring to fig. 3, a storage medium 100, which is a computer-readable storage medium, is further provided, and a computer program 200 is stored on the storage medium, and when executed, the computer program 200 implements the voice wake-up method in any of the embodiments.

Referring to fig. 4, an embodiment of the present application further provides an intelligent device 300, which includes a memory 400, a processor 500, and a computer program 200 stored on the memory 400 and executable on the processor 500, wherein the processor 500 implements the voice wake-up method in any of the above embodiments when executing the computer program 200.

Those skilled in the art will appreciate that the smart device 300 of the embodiments of the present application is a device referred to above for performing one or more of the methods of the present application. These devices may be specially designed and manufactured for the required purposes, or they may comprise known devices in general-purpose computers. These devices have stored therein computer programs 200 or application programs, which computer programs 200 are selectively activated or reconfigured. Such a computer program 200 may be stored in a device (e.g., computer) readable medium, including, but not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magnetic-optical disks, ROMs (Read-Only memories), RAMs (Random Access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (Electrically Programmable Read-Only memories), flash memories, magnetic cards, or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a bus. That is, a readable medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A voice wake-up method, comprising the steps of:

identifying probabilities of a plurality of classes from the attention vector;

2. The voice wake-up method according to claim 1, wherein the step of determining whether to perform the wake-up function according to the probability result of the class comprises:

judging whether the identified category is a target category:

if so, judging whether the probability of the identified category reaches a threshold value;

if so, executing the awakening function corresponding to the category;

3. The voice wake-up method according to claim 1, wherein the step of performing encoding calculation on the input voice sequence by an Encoder and outputting a first matrix with the same number of rows and columns as the voice sequence comprises:

and carrying out coding calculation on the input voice sequence through N layers of superposed encoders, and outputting a first matrix with the same row number and column number as the voice sequence, wherein N is a positive integer.

4. The voice wake method according to claim 1, wherein the step of identifying probabilities of classes based on the attention vector comprises:

5. The voice wake-up method according to claim 1, wherein the number of attention vector rows and columns is 40 x 1.

6. A voice wake-up apparatus, comprising:

7. Storage medium, characterized in that it is a computer-readable storage medium, on which a computer program is stored which, when being executed, implements a voice wake-up method according to any of claims 1 to 5.

8. An intelligent device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the voice wake-up method of any of claims 1 to 5 when executing the computer program.