CN111798840B

CN111798840B - Voice keyword recognition method and device

Info

Publication number: CN111798840B
Application number: CN202010688457.1A
Authority: CN
Inventors: 赵江江; 李昭奇; 任玉玲; 李青龙; 黎塔; 颜永红
Original assignee: Institute of Acoustics CAS; China Mobile Online Services Co Ltd
Current assignee: Institute of Acoustics CAS; China Mobile Online Services Co Ltd
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2023-08-08
Anticipated expiration: 2040-07-16
Also published as: CN111798840A

Abstract

The application provides a voice keyword recognition method and device, which are characterized in that first acoustic features and second acoustic features are extracted from audio of a target keyword, and the first acoustic features and the second acoustic features are spliced into a first acoustic feature sequence; extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; respectively inputting the first and second acoustic feature sequences into a first model and a second model trained in advance, and outputting a first embedded vector and a second embedded vector; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity. The method enables the output embedded vector to contain more context information, and improves the effectiveness of sample keyword recognition.

Description

Voice keyword recognition method and device

Technical Field

The embodiment of the application relates to the technical field of audio signal processing, in particular to a voice keyword recognition method and device.

Background

Keyword detection (Spoken keyword spotting or Spoken Term Detection) technology is a sub-field of the speech recognition field, whose purpose is to detect all occurrence positions of specified words in speech signals, one of the important research contents of the human interaction field. The traditional keyword recognition technology needs to construct a voice recognition system, the voice recognition system generally comprises an acoustic model, a pronunciation dictionary and a language model, a complex decoding network needs to be constructed by means of a weighted finite state transducer, an acoustic feature sequence is converted into a text sequence, then searching is conducted on the text sequence, the operation complexity is high, and more resources are needed to be occupied.

A keyword recognition scheme based on a sample can avoid building a recognition system, compares keywords only through acoustic similarity, has outstanding performance in a low-resource scene where an effective voice recognition system cannot be built, but contains fewer context information in the extracted features of the scheme, and cannot fully characterize the semantic association of the context of the keywords in sentences, so that the performance of the sample keyword recognition technology is limited, and the recognition efficiency is required to be further improved.

Disclosure of Invention

The application describes a voice keyword recognition method and device for solving the problems.

In a first aspect, an embodiment of the present application provides a method for identifying a voice keyword, where the method includes:

acquiring the audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic features carry context semantic association information; sequentially inputting a first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model comprises at least an attention network model, wherein the attention network model is used for executing semantic feature aggregation; extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; sequentially inputting a second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity.

In one embodiment, the first acoustic feature comprises any one of a logarithmic mel-cepstrum feature, an acoustic posterior probability feature, and a neural network bottleneck feature extracted from audio of the target keyword;

the third acoustic feature includes any one of a log mel cepstrum feature, an acoustic posterior probability feature, and a neural network bottleneck feature extracted from audio of the target keyword.

In one embodiment, the first acoustic feature is a log-mel-cepstral feature, and extracting the first acoustic feature from the audio of the target keyword comprises:

inputting the audio signal of the target keyword into a high-pass filter; framing the audio signal output by the high-pass filter according to a preset frame length and frame shift; windowing each frame respectively, wherein a window function is a Hamming window; performing fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point; respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum; performing off-line cosine transform on the logarithmic mel spectrum to obtain a cepstrum, and selecting n+1-dimensional features of the first n-order and the energy value; and carrying out first-order and second-order differential operation on the n+1-dimensional characteristics to obtain logarithmic mel cepstrum characteristics with the dimension of 3 (n+1).

In one embodiment, extracting a second acoustic feature from audio of a target keyword includes:

inputting an audio signal of a target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for carrying out coding processing on the audio signal, and the second convolutional neural network is used for extracting context correlation characteristics in the audio signal.

In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the convolution kernel sizes are 10,8,4,4,4,1,1, and the convolution step sizes are 5,4,2,2,2,1,1; and/or the number of the groups of groups,

the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the convolution kernel sizes are 1,2,3,4,5,6,7,8,9, 10, 11 and 12, and the convolution step sizes are 1.

In one embodiment, the first model is a bidirectional long and short time memory network BLSTM model, comprising three bidirectional long and short time memory layers, each layer comprising a plurality of hidden units; the bidirectional long-short-time memory network model is used for carrying out forward processing and reverse processing on each frame in the audio signal, and splicing forward output and reverse output as a first embedded vector or a second embedded vector corresponding to the frame.

In one embodiment, the attention network model includes two fully connected layers, and the activation function between the two fully connected layers is a normalized exponential function.

In one embodiment, a first layer of the two fully connected layers contains 1024 neurons and a second layer contains 1 neuron.

In a second aspect, an embodiment of the present application provides a voice keyword recognition apparatus, including:

the first extraction unit is configured to acquire the audio of the target keyword, extract first acoustic features and second acoustic features from the audio of the target keyword, and splice the first acoustic features and the second acoustic features into a first acoustic feature sequence; a first embedding unit configured to sequentially input a first acoustic feature sequence into a first model and a second model trained in advance, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model includes at least an attention network model; a second extraction unit configured to extract a third acoustic feature and a fourth acoustic feature from audio of a target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; a second embedding unit configured to sequentially input a second acoustic feature sequence into the first model and the second model trained in advance, and output a second embedded vector of a specified dimension through the first model and the second model; and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is contained in the target voice or not based on the similarity.

In a third aspect, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the methods of the first to second aspects.

In a fourth aspect, embodiments of the present application further provide a computing device, including a memory and a processor, where the memory stores executable code, and the processor implements the methods of the first to second aspects when executing the executable code.

By adopting the voice keyword recognition method and device provided by the embodiment of the application, the first acoustic feature sequence extracted from the audio of the target keyword and the second acoustic feature sequence extracted from the audio of the target voice to be recognized both comprise context semantic association information, the first and second acoustic feature sequences are respectively input into a first model and a second model trained in advance, the first model is a long and short memory network LSTM model, the context association features in the feature sequences can be extracted, the second model is provided with an attention network model, semantic feature aggregation is further executed, so that an embedded vector with mountain context semantic information is output, similarity calculation is performed based on the embedded vector, and recognition accuracy is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed herein, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only embodiments disclosed herein, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a system architecture diagram of a speech keyword recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart of one embodiment of a method for recognizing a voice keyword according to an embodiment of the present application;

fig. 3 is a flowchart of an embodiment of a voice keyword recognition apparatus provided in an embodiment of the present application.

Detailed Description

Various embodiments disclosed herein are described below with reference to the accompanying drawings.

The voice keyword recognition method and device provided by the embodiment of the application are suitable for various voice recognition scenes.

Referring to fig. 1, fig. 1 is a system architecture diagram implemented by a voice keyword recognition method according to an embodiment of the present application. First, a first acoustic feature and a second acoustic feature are extracted from the audio of the target keyword, for example, the first acoustic feature may be a log mel-spectrum cepstrum feature, and the second acoustic feature should carry context semantic association information. The first acoustic feature and the second acoustic feature are then stitched into a first acoustic feature sequence. And extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, wherein the third acoustic feature and the first acoustic feature are acoustic features with the same attribute, such as logarithmic mel-spectrum cepstrum features, and the difference is that the first acoustic feature is extracted from the keyword audio and the third acoustic feature is extracted from the target voice audio. Correspondingly, the fourth acoustic feature is also the same attribute feature as the second acoustic feature, except that the second acoustic feature is extracted from the keyword audio and the fourth acoustic feature is extracted from the target voice audio. The third acoustic feature and the fourth acoustic feature are then stitched into a second acoustic feature sequence.

And then sequentially inputting the first acoustic feature sequence and the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector or a second embedded vector with specified dimensionality through the first model and the second model. And calculating the similarity between the first embedded vector and the second embedded vector, determining whether the target speech contains the target keyword based on the similarity, and judging that the target speech contains the target keyword if the similarity is higher than a preset threshold.

Specifically, referring to fig. 2, the voice keyword recognition method provided in the embodiment of the present application includes the following steps:

s201, acquiring the audio of the target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.

S202, sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model.

And S203, extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.

S204, sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model.

S205, calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice based on the similarity.

The audio of the target keyword is sample keyword audio, the audio of the target voice is a complete audio to be detected, and the audio of the target keyword is equivalent to a segment in the target voice audio.

Alternatively, in one embodiment, the first acoustic feature extracted in S201 is a log mel-spectrum cepstrum feature, and is a frame-level acoustic feature, specifically, the extraction may be performed as follows:

firstly, pre-emphasis is carried out on audio, a high-frequency part is promoted, an audio signal of a target keyword passes through a high-pass filter, low-frequency characteristics are filtered, and the adopted high-pass filter is as follows:

H(z)＝1-0.μz ^-1

where μ has a value between 0.9 and 1.0, for example 0.97, z represents the audio signal.

Then framing the audio, namely integrating N sampling points in the audio signal into an observation unit, defining the observation unit as a frame, wherein the value of N is 256 or 512, the frame length is about 20-30ms, and in order to avoid overlarge change of two adjacent frames, an overlapping area is arranged between the two adjacent frames, the overlapping area comprises M sampling points, and the value of M is about 1/2 or 1/3 of the value of N. As an implementation manner, in the embodiment of the present application, the frame length is set to 25 ms, and the frame is shifted by 10 ms.

Each frame is then windowed, the window function being a hamming window, and each frame is multiplied by the hamming window to increase the continuity of the left and right ends of the frame. After multiplying the hamming window, each frame then needs to be subjected to a fast fourier transform to obtain an energy distribution in the frequency spectrum, different energy distributions in the frequency domain, which can represent the characteristics of different voices. And performing fast discrete Fourier transform on each frame to obtain the frequency spectrum of each frame, and calculating the power spectrum energy at each frequency point. And obtaining the power spectrum energy of the audio signal by frequency spectrum modular squaring.

And then calculating the energy of each frame of power spectrum energy in the audio signal after passing through the Mel filter, taking the logarithm of the energy to obtain a logarithmic Mel spectrum, performing off-line cosine transform on the logarithmic Mel spectrum to obtain a cepstrum, selecting n+1-dimensional eigenvectors of the first n-order and energy values, wherein n is a positive integer. For example, n may take the value 12, i.e., the first 12 th order and energy values together 13-dimensional feature vectors.

Next, first-order and second-order differences are calculated on the obtained n+1-dimensional feature vector, and an n+1-dimensional feature after the first-order difference and an n+1-dimensional feature after the second-order difference are obtained, which are 3 (n+1) dimensional features in total. For example, the 13-dimensional feature vector calculates first-order and second-order differences, resulting in 39-dimensional log mel-cepstrum features in total.

In addition to the log mel-cepstrum feature, the first acoustic feature may be an acoustic posterior probability feature or a neural network bottleneck feature.

The acoustic posterior probability feature is probability distribution of each instance corresponding to each unit in a frame of voice obtained through calculation under the known prior knowledge and observation condition, the unit is a modeling unit, the minimum structure in a built system is represented, for example, the modeling unit is a word, and phonemic phonetic symbols below the word level are not considered any more, and only how the word forms words and sentences are needed to be considered. The known prior knowledge may be a pronunciation dictionary of a language, text, audio marked with text annotation data corresponding to points in time, which will be used to train a gaussian mixture model or a neural network model,

the observation condition may be a waveform and a frequency spectrum of a current frame, a waveform and a frequency spectrum of a history frame, and a waveform and a frequency spectrum of a future frame in the audio signal.

In one embodiment, the known prior knowledge of each unit in the speech frame and the waveform and spectrum data corresponding to the speech frame are taken as inputs, the implicit state of each unit is taken as output, a gaussian mixture model or other neural network model is trained to obtain optimized parameters, and the implicit state of each unit is output based on the trained gaussian mixture model or other neural network model. The method comprises the steps of taking units as nodes in a hidden Markov model, taking priori knowledge as observation values of all display nodes, taking hidden states output by a Gaussian mixture model or other neural network models as hidden state values of hidden nodes corresponding to all display nodes, and outputting occurrence probability and transition probability of the units represented by all the nodes through the hidden Markov model to serve as acoustic posterior probability characteristics of corresponding voice frames. units may include examples of monophones, polyphones, initials, letters, words, and the like.

In another embodiment, the neural network model directly outputs the acoustic posterior probability features corresponding to the speech frames without going through the hidden Markov model. For example, the unit is an english letter, the neural network model directly outputs a 26-dimensional vector representing the probability that the frame is each letter, and the 26-dimensional vector that is output is the acoustic posterior probability feature.

In another embodiment, the method for obtaining the acoustic posterior probability features is to input the logarithmic mel cepstrum features extracted from each frame into a neural network acoustic model trained through priori knowledge, and calculate to obtain probability distribution of all monophonins corresponding to the frame, where the probability is the acoustic posterior probability feature.

The bottleneck feature of the neural network is that the output of a middle layer of the neural network is used as a feature, such as training a neural network with a two-part structure, wherein the first part is a three-layer BLSTM (block STM) as an encoding network, and the second part is a decoding network composed of four LSTM layers, and the four LSTM layers respectively output the probabilities of word levels of mandarin/English/Spanish/Bos. Training is performed sequentially using four languages, each training only the decoding network of the corresponding LSTM layer in the first and second portions. After training is completed, the output of the first part of the network is used as a feature of each frame, which is defined as a neural network bottleneck feature in the embodiments of the present application.

In S201, after the first acoustic feature is obtained, a second acoustic feature needs to be extracted. The second acoustic feature is also a frame-level acoustic feature, i.e., extracted in frames. The audio of the target keyword is input into a pre-trained neural network model, the output of which serves as a second acoustic feature.

Specifically, the pre-trained neural network model includes an encoding network and a context network. For example, the coding network and the context network are two one-dimensional convolutional neural networks. For convenience of description, the encoding network is defined as a first convolutional neural network and the context network is defined as a second convolutional neural network. The first convolutional neural network is used for encoding the audio signal, and the second convolutional neural network is used for extracting context-related features from the audio signal.

In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the convolution kernel sizes for performing the convolution operations are (10,8,4,4,4,1,1) respectively, and the step sizes of the convolution operations are (5,4,2,2,2,1,1) respectively; the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the convolution kernel sizes are (1, 2,3,4,5,6,7,8,9, 10, 11, 12), and the convolution step sizes are all set to be 1, that is, the step size of convolution operation of each convolutional layer is (1,1,1,1,1,1,1,1,1,1,1,1).

The second acoustic features extracted by the convolutional neural network model carry context semantic association information.

Then, the first acoustic feature and the second acoustic feature are spliced into a first acoustic feature sequence, that is, two feature vectors are spliced into a feature vector with a higher dimension, and different feature vectors are arranged in sequence to form the first acoustic feature sequence.

The step of extracting the third acoustic feature and the fourth acoustic feature from the audio of the target voice to be recognized in S203 and the step of concatenating the third acoustic feature and the fourth acoustic feature into the second acoustic feature sequence may be described in detail with reference to S201, except that the processing object in S201 is the target keyword audio and the processing object in S203 is the audio of the target voice.

In the embodiment of the present application, S202 and S204, and S201 and S203 are not sequentially required in the execution timing, and the step sequence is only for convenience in description and is not to be construed as limiting the execution timing.

In S202 and S204, the first acoustic feature sequence and the second acoustic feature sequence obtained by stitching are input into a first model and a second model trained in advance, respectively. In some embodiments, the first model and the second model are both pre-built and trained, and the first model may be a Long Short-Term Memory network (LSTM) model. For example, in one particular embodiment, the first model comprises three layers of bi-directional long and short-term memory network, each layer comprising 256 neurons, each frame concatenates the forward and reverse outputs as an embedded vector for this frame, the embedded vector being of fixed 512 dimensions.

The second model includes at least an attention network model for performing semantic feature aggregation. In some embodiments, the second model includes an attention network, a layer of average structure, and a mosaic structure.

In one embodiment, the attention network comprises two fully connected layers, the activation function between the fully connected layers being a normalized exponential function. For example, the first fully-connected layer contains 1024 neurons, and the second layer (corresponding to the output layer) contains 1 neuron.

The overall calculation process of the first model and the second model is as follows:

Y＝FC(Z),

the split structure is that the feature vectors output by the attention network and the average network are spliced into the final output embedded vector E. Specifically, FC (·) represents the fully connected layer, Z represents the feature vector corresponding to the audio signal input to the fully connected layer, Z _t Represents the t-th element in Z. The first part of the front of the two parts spliced is the feature vector output by the attention network, and then the weight pair z output according to softmax _t Performing weighted average; the rear faceI.e. average structure, for x _t Averaged together. The two output vectors are stitched together as a final output vector.

SoftMax (·) is a normalized exponential function with the expression:

y represents the output sequence of the full link layer, x _t ,y _t Respectively representing the output characteristic vector of the t three-layer bidirectional long-short-time memory network and the output characteristic vector of the t full-connection layer. T represents input and outputIs used for the number of feature vectors of (a). exp represents an exponential function based on a natural constant e.

In the step S202 and the step S204, the process of outputting the first embedded vector and the second embedded vector with specified dimensions through the first model and the second model may refer to the above process, when the processing object input into the first model is the first acoustic feature sequence corresponding to the target keyword audio, the output embedded vector E is the first embedded vector, and when the processing object input into the first model is the second acoustic feature sequence corresponding to the target voice audio, the output embedded vector E is the second embedded vector. The dimensions of the first and second embedded vectors are both predetermined values, for example, in one embodiment, 512. The length of the input features and the embedded vectors of the output used are determined at training time and if modified the network needs to be retrained.

Next, at S205, the similarity between the first embedded vector and the second embedded vector is calculated. There are various ways to calculate the similarity, such as manhattan distance (Manhattan Distance), euclidean distance (Euclidean Distance), pearson correlation coefficient (Pearson Correlation), cosine similarity (Cosine Similarity), and so on.

In one embodiment, the similarity is calculated using a cosine distance, and a determination is made as to whether the keyword represented by the keyword audio is present in the target audio based on the cosine distance. And when the cosine distance between the keyword and the target voice embedded vector is smaller than a preset threshold value or the similarity is higher than a preset threshold value, judging that the keyword is hit. The cosine distance formula is as follows:

wherein A and B are two embedded vectors to be compared, and d (·) is the calculated cosine distance.

In summary, in the recognition method provided by the embodiment of the present application, in the keyword query process, the target keyword audio to be queried is input, the log-mel cepstrum feature (i.e., the first acoustic feature) and the pre-training feature (i.e., the second acoustic feature) are extracted through step S201, the log-mel cepstrum feature and the pre-training feature are spliced together to be used as the frame-level acoustic feature sequence, the acoustic feature sequence is sequentially input into the first model and the second model, and an embedded vector with a preset fixed length is output. The input target keyword audio frequency does not need language and text information, so that a related model for extracting the text information is not required to be established, the resource space occupied by the related model for carrying out data operation is reduced, and the embedded neural network structure is improved, so that the embedded vector with fixed length can contain more context information.

In a second aspect, referring to fig. 3, the embodiment of the present application further provides a method and an apparatus 310 for recognizing a voice keyword, where the apparatus includes:

the first extraction unit 3101 is configured to obtain the audio of the target keyword, extract the first acoustic feature and the second acoustic feature from the audio of the target keyword, and splice the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.

A first embedding unit 3102 configured to sequentially input a first acoustic feature sequence into a first model and a second model trained in advance, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model includes at least an attention network model.

The second extraction unit 3103 is configured to extract a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.

The second embedding unit 3104 is configured to sequentially input the second acoustic feature sequence into a first model and a second model trained in advance, and output a second embedded vector of a specified dimension through the first model and the second model.

The recognition unit 3105 is configured to calculate a similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is included in the target speech based on the similarity.

In a third aspect, embodiments of the present application further provide a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in any of the embodiments above.

In a fourth aspect, embodiments of the present application further provide a computing device, including a memory and a processor, where the memory stores executable code, and the processor implements the method described in any of the foregoing embodiments when executing the executable code.

Those of skill in the art will appreciate that in one or more of the above examples, the functions described in the various embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

While the foregoing detailed description has set forth the objects, aspects and advantages of the various embodiments disclosed herein in further detail, it should be understood that the foregoing description is only illustrative of the various embodiments disclosed herein and is not intended to limit the scope of the various embodiments disclosed herein, and that any modifications, equivalents, improvements or the like that are based on the technical aspects of the various embodiments disclosed herein are intended to be included within the scope of the various embodiments disclosed herein.

Claims

1. A method for recognizing a voice keyword, the method comprising:

acquiring the audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic features carry context semantic association information;

sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model comprises at least an attention network model, wherein the attention network model is used for executing semantic feature aggregation;

extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;

sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model;

and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity.

2. The method of claim 1, wherein the first acoustic feature comprises any one of a log mel-cepstral feature, an acoustic posterior probability feature, a neural network bottleneck feature extracted from audio of the target keyword;

the third acoustic feature includes any one of a log mel cepstrum feature, an acoustic posterior probability feature, and a neural network bottleneck feature extracted from the audio of the target voice to be recognized.

3. The method of claim 2, wherein the first acoustic feature is a log mel-cepstral feature, the extracting the first acoustic feature from the audio of the target keyword comprising:

inputting the audio signal of the target keyword into a high-pass filter;

framing the audio signal output by the high-pass filter according to a preset frame length and frame shift;

windowing each frame respectively, wherein a window function is a Hamming window;

performing fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point;

respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum;

performing off-line cosine transform on the logarithmic mel spectrum to obtain a cepstrum, and selecting n+1-dimensional features of the first n-order and the energy value;

and carrying out first-order and second-order differential operation on the n+1-dimensional features to obtain logarithmic mel cepstrum features with the dimension of 3 (n+1).

4. The method of claim 1, wherein extracting a second acoustic feature from the audio of the target keyword comprises:

inputting the audio signal of the target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for carrying out coding processing on the audio signal, and the second convolutional neural network is used for extracting context correlation characteristics in the audio signal.

5. The method of claim 4, wherein the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the convolution kernel sizes are 10,8,4,4,4,1,1, and the convolution steps are 5,4,2,2,2,1,1; and/or the number of the groups of groups,

6. The method of claim 1, wherein the first model is a bidirectional long and short time memory network BLSTM model comprising three bidirectional long and short time memory layers, each layer comprising a plurality of hidden units;

the bidirectional long-short-time memory network model is used for carrying out forward processing and reverse processing on each frame in the audio signal, and splicing the forward output and the reverse output as a first embedded vector or a second embedded vector corresponding to the frame.

7. The method of claim 1, wherein the attention network model comprises two fully connected layers, a first layer of the two fully connected layers comprising 1024 neurons and a second layer comprising 1 neuron, the activation function between the two fully connected layers being a normalized exponential function.

8. A voice keyword recognition apparatus, comprising:

the first extraction unit is configured to acquire the audio of the target keyword, extract first acoustic features and second acoustic features from the audio of the target keyword, and splice the first acoustic features and the second acoustic features into a first acoustic feature sequence;

a first embedding unit configured to sequentially input the first acoustic feature sequence into a first model and a second model trained in advance, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model includes at least an attention network model;

a second extraction unit configured to extract a third acoustic feature and a fourth acoustic feature from audio of a target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;

a second embedding unit configured to sequentially input the second acoustic feature sequence into a first model and a second model trained in advance, and output a second embedding vector of a specified dimension through the first model and the second model;

and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is contained in the target voice or not based on the similarity.

9. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements the method of any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed in a computer, causes the computer to perform the method according to any of claims 1-7.