CN111798840A

CN111798840A - Voice keyword recognition method and device

Info

Publication number: CN111798840A
Application number: CN202010688457.1A
Authority: CN
Inventors: 赵江江; 李昭奇; 任玉玲; 李青龙; 黎塔; 颜永红
Original assignee: Institute of Acoustics CAS; China Mobile Online Services Co Ltd
Current assignee: Institute of Acoustics CAS; China Mobile Online Services Co Ltd
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-10-20
Anticipated expiration: 2040-07-16
Also published as: CN111798840B

Abstract

The application provides a method and a device for recognizing a voice keyword, wherein a first acoustic feature and a second acoustic feature are extracted from an audio frequency of a target keyword, and the first acoustic feature and the second acoustic feature are spliced into a first acoustic feature sequence; extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; inputting the first acoustic characteristic sequence and the second acoustic characteristic sequence into a first model and a second model which are trained in advance respectively, and outputting a first embedded vector and a second embedded vector; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword or not based on the similarity. The method enables the output embedded vector to contain more context information, and improves the effectiveness of sample keyword identification.

Description

Voice keyword recognition method and device

Technical Field

The embodiment of the application relates to the technical field of audio signal processing, in particular to a method and a device for recognizing a voice keyword.

Background

The keyword Detection (Spoken keyword spotting or Spoken Term Detection) technology is a sub-field of the speech recognition field, and aims to detect all appearance positions of specified words in speech signals, and is one of important research contents in the field of human-computer interaction. The traditional keyword recognition technology needs to construct a voice recognition system, the voice recognition system generally comprises an acoustic model, a pronunciation dictionary and a language model, a complex decoding network needs to be constructed by means of a weighted finite state converter, an acoustic feature sequence is converted into a text sequence, then searching is carried out on the text sequence, the operation complexity is high, and more resources are required to be occupied.

A keyword recognition scheme based on a sample can avoid the construction of a recognition system, the comparison of keywords is carried out only through acoustic similarity, the performance is outstanding under the low-resource scene that an effective voice recognition system cannot be constructed, but the extracted features of the scheme contain less context information, the semantic association of the contexts of the keywords in sentences cannot be fully represented, the performance of the sample keyword recognition technology is limited, and the recognition efficiency is to be further improved.

Disclosure of Invention

The application describes a method and a device for recognizing a voice keyword, which are used for solving the problems.

In a first aspect, an embodiment of the present application provides a method for recognizing a speech keyword, where the method includes:

acquiring audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic feature carries context semantic related information; sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model, and the attention network model is used for performing semantic feature aggregation; extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword or not based on the similarity.

In one embodiment, the first acoustic feature includes any one of a logarithmic mel-frequency cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio frequency of the target keyword;

the third acoustic feature comprises any one of a logarithmic mel frequency cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio frequency of the target keyword.

In one embodiment, the first acoustic feature is a logarithmic mel-frequency cepstrum feature, and the extracting the first acoustic feature from the audio of the target keyword includes:

inputting the audio signal of the target keyword into a high-pass filter; framing the audio signal output by the high-pass filter according to a preset frame length and a frame shift; windowing each frame respectively, wherein the window function is a Hamming window; respectively carrying out fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point; respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum; performing offline cosine transformation on the logarithmic Mel spectrum to obtain a cepstrum, and selecting n +1 dimensional features of the first n orders and energy values; and performing first-order and second-order difference operation on the n +1 dimensional features to obtain the logarithmic Mel cepstrum features with the dimensionality of 3(n + 1).

In one embodiment, extracting the second acoustic feature from the audio of the target keyword comprises:

inputting the audio signal of the target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for coding the audio signal, and the second convolutional neural network is used for extracting context correlation characteristics in the audio signal.

In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 10, 8, 4, 4, 4, 1, 1 respectively, the convolution step sizes are 5, 4, 2, 2, 2, 1, 1 respectively; and/or the presence of a gas in the gas,

the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 respectively, and the convolution step sizes are all 1.

In one embodiment, the first model is a bidirectional long-short-time memory network BLSTM model comprising three bidirectional long-short-time memory layers, each layer comprising a plurality of hidden units; the bidirectional long-time and short-time memory network model is used for carrying out forward processing and backward processing on each frame in the audio signal, and splicing forward output and backward output to be used as a first embedded vector or a second embedded vector corresponding to the frame.

In one embodiment, the attention network model includes two fully-connected layers, and the activation function between the two fully-connected layers is a normalized exponential function.

In one embodiment, the first of the two fully-connected layers contains 1024 neurons and the second layer contains 1 neuron.

In a second aspect, an embodiment of the present application provides a speech keyword recognition apparatus, including:

the first extraction unit is configured to acquire the audio of the target keyword, extract a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splice the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the first embedding unit is configured to input the first acoustic feature sequence into a first model and a second model which are trained in advance in sequence, and output a first embedding vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model; the second extraction unit is configured to extract a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; a second embedding unit configured to input a second acoustic feature sequence into a first model and a second model trained in advance in sequence, and output a second embedding vector of a specified dimension through the first model and the second model; and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector and determine whether the target keyword is contained in the target voice or not based on the similarity.

In a third aspect, embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to execute the method of the first aspect to the second aspect.

In a fourth aspect, an embodiment of the present application further provides a computing device, which includes a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the methods of the first to second aspects.

By adopting the speech keyword recognition method and the speech keyword recognition device provided by the embodiment of the application, context semantic association information is contained in a first acoustic feature sequence extracted from the audio frequency of a target keyword and a second acoustic feature sequence extracted from the audio frequency of a target speech to be recognized, the first acoustic feature sequence and the second acoustic feature sequence are respectively input into a first model and a second model which are trained in advance, the first model is a long-time memory network LSTM model and can extract the context association feature in the feature sequences, an attention network model is arranged in the second model, semantic feature aggregation is further executed, and therefore an embedded vector with context semantic information is output, similarity calculation is carried out based on the embedded vector, and recognition accuracy is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present application, the drawings required to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only embodiments disclosed in the present application, and it is obvious for those skilled in the art that other drawings can be obtained based on the drawings without inventive efforts.

Fig. 1 is a system architecture diagram of an implementation of a speech keyword recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart of an embodiment of a method for recognizing a speech keyword according to an embodiment of the present application;

fig. 3 is a flowchart of an embodiment of a speech keyword recognition apparatus according to an embodiment of the present application.

Detailed Description

Embodiments disclosed in the present application are described below with reference to the accompanying drawings.

The method and the device for recognizing the voice keywords are suitable for various voice recognition scenes.

Referring to fig. 1, fig. 1 is a system architecture diagram implemented by a speech keyword recognition method according to an embodiment of the present application. First, a first acoustic feature and a second acoustic feature are extracted from the audio of the target keyword, for example, the first acoustic feature may be a logarithmic mel-frequency spectrum cepstrum feature, and the second acoustic feature should carry context semantic association information. The first acoustic feature and the second acoustic feature are then stitched into a first sequence of acoustic features. And extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, wherein the third acoustic feature and the first acoustic feature are acoustic features with the same attribute, such as logarithmic mel spectrum cepstrum features, except that the first acoustic feature is extracted from the keyword audio frequency, and the third acoustic feature is extracted from the target voice audio frequency. Correspondingly, the fourth acoustic feature and the second acoustic feature are also the same attribute feature, except that the second acoustic feature is extracted from the keyword audio, and the fourth acoustic feature is extracted from the target voice audio. The third and fourth acoustic features are then stitched into a second acoustic feature sequence.

And then, sequentially inputting the first acoustic feature sequence and the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedding vector or a second embedding vector with a specified dimension through the first model and the second model. And calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword or not based on the similarity, for example, if the similarity is higher than a predetermined threshold, determining that the target speech contains the target keyword.

Specifically, referring to fig. 2, the speech keyword recognition method provided in the embodiment of the present application includes the following steps:

s201, obtaining the audio frequency of the target keyword, extracting a first acoustic feature and a second acoustic feature from the audio frequency of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.

S202, the first acoustic feature sequence is sequentially input into a first model and a second model which are trained in advance, and a first embedding vector with a specified dimension is output through the first model and the second model.

S203, extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.

And S204, sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model.

S205, calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword based on the similarity.

The audio frequency of the target keyword is sample keyword audio frequency, the audio frequency of the target voice is a section of complete audio frequency to be detected, and the audio frequency of the target keyword is equivalent to a segment in the target voice audio frequency.

Optionally, in an embodiment, the first acoustic feature extracted in S201 is a cepstrum feature of a logarithmic mel spectrum, and is an acoustic feature at a frame level, and specifically, the following extraction may be performed:

firstly, pre-emphasis is carried out on audio, a high-frequency part is promoted, audio signals of target keywords pass through a high-pass filter, low-frequency characteristics are filtered, and the adopted high-pass filter is as follows:

H(z)＝1-0.μz^-1

where μ has a value between 0.9 and 1.0, for example 0.97 may be taken, and z represents the audio signal.

Then, the audio is framed, that is, N sampling points in the audio signal are grouped into an observation unit, defined as a frame, where N takes a value of 256 or 512, the frame length is about 20-30ms, in order to avoid the excessive change of two adjacent frames, an overlap region is set between the two adjacent frames, the overlap region includes M sampling points, and the value of M is about 1/2 or 1/3 of N. In this embodiment, a frame is set to be 25 ms long and a frame is shifted by 10 ms.

Then, each frame is windowed, the window function being a hamming window, and each frame is multiplied by the hamming window to increase the continuity of the left and right ends of the frame. After multiplying by the hamming window, each frame needs to be subjected to fast fourier transform to obtain the energy distribution on the frequency spectrum, and different energy distributions on the frequency domain can represent the characteristics of different voices. And performing fast discrete Fourier transform on each frame to obtain the frequency spectrum of each frame, and calculating the power spectrum energy at each frequency point. And obtaining the power spectrum energy of the audio signal by performing the modulus square on the frequency spectrum.

Then, calculating the energy of the power spectrum energy of each frame in the audio signal after passing through a Mel filter, taking the logarithm of the energy to obtain a logarithmic Mel spectrum, performing offline cosine transform on the logarithmic Mel spectrum to obtain a cepstrum, and selecting n + 1-dimensional feature vectors of the first n orders and energy values, wherein n is a positive integer. For example, n may take the value 12, i.e. 13-dimensional feature vectors of the first 12 th order and energy values are taken.

Then, first-order and second-order differences are calculated for the obtained n + 1-dimensional feature vector, so that n + 1-dimensional features after the first-order difference and n + 1-dimensional features after the second-order difference are obtained, and the total is 3(n +1) -dimensional features. For example, the 13-dimensional feature vector computes first and second order differences, resulting in a total of 39-dimensional log mel-frequency cepstral features.

In addition to the log-mel-frequency cepstral feature, the first acoustic feature may be an acoustic posterior probability feature or a neural network bottleneck feature.

The acoustic posterior probability features are probability distributions of various instances corresponding to units in a frame of voice obtained by calculation under the known priori knowledge and observation conditions, the units are modeling units and represent the minimum structure in the built system, for example, the modeling units are words, phoneme phonetic symbols below the word level are not considered, and only how the words form words and sentences is considered. The known a priori knowledge, which may be a pronunciation dictionary of a language, text, audio tagged with text annotation data for corresponding points in time, will be used to train a gaussian mixture model or a neural network model,

the observation condition may be a waveform and a frequency spectrum of a current frame, a waveform and a frequency spectrum of a historical frame, and a waveform and a frequency spectrum of a future frame in the audio signal.

In one embodiment, known prior knowledge of each unit in a speech frame and waveform and spectrum data corresponding to the speech frame are used as input, a Gaussian mixture model or other neural network models are trained by using hidden states of the units as output to obtain optimized parameters, and the hidden states of the units are output based on the trained Gaussian mixture model or other neural network models. The unit is used as a node in a hidden Markov model, prior knowledge is used as an observed value of each display node, a hidden state output by a Gaussian mixture model or other neural network models is used as a hidden state value of the hidden node corresponding to each display node, and the occurrence probability and the transition probability of the unit represented by each node are output through the hidden Markov model and are used as the acoustic posterior probability characteristics of the corresponding voice frame. The unit may include instances of monophonic elements, polyphones, initials, letters, words, and the like.

In another embodiment, the neural network model directly outputs the acoustic posterior probability characteristics corresponding to the speech frame without passing through the hidden Markov model. For example, if the unit is an english letter, the neural network model directly outputs a 26-dimensional vector representing the probability that the frame is each letter, and the output 26-dimensional vector is the acoustic posterior probability feature.

In another embodiment, the method for obtaining the acoustic posterior probability features includes inputting the extracted logarithmic mel cepstrum features of each frame into a neural network acoustic model trained through prior knowledge, and calculating to obtain the probability distribution of all the single phones corresponding to the frame, wherein the probability is the acoustic posterior probability features.

The neural network bottleneck characteristic is that the output of the middle layer of the neural network is used as the characteristic, for example, a neural network with a two-part structure is trained, the first part is a BLSTM with three layers used as an encoding network, the second part is a decoding network consisting of four LSTM layers, and the four LSTM layers respectively output the word-level probabilities of Mandarin/English/Spanish/Persian. The four languages are used for training in turn, each language only training the decoding network of the corresponding LSTM layer in the first and second parts. After the training is completed, the output of the first partial network is used as the feature of each frame, which is defined as the neural network bottleneck feature in the embodiment of the present application.

In S201, after obtaining the first acoustic feature, a second acoustic feature needs to be extracted. The second acoustic feature is also a frame-level acoustic feature, i.e., extracted in units of frames. The audio of the target keyword is input to a pre-trained neural network model, the output of which is taken as the second acoustic feature.

Specifically, the pre-trained neural network model includes a coding network and a context network. For example, the coding network and the context network are two one-dimensional convolutional neural networks. For convenience of description, the coding network is defined as a first convolutional neural network, and the context network is defined as a second convolutional neural network. The first convolutional neural network is used for coding the audio signal, and the second convolutional neural network is used for extracting context-related features in the audio signal.

In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the sizes of convolutional kernels for performing convolutional operations are respectively (10, 8, 4, 4, 4, 1, 1), and the step sizes of the convolutional operations are respectively (5, 4, 2, 2, 2, 1, 1); and the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the sizes of convolutional kernels are (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12), the convolution step sizes are all set to be 1, namely the step size of each convolutional layer convolution operation is (1, 1, 1, 1, 1, 1, 1 and 1).

And the second acoustic features extracted by the convolutional neural network model carry context semantic related information.

And then, the first acoustic feature and the second acoustic feature are spliced into a first acoustic feature sequence, namely, two feature vectors are spliced into a feature vector with a higher dimension, and different feature vectors are arranged in sequence to form the first acoustic feature sequence.

In S203, a third acoustic feature and a fourth acoustic feature are extracted from the audio of the target speech to be recognized, and the third acoustic feature and the fourth acoustic feature are spliced into a second acoustic feature sequence, which may refer to the detailed description of S201, except that the processing object in S201 is the target keyword audio, and the processing object in S203 is the audio of the target speech.

It should be noted that, in the embodiment of the present application, the execution timings of S202 and S204, and the execution timings of S201 and S203 are not sequentially required, and the step numbers are only for convenience of description, and are not to be construed as limitations on the execution timings.

In S202 and S204, the first acoustic feature sequence and the second acoustic feature sequence obtained by concatenation are input to the first model and the second model trained in advance, respectively. In some embodiments, the first model and the second model are both pre-constructed and trained, and the first model may be a Long Short-Term Memory network (LSTM) model. For example, in one specific embodiment, the first model comprises three layers of bidirectional long-term memory networks, each layer comprises 256 neurons, each frame splices the forward and backward outputs as an embedded vector of the frame, and the embedded vector is 512 fixed dimensions.

The second model includes at least an attention network model for performing semantic feature aggregation. In some embodiments, the second model includes an attention network, a layer of averaging structures, and a mosaic structure.

In one embodiment, the attention network includes two fully-connected layers, and the activation function between the fully-connected layers is a normalized exponential function. For example, the first fully-connected layer contains 1024 neurons, and the second layer (corresponding to the output layer) contains 1 neuron.

The overall calculation process of the first model and the second model is as follows:

Y＝FC(Z),

the split structure represents that feature vectors output by the attention network and output by the average network are spliced into a final output embedded vector E. Specifically, FC (·) denotes a fully-connected layer, Z denotes a feature vector corresponding to an audio signal input to the fully-connected layer, Z_tRepresenting the t-th element in Z. The first part in front of the two parts of the stitching is the feature vector of the attention network output, then the pair of z according to the weight of the softmax output_tCarrying out weighted average; rear face

I.e. average structure, for x_tThe sum of (a) is averaged. The two portions of the output vector are spliced together as a final output vector.

SoftMax (-) is a normalized exponential function, whose expression is:

y denotes the output sequence of the full connection layer, x_t,y_tRespectively representing the output characteristic vector of the tth three-layer bidirectional long and short time memory network and the output characteristic vector of the tth fully-connected layer. T denotes the number of input and output feature vectors. exp denotes an exponential function with a natural constant e as the base.

In S202 and S204, the processes of outputting the first embedded vector and the second embedded vector of the designated dimension through the first model and the second model may refer to the above processes, where when the processing object of the first model is the first acoustic feature sequence corresponding to the target keyword audio, the output embedded vector E is the first embedded vector, and when the processing object of the first model is the second acoustic feature sequence corresponding to the target speech audio, the output embedded vector E is the second embedded vector. The dimensions of the first embedding vector and the second embedding vector are both predetermined values, e.g., in one embodiment, the dimensions of the first embedding vector and the second embedding vector are both 512. The length of the embedded vectors of the input features and output used is determined at training time and if modified, the network needs to be retrained.

Next, in S205, the similarity between the first embedding vector and the second embedding vector is calculated. There are various ways to calculate the Similarity, such as Manhattan Distance (Manhattan Distance), Euclidean Distance (Euclidean Distance), Pearson Correlation coefficient (Pearson Correlation), Cosine Similarity (Cosine Similarity), and so on.

In one embodiment, cosine distances are used to calculate similarity, and whether keywords represented by the keyword audio appear in the target audio is determined according to the cosine distances. And when the cosine distance between the keyword and the target voice embedded vector is smaller than a preset threshold value or the similarity is higher than a preset threshold value, judging that the keyword is hit. The cosine distance formula is as follows:

where A and B are the two embedded vectors to be compared and d (-) is the calculated cosine distance.

To sum up, in the recognition method provided by the embodiment of the present application, in the keyword query process, the target keyword audio to be queried is input, the logarithmic mel cepstrum feature (i.e., the first acoustic feature) and the pre-training feature (i.e., the second acoustic feature) are extracted through step S201, the logarithmic mel cepstrum feature and the pre-training feature are spliced together to serve as a frame-level acoustic feature sequence, the acoustic feature sequence is sequentially input into the first model and the second model, and an embedded vector with a preset fixed length is output. The input target keyword audio frequency does not need languages and text information, and therefore a relevant model for extracting the text information does not need to be established, and therefore the resource space occupied by the relevant model for data operation is reduced.

In a second aspect, referring to fig. 3, an embodiment of the present application further provides a speech keyword recognition method and apparatus 310, where the apparatus includes:

the first extraction unit 3101 is configured to acquire an audio of a target keyword, extract a first acoustic feature and a second acoustic feature from the audio of the target keyword, and concatenate the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.

A first embedding unit 3102 configured to input the first acoustic feature sequence into a first model and a second model trained in advance in sequence, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model includes at least an attention network model.

A second extraction unit 3103 configured to extract a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and concatenate the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.

A second embedding unit 3104 configured to input the second acoustic feature sequence into the first model and the second model trained in advance in order, and output a second embedding vector of a specified dimension through the first model and the second model.

A recognition unit 3105 configured to calculate a similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is contained in the target speech based on the similarity.

In a third aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to perform the method described in any one of the above embodiments.

In a fourth aspect, an embodiment of the present application further provides a computing device, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method described in any of the foregoing embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the embodiments disclosed in the present application are described in further detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the embodiments disclosed in the present application, and are not intended to limit the scope of the embodiments disclosed in the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments disclosed in the present application should be included in the scope of the embodiments disclosed in the present application.

Claims

1. A speech keyword recognition method, characterized in that the method comprises:

acquiring audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic feature carries context semantic association information;

sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedding vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model for performing semantic feature aggregation;

extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;

sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model;

calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity.

2. The method according to claim 1, wherein the first acoustic feature comprises any one of a logarithmic mel-frequency cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio of the target keyword;

the third acoustic feature comprises any one of a logarithmic Mel cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio frequency of the target keyword.

3. The method of claim 2, wherein the first acoustic feature is a logarithmic mel-frequency cepstrum feature, and wherein extracting the first acoustic feature from the audio of the target keyword comprises:

inputting the audio signal of the target keyword into a high-pass filter;

framing the audio signal output by the high-pass filter according to a preset frame length and a frame shift;

windowing each frame respectively, wherein the window function is a Hamming window;

respectively carrying out fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point;

respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum;

performing offline cosine transformation on the logarithmic Mel spectrum to obtain a cepstrum, and selecting n +1 dimensional features of the first n orders and the energy value;

and performing first-order and second-order difference operation on the n +1 dimensional features to obtain a logarithmic Mel cepstrum feature with a dimension of 3(n + 1).

4. The method of claim 1, wherein extracting a second acoustic feature from the audio of the target keyword comprises:

inputting the audio signal of the target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for coding the audio signal, and the second convolutional neural network is used for extracting context-related features in the audio signal.

5. The method of claim 4, wherein the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 10, 8, 4, 4, 4, 1, 1, respectively, and the convolution steps are 5, 4, 2, 2, 2, 1, 1; and/or the presence of a gas in the gas,

the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 respectively, and the convolution step length is 1.

6. The method according to claim 1, wherein the first model is a bidirectional long and short memory network BLSTM model comprising three bidirectional long and short memory layers, each layer comprising a plurality of hidden units;

the bidirectional long-time and short-time memory network model is used for carrying out forward processing and backward processing on each frame in the audio signal, and splicing the forward output and the backward output to be used as a first embedded vector or a second embedded vector corresponding to the frame.

7. The method of claim 1, wherein the attention network model comprises two fully-connected layers, a first layer of the two fully-connected layers containing 1024 neurons and a second layer containing 1 neuron, the activation function between the two fully-connected layers being a normalized exponential function.

8. A speech keyword recognition apparatus, comprising:

the first extraction unit is configured to acquire the audio of a target keyword, extract a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splice the first acoustic feature and the second acoustic feature into a first acoustic feature sequence;

the first embedding unit is configured to input the first acoustic feature sequence into a first model and a second model which are trained in advance in sequence, and output a first embedding vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model;

the second extraction unit is configured to extract a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;

a second embedding unit configured to input the second acoustic feature sequence into a first model and a second model trained in advance in sequence, and output a second embedding vector of a specified dimension through the first model and the second model;

and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector and determine whether the target keyword is contained in the target voice or not based on the similarity.

9. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, implements the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when the computer program is executed in a computer, causes the computer to carry out the method according to any one of claims 1-7.