CN111798840A - Voice keyword recognition method and device - Google Patents

Voice keyword recognition method and device Download PDF

Info

Publication number
CN111798840A
CN111798840A CN202010688457.1A CN202010688457A CN111798840A CN 111798840 A CN111798840 A CN 111798840A CN 202010688457 A CN202010688457 A CN 202010688457A CN 111798840 A CN111798840 A CN 111798840A
Authority
CN
China
Prior art keywords
acoustic feature
model
feature
acoustic
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010688457.1A
Other languages
Chinese (zh)
Other versions
CN111798840B (en
Inventor
赵江江
李昭奇
任玉玲
李青龙
黎塔
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
China Mobile Online Services Co Ltd
Original Assignee
Institute of Acoustics CAS
China Mobile Online Services Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, China Mobile Online Services Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN202010688457.1A priority Critical patent/CN111798840B/en
Publication of CN111798840A publication Critical patent/CN111798840A/en
Application granted granted Critical
Publication of CN111798840B publication Critical patent/CN111798840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a method and a device for recognizing a voice keyword, wherein a first acoustic feature and a second acoustic feature are extracted from an audio frequency of a target keyword, and the first acoustic feature and the second acoustic feature are spliced into a first acoustic feature sequence; extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; inputting the first acoustic characteristic sequence and the second acoustic characteristic sequence into a first model and a second model which are trained in advance respectively, and outputting a first embedded vector and a second embedded vector; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword or not based on the similarity. The method enables the output embedded vector to contain more context information, and improves the effectiveness of sample keyword identification.

Description

Voice keyword recognition method and device
Technical Field
The embodiment of the application relates to the technical field of audio signal processing, in particular to a method and a device for recognizing a voice keyword.
Background
The keyword Detection (Spoken keyword spotting or Spoken Term Detection) technology is a sub-field of the speech recognition field, and aims to detect all appearance positions of specified words in speech signals, and is one of important research contents in the field of human-computer interaction. The traditional keyword recognition technology needs to construct a voice recognition system, the voice recognition system generally comprises an acoustic model, a pronunciation dictionary and a language model, a complex decoding network needs to be constructed by means of a weighted finite state converter, an acoustic feature sequence is converted into a text sequence, then searching is carried out on the text sequence, the operation complexity is high, and more resources are required to be occupied.
A keyword recognition scheme based on a sample can avoid the construction of a recognition system, the comparison of keywords is carried out only through acoustic similarity, the performance is outstanding under the low-resource scene that an effective voice recognition system cannot be constructed, but the extracted features of the scheme contain less context information, the semantic association of the contexts of the keywords in sentences cannot be fully represented, the performance of the sample keyword recognition technology is limited, and the recognition efficiency is to be further improved.
Disclosure of Invention
The application describes a method and a device for recognizing a voice keyword, which are used for solving the problems.
In a first aspect, an embodiment of the present application provides a method for recognizing a speech keyword, where the method includes:
acquiring audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic feature carries context semantic related information; sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model, and the attention network model is used for performing semantic feature aggregation; extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword or not based on the similarity.
In one embodiment, the first acoustic feature includes any one of a logarithmic mel-frequency cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio frequency of the target keyword;
the third acoustic feature comprises any one of a logarithmic mel frequency cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio frequency of the target keyword.
In one embodiment, the first acoustic feature is a logarithmic mel-frequency cepstrum feature, and the extracting the first acoustic feature from the audio of the target keyword includes:
inputting the audio signal of the target keyword into a high-pass filter; framing the audio signal output by the high-pass filter according to a preset frame length and a frame shift; windowing each frame respectively, wherein the window function is a Hamming window; respectively carrying out fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point; respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum; performing offline cosine transformation on the logarithmic Mel spectrum to obtain a cepstrum, and selecting n +1 dimensional features of the first n orders and energy values; and performing first-order and second-order difference operation on the n +1 dimensional features to obtain the logarithmic Mel cepstrum features with the dimensionality of 3(n + 1).
In one embodiment, extracting the second acoustic feature from the audio of the target keyword comprises:
inputting the audio signal of the target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for coding the audio signal, and the second convolutional neural network is used for extracting context correlation characteristics in the audio signal.
In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 10, 8, 4, 4, 4, 1, 1 respectively, the convolution step sizes are 5, 4, 2, 2, 2, 1, 1 respectively; and/or the presence of a gas in the gas,
the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 respectively, and the convolution step sizes are all 1.
In one embodiment, the first model is a bidirectional long-short-time memory network BLSTM model comprising three bidirectional long-short-time memory layers, each layer comprising a plurality of hidden units; the bidirectional long-time and short-time memory network model is used for carrying out forward processing and backward processing on each frame in the audio signal, and splicing forward output and backward output to be used as a first embedded vector or a second embedded vector corresponding to the frame.
In one embodiment, the attention network model includes two fully-connected layers, and the activation function between the two fully-connected layers is a normalized exponential function.
In one embodiment, the first of the two fully-connected layers contains 1024 neurons and the second layer contains 1 neuron.
In a second aspect, an embodiment of the present application provides a speech keyword recognition apparatus, including:
the first extraction unit is configured to acquire the audio of the target keyword, extract a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splice the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the first embedding unit is configured to input the first acoustic feature sequence into a first model and a second model which are trained in advance in sequence, and output a first embedding vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model; the second extraction unit is configured to extract a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; a second embedding unit configured to input a second acoustic feature sequence into a first model and a second model trained in advance in sequence, and output a second embedding vector of a specified dimension through the first model and the second model; and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector and determine whether the target keyword is contained in the target voice or not based on the similarity.
In a third aspect, embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to execute the method of the first aspect to the second aspect.
In a fourth aspect, an embodiment of the present application further provides a computing device, which includes a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the methods of the first to second aspects.
By adopting the speech keyword recognition method and the speech keyword recognition device provided by the embodiment of the application, context semantic association information is contained in a first acoustic feature sequence extracted from the audio frequency of a target keyword and a second acoustic feature sequence extracted from the audio frequency of a target speech to be recognized, the first acoustic feature sequence and the second acoustic feature sequence are respectively input into a first model and a second model which are trained in advance, the first model is a long-time memory network LSTM model and can extract the context association feature in the feature sequences, an attention network model is arranged in the second model, semantic feature aggregation is further executed, and therefore an embedded vector with context semantic information is output, similarity calculation is carried out based on the embedded vector, and recognition accuracy is higher.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present application, the drawings required to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only embodiments disclosed in the present application, and it is obvious for those skilled in the art that other drawings can be obtained based on the drawings without inventive efforts.
Fig. 1 is a system architecture diagram of an implementation of a speech keyword recognition method according to an embodiment of the present application;
FIG. 2 is a flowchart of an embodiment of a method for recognizing a speech keyword according to an embodiment of the present application;
fig. 3 is a flowchart of an embodiment of a speech keyword recognition apparatus according to an embodiment of the present application.
Detailed Description
Embodiments disclosed in the present application are described below with reference to the accompanying drawings.
The method and the device for recognizing the voice keywords are suitable for various voice recognition scenes.
Referring to fig. 1, fig. 1 is a system architecture diagram implemented by a speech keyword recognition method according to an embodiment of the present application. First, a first acoustic feature and a second acoustic feature are extracted from the audio of the target keyword, for example, the first acoustic feature may be a logarithmic mel-frequency spectrum cepstrum feature, and the second acoustic feature should carry context semantic association information. The first acoustic feature and the second acoustic feature are then stitched into a first sequence of acoustic features. And extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, wherein the third acoustic feature and the first acoustic feature are acoustic features with the same attribute, such as logarithmic mel spectrum cepstrum features, except that the first acoustic feature is extracted from the keyword audio frequency, and the third acoustic feature is extracted from the target voice audio frequency. Correspondingly, the fourth acoustic feature and the second acoustic feature are also the same attribute feature, except that the second acoustic feature is extracted from the keyword audio, and the fourth acoustic feature is extracted from the target voice audio. The third and fourth acoustic features are then stitched into a second acoustic feature sequence.
And then, sequentially inputting the first acoustic feature sequence and the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedding vector or a second embedding vector with a specified dimension through the first model and the second model. And calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword or not based on the similarity, for example, if the similarity is higher than a predetermined threshold, determining that the target speech contains the target keyword.
Specifically, referring to fig. 2, the speech keyword recognition method provided in the embodiment of the present application includes the following steps:
s201, obtaining the audio frequency of the target keyword, extracting a first acoustic feature and a second acoustic feature from the audio frequency of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.
S202, the first acoustic feature sequence is sequentially input into a first model and a second model which are trained in advance, and a first embedding vector with a specified dimension is output through the first model and the second model.
S203, extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.
And S204, sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model.
S205, calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target speech contains the target keyword based on the similarity.
The audio frequency of the target keyword is sample keyword audio frequency, the audio frequency of the target voice is a section of complete audio frequency to be detected, and the audio frequency of the target keyword is equivalent to a segment in the target voice audio frequency.
Optionally, in an embodiment, the first acoustic feature extracted in S201 is a cepstrum feature of a logarithmic mel spectrum, and is an acoustic feature at a frame level, and specifically, the following extraction may be performed:
firstly, pre-emphasis is carried out on audio, a high-frequency part is promoted, audio signals of target keywords pass through a high-pass filter, low-frequency characteristics are filtered, and the adopted high-pass filter is as follows:
H(z)=1-0.μz-1
where μ has a value between 0.9 and 1.0, for example 0.97 may be taken, and z represents the audio signal.
Then, the audio is framed, that is, N sampling points in the audio signal are grouped into an observation unit, defined as a frame, where N takes a value of 256 or 512, the frame length is about 20-30ms, in order to avoid the excessive change of two adjacent frames, an overlap region is set between the two adjacent frames, the overlap region includes M sampling points, and the value of M is about 1/2 or 1/3 of N. In this embodiment, a frame is set to be 25 ms long and a frame is shifted by 10 ms.
Then, each frame is windowed, the window function being a hamming window, and each frame is multiplied by the hamming window to increase the continuity of the left and right ends of the frame. After multiplying by the hamming window, each frame needs to be subjected to fast fourier transform to obtain the energy distribution on the frequency spectrum, and different energy distributions on the frequency domain can represent the characteristics of different voices. And performing fast discrete Fourier transform on each frame to obtain the frequency spectrum of each frame, and calculating the power spectrum energy at each frequency point. And obtaining the power spectrum energy of the audio signal by performing the modulus square on the frequency spectrum.
Then, calculating the energy of the power spectrum energy of each frame in the audio signal after passing through a Mel filter, taking the logarithm of the energy to obtain a logarithmic Mel spectrum, performing offline cosine transform on the logarithmic Mel spectrum to obtain a cepstrum, and selecting n + 1-dimensional feature vectors of the first n orders and energy values, wherein n is a positive integer. For example, n may take the value 12, i.e. 13-dimensional feature vectors of the first 12 th order and energy values are taken.
Then, first-order and second-order differences are calculated for the obtained n + 1-dimensional feature vector, so that n + 1-dimensional features after the first-order difference and n + 1-dimensional features after the second-order difference are obtained, and the total is 3(n +1) -dimensional features. For example, the 13-dimensional feature vector computes first and second order differences, resulting in a total of 39-dimensional log mel-frequency cepstral features.
In addition to the log-mel-frequency cepstral feature, the first acoustic feature may be an acoustic posterior probability feature or a neural network bottleneck feature.
The acoustic posterior probability features are probability distributions of various instances corresponding to units in a frame of voice obtained by calculation under the known priori knowledge and observation conditions, the units are modeling units and represent the minimum structure in the built system, for example, the modeling units are words, phoneme phonetic symbols below the word level are not considered, and only how the words form words and sentences is considered. The known a priori knowledge, which may be a pronunciation dictionary of a language, text, audio tagged with text annotation data for corresponding points in time, will be used to train a gaussian mixture model or a neural network model,
the observation condition may be a waveform and a frequency spectrum of a current frame, a waveform and a frequency spectrum of a historical frame, and a waveform and a frequency spectrum of a future frame in the audio signal.
In one embodiment, known prior knowledge of each unit in a speech frame and waveform and spectrum data corresponding to the speech frame are used as input, a Gaussian mixture model or other neural network models are trained by using hidden states of the units as output to obtain optimized parameters, and the hidden states of the units are output based on the trained Gaussian mixture model or other neural network models. The unit is used as a node in a hidden Markov model, prior knowledge is used as an observed value of each display node, a hidden state output by a Gaussian mixture model or other neural network models is used as a hidden state value of the hidden node corresponding to each display node, and the occurrence probability and the transition probability of the unit represented by each node are output through the hidden Markov model and are used as the acoustic posterior probability characteristics of the corresponding voice frame. The unit may include instances of monophonic elements, polyphones, initials, letters, words, and the like.
In another embodiment, the neural network model directly outputs the acoustic posterior probability characteristics corresponding to the speech frame without passing through the hidden Markov model. For example, if the unit is an english letter, the neural network model directly outputs a 26-dimensional vector representing the probability that the frame is each letter, and the output 26-dimensional vector is the acoustic posterior probability feature.
In another embodiment, the method for obtaining the acoustic posterior probability features includes inputting the extracted logarithmic mel cepstrum features of each frame into a neural network acoustic model trained through prior knowledge, and calculating to obtain the probability distribution of all the single phones corresponding to the frame, wherein the probability is the acoustic posterior probability features.
The neural network bottleneck characteristic is that the output of the middle layer of the neural network is used as the characteristic, for example, a neural network with a two-part structure is trained, the first part is a BLSTM with three layers used as an encoding network, the second part is a decoding network consisting of four LSTM layers, and the four LSTM layers respectively output the word-level probabilities of Mandarin/English/Spanish/Persian. The four languages are used for training in turn, each language only training the decoding network of the corresponding LSTM layer in the first and second parts. After the training is completed, the output of the first partial network is used as the feature of each frame, which is defined as the neural network bottleneck feature in the embodiment of the present application.
In S201, after obtaining the first acoustic feature, a second acoustic feature needs to be extracted. The second acoustic feature is also a frame-level acoustic feature, i.e., extracted in units of frames. The audio of the target keyword is input to a pre-trained neural network model, the output of which is taken as the second acoustic feature.
Specifically, the pre-trained neural network model includes a coding network and a context network. For example, the coding network and the context network are two one-dimensional convolutional neural networks. For convenience of description, the coding network is defined as a first convolutional neural network, and the context network is defined as a second convolutional neural network. The first convolutional neural network is used for coding the audio signal, and the second convolutional neural network is used for extracting context-related features in the audio signal.
In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the sizes of convolutional kernels for performing convolutional operations are respectively (10, 8, 4, 4, 4, 1, 1), and the step sizes of the convolutional operations are respectively (5, 4, 2, 2, 2, 1, 1); and the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the sizes of convolutional kernels are (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12), the convolution step sizes are all set to be 1, namely the step size of each convolutional layer convolution operation is (1, 1, 1, 1, 1, 1, 1 and 1).
And the second acoustic features extracted by the convolutional neural network model carry context semantic related information.
And then, the first acoustic feature and the second acoustic feature are spliced into a first acoustic feature sequence, namely, two feature vectors are spliced into a feature vector with a higher dimension, and different feature vectors are arranged in sequence to form the first acoustic feature sequence.
In S203, a third acoustic feature and a fourth acoustic feature are extracted from the audio of the target speech to be recognized, and the third acoustic feature and the fourth acoustic feature are spliced into a second acoustic feature sequence, which may refer to the detailed description of S201, except that the processing object in S201 is the target keyword audio, and the processing object in S203 is the audio of the target speech.
It should be noted that, in the embodiment of the present application, the execution timings of S202 and S204, and the execution timings of S201 and S203 are not sequentially required, and the step numbers are only for convenience of description, and are not to be construed as limitations on the execution timings.
In S202 and S204, the first acoustic feature sequence and the second acoustic feature sequence obtained by concatenation are input to the first model and the second model trained in advance, respectively. In some embodiments, the first model and the second model are both pre-constructed and trained, and the first model may be a Long Short-Term Memory network (LSTM) model. For example, in one specific embodiment, the first model comprises three layers of bidirectional long-term memory networks, each layer comprises 256 neurons, each frame splices the forward and backward outputs as an embedded vector of the frame, and the embedded vector is 512 fixed dimensions.
The second model includes at least an attention network model for performing semantic feature aggregation. In some embodiments, the second model includes an attention network, a layer of averaging structures, and a mosaic structure.
In one embodiment, the attention network includes two fully-connected layers, and the activation function between the fully-connected layers is a normalized exponential function. For example, the first fully-connected layer contains 1024 neurons, and the second layer (corresponding to the output layer) contains 1 neuron.
The overall calculation process of the first model and the second model is as follows:
Y=FC(Z),
Figure BDA0002588472580000101
the split structure represents that feature vectors output by the attention network and output by the average network are spliced into a final output embedded vector E. Specifically, FC (·) denotes a fully-connected layer, Z denotes a feature vector corresponding to an audio signal input to the fully-connected layer, ZtRepresenting the t-th element in Z. The first part in front of the two parts of the stitching is the feature vector of the attention network output, then the pair of z according to the weight of the softmax outputtCarrying out weighted average; rear face
Figure BDA0002588472580000102
I.e. average structure, for xtThe sum of (a) is averaged. The two portions of the output vector are spliced together as a final output vector.
SoftMax (-) is a normalized exponential function, whose expression is:
Figure BDA0002588472580000103
y denotes the output sequence of the full connection layer, xt,ytRespectively representing the output characteristic vector of the tth three-layer bidirectional long and short time memory network and the output characteristic vector of the tth fully-connected layer. T denotes the number of input and output feature vectors. exp denotes an exponential function with a natural constant e as the base.
In S202 and S204, the processes of outputting the first embedded vector and the second embedded vector of the designated dimension through the first model and the second model may refer to the above processes, where when the processing object of the first model is the first acoustic feature sequence corresponding to the target keyword audio, the output embedded vector E is the first embedded vector, and when the processing object of the first model is the second acoustic feature sequence corresponding to the target speech audio, the output embedded vector E is the second embedded vector. The dimensions of the first embedding vector and the second embedding vector are both predetermined values, e.g., in one embodiment, the dimensions of the first embedding vector and the second embedding vector are both 512. The length of the embedded vectors of the input features and output used is determined at training time and if modified, the network needs to be retrained.
Next, in S205, the similarity between the first embedding vector and the second embedding vector is calculated. There are various ways to calculate the Similarity, such as Manhattan Distance (Manhattan Distance), Euclidean Distance (Euclidean Distance), Pearson Correlation coefficient (Pearson Correlation), Cosine Similarity (Cosine Similarity), and so on.
In one embodiment, cosine distances are used to calculate similarity, and whether keywords represented by the keyword audio appear in the target audio is determined according to the cosine distances. And when the cosine distance between the keyword and the target voice embedded vector is smaller than a preset threshold value or the similarity is higher than a preset threshold value, judging that the keyword is hit. The cosine distance formula is as follows:
Figure BDA0002588472580000111
where A and B are the two embedded vectors to be compared and d (-) is the calculated cosine distance.
To sum up, in the recognition method provided by the embodiment of the present application, in the keyword query process, the target keyword audio to be queried is input, the logarithmic mel cepstrum feature (i.e., the first acoustic feature) and the pre-training feature (i.e., the second acoustic feature) are extracted through step S201, the logarithmic mel cepstrum feature and the pre-training feature are spliced together to serve as a frame-level acoustic feature sequence, the acoustic feature sequence is sequentially input into the first model and the second model, and an embedded vector with a preset fixed length is output. The input target keyword audio frequency does not need languages and text information, and therefore a relevant model for extracting the text information does not need to be established, and therefore the resource space occupied by the relevant model for data operation is reduced.
In a second aspect, referring to fig. 3, an embodiment of the present application further provides a speech keyword recognition method and apparatus 310, where the apparatus includes:
the first extraction unit 3101 is configured to acquire an audio of a target keyword, extract a first acoustic feature and a second acoustic feature from the audio of the target keyword, and concatenate the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.
A first embedding unit 3102 configured to input the first acoustic feature sequence into a first model and a second model trained in advance in sequence, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model includes at least an attention network model.
A second extraction unit 3103 configured to extract a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and concatenate the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.
A second embedding unit 3104 configured to input the second acoustic feature sequence into the first model and the second model trained in advance in order, and output a second embedding vector of a specified dimension through the first model and the second model.
A recognition unit 3105 configured to calculate a similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is contained in the target speech based on the similarity.
In a third aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to perform the method described in any one of the above embodiments.
In a fourth aspect, an embodiment of the present application further provides a computing device, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method described in any of the foregoing embodiments.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the embodiments disclosed in the present application are described in further detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the embodiments disclosed in the present application, and are not intended to limit the scope of the embodiments disclosed in the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments disclosed in the present application should be included in the scope of the embodiments disclosed in the present application.

Claims (10)

1. A speech keyword recognition method, characterized in that the method comprises:
acquiring audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic feature carries context semantic association information;
sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedding vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model for performing semantic feature aggregation;
extracting a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;
sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model;
calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity.
2. The method according to claim 1, wherein the first acoustic feature comprises any one of a logarithmic mel-frequency cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio of the target keyword;
the third acoustic feature comprises any one of a logarithmic Mel cepstrum feature, an acoustic posterior probability feature and a neural network bottleneck feature extracted from the audio frequency of the target keyword.
3. The method of claim 2, wherein the first acoustic feature is a logarithmic mel-frequency cepstrum feature, and wherein extracting the first acoustic feature from the audio of the target keyword comprises:
inputting the audio signal of the target keyword into a high-pass filter;
framing the audio signal output by the high-pass filter according to a preset frame length and a frame shift;
windowing each frame respectively, wherein the window function is a Hamming window;
respectively carrying out fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point;
respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum;
performing offline cosine transformation on the logarithmic Mel spectrum to obtain a cepstrum, and selecting n +1 dimensional features of the first n orders and the energy value;
and performing first-order and second-order difference operation on the n +1 dimensional features to obtain a logarithmic Mel cepstrum feature with a dimension of 3(n + 1).
4. The method of claim 1, wherein extracting a second acoustic feature from the audio of the target keyword comprises:
inputting the audio signal of the target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for coding the audio signal, and the second convolutional neural network is used for extracting context-related features in the audio signal.
5. The method of claim 4, wherein the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 10, 8, 4, 4, 4, 1, 1, respectively, and the convolution steps are 5, 4, 2, 2, 2, 1, 1; and/or the presence of a gas in the gas,
the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the sizes of convolutional kernels are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 respectively, and the convolution step length is 1.
6. The method according to claim 1, wherein the first model is a bidirectional long and short memory network BLSTM model comprising three bidirectional long and short memory layers, each layer comprising a plurality of hidden units;
the bidirectional long-time and short-time memory network model is used for carrying out forward processing and backward processing on each frame in the audio signal, and splicing the forward output and the backward output to be used as a first embedded vector or a second embedded vector corresponding to the frame.
7. The method of claim 1, wherein the attention network model comprises two fully-connected layers, a first layer of the two fully-connected layers containing 1024 neurons and a second layer containing 1 neuron, the activation function between the two fully-connected layers being a normalized exponential function.
8. A speech keyword recognition apparatus, comprising:
the first extraction unit is configured to acquire the audio of a target keyword, extract a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splice the first acoustic feature and the second acoustic feature into a first acoustic feature sequence;
the first embedding unit is configured to input the first acoustic feature sequence into a first model and a second model which are trained in advance in sequence, and output a first embedding vector with a specified dimension through the first model and the second model; the first model is a long-time memory network LSTM model; the second model comprises at least an attention network model;
the second extraction unit is configured to extract a third acoustic feature and a fourth acoustic feature from the audio frequency of the target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;
a second embedding unit configured to input the second acoustic feature sequence into a first model and a second model trained in advance in sequence, and output a second embedding vector of a specified dimension through the first model and the second model;
and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector and determine whether the target keyword is contained in the target voice or not based on the similarity.
9. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, implements the method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when the computer program is executed in a computer, causes the computer to carry out the method according to any one of claims 1-7.
CN202010688457.1A 2020-07-16 2020-07-16 Voice keyword recognition method and device Active CN111798840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010688457.1A CN111798840B (en) 2020-07-16 2020-07-16 Voice keyword recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010688457.1A CN111798840B (en) 2020-07-16 2020-07-16 Voice keyword recognition method and device

Publications (2)

Publication Number Publication Date
CN111798840A true CN111798840A (en) 2020-10-20
CN111798840B CN111798840B (en) 2023-08-08

Family

ID=72807488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010688457.1A Active CN111798840B (en) 2020-07-16 2020-07-16 Voice keyword recognition method and device

Country Status (1)

Country Link
CN (1) CN111798840B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530410A (en) * 2020-12-24 2021-03-19 北京地平线机器人技术研发有限公司 Command word recognition method and device
CN112634870A (en) * 2020-12-11 2021-04-09 平安科技(深圳)有限公司 Keyword detection method, device, equipment and storage medium
CN112685594A (en) * 2020-12-24 2021-04-20 中国人民解放军战略支援部队信息工程大学 Attention-based weak supervision voice retrieval method and system
CN113470693A (en) * 2021-07-07 2021-10-01 杭州网易云音乐科技有限公司 Method and device for detecting singing, electronic equipment and computer readable storage medium
CN113671031A (en) * 2021-08-20 2021-11-19 北京房江湖科技有限公司 Wall hollowing detection method and device
CN114817456A (en) * 2022-03-10 2022-07-29 马上消费金融股份有限公司 Keyword detection method and device, computer equipment and storage medium
CN116453514A (en) * 2023-06-08 2023-07-18 四川大学 Multi-view-based voice keyword detection and positioning method and device

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001290496A (en) * 2000-04-07 2001-10-19 Ricoh Co Ltd Speech retrieval device, speech retrieval method and recording medium
US20090234854A1 (en) * 2008-03-11 2009-09-17 Hitachi, Ltd. Search system and search method for speech database
US20100094626A1 (en) * 2006-09-27 2010-04-15 Fengqin Li Method and apparatus for locating speech keyword and speech recognition system
US20100179811A1 (en) * 2009-01-13 2010-07-15 Crim Identifying keyword occurrences in audio data
CN103559881A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Language-irrelevant key word recognition method and system
US20150279351A1 (en) * 2012-12-19 2015-10-01 Google Inc. Keyword detection based on acoustic alignment
US20150302847A1 (en) * 2014-04-17 2015-10-22 Qualcomm Incorporated Keyword model generation for detecting user-defined keyword
WO2018019116A1 (en) * 2016-07-28 2018-02-01 上海未来伙伴机器人有限公司 Natural language-based man-machine interaction method and system
WO2018107810A1 (en) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 Voiceprint recognition method and apparatus, and electronic device and medium
CN110349572A (en) * 2017-05-27 2019-10-18 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN110444193A (en) * 2018-01-31 2019-11-12 腾讯科技(深圳)有限公司 The recognition methods of voice keyword and device
CN110610707A (en) * 2019-09-20 2019-12-24 科大讯飞股份有限公司 Voice keyword recognition method and device, electronic equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001290496A (en) * 2000-04-07 2001-10-19 Ricoh Co Ltd Speech retrieval device, speech retrieval method and recording medium
US20100094626A1 (en) * 2006-09-27 2010-04-15 Fengqin Li Method and apparatus for locating speech keyword and speech recognition system
US20090234854A1 (en) * 2008-03-11 2009-09-17 Hitachi, Ltd. Search system and search method for speech database
US20100179811A1 (en) * 2009-01-13 2010-07-15 Crim Identifying keyword occurrences in audio data
US20150279351A1 (en) * 2012-12-19 2015-10-01 Google Inc. Keyword detection based on acoustic alignment
CN103559881A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Language-irrelevant key word recognition method and system
US20150302847A1 (en) * 2014-04-17 2015-10-22 Qualcomm Incorporated Keyword model generation for detecting user-defined keyword
WO2018019116A1 (en) * 2016-07-28 2018-02-01 上海未来伙伴机器人有限公司 Natural language-based man-machine interaction method and system
WO2018107810A1 (en) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 Voiceprint recognition method and apparatus, and electronic device and medium
CN110349572A (en) * 2017-05-27 2019-10-18 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN110444193A (en) * 2018-01-31 2019-11-12 腾讯科技(深圳)有限公司 The recognition methods of voice keyword and device
CN110610707A (en) * 2019-09-20 2019-12-24 科大讯飞股份有限公司 Voice keyword recognition method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵晓群;张扬;: "语音关键词识别系统声学模型构建综述", 燕山大学学报, no. 06 *
郑铁然;张战;韩纪庆;: "基于限制模型规模和声学置信度的关键词检出方法", 计算机科学, no. 01 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634870A (en) * 2020-12-11 2021-04-09 平安科技(深圳)有限公司 Keyword detection method, device, equipment and storage medium
WO2022121188A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Keyword detection method and apparatus, device and storage medium
CN112634870B (en) * 2020-12-11 2023-05-30 平安科技(深圳)有限公司 Keyword detection method, device, equipment and storage medium
CN112530410A (en) * 2020-12-24 2021-03-19 北京地平线机器人技术研发有限公司 Command word recognition method and device
CN112685594A (en) * 2020-12-24 2021-04-20 中国人民解放军战略支援部队信息工程大学 Attention-based weak supervision voice retrieval method and system
CN112685594B (en) * 2020-12-24 2022-10-04 中国人民解放军战略支援部队信息工程大学 Attention-based weak supervision voice retrieval method and system
CN113470693A (en) * 2021-07-07 2021-10-01 杭州网易云音乐科技有限公司 Method and device for detecting singing, electronic equipment and computer readable storage medium
CN113671031A (en) * 2021-08-20 2021-11-19 北京房江湖科技有限公司 Wall hollowing detection method and device
CN114817456A (en) * 2022-03-10 2022-07-29 马上消费金融股份有限公司 Keyword detection method and device, computer equipment and storage medium
CN114817456B (en) * 2022-03-10 2023-09-05 马上消费金融股份有限公司 Keyword detection method, keyword detection device, computer equipment and storage medium
CN116453514A (en) * 2023-06-08 2023-07-18 四川大学 Multi-view-based voice keyword detection and positioning method and device
CN116453514B (en) * 2023-06-08 2023-08-25 四川大学 Multi-view-based voice keyword detection and positioning method and device

Also Published As

Publication number Publication date
CN111798840B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN111798840B (en) Voice keyword recognition method and device
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
Collobert et al. Wav2letter: an end-to-end convnet-based speech recognition system
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
KR100755677B1 (en) Apparatus and method for dialogue speech recognition using topic detection
US8321218B2 (en) Searching in audio speech
US6845357B2 (en) Pattern recognition using an observable operator model
EP4018437B1 (en) Optimizing a keyword spotting system
CN107093422B (en) Voice recognition method and voice recognition system
CN112397056B (en) Voice evaluation method and computer storage medium
CN112331229B (en) Voice detection method, device, medium and computing equipment
Nasereddin et al. Classification techniques for automatic speech recognition (ASR) algorithms used with real time speech translation
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
Das et al. Best of both worlds: Robust accented speech recognition with adversarial transfer learning
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN114783418A (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
Radha et al. Speech and speaker recognition using raw waveform modeling for adult and children’s speech: a comprehensive review
Biswas et al. Speech Recognition using Weighted Finite-State Transducers
CN114333762B (en) Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium
KR100480790B1 (en) Method and apparatus for continous speech recognition using bi-directional n-gram language model
Saha Development of a bangla speech to text conversion system using deep learning
CN115132170A (en) Language classification method and device and computer readable storage medium
Tabibian A survey on structured discriminative spoken keyword spotting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant