CN111798840B - Voice keyword recognition method and device - Google Patents

Voice keyword recognition method and device Download PDF

Info

Publication number
CN111798840B
CN111798840B CN202010688457.1A CN202010688457A CN111798840B CN 111798840 B CN111798840 B CN 111798840B CN 202010688457 A CN202010688457 A CN 202010688457A CN 111798840 B CN111798840 B CN 111798840B
Authority
CN
China
Prior art keywords
model
acoustic feature
acoustic
feature
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010688457.1A
Other languages
Chinese (zh)
Other versions
CN111798840A (en
Inventor
赵江江
李昭奇
任玉玲
李青龙
黎塔
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
China Mobile Online Services Co Ltd
Original Assignee
Institute of Acoustics CAS
China Mobile Online Services Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, China Mobile Online Services Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN202010688457.1A priority Critical patent/CN111798840B/en
Publication of CN111798840A publication Critical patent/CN111798840A/en
Application granted granted Critical
Publication of CN111798840B publication Critical patent/CN111798840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a voice keyword recognition method and device, which are characterized in that first acoustic features and second acoustic features are extracted from audio of a target keyword, and the first acoustic features and the second acoustic features are spliced into a first acoustic feature sequence; extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; respectively inputting the first and second acoustic feature sequences into a first model and a second model trained in advance, and outputting a first embedded vector and a second embedded vector; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity. The method enables the output embedded vector to contain more context information, and improves the effectiveness of sample keyword recognition.

Description

Voice keyword recognition method and device
Technical Field
The embodiment of the application relates to the technical field of audio signal processing, in particular to a voice keyword recognition method and device.
Background
Keyword detection (Spoken keyword spotting or Spoken Term Detection) technology is a sub-field of the speech recognition field, whose purpose is to detect all occurrence positions of specified words in speech signals, one of the important research contents of the human interaction field. The traditional keyword recognition technology needs to construct a voice recognition system, the voice recognition system generally comprises an acoustic model, a pronunciation dictionary and a language model, a complex decoding network needs to be constructed by means of a weighted finite state transducer, an acoustic feature sequence is converted into a text sequence, then searching is conducted on the text sequence, the operation complexity is high, and more resources are needed to be occupied.
A keyword recognition scheme based on a sample can avoid building a recognition system, compares keywords only through acoustic similarity, has outstanding performance in a low-resource scene where an effective voice recognition system cannot be built, but contains fewer context information in the extracted features of the scheme, and cannot fully characterize the semantic association of the context of the keywords in sentences, so that the performance of the sample keyword recognition technology is limited, and the recognition efficiency is required to be further improved.
Disclosure of Invention
The application describes a voice keyword recognition method and device for solving the problems.
In a first aspect, an embodiment of the present application provides a method for identifying a voice keyword, where the method includes:
acquiring the audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic features carry context semantic association information; sequentially inputting a first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model comprises at least an attention network model, wherein the attention network model is used for executing semantic feature aggregation; extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; sequentially inputting a second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model; and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity.
In one embodiment, the first acoustic feature comprises any one of a logarithmic mel-cepstrum feature, an acoustic posterior probability feature, and a neural network bottleneck feature extracted from audio of the target keyword;
the third acoustic feature includes any one of a log mel cepstrum feature, an acoustic posterior probability feature, and a neural network bottleneck feature extracted from audio of the target keyword.
In one embodiment, the first acoustic feature is a log-mel-cepstral feature, and extracting the first acoustic feature from the audio of the target keyword comprises:
inputting the audio signal of the target keyword into a high-pass filter; framing the audio signal output by the high-pass filter according to a preset frame length and frame shift; windowing each frame respectively, wherein a window function is a Hamming window; performing fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point; respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum; performing off-line cosine transform on the logarithmic mel spectrum to obtain a cepstrum, and selecting n+1-dimensional features of the first n-order and the energy value; and carrying out first-order and second-order differential operation on the n+1-dimensional characteristics to obtain logarithmic mel cepstrum characteristics with the dimension of 3 (n+1).
In one embodiment, extracting a second acoustic feature from audio of a target keyword includes:
inputting an audio signal of a target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for carrying out coding processing on the audio signal, and the second convolutional neural network is used for extracting context correlation characteristics in the audio signal.
In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the convolution kernel sizes are 10,8,4,4,4,1,1, and the convolution step sizes are 5,4,2,2,2,1,1; and/or the number of the groups of groups,
the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the convolution kernel sizes are 1,2,3,4,5,6,7,8,9, 10, 11 and 12, and the convolution step sizes are 1.
In one embodiment, the first model is a bidirectional long and short time memory network BLSTM model, comprising three bidirectional long and short time memory layers, each layer comprising a plurality of hidden units; the bidirectional long-short-time memory network model is used for carrying out forward processing and reverse processing on each frame in the audio signal, and splicing forward output and reverse output as a first embedded vector or a second embedded vector corresponding to the frame.
In one embodiment, the attention network model includes two fully connected layers, and the activation function between the two fully connected layers is a normalized exponential function.
In one embodiment, a first layer of the two fully connected layers contains 1024 neurons and a second layer contains 1 neuron.
In a second aspect, an embodiment of the present application provides a voice keyword recognition apparatus, including:
the first extraction unit is configured to acquire the audio of the target keyword, extract first acoustic features and second acoustic features from the audio of the target keyword, and splice the first acoustic features and the second acoustic features into a first acoustic feature sequence; a first embedding unit configured to sequentially input a first acoustic feature sequence into a first model and a second model trained in advance, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model includes at least an attention network model; a second extraction unit configured to extract a third acoustic feature and a fourth acoustic feature from audio of a target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence; a second embedding unit configured to sequentially input a second acoustic feature sequence into the first model and the second model trained in advance, and output a second embedded vector of a specified dimension through the first model and the second model; and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is contained in the target voice or not based on the similarity.
In a third aspect, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the methods of the first to second aspects.
In a fourth aspect, embodiments of the present application further provide a computing device, including a memory and a processor, where the memory stores executable code, and the processor implements the methods of the first to second aspects when executing the executable code.
By adopting the voice keyword recognition method and device provided by the embodiment of the application, the first acoustic feature sequence extracted from the audio of the target keyword and the second acoustic feature sequence extracted from the audio of the target voice to be recognized both comprise context semantic association information, the first and second acoustic feature sequences are respectively input into a first model and a second model trained in advance, the first model is a long and short memory network LSTM model, the context association features in the feature sequences can be extracted, the second model is provided with an attention network model, semantic feature aggregation is further executed, so that an embedded vector with mountain context semantic information is output, similarity calculation is performed based on the embedded vector, and recognition accuracy is higher.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments disclosed herein, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only embodiments disclosed herein, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a system architecture diagram of a speech keyword recognition method according to an embodiment of the present application;
FIG. 2 is a flowchart of one embodiment of a method for recognizing a voice keyword according to an embodiment of the present application;
fig. 3 is a flowchart of an embodiment of a voice keyword recognition apparatus provided in an embodiment of the present application.
Detailed Description
Various embodiments disclosed herein are described below with reference to the accompanying drawings.
The voice keyword recognition method and device provided by the embodiment of the application are suitable for various voice recognition scenes.
Referring to fig. 1, fig. 1 is a system architecture diagram implemented by a voice keyword recognition method according to an embodiment of the present application. First, a first acoustic feature and a second acoustic feature are extracted from the audio of the target keyword, for example, the first acoustic feature may be a log mel-spectrum cepstrum feature, and the second acoustic feature should carry context semantic association information. The first acoustic feature and the second acoustic feature are then stitched into a first acoustic feature sequence. And extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, wherein the third acoustic feature and the first acoustic feature are acoustic features with the same attribute, such as logarithmic mel-spectrum cepstrum features, and the difference is that the first acoustic feature is extracted from the keyword audio and the third acoustic feature is extracted from the target voice audio. Correspondingly, the fourth acoustic feature is also the same attribute feature as the second acoustic feature, except that the second acoustic feature is extracted from the keyword audio and the fourth acoustic feature is extracted from the target voice audio. The third acoustic feature and the fourth acoustic feature are then stitched into a second acoustic feature sequence.
And then sequentially inputting the first acoustic feature sequence and the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector or a second embedded vector with specified dimensionality through the first model and the second model. And calculating the similarity between the first embedded vector and the second embedded vector, determining whether the target speech contains the target keyword based on the similarity, and judging that the target speech contains the target keyword if the similarity is higher than a preset threshold.
Specifically, referring to fig. 2, the voice keyword recognition method provided in the embodiment of the present application includes the following steps:
s201, acquiring the audio of the target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.
S202, sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model.
And S203, extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.
S204, sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model.
S205, calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice based on the similarity.
The audio of the target keyword is sample keyword audio, the audio of the target voice is a complete audio to be detected, and the audio of the target keyword is equivalent to a segment in the target voice audio.
Alternatively, in one embodiment, the first acoustic feature extracted in S201 is a log mel-spectrum cepstrum feature, and is a frame-level acoustic feature, specifically, the extraction may be performed as follows:
firstly, pre-emphasis is carried out on audio, a high-frequency part is promoted, an audio signal of a target keyword passes through a high-pass filter, low-frequency characteristics are filtered, and the adopted high-pass filter is as follows:
H(z)=1-0.μz -1
where μ has a value between 0.9 and 1.0, for example 0.97, z represents the audio signal.
Then framing the audio, namely integrating N sampling points in the audio signal into an observation unit, defining the observation unit as a frame, wherein the value of N is 256 or 512, the frame length is about 20-30ms, and in order to avoid overlarge change of two adjacent frames, an overlapping area is arranged between the two adjacent frames, the overlapping area comprises M sampling points, and the value of M is about 1/2 or 1/3 of the value of N. As an implementation manner, in the embodiment of the present application, the frame length is set to 25 ms, and the frame is shifted by 10 ms.
Each frame is then windowed, the window function being a hamming window, and each frame is multiplied by the hamming window to increase the continuity of the left and right ends of the frame. After multiplying the hamming window, each frame then needs to be subjected to a fast fourier transform to obtain an energy distribution in the frequency spectrum, different energy distributions in the frequency domain, which can represent the characteristics of different voices. And performing fast discrete Fourier transform on each frame to obtain the frequency spectrum of each frame, and calculating the power spectrum energy at each frequency point. And obtaining the power spectrum energy of the audio signal by frequency spectrum modular squaring.
And then calculating the energy of each frame of power spectrum energy in the audio signal after passing through the Mel filter, taking the logarithm of the energy to obtain a logarithmic Mel spectrum, performing off-line cosine transform on the logarithmic Mel spectrum to obtain a cepstrum, selecting n+1-dimensional eigenvectors of the first n-order and energy values, wherein n is a positive integer. For example, n may take the value 12, i.e., the first 12 th order and energy values together 13-dimensional feature vectors.
Next, first-order and second-order differences are calculated on the obtained n+1-dimensional feature vector, and an n+1-dimensional feature after the first-order difference and an n+1-dimensional feature after the second-order difference are obtained, which are 3 (n+1) dimensional features in total. For example, the 13-dimensional feature vector calculates first-order and second-order differences, resulting in 39-dimensional log mel-cepstrum features in total.
In addition to the log mel-cepstrum feature, the first acoustic feature may be an acoustic posterior probability feature or a neural network bottleneck feature.
The acoustic posterior probability feature is probability distribution of each instance corresponding to each unit in a frame of voice obtained through calculation under the known prior knowledge and observation condition, the unit is a modeling unit, the minimum structure in a built system is represented, for example, the modeling unit is a word, and phonemic phonetic symbols below the word level are not considered any more, and only how the word forms words and sentences are needed to be considered. The known prior knowledge may be a pronunciation dictionary of a language, text, audio marked with text annotation data corresponding to points in time, which will be used to train a gaussian mixture model or a neural network model,
the observation condition may be a waveform and a frequency spectrum of a current frame, a waveform and a frequency spectrum of a history frame, and a waveform and a frequency spectrum of a future frame in the audio signal.
In one embodiment, the known prior knowledge of each unit in the speech frame and the waveform and spectrum data corresponding to the speech frame are taken as inputs, the implicit state of each unit is taken as output, a gaussian mixture model or other neural network model is trained to obtain optimized parameters, and the implicit state of each unit is output based on the trained gaussian mixture model or other neural network model. The method comprises the steps of taking units as nodes in a hidden Markov model, taking priori knowledge as observation values of all display nodes, taking hidden states output by a Gaussian mixture model or other neural network models as hidden state values of hidden nodes corresponding to all display nodes, and outputting occurrence probability and transition probability of the units represented by all the nodes through the hidden Markov model to serve as acoustic posterior probability characteristics of corresponding voice frames. units may include examples of monophones, polyphones, initials, letters, words, and the like.
In another embodiment, the neural network model directly outputs the acoustic posterior probability features corresponding to the speech frames without going through the hidden Markov model. For example, the unit is an english letter, the neural network model directly outputs a 26-dimensional vector representing the probability that the frame is each letter, and the 26-dimensional vector that is output is the acoustic posterior probability feature.
In another embodiment, the method for obtaining the acoustic posterior probability features is to input the logarithmic mel cepstrum features extracted from each frame into a neural network acoustic model trained through priori knowledge, and calculate to obtain probability distribution of all monophonins corresponding to the frame, where the probability is the acoustic posterior probability feature.
The bottleneck feature of the neural network is that the output of a middle layer of the neural network is used as a feature, such as training a neural network with a two-part structure, wherein the first part is a three-layer BLSTM (block STM) as an encoding network, and the second part is a decoding network composed of four LSTM layers, and the four LSTM layers respectively output the probabilities of word levels of mandarin/English/Spanish/Bos. Training is performed sequentially using four languages, each training only the decoding network of the corresponding LSTM layer in the first and second portions. After training is completed, the output of the first part of the network is used as a feature of each frame, which is defined as a neural network bottleneck feature in the embodiments of the present application.
In S201, after the first acoustic feature is obtained, a second acoustic feature needs to be extracted. The second acoustic feature is also a frame-level acoustic feature, i.e., extracted in frames. The audio of the target keyword is input into a pre-trained neural network model, the output of which serves as a second acoustic feature.
Specifically, the pre-trained neural network model includes an encoding network and a context network. For example, the coding network and the context network are two one-dimensional convolutional neural networks. For convenience of description, the encoding network is defined as a first convolutional neural network and the context network is defined as a second convolutional neural network. The first convolutional neural network is used for encoding the audio signal, and the second convolutional neural network is used for extracting context-related features from the audio signal.
In one embodiment, the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the convolution kernel sizes for performing the convolution operations are (10,8,4,4,4,1,1) respectively, and the step sizes of the convolution operations are (5,4,2,2,2,1,1) respectively; the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the convolution kernel sizes are (1, 2,3,4,5,6,7,8,9, 10, 11, 12), and the convolution step sizes are all set to be 1, that is, the step size of convolution operation of each convolutional layer is (1,1,1,1,1,1,1,1,1,1,1,1).
The second acoustic features extracted by the convolutional neural network model carry context semantic association information.
Then, the first acoustic feature and the second acoustic feature are spliced into a first acoustic feature sequence, that is, two feature vectors are spliced into a feature vector with a higher dimension, and different feature vectors are arranged in sequence to form the first acoustic feature sequence.
The step of extracting the third acoustic feature and the fourth acoustic feature from the audio of the target voice to be recognized in S203 and the step of concatenating the third acoustic feature and the fourth acoustic feature into the second acoustic feature sequence may be described in detail with reference to S201, except that the processing object in S201 is the target keyword audio and the processing object in S203 is the audio of the target voice.
In the embodiment of the present application, S202 and S204, and S201 and S203 are not sequentially required in the execution timing, and the step sequence is only for convenience in description and is not to be construed as limiting the execution timing.
In S202 and S204, the first acoustic feature sequence and the second acoustic feature sequence obtained by stitching are input into a first model and a second model trained in advance, respectively. In some embodiments, the first model and the second model are both pre-built and trained, and the first model may be a Long Short-Term Memory network (LSTM) model. For example, in one particular embodiment, the first model comprises three layers of bi-directional long and short-term memory network, each layer comprising 256 neurons, each frame concatenates the forward and reverse outputs as an embedded vector for this frame, the embedded vector being of fixed 512 dimensions.
The second model includes at least an attention network model for performing semantic feature aggregation. In some embodiments, the second model includes an attention network, a layer of average structure, and a mosaic structure.
In one embodiment, the attention network comprises two fully connected layers, the activation function between the fully connected layers being a normalized exponential function. For example, the first fully-connected layer contains 1024 neurons, and the second layer (corresponding to the output layer) contains 1 neuron.
The overall calculation process of the first model and the second model is as follows:
Y=FC(Z),
the split structure is that the feature vectors output by the attention network and the average network are spliced into the final output embedded vector E. Specifically, FC (·) represents the fully connected layer, Z represents the feature vector corresponding to the audio signal input to the fully connected layer, Z t Represents the t-th element in Z. The first part of the front of the two parts spliced is the feature vector output by the attention network, and then the weight pair z output according to softmax t Performing weighted average; the rear faceI.e. average structure, for x t Averaged together. The two output vectors are stitched together as a final output vector.
SoftMax (·) is a normalized exponential function with the expression:
y represents the output sequence of the full link layer, x t ,y t Respectively representing the output characteristic vector of the t three-layer bidirectional long-short-time memory network and the output characteristic vector of the t full-connection layer. T represents input and outputIs used for the number of feature vectors of (a). exp represents an exponential function based on a natural constant e.
In the step S202 and the step S204, the process of outputting the first embedded vector and the second embedded vector with specified dimensions through the first model and the second model may refer to the above process, when the processing object input into the first model is the first acoustic feature sequence corresponding to the target keyword audio, the output embedded vector E is the first embedded vector, and when the processing object input into the first model is the second acoustic feature sequence corresponding to the target voice audio, the output embedded vector E is the second embedded vector. The dimensions of the first and second embedded vectors are both predetermined values, for example, in one embodiment, 512. The length of the input features and the embedded vectors of the output used are determined at training time and if modified the network needs to be retrained.
Next, at S205, the similarity between the first embedded vector and the second embedded vector is calculated. There are various ways to calculate the similarity, such as manhattan distance (Manhattan Distance), euclidean distance (Euclidean Distance), pearson correlation coefficient (Pearson Correlation), cosine similarity (Cosine Similarity), and so on.
In one embodiment, the similarity is calculated using a cosine distance, and a determination is made as to whether the keyword represented by the keyword audio is present in the target audio based on the cosine distance. And when the cosine distance between the keyword and the target voice embedded vector is smaller than a preset threshold value or the similarity is higher than a preset threshold value, judging that the keyword is hit. The cosine distance formula is as follows:
wherein A and B are two embedded vectors to be compared, and d (·) is the calculated cosine distance.
In summary, in the recognition method provided by the embodiment of the present application, in the keyword query process, the target keyword audio to be queried is input, the log-mel cepstrum feature (i.e., the first acoustic feature) and the pre-training feature (i.e., the second acoustic feature) are extracted through step S201, the log-mel cepstrum feature and the pre-training feature are spliced together to be used as the frame-level acoustic feature sequence, the acoustic feature sequence is sequentially input into the first model and the second model, and an embedded vector with a preset fixed length is output. The input target keyword audio frequency does not need language and text information, so that a related model for extracting the text information is not required to be established, the resource space occupied by the related model for carrying out data operation is reduced, and the embedded neural network structure is improved, so that the embedded vector with fixed length can contain more context information.
In a second aspect, referring to fig. 3, the embodiment of the present application further provides a method and an apparatus 310 for recognizing a voice keyword, where the apparatus includes:
the first extraction unit 3101 is configured to obtain the audio of the target keyword, extract the first acoustic feature and the second acoustic feature from the audio of the target keyword, and splice the first acoustic feature and the second acoustic feature into a first acoustic feature sequence.
A first embedding unit 3102 configured to sequentially input a first acoustic feature sequence into a first model and a second model trained in advance, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model includes at least an attention network model.
The second extraction unit 3103 is configured to extract a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence.
The second embedding unit 3104 is configured to sequentially input the second acoustic feature sequence into a first model and a second model trained in advance, and output a second embedded vector of a specified dimension through the first model and the second model.
The recognition unit 3105 is configured to calculate a similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is included in the target speech based on the similarity.
In a third aspect, embodiments of the present application further provide a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in any of the embodiments above.
In a fourth aspect, embodiments of the present application further provide a computing device, including a memory and a processor, where the memory stores executable code, and the processor implements the method described in any of the foregoing embodiments when executing the executable code.
Those of skill in the art will appreciate that in one or more of the above examples, the functions described in the various embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
While the foregoing detailed description has set forth the objects, aspects and advantages of the various embodiments disclosed herein in further detail, it should be understood that the foregoing description is only illustrative of the various embodiments disclosed herein and is not intended to limit the scope of the various embodiments disclosed herein, and that any modifications, equivalents, improvements or the like that are based on the technical aspects of the various embodiments disclosed herein are intended to be included within the scope of the various embodiments disclosed herein.

Claims (10)

1. A method for recognizing a voice keyword, the method comprising:
acquiring the audio of a target keyword, extracting a first acoustic feature and a second acoustic feature from the audio of the target keyword, and splicing the first acoustic feature and the second acoustic feature into a first acoustic feature sequence; the second acoustic features carry context semantic association information;
sequentially inputting the first acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a first embedded vector with a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model comprises at least an attention network model, wherein the attention network model is used for executing semantic feature aggregation;
extracting a third acoustic feature and a fourth acoustic feature from the audio of the target voice to be recognized, and splicing the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;
sequentially inputting the second acoustic feature sequence into a first model and a second model which are trained in advance, and outputting a second embedded vector with a specified dimension through the first model and the second model;
and calculating the similarity between the first embedded vector and the second embedded vector, and determining whether the target keyword is contained in the target voice or not based on the similarity.
2. The method of claim 1, wherein the first acoustic feature comprises any one of a log mel-cepstral feature, an acoustic posterior probability feature, a neural network bottleneck feature extracted from audio of the target keyword;
the third acoustic feature includes any one of a log mel cepstrum feature, an acoustic posterior probability feature, and a neural network bottleneck feature extracted from the audio of the target voice to be recognized.
3. The method of claim 2, wherein the first acoustic feature is a log mel-cepstral feature, the extracting the first acoustic feature from the audio of the target keyword comprising:
inputting the audio signal of the target keyword into a high-pass filter;
framing the audio signal output by the high-pass filter according to a preset frame length and frame shift;
windowing each frame respectively, wherein a window function is a Hamming window;
performing fast discrete Fourier transform on each frame to obtain a frequency spectrum corresponding to each frame, and calculating power spectrum energy corresponding to each frequency point;
respectively inputting the power spectrum energy of each frame into a Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum;
performing off-line cosine transform on the logarithmic mel spectrum to obtain a cepstrum, and selecting n+1-dimensional features of the first n-order and the energy value;
and carrying out first-order and second-order differential operation on the n+1-dimensional features to obtain logarithmic mel cepstrum features with the dimension of 3 (n+1).
4. The method of claim 1, wherein extracting a second acoustic feature from the audio of the target keyword comprises:
inputting the audio signal of the target keyword into a pre-trained convolutional neural network model, and outputting a second acoustic feature; the convolutional neural network model comprises a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for carrying out coding processing on the audio signal, and the second convolutional neural network is used for extracting context correlation characteristics in the audio signal.
5. The method of claim 4, wherein the first convolutional neural network comprises seven convolutional layers, the number of channels is 512, the convolution kernel sizes are 10,8,4,4,4,1,1, and the convolution steps are 5,4,2,2,2,1,1; and/or the number of the groups of groups,
the second convolutional neural network comprises twelve convolutional layers, the number of channels is 512, the convolution kernel sizes are 1,2,3,4,5,6,7,8,9, 10, 11 and 12, and the convolution step sizes are 1.
6. The method of claim 1, wherein the first model is a bidirectional long and short time memory network BLSTM model comprising three bidirectional long and short time memory layers, each layer comprising a plurality of hidden units;
the bidirectional long-short-time memory network model is used for carrying out forward processing and reverse processing on each frame in the audio signal, and splicing the forward output and the reverse output as a first embedded vector or a second embedded vector corresponding to the frame.
7. The method of claim 1, wherein the attention network model comprises two fully connected layers, a first layer of the two fully connected layers comprising 1024 neurons and a second layer comprising 1 neuron, the activation function between the two fully connected layers being a normalized exponential function.
8. A voice keyword recognition apparatus, comprising:
the first extraction unit is configured to acquire the audio of the target keyword, extract first acoustic features and second acoustic features from the audio of the target keyword, and splice the first acoustic features and the second acoustic features into a first acoustic feature sequence;
a first embedding unit configured to sequentially input the first acoustic feature sequence into a first model and a second model trained in advance, and output a first embedding vector of a specified dimension through the first model and the second model; the first model is a long-short-time memory network LSTM model; the second model includes at least an attention network model;
a second extraction unit configured to extract a third acoustic feature and a fourth acoustic feature from audio of a target voice to be recognized, and splice the third acoustic feature and the fourth acoustic feature into a second acoustic feature sequence;
a second embedding unit configured to sequentially input the second acoustic feature sequence into a first model and a second model trained in advance, and output a second embedding vector of a specified dimension through the first model and the second model;
and the recognition unit is configured to calculate the similarity between the first embedded vector and the second embedded vector, and determine whether the target keyword is contained in the target voice or not based on the similarity.
9. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements the method of any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed in a computer, causes the computer to perform the method according to any of claims 1-7.
CN202010688457.1A 2020-07-16 2020-07-16 Voice keyword recognition method and device Active CN111798840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010688457.1A CN111798840B (en) 2020-07-16 2020-07-16 Voice keyword recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010688457.1A CN111798840B (en) 2020-07-16 2020-07-16 Voice keyword recognition method and device

Publications (2)

Publication Number Publication Date
CN111798840A CN111798840A (en) 2020-10-20
CN111798840B true CN111798840B (en) 2023-08-08

Family

ID=72807488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010688457.1A Active CN111798840B (en) 2020-07-16 2020-07-16 Voice keyword recognition method and device

Country Status (1)

Country Link
CN (1) CN111798840B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634870B (en) * 2020-12-11 2023-05-30 平安科技(深圳)有限公司 Keyword detection method, device, equipment and storage medium
CN112685594B (en) * 2020-12-24 2022-10-04 中国人民解放军战略支援部队信息工程大学 Attention-based weak supervision voice retrieval method and system
CN112530410A (en) * 2020-12-24 2021-03-19 北京地平线机器人技术研发有限公司 Command word recognition method and device
CN113470693A (en) * 2021-07-07 2021-10-01 杭州网易云音乐科技有限公司 Method and device for detecting singing, electronic equipment and computer readable storage medium
CN113671031A (en) * 2021-08-20 2021-11-19 北京房江湖科技有限公司 Wall hollowing detection method and device
CN114817456B (en) * 2022-03-10 2023-09-05 马上消费金融股份有限公司 Keyword detection method, keyword detection device, computer equipment and storage medium
CN116453514B (en) * 2023-06-08 2023-08-25 四川大学 Multi-view-based voice keyword detection and positioning method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001290496A (en) * 2000-04-07 2001-10-19 Ricoh Co Ltd Speech retrieval device, speech retrieval method and recording medium
CN103559881A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Language-irrelevant key word recognition method and system
WO2018019116A1 (en) * 2016-07-28 2018-02-01 上海未来伙伴机器人有限公司 Natural language-based man-machine interaction method and system
WO2018107810A1 (en) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 Voiceprint recognition method and apparatus, and electronic device and medium
CN110349572A (en) * 2017-05-27 2019-10-18 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN110444193A (en) * 2018-01-31 2019-11-12 腾讯科技(深圳)有限公司 The recognition methods of voice keyword and device
CN110610707A (en) * 2019-09-20 2019-12-24 科大讯飞股份有限公司 Voice keyword recognition method and device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154379B (en) * 2006-09-27 2011-11-23 夏普株式会社 Method and device for locating keywords in voice and voice recognition system
JP5142769B2 (en) * 2008-03-11 2013-02-13 株式会社日立製作所 Voice data search system and voice data search method
CA2690174C (en) * 2009-01-13 2014-10-14 Crim (Centre De Recherche Informatique De Montreal) Identifying keyword occurrences in audio data
US9378733B1 (en) * 2012-12-19 2016-06-28 Google Inc. Keyword detection without decoding
US9953632B2 (en) * 2014-04-17 2018-04-24 Qualcomm Incorporated Keyword model generation for detecting user-defined keyword

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001290496A (en) * 2000-04-07 2001-10-19 Ricoh Co Ltd Speech retrieval device, speech retrieval method and recording medium
CN103559881A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Language-irrelevant key word recognition method and system
WO2018019116A1 (en) * 2016-07-28 2018-02-01 上海未来伙伴机器人有限公司 Natural language-based man-machine interaction method and system
WO2018107810A1 (en) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 Voiceprint recognition method and apparatus, and electronic device and medium
CN110349572A (en) * 2017-05-27 2019-10-18 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN110444193A (en) * 2018-01-31 2019-11-12 腾讯科技(深圳)有限公司 The recognition methods of voice keyword and device
CN110610707A (en) * 2019-09-20 2019-12-24 科大讯飞股份有限公司 Voice keyword recognition method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于限制模型规模和声学置信度的关键词检出方法;郑铁然;张战;韩纪庆;;计算机科学(01);全文 *

Also Published As

Publication number Publication date
CN111798840A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
CN111798840B (en) Voice keyword recognition method and device
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
Collobert et al. Wav2letter: an end-to-end convnet-based speech recognition system
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
US6845357B2 (en) Pattern recognition using an observable operator model
EP4018437B1 (en) Optimizing a keyword spotting system
CN113223506B (en) Speech recognition model training method and speech recognition method
CN112397056B (en) Voice evaluation method and computer storage medium
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
Das et al. Best of both worlds: Robust accented speech recognition with adversarial transfer learning
JP2020042257A (en) Voice recognition method and device
CN112074903A (en) System and method for tone recognition in spoken language
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
Das et al. Bottleneck feature-based hybrid deep autoencoder approach for Indian language identification
Radha et al. Speech and speaker recognition using raw waveform modeling for adult and children’s speech: a comprehensive review
CN114333762B (en) Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN115512692A (en) Voice recognition method, device, equipment and storage medium
CN113066510B (en) Vowel weak reading detection method and device
CN115132170A (en) Language classification method and device and computer readable storage medium
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
CN114974310A (en) Emotion recognition method and device based on artificial intelligence, computer equipment and medium
Gündogdu Keyword search for low resource languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant