CN110827806A

CN110827806A - Voice keyword detection method and system

Info

Publication number: CN110827806A
Application number: CN201910990230.XA
Authority: CN
Inventors: 吴志勇; 张坤
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-02-21
Anticipated expiration: 2039-10-17
Also published as: CN110827806B

Abstract

The invention provides a voice keyword detection method and a system, wherein the method comprises the following steps: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters; calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix; taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors; and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results. By enabling the two fixed-length vectors to mutually influence in the process of voice coding, the semantic information which is mutually related is effectively reserved, and meanwhile, the position bias of information coding is eliminated by the introduction of the attention mechanism.

Description

Voice keyword detection method and system

Technical Field

The invention relates to the technical field of voice keyword detection, in particular to a voice keyword detection method and system.

Background

In the big data era, a large amount of voice data is generated by internet service, and how to retrieve required linguistic data from the data becomes a difficult problem which needs to be solved urgently. The voice keyword detection technology based on example query only needs to input the voice examples of the keywords and the voice to be detected, and can directly output the detection result without using a voice recognition technology. The existing voice keyword detection system based on example query comprises two parts: speech coding and similarity measures. The speech coding part consists of a long-time and short-time memory network and aims to code speech into a fixed-length vector. The similarity measure generally uses cosine similarity. Firstly, a voice coding part codes an input keyword voice example and a voice to be detected into two fixed-length vectors, then a similarity measurement part is used for calculating the similarity between the two vectors, and finally all the voices to be detected in a corpus are sequenced according to the similarity, and the voices with higher similarity are output. The key of the whole detection system is that the speech coding part is designed, so that the encoder can effectively extract semantic information in speech, and simultaneously remove information irrelevant to tasks, such as speakers, environmental noise, emotion and the like. A speech coder based on a long-time and short-time memory network converts an acoustic feature sequence of speech into a hidden state vector sequence, and then the hidden state vector at the last moment is used as a fixed-length vector of the speech. The speech coding mode can lead the fixed-length vector to retain more semantic information of later time periods and lose a plurality of semantic information of earlier time periods, and the phenomenon is called as position bias of information coding. And the encoding processes of the speech instance of the keyword and the speech to be detected are independent, so that the semantic information which is correlated with the speech instance of the keyword and the speech to be detected cannot be effectively extracted.

In the prior art, a long-and-short-term memory network is used as a speech coder, an acoustic feature sequence of speech is converted into a hidden state vector sequence, and then the hidden state vector at the last moment is used as a fixed-length vector of the speech. And finally, calculating the similarity between the two fixed-length vectors, sequencing all the voices to be detected in the corpus according to the similarity, and outputting the voices with higher similarity.

In the existing scheme, the following defects exist:

(1) the voice encoder based on the long-time and short-time memory network can enable the fixed-length vector to retain more semantic information in later time periods and lose a plurality of semantic information in earlier time periods, and the phenomenon becomes a position bias of information encoding.

(2) The encoding processes of the speech example of the keyword and the speech to be detected are independent from each other, and the semantic information which is correlated with the speech example of the keyword and the speech to be detected cannot be effectively extracted.

Disclosure of Invention

The invention provides a voice keyword detection method and system for solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a method for detecting a voice keyword comprises the following steps: s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters; s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix; s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors; s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.

Preferably, the following steps are further included after step S3: reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.

Preferably, the fixed-length vector uses cosine similarity as a measure of similarity.

Preferably, a convolution neural network, a bidirectional circulation neural network and a time delay neural network are adopted to calculate the hidden state vector sequences of the keyword speech examples and the speech to be detected.

Preferably, the detection score is calculated using a feed forward neural network.

Preferably, the long-time memory network and the metric matrix are trained simultaneously.

Preferably, the training data of the training is a speech recognition data set, and the speech recognition data set includes speech data and corresponding text label data; the speech segments of specific semantic keywords are cut out by forced alignment, the speech segments of the same semantic are used as positive sample pairs, and the speech segments of different semantics are used as negative sample pairs.

Preferably, the training objective function is designed such that the distance between the voice fixed-length vectors with the same semantic is longer, and the distance between the voice fixed-length vectors with different semantics is shorter, wherein the distance refers to a cosine distance; the closer the distance, the larger the detection score; the farther the distance, the smaller the detection score.

The invention also provides a system for detecting the voice keywords, which is characterized in that the method is adopted to detect the voice keywords.

The invention further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method as set forth in any of the above.

The invention has the beneficial effects that: the method and the system for detecting the speech keywords effectively keep the related semantic information by enabling the two fixed-length vectors to mutually influence in the process of speech coding, and simultaneously eliminate the position bias of information coding by introducing an attention mechanism.

Drawings

Fig. 1 is a schematic diagram of a method for detecting a speech keyword according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a speech coding structure based on a bidirectional attention mechanism in an embodiment of the present invention.

Fig. 3 is a schematic diagram of another method for detecting a speech keyword according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a system for detecting a speech keyword according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

Example 1

As shown in fig. 1, the present invention provides a method for detecting a speech keyword, comprising the following steps:

s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters;

s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix;

s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors;

s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.

As shown in FIG. 2, the keyword speech instance and the speech to be detected are converted into a hidden state vector sequence H by a long-time and short-time memory network sharing parameters^QAnd H^SAn attention matrix G between two hidden state vector sequences is then calculated using a trainable metric matrix U. The attention matrix is taken to be maximum according to rows and maximum according to columns to respectively obtain the attention weight vector sigma of the keyword speech example and the speech to be detected^QAnd σ^SThen, the attention weight vector is used for carrying out weighted summation on the hidden state vector sequences to obtain a final fixed-length vector V_QAnd V_S. The trainable measurement matrix U can enable the encoding processes of two sections of voice input to be mutually influenced, so that the associated semantic information can be more effectively extracted. The attention weight is used for weighting and summing the hidden state vector sequence, so that the position bias of information extraction can be eliminated, and semantic information at a plurality of positions in front can be prevented from being lost.

The fixed-length vectors extracted by the two-way attention mechanism use cosine similarity as a way of similarity measure. The encoding processes of the keyword instance and the speech to be detected are completely symmetrical and share parameters, so that the extracted fixed-length vectors are in the same vector space. The trainable metric matrix may learn from the data a particular mapping that can map inputs of different domains (e.g., different languages of speech) to a vector space of comparable similarity. The characteristics can greatly improve the comparability between the extracted fixed-length vectors. For more complex data distributions, the similarity metric may be improved to a feed-forward neural network.

Each set of training data consists of a positive sample pair and a negative sample pair, wherein the positive sample pair refers to two pieces of voice containing the same semantic meaning, and the negative sample pair refers to two pieces of voice containing different semantic meanings. The data source is a speech recognition data set comprising speech data and corresponding text annotation data. The speech segments with specific semantics (keywords) are cut out by forced alignment, and then the speech segments with the same semantics are used as positive sample pairs, and the speech segments with different semantics are used as negative sample pairs.

The design of the target function can enable the distance between the voice fixed-length vectors with the same semantic to be farther, and the distance between the voice fixed-length vectors with different semantics to be closer, wherein the distance refers to cosine distance. The long and short term memory network and the measurement matrix parameters learned through the supervised learning process can map the voice input with the same semantic to two vectors with a closer distance, and map the voice input with different semantics to two vectors with a farther distance. The closer the distance, the larger the detection score, and the farther the distance, the smaller the detection score, and finally the detection effect can be realized by sequencing all the voices to be detected in the material library by using the detection score.

The long and short term memory network RNNs and the measurement matrix U are obtained by simultaneous training. Firstly, the parameter values of RNNs and U are obtained through training of a training set, and then keyword detection is carried out by using the trained parameter values. The training process uses a back propagation algorithm, all values in the measurement matrix are undetermined parameters, and the parameters are initialized, updated according to the target function gradient returned by the back propagation algorithm, and finally converged. The same principles apply to the training of RNNs.

As shown in fig. 3, in an embodiment of the present invention, after step S3 and before step S4, the method further includes:

reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.

That is, the invention adds the self-encoder structure to make the fixed length vector of the voice keep the reconstruction information of the voice.

In another embodiment of the present invention, a convolution neural network, a bidirectional circulation neural network, and a delay neural network may also be used to calculate the hidden state vector sequences of the keyword speech instance and the speech to be detected; and calculating the detection fraction by adopting a feedforward neural network.

Taking the keyword "Apple" as an example, the testing environment is to search 50 speech segments containing the keyword "Apple" from a corpus containing 10000 speech segments to be detected. 10000 speech segments in a corpus are sorted from high to low according to detection scores, in a returned result with the detection score ranked 20, 2 speech segments containing 'Apple' in the prior art, and 7 speech segments containing 'Apple' in the invention, the hit rate is improved by two times.

As shown in fig. 4, the present invention further provides a system for detecting a voice keyword, and the method of the present invention is adopted to detect the voice keyword. In the detection system, a user inputs a keyword speech example, a section of speech to be detected is taken out from a corpus, respective fixed-length vectors are obtained through a speech coder, and the detection score is calculated by using a similarity measurement part. And sequencing all the voices to be detected in the corpus according to the detection scores, and outputting the voices with higher detection scores as results.

All or part of the flow of the method in the embodiments of the present invention may be realized by instructing related hardware through a computer program, where the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the embodiments of the method may be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The beneficial effects brought by the technical scheme of the invention can be summarized as follows:

1) two fixed-length vectors are mutually influenced in the process of voice coding, and mutually related semantic information is effectively reserved. Compared with the technical scheme that only a long-time memory network is used as an encoder, 200 keyword detection samples are counted in the English corpus, and the average hit rate of return results 20 before the detection score ranking reaches a relative improvement of more than 30%.

2) The introduction of an attention mechanism eliminates the position bias of information coding, and a final fixed-length vector is obtained by weighted summation of attention weights of the whole speech hidden state vector sequence. Compared with the technical scheme of only using a long-time and short-time memory network as an encoder, the method has the advantages that the influence of the phoneme suffixes on the extracted fixed-length vectors of the speech segments is small, and the minimum editing distance change of the fixed-length vectors caused by modifying the phoneme suffixes is reduced by 86%.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A method for detecting a voice keyword is characterized by comprising the following steps:

2. The method for detecting a speech keyword according to claim 1, further comprising, after the step S3, the steps of:

3. The method according to claim 1, wherein the fixed-length vector uses cosine similarity as a measure of similarity.

4. The method according to claim 1, wherein the hidden state vector sequences of the keyword speech instance and the speech to be detected are calculated by using a convolutional neural network, a bidirectional cyclic neural network, and a time-delay neural network.

5. The method of detecting a keyword in speech according to claim 1, wherein the detection score is calculated using a feedforward neural network.

6. The method for detecting a speech keyword according to claim 1, wherein the long-time memory network and the long-time memory network are trained simultaneously.

7. The method of claim 6, wherein the training data is a speech recognition data set, the speech recognition data set comprising speech data and corresponding text label data; the speech segments of specific semantic keywords are cut out by forced alignment, the speech segments of the same semantic are used as positive sample pairs, and the speech segments of different semantics are used as negative sample pairs.

8. The method for detecting the speech keyword according to claim 6, wherein the trained objective function is designed such that the distance between the speech fixed-length vectors with the same semantic meaning is farther, and the distance between the speech fixed-length vectors with different semantic meanings is closer, wherein the distance refers to a cosine distance; the closer the distance, the larger the detection score; the farther the distance, the smaller the detection score.

9. A system for detecting speech keywords, characterized in that the method according to any of claims 1-8 is used for detecting speech keywords.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.