CN110827806A - Voice keyword detection method and system - Google Patents

Voice keyword detection method and system Download PDF

Info

Publication number
CN110827806A
CN110827806A CN201910990230.XA CN201910990230A CN110827806A CN 110827806 A CN110827806 A CN 110827806A CN 201910990230 A CN201910990230 A CN 201910990230A CN 110827806 A CN110827806 A CN 110827806A
Authority
CN
China
Prior art keywords
voice
speech
keyword
detected
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910990230.XA
Other languages
Chinese (zh)
Other versions
CN110827806B (en
Inventor
吴志勇
张坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN201910990230.XA priority Critical patent/CN110827806B/en
Publication of CN110827806A publication Critical patent/CN110827806A/en
Application granted granted Critical
Publication of CN110827806B publication Critical patent/CN110827806B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a voice keyword detection method and a system, wherein the method comprises the following steps: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters; calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix; taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors; and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results. By enabling the two fixed-length vectors to mutually influence in the process of voice coding, the semantic information which is mutually related is effectively reserved, and meanwhile, the position bias of information coding is eliminated by the introduction of the attention mechanism.

Description

Voice keyword detection method and system
Technical Field
The invention relates to the technical field of voice keyword detection, in particular to a voice keyword detection method and system.
Background
In the big data era, a large amount of voice data is generated by internet service, and how to retrieve required linguistic data from the data becomes a difficult problem which needs to be solved urgently. The voice keyword detection technology based on example query only needs to input the voice examples of the keywords and the voice to be detected, and can directly output the detection result without using a voice recognition technology. The existing voice keyword detection system based on example query comprises two parts: speech coding and similarity measures. The speech coding part consists of a long-time and short-time memory network and aims to code speech into a fixed-length vector. The similarity measure generally uses cosine similarity. Firstly, a voice coding part codes an input keyword voice example and a voice to be detected into two fixed-length vectors, then a similarity measurement part is used for calculating the similarity between the two vectors, and finally all the voices to be detected in a corpus are sequenced according to the similarity, and the voices with higher similarity are output. The key of the whole detection system is that the speech coding part is designed, so that the encoder can effectively extract semantic information in speech, and simultaneously remove information irrelevant to tasks, such as speakers, environmental noise, emotion and the like. A speech coder based on a long-time and short-time memory network converts an acoustic feature sequence of speech into a hidden state vector sequence, and then the hidden state vector at the last moment is used as a fixed-length vector of the speech. The speech coding mode can lead the fixed-length vector to retain more semantic information of later time periods and lose a plurality of semantic information of earlier time periods, and the phenomenon is called as position bias of information coding. And the encoding processes of the speech instance of the keyword and the speech to be detected are independent, so that the semantic information which is correlated with the speech instance of the keyword and the speech to be detected cannot be effectively extracted.
In the prior art, a long-and-short-term memory network is used as a speech coder, an acoustic feature sequence of speech is converted into a hidden state vector sequence, and then the hidden state vector at the last moment is used as a fixed-length vector of the speech. And finally, calculating the similarity between the two fixed-length vectors, sequencing all the voices to be detected in the corpus according to the similarity, and outputting the voices with higher similarity.
In the existing scheme, the following defects exist:
(1) the voice encoder based on the long-time and short-time memory network can enable the fixed-length vector to retain more semantic information in later time periods and lose a plurality of semantic information in earlier time periods, and the phenomenon becomes a position bias of information encoding.
(2) The encoding processes of the speech example of the keyword and the speech to be detected are independent from each other, and the semantic information which is correlated with the speech example of the keyword and the speech to be detected cannot be effectively extracted.
Disclosure of Invention
The invention provides a voice keyword detection method and system for solving the existing problems.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a method for detecting a voice keyword comprises the following steps: s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters; s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix; s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors; s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.
Preferably, the following steps are further included after step S3: reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.
Preferably, the fixed-length vector uses cosine similarity as a measure of similarity.
Preferably, a convolution neural network, a bidirectional circulation neural network and a time delay neural network are adopted to calculate the hidden state vector sequences of the keyword speech examples and the speech to be detected.
Preferably, the detection score is calculated using a feed forward neural network.
Preferably, the long-time memory network and the metric matrix are trained simultaneously.
Preferably, the training data of the training is a speech recognition data set, and the speech recognition data set includes speech data and corresponding text label data; the speech segments of specific semantic keywords are cut out by forced alignment, the speech segments of the same semantic are used as positive sample pairs, and the speech segments of different semantics are used as negative sample pairs.
Preferably, the training objective function is designed such that the distance between the voice fixed-length vectors with the same semantic is longer, and the distance between the voice fixed-length vectors with different semantics is shorter, wherein the distance refers to a cosine distance; the closer the distance, the larger the detection score; the farther the distance, the smaller the detection score.
The invention also provides a system for detecting the voice keywords, which is characterized in that the method is adopted to detect the voice keywords.
The invention further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method as set forth in any of the above.
The invention has the beneficial effects that: the method and the system for detecting the speech keywords effectively keep the related semantic information by enabling the two fixed-length vectors to mutually influence in the process of speech coding, and simultaneously eliminate the position bias of information coding by introducing an attention mechanism.
Drawings
Fig. 1 is a schematic diagram of a method for detecting a speech keyword according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a speech coding structure based on a bidirectional attention mechanism in an embodiment of the present invention.
Fig. 3 is a schematic diagram of another method for detecting a speech keyword according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a system for detecting a speech keyword according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
Example 1
As shown in fig. 1, the present invention provides a method for detecting a speech keyword, comprising the following steps:
s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters;
s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix;
s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors;
s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.
As shown in FIG. 2, the keyword speech instance and the speech to be detected are converted into a hidden state vector sequence H by a long-time and short-time memory network sharing parametersQAnd HSAn attention matrix G between two hidden state vector sequences is then calculated using a trainable metric matrix U. The attention matrix is taken to be maximum according to rows and maximum according to columns to respectively obtain the attention weight vector sigma of the keyword speech example and the speech to be detectedQAnd σSThen, the attention weight vector is used for carrying out weighted summation on the hidden state vector sequences to obtain a final fixed-length vector VQAnd VS. The trainable measurement matrix U can enable the encoding processes of two sections of voice input to be mutually influenced, so that the associated semantic information can be more effectively extracted. The attention weight is used for weighting and summing the hidden state vector sequence, so that the position bias of information extraction can be eliminated, and semantic information at a plurality of positions in front can be prevented from being lost.
The fixed-length vectors extracted by the two-way attention mechanism use cosine similarity as a way of similarity measure. The encoding processes of the keyword instance and the speech to be detected are completely symmetrical and share parameters, so that the extracted fixed-length vectors are in the same vector space. The trainable metric matrix may learn from the data a particular mapping that can map inputs of different domains (e.g., different languages of speech) to a vector space of comparable similarity. The characteristics can greatly improve the comparability between the extracted fixed-length vectors. For more complex data distributions, the similarity metric may be improved to a feed-forward neural network.
Each set of training data consists of a positive sample pair and a negative sample pair, wherein the positive sample pair refers to two pieces of voice containing the same semantic meaning, and the negative sample pair refers to two pieces of voice containing different semantic meanings. The data source is a speech recognition data set comprising speech data and corresponding text annotation data. The speech segments with specific semantics (keywords) are cut out by forced alignment, and then the speech segments with the same semantics are used as positive sample pairs, and the speech segments with different semantics are used as negative sample pairs.
The design of the target function can enable the distance between the voice fixed-length vectors with the same semantic to be farther, and the distance between the voice fixed-length vectors with different semantics to be closer, wherein the distance refers to cosine distance. The long and short term memory network and the measurement matrix parameters learned through the supervised learning process can map the voice input with the same semantic to two vectors with a closer distance, and map the voice input with different semantics to two vectors with a farther distance. The closer the distance, the larger the detection score, and the farther the distance, the smaller the detection score, and finally the detection effect can be realized by sequencing all the voices to be detected in the material library by using the detection score.
The long and short term memory network RNNs and the measurement matrix U are obtained by simultaneous training. Firstly, the parameter values of RNNs and U are obtained through training of a training set, and then keyword detection is carried out by using the trained parameter values. The training process uses a back propagation algorithm, all values in the measurement matrix are undetermined parameters, and the parameters are initialized, updated according to the target function gradient returned by the back propagation algorithm, and finally converged. The same principles apply to the training of RNNs.
As shown in fig. 3, in an embodiment of the present invention, after step S3 and before step S4, the method further includes:
reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.
That is, the invention adds the self-encoder structure to make the fixed length vector of the voice keep the reconstruction information of the voice.
In another embodiment of the present invention, a convolution neural network, a bidirectional circulation neural network, and a delay neural network may also be used to calculate the hidden state vector sequences of the keyword speech instance and the speech to be detected; and calculating the detection fraction by adopting a feedforward neural network.
Taking the keyword "Apple" as an example, the testing environment is to search 50 speech segments containing the keyword "Apple" from a corpus containing 10000 speech segments to be detected. 10000 speech segments in a corpus are sorted from high to low according to detection scores, in a returned result with the detection score ranked 20, 2 speech segments containing 'Apple' in the prior art, and 7 speech segments containing 'Apple' in the invention, the hit rate is improved by two times.
As shown in fig. 4, the present invention further provides a system for detecting a voice keyword, and the method of the present invention is adopted to detect the voice keyword. In the detection system, a user inputs a keyword speech example, a section of speech to be detected is taken out from a corpus, respective fixed-length vectors are obtained through a speech coder, and the detection score is calculated by using a similarity measurement part. And sequencing all the voices to be detected in the corpus according to the detection scores, and outputting the voices with higher detection scores as results.
All or part of the flow of the method in the embodiments of the present invention may be realized by instructing related hardware through a computer program, where the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the embodiments of the method may be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The beneficial effects brought by the technical scheme of the invention can be summarized as follows:
1) two fixed-length vectors are mutually influenced in the process of voice coding, and mutually related semantic information is effectively reserved. Compared with the technical scheme that only a long-time memory network is used as an encoder, 200 keyword detection samples are counted in the English corpus, and the average hit rate of return results 20 before the detection score ranking reaches a relative improvement of more than 30%.
2) The introduction of an attention mechanism eliminates the position bias of information coding, and a final fixed-length vector is obtained by weighted summation of attention weights of the whole speech hidden state vector sequence. Compared with the technical scheme of only using a long-time and short-time memory network as an encoder, the method has the advantages that the influence of the phoneme suffixes on the extracted fixed-length vectors of the speech segments is small, and the minimum editing distance change of the fixed-length vectors caused by modifying the phoneme suffixes is reduced by 86%.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims (10)

1. A method for detecting a voice keyword is characterized by comprising the following steps:
s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters;
s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix;
s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors;
s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.
2. The method for detecting a speech keyword according to claim 1, further comprising, after the step S3, the steps of:
reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.
3. The method according to claim 1, wherein the fixed-length vector uses cosine similarity as a measure of similarity.
4. The method according to claim 1, wherein the hidden state vector sequences of the keyword speech instance and the speech to be detected are calculated by using a convolutional neural network, a bidirectional cyclic neural network, and a time-delay neural network.
5. The method of detecting a keyword in speech according to claim 1, wherein the detection score is calculated using a feedforward neural network.
6. The method for detecting a speech keyword according to claim 1, wherein the long-time memory network and the long-time memory network are trained simultaneously.
7. The method of claim 6, wherein the training data is a speech recognition data set, the speech recognition data set comprising speech data and corresponding text label data; the speech segments of specific semantic keywords are cut out by forced alignment, the speech segments of the same semantic are used as positive sample pairs, and the speech segments of different semantics are used as negative sample pairs.
8. The method for detecting the speech keyword according to claim 6, wherein the trained objective function is designed such that the distance between the speech fixed-length vectors with the same semantic meaning is farther, and the distance between the speech fixed-length vectors with different semantic meanings is closer, wherein the distance refers to a cosine distance; the closer the distance, the larger the detection score; the farther the distance, the smaller the detection score.
9. A system for detecting speech keywords, characterized in that the method according to any of claims 1-8 is used for detecting speech keywords.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN201910990230.XA 2019-10-17 2019-10-17 Voice keyword detection method and system Active CN110827806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910990230.XA CN110827806B (en) 2019-10-17 2019-10-17 Voice keyword detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910990230.XA CN110827806B (en) 2019-10-17 2019-10-17 Voice keyword detection method and system

Publications (2)

Publication Number Publication Date
CN110827806A true CN110827806A (en) 2020-02-21
CN110827806B CN110827806B (en) 2022-01-28

Family

ID=69549466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910990230.XA Active CN110827806B (en) 2019-10-17 2019-10-17 Voice keyword detection method and system

Country Status (1)

Country Link
CN (1) CN110827806B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259083A (en) * 2020-10-16 2021-01-22 北京猿力未来科技有限公司 Audio processing method and device
CN112685594A (en) * 2020-12-24 2021-04-20 中国人民解放军战略支援部队信息工程大学 Attention-based weak supervision voice retrieval method and system
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
CN113823274A (en) * 2021-08-16 2021-12-21 华南理工大学 Voice keyword sample screening method based on detection error weighted editing distance
CN114051075A (en) * 2021-10-28 2022-02-15 重庆川南环保科技有限公司 Voice quality inspection method and device and terminal equipment
CN116453514A (en) * 2023-06-08 2023-07-18 四川大学 Multi-view-based voice keyword detection and positioning method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026358A (en) * 1994-12-22 2000-02-15 Justsystem Corporation Neural network, a method of learning of a neural network and phoneme recognition apparatus utilizing a neural network
EP2881939A1 (en) * 2013-12-09 2015-06-10 MediaTek, Inc System for speech keyword detection and associated method
US20170148429A1 (en) * 2015-11-24 2017-05-25 Fujitsu Limited Keyword detector and keyword detection method
US20170192956A1 (en) * 2015-12-31 2017-07-06 Google Inc. Generating parse trees of text segments using neural networks
CN107230475A (en) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN107293296A (en) * 2017-06-28 2017-10-24 百度在线网络技术(北京)有限公司 Voice identification result correcting method, device, equipment and storage medium
EP3312777A1 (en) * 2015-02-06 2018-04-25 Google LLC Recurrent neural network system for data item generation
CN108388554A (en) * 2018-01-04 2018-08-10 中国科学院自动化研究所 Text emotion identifying system based on collaborative filtering attention mechanism
CN109817233A (en) * 2019-01-25 2019-05-28 清华大学 Voice flow steganalysis method and system based on level attention network model
CN110168575A (en) * 2016-12-14 2019-08-23 微软技术许可有限责任公司 Dynamic tensor attention for information retrieval scoring
US20190267023A1 (en) * 2018-02-28 2019-08-29 Microsoft Technology Licensing, Llc Speech recognition using connectionist temporal classification

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026358A (en) * 1994-12-22 2000-02-15 Justsystem Corporation Neural network, a method of learning of a neural network and phoneme recognition apparatus utilizing a neural network
EP2881939A1 (en) * 2013-12-09 2015-06-10 MediaTek, Inc System for speech keyword detection and associated method
EP3312777A1 (en) * 2015-02-06 2018-04-25 Google LLC Recurrent neural network system for data item generation
US20170148429A1 (en) * 2015-11-24 2017-05-25 Fujitsu Limited Keyword detector and keyword detection method
US20170192956A1 (en) * 2015-12-31 2017-07-06 Google Inc. Generating parse trees of text segments using neural networks
CN110168575A (en) * 2016-12-14 2019-08-23 微软技术许可有限责任公司 Dynamic tensor attention for information retrieval scoring
CN107230475A (en) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN107293296A (en) * 2017-06-28 2017-10-24 百度在线网络技术(北京)有限公司 Voice identification result correcting method, device, equipment and storage medium
CN108388554A (en) * 2018-01-04 2018-08-10 中国科学院自动化研究所 Text emotion identifying system based on collaborative filtering attention mechanism
US20190267023A1 (en) * 2018-02-28 2019-08-29 Microsoft Technology Licensing, Llc Speech recognition using connectionist temporal classification
CN109817233A (en) * 2019-01-25 2019-05-28 清华大学 Voice flow steganalysis method and system based on level attention network model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ETHAN R.DUNI,ET AL.: "High-Rate Optimized Recursive Vector Quantization Structures Using Hidden Markov Models", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
KARTIK AUDHKHASI,ET AL.: "End-to-End ASR-Free Keyword Search From Speech", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 *
XIXIN WU,ET AL.: "Automatic speech data clustering with human perception based weighted distance", 《THE 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING》 *
朱张莉: "注意力机制在深度学习中的研究进展", 《中文信息学报》 *
李业良等: "基于混合式注意力机制的语音识别研究", 《计算机应用研究》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259083A (en) * 2020-10-16 2021-01-22 北京猿力未来科技有限公司 Audio processing method and device
CN112259083B (en) * 2020-10-16 2024-02-13 北京猿力未来科技有限公司 Audio processing method and device
CN112685594A (en) * 2020-12-24 2021-04-20 中国人民解放军战略支援部队信息工程大学 Attention-based weak supervision voice retrieval method and system
CN112685594B (en) * 2020-12-24 2022-10-04 中国人民解放军战略支援部队信息工程大学 Attention-based weak supervision voice retrieval method and system
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
CN113035231B (en) * 2021-03-18 2024-01-09 三星(中国)半导体有限公司 Keyword detection method and device
CN113823274A (en) * 2021-08-16 2021-12-21 华南理工大学 Voice keyword sample screening method based on detection error weighted editing distance
CN113823274B (en) * 2021-08-16 2023-10-27 华南理工大学 Voice keyword sample screening method based on detection error weighted editing distance
CN114051075A (en) * 2021-10-28 2022-02-15 重庆川南环保科技有限公司 Voice quality inspection method and device and terminal equipment
CN116453514A (en) * 2023-06-08 2023-07-18 四川大学 Multi-view-based voice keyword detection and positioning method and device
CN116453514B (en) * 2023-06-08 2023-08-25 四川大学 Multi-view-based voice keyword detection and positioning method and device

Also Published As

Publication number Publication date
CN110827806B (en) 2022-01-28

Similar Documents

Publication Publication Date Title
CN110827806B (en) Voice keyword detection method and system
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN109597876B (en) Multi-round dialogue reply selection model based on reinforcement learning and method thereof
CN111881260B (en) Emotion analysis method and device based on aspect attention and convolutional memory neural network
CN109840287A (en) A kind of cross-module state information retrieval method neural network based and device
CN109800434B (en) Method for generating abstract text title based on eye movement attention
CN110457718B (en) Text generation method and device, computer equipment and storage medium
CN107229610A (en) The analysis method and device of a kind of affection data
Rana et al. Emotion based hate speech detection using multimodal learning
Ragni et al. Confidence estimation and deletion prediction using bidirectional recurrent neural networks
CN111382573A (en) Method, apparatus, device and storage medium for answer quality assessment
Wang et al. Dynamically disentangling social bias from task-oriented representations with adversarial attack
CN112036705A (en) Quality inspection result data acquisition method, device and equipment
CN114003682A (en) Text classification method, device, equipment and storage medium
Xie et al. Language-based audio retrieval task in DCASE 2022 challenge
Hohenecker et al. Systematic comparison of neural architectures and training approaches for open information extraction
Xu et al. A comprehensive survey of automated audio captioning
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
CN117056494B (en) Open domain question and answer method, device, electronic equipment and computer storage medium
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN116821339A (en) Misuse language detection method, device and storage medium
CN114333762B (en) Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium
Kongyoung et al. monoQA: Multi-task learning of reranking and answer extraction for open-retrieval conversational question answering
Mei et al. Towards generating diverse audio captions via adversarial training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant