CN110827806A - Voice keyword detection method and system - Google Patents
Voice keyword detection method and system Download PDFInfo
- Publication number
- CN110827806A CN110827806A CN201910990230.XA CN201910990230A CN110827806A CN 110827806 A CN110827806 A CN 110827806A CN 201910990230 A CN201910990230 A CN 201910990230A CN 110827806 A CN110827806 A CN 110827806A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- keyword
- detected
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 42
- 239000013598 vector Substances 0.000 claims abstract description 76
- 238000000034 method Methods 0.000 claims abstract description 32
- 239000011159 matrix material Substances 0.000 claims abstract description 25
- 230000015654 memory Effects 0.000 claims abstract description 15
- 238000005259 measurement Methods 0.000 claims abstract description 10
- 238000012163 sequencing technique Methods 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims 1
- 125000004122 cyclic group Chemical group 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 9
- 230000007246 mechanism Effects 0.000 abstract description 5
- 230000000875 corresponding effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000011524 similarity measure Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007787 long-term memory Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a voice keyword detection method and a system, wherein the method comprises the following steps: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters; calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix; taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors; and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results. By enabling the two fixed-length vectors to mutually influence in the process of voice coding, the semantic information which is mutually related is effectively reserved, and meanwhile, the position bias of information coding is eliminated by the introduction of the attention mechanism.
Description
Technical Field
The invention relates to the technical field of voice keyword detection, in particular to a voice keyword detection method and system.
Background
In the big data era, a large amount of voice data is generated by internet service, and how to retrieve required linguistic data from the data becomes a difficult problem which needs to be solved urgently. The voice keyword detection technology based on example query only needs to input the voice examples of the keywords and the voice to be detected, and can directly output the detection result without using a voice recognition technology. The existing voice keyword detection system based on example query comprises two parts: speech coding and similarity measures. The speech coding part consists of a long-time and short-time memory network and aims to code speech into a fixed-length vector. The similarity measure generally uses cosine similarity. Firstly, a voice coding part codes an input keyword voice example and a voice to be detected into two fixed-length vectors, then a similarity measurement part is used for calculating the similarity between the two vectors, and finally all the voices to be detected in a corpus are sequenced according to the similarity, and the voices with higher similarity are output. The key of the whole detection system is that the speech coding part is designed, so that the encoder can effectively extract semantic information in speech, and simultaneously remove information irrelevant to tasks, such as speakers, environmental noise, emotion and the like. A speech coder based on a long-time and short-time memory network converts an acoustic feature sequence of speech into a hidden state vector sequence, and then the hidden state vector at the last moment is used as a fixed-length vector of the speech. The speech coding mode can lead the fixed-length vector to retain more semantic information of later time periods and lose a plurality of semantic information of earlier time periods, and the phenomenon is called as position bias of information coding. And the encoding processes of the speech instance of the keyword and the speech to be detected are independent, so that the semantic information which is correlated with the speech instance of the keyword and the speech to be detected cannot be effectively extracted.
In the prior art, a long-and-short-term memory network is used as a speech coder, an acoustic feature sequence of speech is converted into a hidden state vector sequence, and then the hidden state vector at the last moment is used as a fixed-length vector of the speech. And finally, calculating the similarity between the two fixed-length vectors, sequencing all the voices to be detected in the corpus according to the similarity, and outputting the voices with higher similarity.
In the existing scheme, the following defects exist:
(1) the voice encoder based on the long-time and short-time memory network can enable the fixed-length vector to retain more semantic information in later time periods and lose a plurality of semantic information in earlier time periods, and the phenomenon becomes a position bias of information encoding.
(2) The encoding processes of the speech example of the keyword and the speech to be detected are independent from each other, and the semantic information which is correlated with the speech example of the keyword and the speech to be detected cannot be effectively extracted.
Disclosure of Invention
The invention provides a voice keyword detection method and system for solving the existing problems.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a method for detecting a voice keyword comprises the following steps: s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters; s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix; s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors; s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.
Preferably, the following steps are further included after step S3: reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.
Preferably, the fixed-length vector uses cosine similarity as a measure of similarity.
Preferably, a convolution neural network, a bidirectional circulation neural network and a time delay neural network are adopted to calculate the hidden state vector sequences of the keyword speech examples and the speech to be detected.
Preferably, the detection score is calculated using a feed forward neural network.
Preferably, the long-time memory network and the metric matrix are trained simultaneously.
Preferably, the training data of the training is a speech recognition data set, and the speech recognition data set includes speech data and corresponding text label data; the speech segments of specific semantic keywords are cut out by forced alignment, the speech segments of the same semantic are used as positive sample pairs, and the speech segments of different semantics are used as negative sample pairs.
Preferably, the training objective function is designed such that the distance between the voice fixed-length vectors with the same semantic is longer, and the distance between the voice fixed-length vectors with different semantics is shorter, wherein the distance refers to a cosine distance; the closer the distance, the larger the detection score; the farther the distance, the smaller the detection score.
The invention also provides a system for detecting the voice keywords, which is characterized in that the method is adopted to detect the voice keywords.
The invention further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method as set forth in any of the above.
The invention has the beneficial effects that: the method and the system for detecting the speech keywords effectively keep the related semantic information by enabling the two fixed-length vectors to mutually influence in the process of speech coding, and simultaneously eliminate the position bias of information coding by introducing an attention mechanism.
Drawings
Fig. 1 is a schematic diagram of a method for detecting a speech keyword according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a speech coding structure based on a bidirectional attention mechanism in an embodiment of the present invention.
Fig. 3 is a schematic diagram of another method for detecting a speech keyword according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a system for detecting a speech keyword according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
Example 1
As shown in fig. 1, the present invention provides a method for detecting a speech keyword, comprising the following steps:
s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters;
s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix;
s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors;
s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.
As shown in FIG. 2, the keyword speech instance and the speech to be detected are converted into a hidden state vector sequence H by a long-time and short-time memory network sharing parametersQAnd HSAn attention matrix G between two hidden state vector sequences is then calculated using a trainable metric matrix U. The attention matrix is taken to be maximum according to rows and maximum according to columns to respectively obtain the attention weight vector sigma of the keyword speech example and the speech to be detectedQAnd σSThen, the attention weight vector is used for carrying out weighted summation on the hidden state vector sequences to obtain a final fixed-length vector VQAnd VS. The trainable measurement matrix U can enable the encoding processes of two sections of voice input to be mutually influenced, so that the associated semantic information can be more effectively extracted. The attention weight is used for weighting and summing the hidden state vector sequence, so that the position bias of information extraction can be eliminated, and semantic information at a plurality of positions in front can be prevented from being lost.
The fixed-length vectors extracted by the two-way attention mechanism use cosine similarity as a way of similarity measure. The encoding processes of the keyword instance and the speech to be detected are completely symmetrical and share parameters, so that the extracted fixed-length vectors are in the same vector space. The trainable metric matrix may learn from the data a particular mapping that can map inputs of different domains (e.g., different languages of speech) to a vector space of comparable similarity. The characteristics can greatly improve the comparability between the extracted fixed-length vectors. For more complex data distributions, the similarity metric may be improved to a feed-forward neural network.
Each set of training data consists of a positive sample pair and a negative sample pair, wherein the positive sample pair refers to two pieces of voice containing the same semantic meaning, and the negative sample pair refers to two pieces of voice containing different semantic meanings. The data source is a speech recognition data set comprising speech data and corresponding text annotation data. The speech segments with specific semantics (keywords) are cut out by forced alignment, and then the speech segments with the same semantics are used as positive sample pairs, and the speech segments with different semantics are used as negative sample pairs.
The design of the target function can enable the distance between the voice fixed-length vectors with the same semantic to be farther, and the distance between the voice fixed-length vectors with different semantics to be closer, wherein the distance refers to cosine distance. The long and short term memory network and the measurement matrix parameters learned through the supervised learning process can map the voice input with the same semantic to two vectors with a closer distance, and map the voice input with different semantics to two vectors with a farther distance. The closer the distance, the larger the detection score, and the farther the distance, the smaller the detection score, and finally the detection effect can be realized by sequencing all the voices to be detected in the material library by using the detection score.
The long and short term memory network RNNs and the measurement matrix U are obtained by simultaneous training. Firstly, the parameter values of RNNs and U are obtained through training of a training set, and then keyword detection is carried out by using the trained parameter values. The training process uses a back propagation algorithm, all values in the measurement matrix are undetermined parameters, and the parameters are initialized, updated according to the target function gradient returned by the back propagation algorithm, and finally converged. The same principles apply to the training of RNNs.
As shown in fig. 3, in an embodiment of the present invention, after step S3 and before step S4, the method further includes:
reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.
That is, the invention adds the self-encoder structure to make the fixed length vector of the voice keep the reconstruction information of the voice.
In another embodiment of the present invention, a convolution neural network, a bidirectional circulation neural network, and a delay neural network may also be used to calculate the hidden state vector sequences of the keyword speech instance and the speech to be detected; and calculating the detection fraction by adopting a feedforward neural network.
Taking the keyword "Apple" as an example, the testing environment is to search 50 speech segments containing the keyword "Apple" from a corpus containing 10000 speech segments to be detected. 10000 speech segments in a corpus are sorted from high to low according to detection scores, in a returned result with the detection score ranked 20, 2 speech segments containing 'Apple' in the prior art, and 7 speech segments containing 'Apple' in the invention, the hit rate is improved by two times.
As shown in fig. 4, the present invention further provides a system for detecting a voice keyword, and the method of the present invention is adopted to detect the voice keyword. In the detection system, a user inputs a keyword speech example, a section of speech to be detected is taken out from a corpus, respective fixed-length vectors are obtained through a speech coder, and the detection score is calculated by using a similarity measurement part. And sequencing all the voices to be detected in the corpus according to the detection scores, and outputting the voices with higher detection scores as results.
All or part of the flow of the method in the embodiments of the present invention may be realized by instructing related hardware through a computer program, where the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the embodiments of the method may be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The beneficial effects brought by the technical scheme of the invention can be summarized as follows:
1) two fixed-length vectors are mutually influenced in the process of voice coding, and mutually related semantic information is effectively reserved. Compared with the technical scheme that only a long-time memory network is used as an encoder, 200 keyword detection samples are counted in the English corpus, and the average hit rate of return results 20 before the detection score ranking reaches a relative improvement of more than 30%.
2) The introduction of an attention mechanism eliminates the position bias of information coding, and a final fixed-length vector is obtained by weighted summation of attention weights of the whole speech hidden state vector sequence. Compared with the technical scheme of only using a long-time and short-time memory network as an encoder, the method has the advantages that the influence of the phoneme suffixes on the extracted fixed-length vectors of the speech segments is small, and the minimum editing distance change of the fixed-length vectors caused by modifying the phoneme suffixes is reduced by 86%.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.
Claims (10)
1. A method for detecting a voice keyword is characterized by comprising the following steps:
s1: converting the keyword voice examples and the voice to be detected into hidden state vector sequences by a long-time memory network sharing parameters;
s2: calculating an attention matrix between two of the sequences of hidden state vectors using a trainable metric matrix;
s3: taking the maximum value of the attention matrix according to rows and the maximum value of the attention matrix according to columns to respectively obtain the attention weight vectors of the keyword voice examples and the voice to be detected, and then carrying out weighted summation on the corresponding hidden state vector sequences by using the attention weight vectors to obtain final fixed-length vectors;
s4: and calculating detection scores by using the similarity measurement, sequencing all the voices to be detected according to the detection scores, and outputting the voices to be detected with higher detection scores as results.
2. The method for detecting a speech keyword according to claim 1, further comprising, after the step S3, the steps of:
reconstructing the keyword voice instance and the voice sequence of the voice to be detected by decoding the fixed-length vector, and comparing the reconstructed voice sequence with the original voice sequence to obtain reconstruction loss; and adding the reconstruction loss into the final training loss, and training the model through a back propagation algorithm to keep the reconstruction information of the voice sequence in the fixed-length vector.
3. The method according to claim 1, wherein the fixed-length vector uses cosine similarity as a measure of similarity.
4. The method according to claim 1, wherein the hidden state vector sequences of the keyword speech instance and the speech to be detected are calculated by using a convolutional neural network, a bidirectional cyclic neural network, and a time-delay neural network.
5. The method of detecting a keyword in speech according to claim 1, wherein the detection score is calculated using a feedforward neural network.
6. The method for detecting a speech keyword according to claim 1, wherein the long-time memory network and the long-time memory network are trained simultaneously.
7. The method of claim 6, wherein the training data is a speech recognition data set, the speech recognition data set comprising speech data and corresponding text label data; the speech segments of specific semantic keywords are cut out by forced alignment, the speech segments of the same semantic are used as positive sample pairs, and the speech segments of different semantics are used as negative sample pairs.
8. The method for detecting the speech keyword according to claim 6, wherein the trained objective function is designed such that the distance between the speech fixed-length vectors with the same semantic meaning is farther, and the distance between the speech fixed-length vectors with different semantic meanings is closer, wherein the distance refers to a cosine distance; the closer the distance, the larger the detection score; the farther the distance, the smaller the detection score.
9. A system for detecting speech keywords, characterized in that the method according to any of claims 1-8 is used for detecting speech keywords.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910990230.XA CN110827806B (en) | 2019-10-17 | 2019-10-17 | Voice keyword detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910990230.XA CN110827806B (en) | 2019-10-17 | 2019-10-17 | Voice keyword detection method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110827806A true CN110827806A (en) | 2020-02-21 |
CN110827806B CN110827806B (en) | 2022-01-28 |
Family
ID=69549466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910990230.XA Active CN110827806B (en) | 2019-10-17 | 2019-10-17 | Voice keyword detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110827806B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112259083A (en) * | 2020-10-16 | 2021-01-22 | 北京猿力未来科技有限公司 | Audio processing method and device |
CN112685594A (en) * | 2020-12-24 | 2021-04-20 | 中国人民解放军战略支援部队信息工程大学 | Attention-based weak supervision voice retrieval method and system |
CN113035231A (en) * | 2021-03-18 | 2021-06-25 | 三星(中国)半导体有限公司 | Keyword detection method and device |
CN113823274A (en) * | 2021-08-16 | 2021-12-21 | 华南理工大学 | Voice keyword sample screening method based on detection error weighted editing distance |
CN114051075A (en) * | 2021-10-28 | 2022-02-15 | 重庆川南环保科技有限公司 | Voice quality inspection method and device and terminal equipment |
CN116453514A (en) * | 2023-06-08 | 2023-07-18 | 四川大学 | Multi-view-based voice keyword detection and positioning method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6026358A (en) * | 1994-12-22 | 2000-02-15 | Justsystem Corporation | Neural network, a method of learning of a neural network and phoneme recognition apparatus utilizing a neural network |
EP2881939A1 (en) * | 2013-12-09 | 2015-06-10 | MediaTek, Inc | System for speech keyword detection and associated method |
US20170148429A1 (en) * | 2015-11-24 | 2017-05-25 | Fujitsu Limited | Keyword detector and keyword detection method |
US20170192956A1 (en) * | 2015-12-31 | 2017-07-06 | Google Inc. | Generating parse trees of text segments using neural networks |
CN107230475A (en) * | 2017-05-27 | 2017-10-03 | 腾讯科技(深圳)有限公司 | A kind of voice keyword recognition method, device, terminal and server |
CN107293296A (en) * | 2017-06-28 | 2017-10-24 | 百度在线网络技术(北京)有限公司 | Voice identification result correcting method, device, equipment and storage medium |
EP3312777A1 (en) * | 2015-02-06 | 2018-04-25 | Google LLC | Recurrent neural network system for data item generation |
CN108388554A (en) * | 2018-01-04 | 2018-08-10 | 中国科学院自动化研究所 | Text emotion identifying system based on collaborative filtering attention mechanism |
CN109817233A (en) * | 2019-01-25 | 2019-05-28 | 清华大学 | Voice flow steganalysis method and system based on level attention network model |
CN110168575A (en) * | 2016-12-14 | 2019-08-23 | 微软技术许可有限责任公司 | Dynamic tensor attention for information retrieval scoring |
US20190267023A1 (en) * | 2018-02-28 | 2019-08-29 | Microsoft Technology Licensing, Llc | Speech recognition using connectionist temporal classification |
-
2019
- 2019-10-17 CN CN201910990230.XA patent/CN110827806B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6026358A (en) * | 1994-12-22 | 2000-02-15 | Justsystem Corporation | Neural network, a method of learning of a neural network and phoneme recognition apparatus utilizing a neural network |
EP2881939A1 (en) * | 2013-12-09 | 2015-06-10 | MediaTek, Inc | System for speech keyword detection and associated method |
EP3312777A1 (en) * | 2015-02-06 | 2018-04-25 | Google LLC | Recurrent neural network system for data item generation |
US20170148429A1 (en) * | 2015-11-24 | 2017-05-25 | Fujitsu Limited | Keyword detector and keyword detection method |
US20170192956A1 (en) * | 2015-12-31 | 2017-07-06 | Google Inc. | Generating parse trees of text segments using neural networks |
CN110168575A (en) * | 2016-12-14 | 2019-08-23 | 微软技术许可有限责任公司 | Dynamic tensor attention for information retrieval scoring |
CN107230475A (en) * | 2017-05-27 | 2017-10-03 | 腾讯科技(深圳)有限公司 | A kind of voice keyword recognition method, device, terminal and server |
CN107293296A (en) * | 2017-06-28 | 2017-10-24 | 百度在线网络技术(北京)有限公司 | Voice identification result correcting method, device, equipment and storage medium |
CN108388554A (en) * | 2018-01-04 | 2018-08-10 | 中国科学院自动化研究所 | Text emotion identifying system based on collaborative filtering attention mechanism |
US20190267023A1 (en) * | 2018-02-28 | 2019-08-29 | Microsoft Technology Licensing, Llc | Speech recognition using connectionist temporal classification |
CN109817233A (en) * | 2019-01-25 | 2019-05-28 | 清华大学 | Voice flow steganalysis method and system based on level attention network model |
Non-Patent Citations (5)
Title |
---|
ETHAN R.DUNI,ET AL.: "High-Rate Optimized Recursive Vector Quantization Structures Using Hidden Markov Models", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 * |
KARTIK AUDHKHASI,ET AL.: "End-to-End ASR-Free Keyword Search From Speech", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 * |
XIXIN WU,ET AL.: "Automatic speech data clustering with human perception based weighted distance", 《THE 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING》 * |
朱张莉: "注意力机制在深度学习中的研究进展", 《中文信息学报》 * |
李业良等: "基于混合式注意力机制的语音识别研究", 《计算机应用研究》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112259083A (en) * | 2020-10-16 | 2021-01-22 | 北京猿力未来科技有限公司 | Audio processing method and device |
CN112259083B (en) * | 2020-10-16 | 2024-02-13 | 北京猿力未来科技有限公司 | Audio processing method and device |
CN112685594A (en) * | 2020-12-24 | 2021-04-20 | 中国人民解放军战略支援部队信息工程大学 | Attention-based weak supervision voice retrieval method and system |
CN112685594B (en) * | 2020-12-24 | 2022-10-04 | 中国人民解放军战略支援部队信息工程大学 | Attention-based weak supervision voice retrieval method and system |
CN113035231A (en) * | 2021-03-18 | 2021-06-25 | 三星(中国)半导体有限公司 | Keyword detection method and device |
CN113035231B (en) * | 2021-03-18 | 2024-01-09 | 三星(中国)半导体有限公司 | Keyword detection method and device |
CN113823274A (en) * | 2021-08-16 | 2021-12-21 | 华南理工大学 | Voice keyword sample screening method based on detection error weighted editing distance |
CN113823274B (en) * | 2021-08-16 | 2023-10-27 | 华南理工大学 | Voice keyword sample screening method based on detection error weighted editing distance |
CN114051075A (en) * | 2021-10-28 | 2022-02-15 | 重庆川南环保科技有限公司 | Voice quality inspection method and device and terminal equipment |
CN116453514A (en) * | 2023-06-08 | 2023-07-18 | 四川大学 | Multi-view-based voice keyword detection and positioning method and device |
CN116453514B (en) * | 2023-06-08 | 2023-08-25 | 四川大学 | Multi-view-based voice keyword detection and positioning method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110827806B (en) | 2022-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110827806B (en) | Voice keyword detection method and system | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN107480143B (en) | Method and system for segmenting conversation topics based on context correlation | |
CN109597876B (en) | Multi-round dialogue reply selection model based on reinforcement learning and method thereof | |
CN111881260B (en) | Emotion analysis method and device based on aspect attention and convolutional memory neural network | |
CN109840287A (en) | A kind of cross-module state information retrieval method neural network based and device | |
CN109800434B (en) | Method for generating abstract text title based on eye movement attention | |
CN110457718B (en) | Text generation method and device, computer equipment and storage medium | |
CN107229610A (en) | The analysis method and device of a kind of affection data | |
Rana et al. | Emotion based hate speech detection using multimodal learning | |
Ragni et al. | Confidence estimation and deletion prediction using bidirectional recurrent neural networks | |
CN111382573A (en) | Method, apparatus, device and storage medium for answer quality assessment | |
Wang et al. | Dynamically disentangling social bias from task-oriented representations with adversarial attack | |
CN112036705A (en) | Quality inspection result data acquisition method, device and equipment | |
CN114003682A (en) | Text classification method, device, equipment and storage medium | |
Xie et al. | Language-based audio retrieval task in DCASE 2022 challenge | |
Hohenecker et al. | Systematic comparison of neural architectures and training approaches for open information extraction | |
Xu et al. | A comprehensive survey of automated audio captioning | |
CN115905487A (en) | Document question and answer method, system, electronic equipment and storage medium | |
CN117056494B (en) | Open domain question and answer method, device, electronic equipment and computer storage medium | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
CN116821339A (en) | Misuse language detection method, device and storage medium | |
CN114333762B (en) | Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium | |
Kongyoung et al. | monoQA: Multi-task learning of reranking and answer extraction for open-retrieval conversational question answering | |
Mei et al. | Towards generating diverse audio captions via adversarial training |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |