CN110415705B - Hot word recognition method, system, device and storage medium - Google Patents
Hot word recognition method, system, device and storage medium Download PDFInfo
- Publication number
- CN110415705B CN110415705B CN201910706314.6A CN201910706314A CN110415705B CN 110415705 B CN110415705 B CN 110415705B CN 201910706314 A CN201910706314 A CN 201910706314A CN 110415705 B CN110415705 B CN 110415705B
- Authority
- CN
- China
- Prior art keywords
- word
- hotword
- hot
- score
- current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000001514 detection method Methods 0.000 claims abstract description 31
- 238000005516 engineering process Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a hot word recognition method, a system, a device and a storage medium, which aim to solve the problem that the correct voice recognition result can be modified by mistake in the prior art, and the hot word recognition method comprises the following steps: step 1, sending user audio into a general recognition engine to obtain a voice recognition result and simultaneously obtain a voice recognition result WiA corresponding location and confidence on the audio; step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S); step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending; and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a system, a device and a storage medium for identifying hot words.
Background
Speech recognition technology has become the dominant technology for current applications of artificial intelligence. Typical speech recognition techniques rely on a particular vocabulary, i.e., only words within a given vocabulary range are recognized; if out-of-vocabulary words appear in the speech, the recognition performance is usually poor, or even not recognized at all. Some solutions have been proposed to address this problem. The main method is called recognition result post-processing technology, which is to correct the recognition result by analyzing the text of the recognition result and then adopting a language model or given hot word pronunciation. This type of method has a fatal disadvantage that the correct recognition result is often mistakenly modified.
Disclosure of Invention
In view of the above problems, the present invention provides a method, system, device and storage medium for hot word recognition, so as to solve the problem in the prior art that a correct speech recognition result is modified by mistake.
The technical scheme is as follows: a hotword recognition method is characterized by comprising the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a voice recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
Further, between the step 1 and the step 2, a step 1.5 is included, if the speech recognition result W existsi~Wj,i<j, j are natural numbers, Wi~WjIs below a given threshold, W is extractedi~WjAnd (5) executing step 2 on the corresponding audio segment.
Further, step 1 and step 2 are performed synchronously.
Further, the step 2 specifically comprises the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
Further, in step 2, the posterior probability score output by the universal recognition acoustic model is adopted in the grammar recognition network.
Further, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.
Further, when the hot word appearance position and the word in the current recognition result have an overlap of the starting position, the step 4 specifically includes the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: predicting the probability of each candidate word of the current word under the conditions that the first word of a given sentence is located in front of the current word and the last word of the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: predicting the probability of each candidate word of the current word under the conditions that the given sentence end word reaches the word behind the current word and the word before the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
Further, between step 4-3 and step 4-5, and between step 4.3 and step 4.5, the following steps are respectively included: and increasing the acoustic confidence information of each word, and predicting the probability of occurrence of each candidate word of the current word to serve as the score of the candidate word.
Further, the hot word detection engine is configured to correspond to the user ID, and when the hot word is added, the hot word and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hot word and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and correspondingly adding the hot words to a hot word detection engine corresponding to the user ID.
A hotword recognition system, comprising:
a general speech recognition engine configured to output speech recognition results and a temporal position and confidence of each word in the audio;
a hot word detection engine configured to detect whether a hot word exists, and output an ID, an audio position, and a score thereof;
the hot word result correction module is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;
and the language model result correction module is configured to correct words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
Further, the system comprises a hotword adding module configured to add hotwords to the hotword detection engine.
A hotword recognition device, comprising: comprising a processor, a memory, and a program;
the program is stored in the memory, and the processor calls the program stored in the memory to execute the hot word identification method.
A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program configured to execute the above-described hotword recognition method.
The hot word recognition method adopts the awakening word detection scheme to recognize the hot words, the hot words can be customized by a user, the hot word results are corrected after the hot words are recognized, in addition, other recognition errors caused by the hot word recognition errors can be further corrected on the basis of detecting and correcting the hot words, and the low-confidence words overlapped with or adjacent to the hot words are corrected.
Drawings
FIG. 1 is a flowchart of a hotword identification method according to embodiment 1;
fig. 2 is a system block diagram of a hotword recognition system of embodiment 1;
FIG. 3 is a flowchart of a hotword identification method according to embodiment 2;
fig. 4 is a system block diagram of a hotword recognition system according to embodiment 2.
Detailed Description
Specific example 1: referring to fig. 1, a hotword recognition method includes the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a voice recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 15 if there is a speech recognition result Wi~Wj,i<j, j are natural numbers, Wi~WjIs lower than a given threshold, the threshold is taken to be 0.5, then W is extractedi~WjCorresponding audio frequency segments, executing step 2; otherwise, ending;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value and the threshold value is 0.5, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
Specifically, the step 2 specifically includes the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
In this embodiment, in step 2, the posterior probability scores output by the generic recognition acoustic model are used in the grammar recognition network.
Specifically, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.
When the hot word appearance position and the word in the current recognition result have overlapping of the starting position, the step 4 specifically comprises the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: predicting the probability of each candidate word of the current word under the conditions that the first word of a given sentence is located in front of the current word and the last word of the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: predicting the probability of each candidate word of the current word under the conditions that the given sentence end word reaches the word behind the current word and the word before the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and using the probability as the score of the candidate word
Step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
Specifically, in this embodiment, the language model used in step 4 is a recurrent neural network, specifically, an LSTM/GRU type language model.
In the embodiment, the hotword detection engine is configured to correspond to the user ID to realize user dependence, and when adding a hotword, the hotword and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hotword and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and adding the hot words into a hot word detection engine corresponding to the user ID correspondingly, so that the hot words can be conveniently added into the hot word detection engine.
Specifically, the user may transmit a triplet to the system telling the system to add or delete a given hotword. The definition of the triplets is as follows: (ID, HotWord, OPT), wherein ID: marking a user; HotWord: marking hot words; OPT: mark action, OPT defined as add or delete.
A hotword recognition system corresponding to embodiment 1 is shown in fig. 2, and includes:
a general speech recognition engine 1 configured to output a speech recognition result and a temporal position and a confidence of each word in audio;
a hotword detection engine 2 configured to detect whether a hotword exists, and output an ID, an audio position, and a score thereof;
the hot word result correction module 3 is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;
and the language model result correction module 4 is configured to correct the words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
Also included is a hotword adding module 5 configured to add hotwords to the hotword detection engine.
In embodiment 1, as shown in fig. 2, a sequential recognition method is provided, in which generic recognition is performed first, and hot word detection is performed according to the confidence of the recognition result, so that additional computing resources are not required, and the system delay is increased.
Specific example 2: referring to fig. 2, a hotword recognition method includes the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a voice recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value and the threshold value is 0.5, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
In the present embodiment, step 1 and step 2 are performed synchronously.
Specifically, the step 2 specifically includes the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
In the present embodiment, in step 2, the acoustic model in the grammar recognition network is a CLDNN model.
Specifically, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.
When the hot word appearance position and the word in the current recognition result have overlapping of the starting position, the step 4 specifically comprises the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: predicting the probability of each candidate word of the current word under the conditions that the first word of a given sentence is located in front of the current word and the last word of the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: predicting the probability of each candidate word of the current word under the conditions that the given sentence end word reaches the word behind the current word and the word before the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and using the probability as the score of the candidate word
Step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
Specifically, in this embodiment, the language model used in step 4 is a recurrent neural network, specifically, an LSTM/GRU type language model.
In the embodiment, the hotword detection engine is configured to correspond to the user ID to realize user dependence, and when adding a hotword, the hotword and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hotword and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and adding the hot words into a hot word detection engine corresponding to the user ID correspondingly, so that the hot words can be conveniently added into the hot word detection engine.
Specifically, the user may transmit a triplet to the system telling the system to add or delete a given hotword. The definition of the triplets is as follows: (ID, HotWord, OPT), wherein ID: marking a user; HotWord: marking hot words; OPT: mark action, OPT defined as add or delete.
A hotword recognition system corresponding to embodiment 2 is shown in fig. 4, and includes:
a general speech recognition engine 1 configured to output a speech recognition result and a temporal position and a confidence of each word in audio;
a hotword detection engine 2 configured to detect whether a hotword exists, and output an ID, an audio position, and a score thereof;
the hot word result correction module 3 is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;
and the language model result correction module 4 is configured to correct the words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
Also included is a hotword adding module 5 configured to add hotwords to the hotword detection engine.
In the specific embodiment 2, as shown in fig. 4, a parallel recognition mode is provided, the general recognition and the hotword detection are performed simultaneously, a relatively rich computing resource is required, the system delay is basically unchanged, and the response speed is faster.
The hot word recognition method adopts the awakening word detection scheme to recognize the hot words, the hot words can be customized by a user, the hot word result is corrected after the hot words are recognized, in addition, other recognition errors caused by the hot word recognition errors can be further corrected on the basis of detecting and correcting the hot words.
In an embodiment of the present invention, there is also provided a hotword recognition apparatus including: comprising a processor, a memory, and a program; a program is stored in the memory and the processor calls the program stored in the memory to perform the hotword recognition method described above.
In the implementation of the above hot word recognition apparatus, the memory and the processor are directly or indirectly electrically connected to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines, such as a bus. The memory stores computer-executable instructions for implementing the data access control method, and includes at least one software functional module which can be stored in the memory in the form of software or firmware, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory.
The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory is used for storing programs, and the processor executes the programs after receiving the execution instructions.
The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In an embodiment of the present invention, there is also provided a computer-readable storage medium configured to store a program configured to execute the above-described hotword recognition method.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer-readable storage medium. When executed by a processor, the program implements steps comprising the above-described method embodiments; and the aforementioned computer-readable storage media comprise: various media that can store program code, such as ROM, RAM, magnetic or optical disks, include instructions for causing a large data transmission device (which can be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments.
Claims (12)
1. A hotword recognition method is characterized by comprising the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a speech recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, if the score S is larger than a given threshold value, replacing the word at the corresponding audio frequency position in the voice recognition result by the hotword W, and executing step 4; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
2. The hotword recognition method of claim 1, wherein: between step 1 and step 2, a step 1.5 is also included, if the speech recognition result W existsi~Wj,i<j, i, j are natural numbers, Wi~WjIs below a given threshold, W is extractedi~WjAnd (5) executing step 2 on the corresponding audio segment.
3. The hotword recognition method of claim 1, wherein: step 1 and step 2 are performed synchronously.
4. The hotword recognition method of claim 1, wherein: the step 2 specifically comprises the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
5. The hotword recognition method of claim 4, wherein: in step 2, the posterior probability scores output by the universal recognition acoustic model are adopted in the grammar recognition network.
6. The hotword recognition method of claim 1, wherein: in step 4, the hot word appearance position and the word in the current recognition result have overlapping including the overlapping of the starting position and the overlapping of the ending position;
when the hot word appearance position and the word in the current recognition result have overlapping of the starting position, the step 4 specifically comprises the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: adopting a pre-trained language model, and predicting the probability of each candidate word of the current word under the conditions of giving words from the first word of a sentence to the front word of the current word and words behind the current word, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: adopting a pre-trained language model, and predicting the probability of each candidate word of the current word under the conditions of giving words after the sentence end word reaches the current word and words before the current word, and taking the probability as the score of the candidate word;
step 4.4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
7. The hotword recognition method of claim 6, wherein: between step 4-3 and step 4-5, and between step 4.3 and step 4.5, the following steps are further included, respectively: and increasing the acoustic confidence information of each word, and predicting the probability of occurrence of each candidate word of the current word to serve as the score of the candidate word.
8. The hotword recognition method of claim 1, wherein: the hot word detection engine is configured to correspond to the user ID, and when the hot words are added, the hot words and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain pronunciations of the hot words and corresponding phoneme sequences of the hot words; then adding the hot words into the grammar network; and generating hot word detection resources, and correspondingly adding the hot words to a hot word detection engine corresponding to the user ID.
9. A hotword recognition system, comprising:
a general speech recognition engine configured to input user audio and output speech recognition results and the time position and confidence of each word in the audio, the speech recognition results being represented as W1,W2,...,WnWherein n is a natural number;
the hot word detection engine is configured to input the user audio to detect whether a hot word exists or not, and output a hot word W with the highest score, an audio position P corresponding to the hot word and a score S, wherein the scores are represented as (W, P and S);
the hot word result correction module is configured to judge the score S of the hot word (W, P, S) with the highest score, and if the score S is larger than a given threshold value, the hot word W is used for replacing the word at the corresponding audio frequency position in the voice recognition result output by the general voice recognition engine;
and the language model result correction module is configured to correct words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
10. A hotword recognition system as recited in claim 9, wherein: also included is a hotword addition module configured to add or update a hotword to the hotword detection engine.
11. A hotword recognition device, comprising: comprising a processor, a memory, and a program;
the program is stored in the memory and the processor invokes the memory-stored program to perform the hotword recognition method of claim 1.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program configured to execute the hotword recognition method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910706314.6A CN110415705B (en) | 2019-08-01 | 2019-08-01 | Hot word recognition method, system, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910706314.6A CN110415705B (en) | 2019-08-01 | 2019-08-01 | Hot word recognition method, system, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110415705A CN110415705A (en) | 2019-11-05 |
CN110415705B true CN110415705B (en) | 2022-03-01 |
Family
ID=68365126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910706314.6A Active CN110415705B (en) | 2019-08-01 | 2019-08-01 | Hot word recognition method, system, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110415705B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110689881B (en) * | 2018-06-20 | 2022-07-12 | 深圳市北科瑞声科技股份有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
CN111090720B (en) * | 2019-11-22 | 2023-09-12 | 北京捷通华声科技股份有限公司 | Hot word adding method and device |
CN110879839A (en) * | 2019-11-27 | 2020-03-13 | 北京声智科技有限公司 | Hot word recognition method, device and system |
CN111028830B (en) * | 2019-12-26 | 2022-07-15 | 大众问问(北京)信息科技有限公司 | Local hot word bank updating method, device and equipment |
CN111161739B (en) * | 2019-12-28 | 2023-01-17 | 科大讯飞股份有限公司 | Speech recognition method and related product |
CN113178194B (en) * | 2020-01-08 | 2024-03-22 | 上海依图信息技术有限公司 | Voice recognition method and system for interactive hotword updating |
CN111583909B (en) * | 2020-05-18 | 2024-04-12 | 科大讯飞股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN112599114B (en) * | 2020-11-11 | 2024-06-18 | 联想(北京)有限公司 | Voice recognition method and device |
CN112349278A (en) * | 2020-11-12 | 2021-02-09 | 苏州思必驰信息科技有限公司 | Local hot word training and recognition method and device |
CN112489651B (en) * | 2020-11-30 | 2023-02-17 | 科大讯飞股份有限公司 | Voice recognition method, electronic device and storage device |
CN112735428A (en) * | 2020-12-27 | 2021-04-30 | 科大讯飞(上海)科技有限公司 | Hot word acquisition method, voice recognition method and related equipment |
CN113836270A (en) * | 2021-09-28 | 2021-12-24 | 深圳格隆汇信息科技有限公司 | Big data processing method and related product |
CN114185511A (en) * | 2021-11-29 | 2022-03-15 | 北京百度网讯科技有限公司 | Audio data processing method and device and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5559925A (en) * | 1994-06-24 | 1996-09-24 | Apple Computer, Inc. | Determining the useability of input signals in a data recognition system |
CN101510222A (en) * | 2009-02-20 | 2009-08-19 | 北京大学 | Multilayer index voice document searching method and system thereof |
US20160104480A1 (en) * | 2014-10-09 | 2016-04-14 | Google Inc. | Hotword detection on multiple devices |
CN106782607A (en) * | 2012-07-03 | 2017-05-31 | 谷歌公司 | Determine hot word grade of fit |
US20180182390A1 (en) * | 2016-12-27 | 2018-06-28 | Google Inc. | Contextual hotwords |
US20180330717A1 (en) * | 2017-05-11 | 2018-11-15 | International Business Machines Corporation | Speech recognition by selecting and refining hot words |
CN108984529A (en) * | 2018-07-16 | 2018-12-11 | 北京华宇信息技术有限公司 | Real-time court's trial speech recognition automatic error correction method, storage medium and computing device |
CN109271495A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing |
CN109523991A (en) * | 2017-09-15 | 2019-03-26 | 阿里巴巴集团控股有限公司 | Method and device, the equipment of speech recognition |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013000136A1 (en) * | 2011-06-29 | 2013-01-03 | 宇龙计算机通信科技(深圳)有限公司 | Mobile terminal and method, system for inputting network hot words into mobile terminal |
US9263042B1 (en) * | 2014-07-25 | 2016-02-16 | Google Inc. | Providing pre-computed hotword models |
CN106326484A (en) * | 2016-08-31 | 2017-01-11 | 北京奇艺世纪科技有限公司 | Error correction method and device for search terms |
US10134396B2 (en) * | 2016-12-07 | 2018-11-20 | Google Llc | Preventing of audio attacks |
CN110689881B (en) * | 2018-06-20 | 2022-07-12 | 深圳市北科瑞声科技股份有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
-
2019
- 2019-08-01 CN CN201910706314.6A patent/CN110415705B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5559925A (en) * | 1994-06-24 | 1996-09-24 | Apple Computer, Inc. | Determining the useability of input signals in a data recognition system |
CN101510222A (en) * | 2009-02-20 | 2009-08-19 | 北京大学 | Multilayer index voice document searching method and system thereof |
CN106782607A (en) * | 2012-07-03 | 2017-05-31 | 谷歌公司 | Determine hot word grade of fit |
US20160104480A1 (en) * | 2014-10-09 | 2016-04-14 | Google Inc. | Hotword detection on multiple devices |
US20180182390A1 (en) * | 2016-12-27 | 2018-06-28 | Google Inc. | Contextual hotwords |
US20180330717A1 (en) * | 2017-05-11 | 2018-11-15 | International Business Machines Corporation | Speech recognition by selecting and refining hot words |
CN109523991A (en) * | 2017-09-15 | 2019-03-26 | 阿里巴巴集团控股有限公司 | Method and device, the equipment of speech recognition |
CN108984529A (en) * | 2018-07-16 | 2018-12-11 | 北京华宇信息技术有限公司 | Real-time court's trial speech recognition automatic error correction method, storage medium and computing device |
CN109271495A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing |
Non-Patent Citations (1)
Title |
---|
基于词语热度的启发式中文句子压缩算法;韩静 等;《计算机工程与应用》;20141231;第132-139页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110415705A (en) | 2019-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110415705B (en) | Hot word recognition method, system, device and storage medium | |
US10937448B2 (en) | Voice activity detection method and apparatus | |
CN111797632B (en) | Information processing method and device and electronic equipment | |
KR20220035222A (en) | Speech recognition error correction method, related devices, and readable storage medium | |
CN105632499B (en) | Method and apparatus for optimizing speech recognition results | |
CN107844481B (en) | Text recognition error detection method and device | |
CN111429887B (en) | Speech keyword recognition method, device and equipment based on end-to-end | |
CN109559735B (en) | Voice recognition method, terminal equipment and medium based on neural network | |
CN112257437B (en) | Speech recognition error correction method, device, electronic equipment and storage medium | |
CN110503943B (en) | Voice interaction method and voice interaction system | |
CN114999463B (en) | Voice recognition method, device, equipment and medium | |
CN110751234A (en) | OCR recognition error correction method, device and equipment | |
CN111862963B (en) | Voice wakeup method, device and equipment | |
CN111128174A (en) | Voice information processing method, device, equipment and medium | |
CN110956958A (en) | Searching method, searching device, terminal equipment and storage medium | |
US10468031B2 (en) | Diarization driven by meta-information identified in discussion content | |
US10553205B2 (en) | Speech recognition device, speech recognition method, and computer program product | |
US20180158456A1 (en) | Speech recognition device and method thereof | |
CN115858776B (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN114255761A (en) | Speech recognition method, apparatus, device, storage medium and computer program product | |
JP6527000B2 (en) | Pronunciation error detection device, method and program | |
CN113838456A (en) | Phoneme extraction method, voice recognition method, device, equipment and storage medium | |
CN111883109A (en) | Voice information processing and verification model training method, device, equipment and medium | |
CN111048098B (en) | Voice correction system and voice correction method | |
CN111785259A (en) | Information processing method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |