CN110415705B - Hot word recognition method, system, device and storage medium - Google Patents

Hot word recognition method, system, device and storage medium Download PDF

Info

Publication number
CN110415705B
CN110415705B CN201910706314.6A CN201910706314A CN110415705B CN 110415705 B CN110415705 B CN 110415705B CN 201910706314 A CN201910706314 A CN 201910706314A CN 110415705 B CN110415705 B CN 110415705B
Authority
CN
China
Prior art keywords
word
hotword
hot
score
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910706314.6A
Other languages
Chinese (zh)
Other versions
CN110415705A (en
Inventor
王欢良
唐浩元
王佳珺
鄢戈
张李
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Qdreamer Network Technology Co ltd
Original Assignee
Suzhou Qdreamer Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Qdreamer Network Technology Co ltd filed Critical Suzhou Qdreamer Network Technology Co ltd
Priority to CN201910706314.6A priority Critical patent/CN110415705B/en
Publication of CN110415705A publication Critical patent/CN110415705A/en
Application granted granted Critical
Publication of CN110415705B publication Critical patent/CN110415705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a hot word recognition method, a system, a device and a storage medium, which aim to solve the problem that the correct voice recognition result can be modified by mistake in the prior art, and the hot word recognition method comprises the following steps: step 1, sending user audio into a general recognition engine to obtain a voice recognition result and simultaneously obtain a voice recognition result WiA corresponding location and confidence on the audio; step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S); step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending; and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.

Description

Hot word recognition method, system, device and storage medium
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a system, a device and a storage medium for identifying hot words.
Background
Speech recognition technology has become the dominant technology for current applications of artificial intelligence. Typical speech recognition techniques rely on a particular vocabulary, i.e., only words within a given vocabulary range are recognized; if out-of-vocabulary words appear in the speech, the recognition performance is usually poor, or even not recognized at all. Some solutions have been proposed to address this problem. The main method is called recognition result post-processing technology, which is to correct the recognition result by analyzing the text of the recognition result and then adopting a language model or given hot word pronunciation. This type of method has a fatal disadvantage that the correct recognition result is often mistakenly modified.
Disclosure of Invention
In view of the above problems, the present invention provides a method, system, device and storage medium for hot word recognition, so as to solve the problem in the prior art that a correct speech recognition result is modified by mistake.
The technical scheme is as follows: a hotword recognition method is characterized by comprising the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a voice recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
Further, between the step 1 and the step 2, a step 1.5 is included, if the speech recognition result W existsi~Wj,i<j, j are natural numbers, Wi~WjIs below a given threshold, W is extractedi~WjAnd (5) executing step 2 on the corresponding audio segment.
Further, step 1 and step 2 are performed synchronously.
Further, the step 2 specifically comprises the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
Further, in step 2, the posterior probability score output by the universal recognition acoustic model is adopted in the grammar recognition network.
Further, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.
Further, when the hot word appearance position and the word in the current recognition result have an overlap of the starting position, the step 4 specifically includes the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: predicting the probability of each candidate word of the current word under the conditions that the first word of a given sentence is located in front of the current word and the last word of the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: predicting the probability of each candidate word of the current word under the conditions that the given sentence end word reaches the word behind the current word and the word before the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
Further, between step 4-3 and step 4-5, and between step 4.3 and step 4.5, the following steps are respectively included: and increasing the acoustic confidence information of each word, and predicting the probability of occurrence of each candidate word of the current word to serve as the score of the candidate word.
Further, the hot word detection engine is configured to correspond to the user ID, and when the hot word is added, the hot word and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hot word and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and correspondingly adding the hot words to a hot word detection engine corresponding to the user ID.
A hotword recognition system, comprising:
a general speech recognition engine configured to output speech recognition results and a temporal position and confidence of each word in the audio;
a hot word detection engine configured to detect whether a hot word exists, and output an ID, an audio position, and a score thereof;
the hot word result correction module is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;
and the language model result correction module is configured to correct words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
Further, the system comprises a hotword adding module configured to add hotwords to the hotword detection engine.
A hotword recognition device, comprising: comprising a processor, a memory, and a program;
the program is stored in the memory, and the processor calls the program stored in the memory to execute the hot word identification method.
A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program configured to execute the above-described hotword recognition method.
The hot word recognition method adopts the awakening word detection scheme to recognize the hot words, the hot words can be customized by a user, the hot word results are corrected after the hot words are recognized, in addition, other recognition errors caused by the hot word recognition errors can be further corrected on the basis of detecting and correcting the hot words, and the low-confidence words overlapped with or adjacent to the hot words are corrected.
Drawings
FIG. 1 is a flowchart of a hotword identification method according to embodiment 1;
fig. 2 is a system block diagram of a hotword recognition system of embodiment 1;
FIG. 3 is a flowchart of a hotword identification method according to embodiment 2;
fig. 4 is a system block diagram of a hotword recognition system according to embodiment 2.
Detailed Description
Specific example 1: referring to fig. 1, a hotword recognition method includes the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a voice recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 15 if there is a speech recognition result Wi~Wj,i<j, j are natural numbers, Wi~WjIs lower than a given threshold, the threshold is taken to be 0.5, then W is extractedi~WjCorresponding audio frequency segments, executing step 2; otherwise, ending;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value and the threshold value is 0.5, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
Specifically, the step 2 specifically includes the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
In this embodiment, in step 2, the posterior probability scores output by the generic recognition acoustic model are used in the grammar recognition network.
Specifically, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.
When the hot word appearance position and the word in the current recognition result have overlapping of the starting position, the step 4 specifically comprises the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: predicting the probability of each candidate word of the current word under the conditions that the first word of a given sentence is located in front of the current word and the last word of the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: predicting the probability of each candidate word of the current word under the conditions that the given sentence end word reaches the word behind the current word and the word before the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and using the probability as the score of the candidate word
Step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
Specifically, in this embodiment, the language model used in step 4 is a recurrent neural network, specifically, an LSTM/GRU type language model.
In the embodiment, the hotword detection engine is configured to correspond to the user ID to realize user dependence, and when adding a hotword, the hotword and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hotword and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and adding the hot words into a hot word detection engine corresponding to the user ID correspondingly, so that the hot words can be conveniently added into the hot word detection engine.
Specifically, the user may transmit a triplet to the system telling the system to add or delete a given hotword. The definition of the triplets is as follows: (ID, HotWord, OPT), wherein ID: marking a user; HotWord: marking hot words; OPT: mark action, OPT defined as add or delete.
A hotword recognition system corresponding to embodiment 1 is shown in fig. 2, and includes:
a general speech recognition engine 1 configured to output a speech recognition result and a temporal position and a confidence of each word in audio;
a hotword detection engine 2 configured to detect whether a hotword exists, and output an ID, an audio position, and a score thereof;
the hot word result correction module 3 is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;
and the language model result correction module 4 is configured to correct the words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
Also included is a hotword adding module 5 configured to add hotwords to the hotword detection engine.
In embodiment 1, as shown in fig. 2, a sequential recognition method is provided, in which generic recognition is performed first, and hot word detection is performed according to the confidence of the recognition result, so that additional computing resources are not required, and the system delay is increased.
Specific example 2: referring to fig. 2, a hotword recognition method includes the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a voice recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value and the threshold value is 0.5, replacing the speech recognition result W with the hotword Wi~WjThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
In the present embodiment, step 1 and step 2 are performed synchronously.
Specifically, the step 2 specifically includes the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
In the present embodiment, in step 2, the acoustic model in the grammar recognition network is a CLDNN model.
Specifically, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.
When the hot word appearance position and the word in the current recognition result have overlapping of the starting position, the step 4 specifically comprises the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: predicting the probability of each candidate word of the current word under the conditions that the first word of a given sentence is located in front of the current word and the last word of the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: predicting the probability of each candidate word of the current word under the conditions that the given sentence end word reaches the word behind the current word and the word before the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;
increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and using the probability as the score of the candidate word
Step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
Specifically, in this embodiment, the language model used in step 4 is a recurrent neural network, specifically, an LSTM/GRU type language model.
In the embodiment, the hotword detection engine is configured to correspond to the user ID to realize user dependence, and when adding a hotword, the hotword and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hotword and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and adding the hot words into a hot word detection engine corresponding to the user ID correspondingly, so that the hot words can be conveniently added into the hot word detection engine.
Specifically, the user may transmit a triplet to the system telling the system to add or delete a given hotword. The definition of the triplets is as follows: (ID, HotWord, OPT), wherein ID: marking a user; HotWord: marking hot words; OPT: mark action, OPT defined as add or delete.
A hotword recognition system corresponding to embodiment 2 is shown in fig. 4, and includes:
a general speech recognition engine 1 configured to output a speech recognition result and a temporal position and a confidence of each word in audio;
a hotword detection engine 2 configured to detect whether a hotword exists, and output an ID, an audio position, and a score thereof;
the hot word result correction module 3 is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;
and the language model result correction module 4 is configured to correct the words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
Also included is a hotword adding module 5 configured to add hotwords to the hotword detection engine.
In the specific embodiment 2, as shown in fig. 4, a parallel recognition mode is provided, the general recognition and the hotword detection are performed simultaneously, a relatively rich computing resource is required, the system delay is basically unchanged, and the response speed is faster.
The hot word recognition method adopts the awakening word detection scheme to recognize the hot words, the hot words can be customized by a user, the hot word result is corrected after the hot words are recognized, in addition, other recognition errors caused by the hot word recognition errors can be further corrected on the basis of detecting and correcting the hot words.
In an embodiment of the present invention, there is also provided a hotword recognition apparatus including: comprising a processor, a memory, and a program; a program is stored in the memory and the processor calls the program stored in the memory to perform the hotword recognition method described above.
In the implementation of the above hot word recognition apparatus, the memory and the processor are directly or indirectly electrically connected to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines, such as a bus. The memory stores computer-executable instructions for implementing the data access control method, and includes at least one software functional module which can be stored in the memory in the form of software or firmware, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory.
The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory is used for storing programs, and the processor executes the programs after receiving the execution instructions.
The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In an embodiment of the present invention, there is also provided a computer-readable storage medium configured to store a program configured to execute the above-described hotword recognition method.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer-readable storage medium. When executed by a processor, the program implements steps comprising the above-described method embodiments; and the aforementioned computer-readable storage media comprise: various media that can store program code, such as ROM, RAM, magnetic or optical disks, include instructions for causing a large data transmission device (which can be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments.

Claims (12)

1. A hotword recognition method is characterized by comprising the following steps:
step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W1,W2,...,WnWhere n is a natural number, and obtaining a speech recognition result WiCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;
step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);
step 3, judging the score S of the hotword (W, P, S) with the highest score, if the score S is larger than a given threshold value, replacing the word at the corresponding audio frequency position in the voice recognition result by the hotword W, and executing step 4; otherwise, ending;
and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.
2. The hotword recognition method of claim 1, wherein: between step 1 and step 2, a step 1.5 is also included, if the speech recognition result W existsi~Wj,i<j, i, j are natural numbers, Wi~WjIs below a given threshold, W is extractedi~WjAnd (5) executing step 2 on the corresponding audio segment.
3. The hotword recognition method of claim 1, wherein: step 1 and step 2 are performed synchronously.
4. The hotword recognition method of claim 1, wherein: the step 2 specifically comprises the following steps:
step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;
step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;
2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;
and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.
5. The hotword recognition method of claim 4, wherein: in step 2, the posterior probability scores output by the universal recognition acoustic model are adopted in the grammar recognition network.
6. The hotword recognition method of claim 1, wherein: in step 4, the hot word appearance position and the word in the current recognition result have overlapping including the overlapping of the starting position and the overlapping of the ending position;
when the hot word appearance position and the word in the current recognition result have overlapping of the starting position, the step 4 specifically comprises the following steps:
step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;
step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;
step 4-3: adopting a pre-trained language model, and predicting the probability of each candidate word of the current word under the conditions of giving words from the first word of a sentence to the front word of the current word and words behind the current word, and taking the probability as the score of the candidate word;
step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;
when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:
step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;
step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;
step 4.3: adopting a pre-trained language model, and predicting the probability of each candidate word of the current word under the conditions of giving words after the sentence end word reaches the current word and words before the current word, and taking the probability as the score of the candidate word;
step 4.4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.
7. The hotword recognition method of claim 6, wherein: between step 4-3 and step 4-5, and between step 4.3 and step 4.5, the following steps are further included, respectively: and increasing the acoustic confidence information of each word, and predicting the probability of occurrence of each candidate word of the current word to serve as the score of the candidate word.
8. The hotword recognition method of claim 1, wherein: the hot word detection engine is configured to correspond to the user ID, and when the hot words are added, the hot words and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain pronunciations of the hot words and corresponding phoneme sequences of the hot words; then adding the hot words into the grammar network; and generating hot word detection resources, and correspondingly adding the hot words to a hot word detection engine corresponding to the user ID.
9. A hotword recognition system, comprising:
a general speech recognition engine configured to input user audio and output speech recognition results and the time position and confidence of each word in the audio, the speech recognition results being represented as W1,W2,...,WnWherein n is a natural number;
the hot word detection engine is configured to input the user audio to detect whether a hot word exists or not, and output a hot word W with the highest score, an audio position P corresponding to the hot word and a score S, wherein the scores are represented as (W, P and S);
the hot word result correction module is configured to judge the score S of the hot word (W, P, S) with the highest score, and if the score S is larger than a given threshold value, the hot word W is used for replacing the word at the corresponding audio frequency position in the voice recognition result output by the general voice recognition engine;
and the language model result correction module is configured to correct words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.
10. A hotword recognition system as recited in claim 9, wherein: also included is a hotword addition module configured to add or update a hotword to the hotword detection engine.
11. A hotword recognition device, comprising: comprising a processor, a memory, and a program;
the program is stored in the memory and the processor invokes the memory-stored program to perform the hotword recognition method of claim 1.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program configured to execute the hotword recognition method of claim 1.
CN201910706314.6A 2019-08-01 2019-08-01 Hot word recognition method, system, device and storage medium Active CN110415705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910706314.6A CN110415705B (en) 2019-08-01 2019-08-01 Hot word recognition method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910706314.6A CN110415705B (en) 2019-08-01 2019-08-01 Hot word recognition method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN110415705A CN110415705A (en) 2019-11-05
CN110415705B true CN110415705B (en) 2022-03-01

Family

ID=68365126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910706314.6A Active CN110415705B (en) 2019-08-01 2019-08-01 Hot word recognition method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN110415705B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689881B (en) * 2018-06-20 2022-07-12 深圳市北科瑞声科技股份有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN111090720B (en) * 2019-11-22 2023-09-12 北京捷通华声科技股份有限公司 Hot word adding method and device
CN110879839A (en) * 2019-11-27 2020-03-13 北京声智科技有限公司 Hot word recognition method, device and system
CN111028830B (en) * 2019-12-26 2022-07-15 大众问问(北京)信息科技有限公司 Local hot word bank updating method, device and equipment
CN111161739B (en) * 2019-12-28 2023-01-17 科大讯飞股份有限公司 Speech recognition method and related product
CN113178194B (en) * 2020-01-08 2024-03-22 上海依图信息技术有限公司 Voice recognition method and system for interactive hotword updating
CN111583909B (en) * 2020-05-18 2024-04-12 科大讯飞股份有限公司 Voice recognition method, device, equipment and storage medium
CN112599114B (en) * 2020-11-11 2024-06-18 联想(北京)有限公司 Voice recognition method and device
CN112349278A (en) * 2020-11-12 2021-02-09 苏州思必驰信息科技有限公司 Local hot word training and recognition method and device
CN112489651B (en) * 2020-11-30 2023-02-17 科大讯飞股份有限公司 Voice recognition method, electronic device and storage device
CN112735428A (en) * 2020-12-27 2021-04-30 科大讯飞(上海)科技有限公司 Hot word acquisition method, voice recognition method and related equipment
CN113836270A (en) * 2021-09-28 2021-12-24 深圳格隆汇信息科技有限公司 Big data processing method and related product
CN114185511A (en) * 2021-11-29 2022-03-15 北京百度网讯科技有限公司 Audio data processing method and device and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5559925A (en) * 1994-06-24 1996-09-24 Apple Computer, Inc. Determining the useability of input signals in a data recognition system
CN101510222A (en) * 2009-02-20 2009-08-19 北京大学 Multilayer index voice document searching method and system thereof
US20160104480A1 (en) * 2014-10-09 2016-04-14 Google Inc. Hotword detection on multiple devices
CN106782607A (en) * 2012-07-03 2017-05-31 谷歌公司 Determine hot word grade of fit
US20180182390A1 (en) * 2016-12-27 2018-06-28 Google Inc. Contextual hotwords
US20180330717A1 (en) * 2017-05-11 2018-11-15 International Business Machines Corporation Speech recognition by selecting and refining hot words
CN108984529A (en) * 2018-07-16 2018-12-11 北京华宇信息技术有限公司 Real-time court's trial speech recognition automatic error correction method, storage medium and computing device
CN109271495A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing
CN109523991A (en) * 2017-09-15 2019-03-26 阿里巴巴集团控股有限公司 Method and device, the equipment of speech recognition

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013000136A1 (en) * 2011-06-29 2013-01-03 宇龙计算机通信科技(深圳)有限公司 Mobile terminal and method, system for inputting network hot words into mobile terminal
US9263042B1 (en) * 2014-07-25 2016-02-16 Google Inc. Providing pre-computed hotword models
CN106326484A (en) * 2016-08-31 2017-01-11 北京奇艺世纪科技有限公司 Error correction method and device for search terms
US10134396B2 (en) * 2016-12-07 2018-11-20 Google Llc Preventing of audio attacks
CN110689881B (en) * 2018-06-20 2022-07-12 深圳市北科瑞声科技股份有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5559925A (en) * 1994-06-24 1996-09-24 Apple Computer, Inc. Determining the useability of input signals in a data recognition system
CN101510222A (en) * 2009-02-20 2009-08-19 北京大学 Multilayer index voice document searching method and system thereof
CN106782607A (en) * 2012-07-03 2017-05-31 谷歌公司 Determine hot word grade of fit
US20160104480A1 (en) * 2014-10-09 2016-04-14 Google Inc. Hotword detection on multiple devices
US20180182390A1 (en) * 2016-12-27 2018-06-28 Google Inc. Contextual hotwords
US20180330717A1 (en) * 2017-05-11 2018-11-15 International Business Machines Corporation Speech recognition by selecting and refining hot words
CN109523991A (en) * 2017-09-15 2019-03-26 阿里巴巴集团控股有限公司 Method and device, the equipment of speech recognition
CN108984529A (en) * 2018-07-16 2018-12-11 北京华宇信息技术有限公司 Real-time court's trial speech recognition automatic error correction method, storage medium and computing device
CN109271495A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于词语热度的启发式中文句子压缩算法;韩静 等;《计算机工程与应用》;20141231;第132-139页 *

Also Published As

Publication number Publication date
CN110415705A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110415705B (en) Hot word recognition method, system, device and storage medium
US10937448B2 (en) Voice activity detection method and apparatus
CN111797632B (en) Information processing method and device and electronic equipment
KR20220035222A (en) Speech recognition error correction method, related devices, and readable storage medium
CN105632499B (en) Method and apparatus for optimizing speech recognition results
CN107844481B (en) Text recognition error detection method and device
CN111429887B (en) Speech keyword recognition method, device and equipment based on end-to-end
CN109559735B (en) Voice recognition method, terminal equipment and medium based on neural network
CN112257437B (en) Speech recognition error correction method, device, electronic equipment and storage medium
CN110503943B (en) Voice interaction method and voice interaction system
CN114999463B (en) Voice recognition method, device, equipment and medium
CN110751234A (en) OCR recognition error correction method, device and equipment
CN111862963B (en) Voice wakeup method, device and equipment
CN111128174A (en) Voice information processing method, device, equipment and medium
CN110956958A (en) Searching method, searching device, terminal equipment and storage medium
US10468031B2 (en) Diarization driven by meta-information identified in discussion content
US10553205B2 (en) Speech recognition device, speech recognition method, and computer program product
US20180158456A1 (en) Speech recognition device and method thereof
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN114255761A (en) Speech recognition method, apparatus, device, storage medium and computer program product
JP6527000B2 (en) Pronunciation error detection device, method and program
CN113838456A (en) Phoneme extraction method, voice recognition method, device, equipment and storage medium
CN111883109A (en) Voice information processing and verification model training method, device, equipment and medium
CN111048098B (en) Voice correction system and voice correction method
CN111785259A (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant