CN110415705B

CN110415705B - Hot word recognition method, system, device and storage medium

Info

Publication number: CN110415705B
Application number: CN201910706314.6A
Authority: CN
Inventors: 王欢良; 唐浩元; 王佳珺; 鄢戈; 张李
Original assignee: Suzhou Qdreamer Network Technology Co ltd
Current assignee: Suzhou Qdreamer Network Technology Co ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2022-03-01
Anticipated expiration: 2039-08-01
Also published as: CN110415705A

Abstract

The invention provides a hot word recognition method, a system, a device and a storage medium, which aim to solve the problem that the correct voice recognition result can be modified by mistake in the prior art, and the hot word recognition method comprises the following steps: step 1, sending user audio into a general recognition engine to obtain a voice recognition result and simultaneously obtain a voice recognition result W_iA corresponding location and confidence on the audio; step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S); step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value, replacing the speech recognition result W with the hotword W_i～W_jThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending; and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.

Description

Hot word recognition method, system, device and storage medium

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method, a system, a device and a storage medium for identifying hot words.

Background

Speech recognition technology has become the dominant technology for current applications of artificial intelligence. Typical speech recognition techniques rely on a particular vocabulary, i.e., only words within a given vocabulary range are recognized; if out-of-vocabulary words appear in the speech, the recognition performance is usually poor, or even not recognized at all. Some solutions have been proposed to address this problem. The main method is called recognition result post-processing technology, which is to correct the recognition result by analyzing the text of the recognition result and then adopting a language model or given hot word pronunciation. This type of method has a fatal disadvantage that the correct recognition result is often mistakenly modified.

Disclosure of Invention

In view of the above problems, the present invention provides a method, system, device and storage medium for hot word recognition, so as to solve the problem in the prior art that a correct speech recognition result is modified by mistake.

The technical scheme is as follows: a hotword recognition method is characterized by comprising the following steps:

step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W₁,W₂,...,W_nWhere n is a natural number, and obtaining a voice recognition result W_iCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;

step 2, sending the user audio into a hotword detection engine to perform hotword retrieval, and obtaining a hotword W with the highest score, an audio position P corresponding to the hotword and a score S, wherein the audio position P and the score S are represented as (W, P and S);

step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value, replacing the speech recognition result W with the hotword W_i～W_jThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;

and 4, if the position of the hot word is overlapped with the word in the current recognition result, correcting the words before and after the hot word.

Further, between the step 1 and the step 2, a step 1.5 is included, if the speech recognition result W exists_i～W_j，i<j, j are natural numbers, W_i～W_jIs below a given threshold, W is extracted_i～W_jAnd (5) executing step 2 on the corresponding audio segment.

Further, step 1 and step 2 are performed synchronously.

Further, the step 2 specifically comprises the following steps:

step 2-1, adding a filer word according to the hot word list, wherein the filer word is configured to be connected with all the acoustic modeling units to construct a parallel grammar recognition network;

step 2-2, adopting a Viterbi algorithm of beam-search to perform decoding search on the extracted input voice segment;

2-3, backtracking to obtain the hotword with the highest score and the audio position corresponding to the hotword;

and 2-4, calculating the average posterior probability of the speech frames corresponding to the hot words, and outputting the average posterior probability as the scores of the hot words.

Further, in step 2, the posterior probability score output by the universal recognition acoustic model is adopted in the grammar recognition network.

Further, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.

Further, when the hot word appearance position and the word in the current recognition result have an overlap of the starting position, the step 4 specifically includes the following steps:

step 4-1, determining the word at the initial position of the hot word in the recognition result, and calculating the position difference between the initial position of the word and the initial position of the hot word;

step 4-2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the non-overlapped part of the word and the hot word as a candidate word;

step 4-3: predicting the probability of each candidate word of the current word under the conditions that the first word of a given sentence is located in front of the current word and the last word of the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;

step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;

when the hot word appearance position and the word in the current recognition result have overlapping of the ending position, the step 4 specifically comprises the following steps:

step 4.1, determining the word at the end position of the hot word in the recognition result, and calculating the position difference between the end position of the word and the end position of the hot word;

step 4.2, if the position difference is larger than the duration of one word, selecting a word with similar pronunciation from the word list and in the part where the word is not overlapped with the hot word as a candidate word;

step 4.3: predicting the probability of each candidate word of the current word under the conditions that the given sentence end word reaches the word behind the current word and the word before the current word by adopting a pre-trained language model, and taking the probability as the score of the candidate word;

step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.

Further, between step 4-3 and step 4-5, and between step 4.3 and step 4.5, the following steps are respectively included: and increasing the acoustic confidence information of each word, and predicting the probability of occurrence of each candidate word of the current word to serve as the score of the candidate word.

Further, the hot word detection engine is configured to correspond to the user ID, and when the hot word is added, the hot word and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hot word and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and correspondingly adding the hot words to a hot word detection engine corresponding to the user ID.

A hotword recognition system, comprising:

a general speech recognition engine configured to output speech recognition results and a temporal position and confidence of each word in the audio;

a hot word detection engine configured to detect whether a hot word exists, and output an ID, an audio position, and a score thereof;

the hot word result correction module is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;

and the language model result correction module is configured to correct words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.

Further, the system comprises a hotword adding module configured to add hotwords to the hotword detection engine.

A hotword recognition device, comprising: comprising a processor, a memory, and a program;

the program is stored in the memory, and the processor calls the program stored in the memory to execute the hot word identification method.

A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program configured to execute the above-described hotword recognition method.

The hot word recognition method adopts the awakening word detection scheme to recognize the hot words, the hot words can be customized by a user, the hot word results are corrected after the hot words are recognized, in addition, other recognition errors caused by the hot word recognition errors can be further corrected on the basis of detecting and correcting the hot words, and the low-confidence words overlapped with or adjacent to the hot words are corrected.

Drawings

FIG. 1 is a flowchart of a hotword identification method according to embodiment 1;

fig. 2 is a system block diagram of a hotword recognition system of embodiment 1;

FIG. 3 is a flowchart of a hotword identification method according to embodiment 2;

fig. 4 is a system block diagram of a hotword recognition system according to embodiment 2.

Detailed Description

Specific example 1: referring to fig. 1, a hotword recognition method includes the following steps:

step 15 if there is a speech recognition result W_i～W_j，i<j, j are natural numbers, W_i～W_jIs lower than a given threshold, the threshold is taken to be 0.5, then W is extracted_i～W_jCorresponding audio frequency segments, executing step 2; otherwise, ending;

step 3, judging the score S of the hotword (W, P, S) with the highest score, and if the score S is larger than a given threshold value and the threshold value is 0.5, replacing the speech recognition result W with the hotword W_i～W_jThe words in the corresponding audio position are processed, and step 4 is executed; otherwise, ending;

Specifically, the step 2 specifically includes the following steps:

In this embodiment, in step 2, the posterior probability scores output by the generic recognition acoustic model are used in the grammar recognition network.

Specifically, in step 4, the hot word appearance position and the word in the current recognition result have an overlap including the overlap of the start position and the overlap of the end position.

When the hot word appearance position and the word in the current recognition result have overlapping of the starting position, the step 4 specifically comprises the following steps:

increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and taking the probability as the score of the candidate word;

step 4-4, if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, keeping the current word unchanged;

increasing the acoustic confidence information of each word, predicting the probability of occurrence of each candidate word of the current word, and using the probability as the score of the candidate word

Step 4.4: if the score of the candidate word with the highest score is larger than a given threshold value, and the threshold value is 0.5, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.

Specifically, in this embodiment, the language model used in step 4 is a recurrent neural network, specifically, an LSTM/GRU type language model.

In the embodiment, the hotword detection engine is configured to correspond to the user ID to realize user dependence, and when adding a hotword, the hotword and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain the pronunciation of the hotword and the corresponding phoneme sequence thereof; then adding the hot words into the grammar network; and generating hot word detection resources, and adding the hot words into a hot word detection engine corresponding to the user ID correspondingly, so that the hot words can be conveniently added into the hot word detection engine.

Specifically, the user may transmit a triplet to the system telling the system to add or delete a given hotword. The definition of the triplets is as follows: (ID, HotWord, OPT), wherein ID: marking a user; HotWord: marking hot words; OPT: mark action, OPT defined as add or delete.

A hotword recognition system corresponding to embodiment 1 is shown in fig. 2, and includes:

a general speech recognition engine 1 configured to output a speech recognition result and a temporal position and a confidence of each word in audio;

a hotword detection engine 2 configured to detect whether a hotword exists, and output an ID, an audio position, and a score thereof;

the hot word result correction module 3 is configured to replace words at corresponding positions in the voice recognition result output by the general voice recognition engine with hot words;

and the language model result correction module 4 is configured to correct the words before and after the hot word when the hot word appearance position is overlapped with the word in the current recognition result.

Also included is a hotword adding module 5 configured to add hotwords to the hotword detection engine.

In embodiment 1, as shown in fig. 2, a sequential recognition method is provided, in which generic recognition is performed first, and hot word detection is performed according to the confidence of the recognition result, so that additional computing resources are not required, and the system delay is increased.

Specific example 2: referring to fig. 2, a hotword recognition method includes the following steps:

In the present embodiment, step 1 and step 2 are performed synchronously.

Specifically, the step 2 specifically includes the following steps:

In the present embodiment, in step 2, the acoustic model in the grammar recognition network is a CLDNN model.

A hotword recognition system corresponding to embodiment 2 is shown in fig. 4, and includes:

In the specific embodiment 2, as shown in fig. 4, a parallel recognition mode is provided, the general recognition and the hotword detection are performed simultaneously, a relatively rich computing resource is required, the system delay is basically unchanged, and the response speed is faster.

The hot word recognition method adopts the awakening word detection scheme to recognize the hot words, the hot words can be customized by a user, the hot word result is corrected after the hot words are recognized, in addition, other recognition errors caused by the hot word recognition errors can be further corrected on the basis of detecting and correcting the hot words.

In an embodiment of the present invention, there is also provided a hotword recognition apparatus including: comprising a processor, a memory, and a program; a program is stored in the memory and the processor calls the program stored in the memory to perform the hotword recognition method described above.

In the implementation of the above hot word recognition apparatus, the memory and the processor are directly or indirectly electrically connected to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines, such as a bus. The memory stores computer-executable instructions for implementing the data access control method, and includes at least one software functional module which can be stored in the memory in the form of software or firmware, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory.

The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory is used for storing programs, and the processor executes the programs after receiving the execution instructions.

The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In an embodiment of the present invention, there is also provided a computer-readable storage medium configured to store a program configured to execute the above-described hotword recognition method.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer-readable storage medium. When executed by a processor, the program implements steps comprising the above-described method embodiments; and the aforementioned computer-readable storage media comprise: various media that can store program code, such as ROM, RAM, magnetic or optical disks, include instructions for causing a large data transmission device (which can be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments.

Claims

1. A hotword recognition method is characterized by comprising the following steps:

step 1, sending the user audio to a general recognition engine to obtain a voice recognition result, wherein the voice recognition result is expressed as W₁,W₂,...,W_nWhere n is a natural number, and obtaining a speech recognition result W_iCorresponding positions and confidence degrees on the audio frequency, wherein i is more than or equal to 1 and less than or equal to n, and i is a natural number;

step 3, judging the score S of the hotword (W, P, S) with the highest score, if the score S is larger than a given threshold value, replacing the word at the corresponding audio frequency position in the voice recognition result by the hotword W, and executing step 4; otherwise, ending;

2. The hotword recognition method of claim 1, wherein: between step 1 and step 2, a step 1.5 is also included, if the speech recognition result W exists_i～W_j，i<j, i, j are natural numbers, W_i～W_jIs below a given threshold, W is extracted_i～W_jAnd (5) executing step 2 on the corresponding audio segment.

3. The hotword recognition method of claim 1, wherein: step 1 and step 2 are performed synchronously.

4. The hotword recognition method of claim 1, wherein: the step 2 specifically comprises the following steps:

5. The hotword recognition method of claim 4, wherein: in step 2, the posterior probability scores output by the universal recognition acoustic model are adopted in the grammar recognition network.

6. The hotword recognition method of claim 1, wherein: in step 4, the hot word appearance position and the word in the current recognition result have overlapping including the overlapping of the starting position and the overlapping of the ending position;

step 4-3: adopting a pre-trained language model, and predicting the probability of each candidate word of the current word under the conditions of giving words from the first word of a sentence to the front word of the current word and words behind the current word, and taking the probability as the score of the candidate word;

step 4.3: adopting a pre-trained language model, and predicting the probability of each candidate word of the current word under the conditions of giving words after the sentence end word reaches the current word and words before the current word, and taking the probability as the score of the candidate word;

step 4.4, if the score of the candidate word with the highest score is larger than a given threshold value, replacing the current word with the candidate word; otherwise, the current word is kept unchanged.

7. The hotword recognition method of claim 6, wherein: between step 4-3 and step 4-5, and between step 4.3 and step 4.5, the following steps are further included, respectively: and increasing the acoustic confidence information of each word, and predicting the probability of occurrence of each candidate word of the current word to serve as the score of the candidate word.

8. The hotword recognition method of claim 1, wherein: the hot word detection engine is configured to correspond to the user ID, and when the hot words are added, the hot words and the user ID are uploaded at the same time, and a pronunciation dictionary is inquired to obtain pronunciations of the hot words and corresponding phoneme sequences of the hot words; then adding the hot words into the grammar network; and generating hot word detection resources, and correspondingly adding the hot words to a hot word detection engine corresponding to the user ID.

9. A hotword recognition system, comprising:

a general speech recognition engine configured to input user audio and output speech recognition results and the time position and confidence of each word in the audio, the speech recognition results being represented as W₁,W₂,...,W_nWherein n is a natural number;

the hot word detection engine is configured to input the user audio to detect whether a hot word exists or not, and output a hot word W with the highest score, an audio position P corresponding to the hot word and a score S, wherein the scores are represented as (W, P and S);

the hot word result correction module is configured to judge the score S of the hot word (W, P, S) with the highest score, and if the score S is larger than a given threshold value, the hot word W is used for replacing the word at the corresponding audio frequency position in the voice recognition result output by the general voice recognition engine;

10. A hotword recognition system as recited in claim 9, wherein: also included is a hotword addition module configured to add or update a hotword to the hotword detection engine.

11. A hotword recognition device, comprising: comprising a processor, a memory, and a program;

the program is stored in the memory and the processor invokes the memory-stored program to perform the hotword recognition method of claim 1.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program configured to execute the hotword recognition method of claim 1.