CN112102812B

CN112102812B - Anti-false wake-up method based on multiple acoustic models

Info

Publication number: CN112102812B
Application number: CN202011300127.7A
Authority: CN
Inventors: 舒畅; 何云鹏; 许兵
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-02-05
Anticipated expiration: 2040-11-19
Also published as: CN112102812A

Abstract

A mistake wake-up preventing method based on multiple acoustic models and a voice recognition module are provided, wherein the mistake wake-up preventing method based on the multiple acoustic models comprises the following steps: s1, selecting corpora required by training a plurality of acoustic models of different languages respectively; s2, processing the materials; s3, training the acoustic models by utilizing the corpora respectively; s4, packaging the trained acoustic model and the language model to form voice recognition firmware of different languages and burning the voice recognition firmware into a voice recognition module; and S5, the voice recognition module simultaneously inputs the audio to be recognized into a plurality of voice recognition firmware, and when the plurality of voice recognition firmware are recognized as command words simultaneously, the voice recognition module judges the command words and executes the commands. The invention can effectively avoid the error recognition of non-command words into command words caused by harmonic sounds by training the voice recognition firmware of two different languages to simultaneously perform voice recognition, and can not influence the normal recognition of the command words.

Description

Anti-false wake-up method based on multiple acoustic models

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a multi-acoustic-model-based anti-false-awakening method.

Background

With the increasing need for human-computer interaction, speech recognition-related applications are increasing in life. In the mature speech interaction, people are more pursuing comfort for speech recognition and accuracy of speech recognition. Speech recognition is achieved by matching a language model with an acoustic model.

For example, in an english model, because only part of english corpus is labeled as garbage words during training, in a chinese environment, some chinese words having similar pronunciation to an english command word are mistakenly recognized as an english command word by the english model, so that the recognition is performed accordingly. The solution in the prior art is to expand a garbage lexicon required by training an English acoustic model, and the method can solve the problem that the English model is mistakenly awakened in an English environment and can also improve the mistaken identification condition of the English model in a Chinese environment, but still cannot effectively solve the problem.

Disclosure of Invention

In order to overcome the technical defects in the prior art, the invention discloses a false wake prevention method based on a multi-acoustic model.

The invention discloses a multi-acoustic-model-based anti-false-awakening method, which comprises the following steps of:

s1, respectively selecting language corpora required by training a plurality of acoustic models of different languages; wherein, the language corpora corresponding to the training of different acoustic models all contain command word corpora;

s2, processing the material, wherein pronunciations of command words corresponding to the command word material are the same or similar under a plurality of language models;

s3, training the corresponding acoustic models by using the processed corpora respectively;

s4, packaging the trained acoustic model and the language model to form voice recognition firmware of different languages and burning the voice recognition firmware into a voice recognition module;

s5, in the specific recognition process, the voice recognition module simultaneously inputs the audio to be recognized into a plurality of voice recognition firmware, and when the plurality of voice recognition firmware are recognized as command words simultaneously, the voice recognition module judges the command words and executes commands;

if the command words are not recognized at the same time, the command words are not considered;

the differences between the plurality of language models and the plurality of speech recognition firmware are all for different languages; in the step S2, the pronunciation of the command word in the plurality of language models is defined by pronunciation rules of the same language, which is any one of different languages in the step S1.

Specifically, the method specifically comprises the following steps for two different languages:

s11, selecting corpora required by training the first acoustic model and the second acoustic model respectively; wherein, the language corpora corresponding to the training of different acoustic models all contain command word corpora;

s21, processing the linguistic data, wherein pronunciations of the command words under the first language model and the second language model are the same or similar;

s31, training a first acoustic model and a second acoustic model respectively by utilizing the processed two language corpora;

s41, packaging the trained acoustic model and the language model to form a first voice recognition firmware and a second voice recognition firmware, and burning the first voice recognition firmware and the second voice recognition firmware into a voice recognition module;

s51, in the specific recognition process, the voice recognition module simultaneously inputs the audio to be recognized into a first voice recognition firmware and a second voice recognition firmware, and when the first voice recognition firmware and the second voice recognition firmware are recognized as command words simultaneously, the voice recognition module judges the command words and executes commands;

the first acoustic model, the first language model and the first speech recognition firmware are specific to one of two different languages, and the second acoustic model, the second language model and the second speech recognition firmware are specific to the other of the two different languages.

Preferably, the corpora required for training the first and second acoustic models in step S11 are different.

Preferably, the training in the step S3 is performed by using a KALDI method.

The invention also discloses a voice recognition module based on the multiple acoustic models, which comprises voice recognition firmware of different languages, wherein the voice recognition firmware of different languages has the following functions: all command words are given pronunciation in all voice recognition firmware and marked as command words, the voice recognition module further comprises a command word judgment module for judging the command words, and the judgment method of the command word judgment module is as follows: all the voice recognition firmware recognizes the command word, and the command word is judged to be the command word, otherwise, the command word is not considered to be the command word.

By adopting the anti-false-awakening method and the voice recognition module based on the multi-acoustic model, the voice recognition is carried out by training the voice recognition firmware of two different languages at the same time, so that the condition that non-command words are mistakenly recognized as command words due to harmonic sounds can be effectively avoided, and the normal recognition of the command words cannot be influenced.

Drawings

FIG. 1 is a diagram illustrating an embodiment of a speech recognition firmware according to the present invention;

FIG. 2 is a diagram illustrating an embodiment of speech signal recognition according to the present invention.

Detailed Description

The following provides a more detailed description of the present invention.

if the command word is not recognized at the same time, the command word is not considered.

In the simplest form, specifically for both languages, the following steps are included.

the first acoustic model, the first language model and the first voice recognition firmware aim at one of two different languages, and the second acoustic model, the second language model and the second voice recognition firmware aim at the other of the two different languages;

as shown in fig. 1, the above steps complete the construction of the speech recognition module, and step S51 completes the specific recognition process.

In the specific recognition process of step S51, the voice recognition module simultaneously inputs the audio to be recognized into the first voice recognition firmware and the second voice recognition firmware, and when the first voice recognition firmware and the second voice recognition firmware simultaneously recognize as a command word, the voice recognition module determines that the command word and executes the command;

Taking two most common Chinese and English languages as an example, namely the languages corresponding to the first acoustic model, the first language model and the first speech recognition firmware are Chinese; the second acoustic model, the second language model and the second speech recognition firmware correspond to the language of English.

Other languages can be selected, and the types of languages can be continuously increased, so long as more than two languages exist and are similar to the pronunciation of a certain meaningful word, the recognition can be carried out by referring to the invention.

Step S11, selecting corpora needed by training English and Chinese acoustic models respectively based on the application context;

the linguistic data required for training the English acoustic model and the Chinese acoustic model can be the same, but preferably different, and the training of different linguistic data can expand a garbage lexicon, so that the false awakening rate can be better reduced;

command word corpora are included in corpora for training different models;

step S21, processing the language material, and generating corresponding Chinese language model and English language model according to Chinese and English word stock, wherein the pronunciation of the command word is the same or similar under the Chinese and English language models;

for example, corresponding to the English command word START, in the Chinese acoustic model, the pronunciation of the Chinese command word is marked as SI DA TE by Pinyin, tone marks can be added, and three syllables are respectively one sound, four sounds and light sounds.

Step S31, training English acoustic models and Chinese acoustic models by using the processed corpus;

the training in step S3 or S31 may be performed in a KALDI manner. Required English models and Chinese models are trained, Kaldi is written mainly in C + + language, Shell, Python and Perl are used as voice recognition tools for model training with glue, and the model is completely free and can be quickly trained.

Step S41, packaging the trained acoustic model and the language model to form a first voice recognition firmware and a second voice recognition firmware, and burning the first voice recognition firmware and the second voice recognition firmware into the voice recognition module;

through the process, the voice recognition module based on the double-acoustic model can be obtained, the first voice recognition firmware and the second voice recognition firmware are arranged in the voice recognition module, the voice recognition module further comprises a command word judging module, and the judging method of the command word judging module is as follows: the first and second speech recognition firmware recognize the command word, and then determine it is the command word, otherwise, it is not the command word, and the specific determination logic is as shown in fig. 2.

In the specific recognition process, the voice recognition module simultaneously inputs the audio to be recognized into the Chinese voice recognition firmware and the English voice recognition firmware,

when the Chinese voice recognition firmware and the English voice recognition firmware recognize command words at the same time, a command word judgment module of the voice recognition module judges the command words and executes the commands;

not recognized simultaneously, including both firmware being unrecognized, or only one firmware being recognized and the other being unrecognized, are considered not to be command words.

The invention mainly aims at the phenomenon of misrecognition caused by the fact that pronunciation of some Chinese words is similar to that of English command words or the pronunciation of the English words is similar to that of the Chinese command words.

For example, the english command word WARMER is used as a heating command word, is similar to the pronunciation of the Chinese common word "us", and is easily recognized as the command word "WARMER" by the english speech recognition firmware when the pronunciation of "us" is heard.

During training, the WARMER is determined as a command word, corresponding pronunciation is stored in the English language model, and a similar pronunciation, such as 'WO MEN ER', is given to the English word 'WARMER' in the Chinese language model by using Chinese pronunciation rules so that the pronunciation is similar to the 'WARMER' English pronunciation.

While "we" are not command words, in the Chinese language model there are pronunciations, but in the English language model there are no similar pronunciations assigned.

When the user sends out the WARMER pronunciation, the Chinese language model and the English language model have the same or similar pronunciations and can be respectively and simultaneously recognized by the Chinese speech recognition firmware and the English speech recognition firmware, so that the command word is judged.

When the user sends out 'us', even though the 'us' is not a command word, the user can be mistakenly recognized as the 'WARMER' command word by English speech recognition firmware, the word is recognized as not the command word in Chinese speech recognition firmware, and the 'us' is originally a Chinese pronunciation, so that the Chinese speech recognition firmware is easier to recognize and has more accurate recognition result, and the user can not judge the word as the command word or determine the word as a junk word when the Chinese speech recognition firmware recognizes the word as not the command word.

In the invention, the linguistic data adopted for training different language models are not completely the same, only the linguistic data of the command word part is the same, but the linguistic data of the non-command word or the junk word part is different. For example, in an english environment, some words may not be covered during training, or some words may not be used in the english environment, and after the model is placed in a chinese environment, the pronunciation of some english non-command words is similar to that of the chinese command word, thereby causing false awakening. After two language models are adopted for simultaneous recognition, the part of words which are awoken by mistake in the Chinese environment can be defined as garbage words, and the judgment of the Chinese model can be different from that of the English model under the two language models, so that the occurrence of awoken by mistake in single model recognition can be avoided.

The invention discloses a voice recognition module based on multiple acoustic models, which comprises voice recognition firmware of different languages formed by the method, wherein the voice recognition firmware of different languages has the following functions: all command words are given pronunciation in all voice recognition firmware and marked as command words, the voice recognition module further comprises a command word judgment module for judging the command words, and the judgment method of the command word judgment module is as follows: all the voice recognition firmware recognizes the command word, and the command word is judged to be the command word, otherwise, the command word is not considered to be the command word. The command word judging module is a software module and collects the recognition results of all the voice recognition firmware, all the firmware recognizes the current voice signals as command words, the command word judging module considers the command words, and otherwise, the command words are not considered as the command words.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A false wake-up prevention method based on multiple acoustic models is characterized by comprising the following steps:

the differences between the plurality of language models and the plurality of speech recognition firmware are all for different languages;

in the step S2, the pronunciation of the command word in the plurality of language models is defined by pronunciation rules of the same language, which is any one of different languages in the step S1.

2. The multi-acoustic-model-based anti-false-wake method according to claim 1, wherein the anti-false-wake method is based on a multi-acoustic model

Aiming at two different languages, the method specifically comprises the following steps:

3. The method for preventing false wake-up based on multiple acoustic models according to claim 2, wherein the linguistic data required for training the first and second acoustic models in the step S11 is different.

4. The multi-acoustic-model-based anti-wake-up error method according to claim 1, wherein the training in the step S3 is performed by using KALDI.