CN111429912B

CN111429912B - Keyword detection method, system, mobile terminal and storage medium

Info

Publication number: CN111429912B
Application number: CN202010184549.6A
Authority: CN
Inventors: 徐敏; 肖龙源; 李稀敏; 蔡振华; 刘晓葳; 谭玉坤
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2023-02-10
Anticipated expiration: 2040-03-17
Also published as: CN111429912A

Abstract

The invention provides a keyword detection method, a system, a mobile terminal and a storage medium, wherein the method comprises the following steps: obtaining text corpora and a transcription text to perform model training on a language model; performing model training on the chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model; inputting the voice segment to be detected into a voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph; converting the index result into a factor converter, inputting a preset keyword into the factor converter for retrieval to obtain a keyword retrieval result; and calculating the occurrence probability of the preset keywords according to the keyword retrieval result, and judging that the preset keywords appear in the voice segment to be detected when the occurrence probability is greater than a probability threshold value. The method and the device decode the voice segment to be detected to generate the word graph by controlling the voice recognition model, avoid the condition of wrong keyword detection caused by wrong voice recognition, and improve the accuracy of keyword detection.

Description

Keyword detection method, system, mobile terminal and storage medium

Technical Field

The invention belongs to the technical field of keyword detection, and particularly relates to a keyword detection method, a keyword detection system, a mobile terminal and a storage medium.

Background

Keyword detection is a technology for detecting keywords from interested continuous voices, and has important application in the fields of intelligent home, telephone monitoring, voice data mining and the like. Keyword detection has been studied for over 40 years, but keyword detection in low-resource, low-power consumption, low-computational complexity environments remains a research hotspot. Keyword detection can be divided into two categories from an application scene, one category is that the number of keywords is small and fixed, whether the keywords in a keyword list exist in a voice stream is continuously detected from a continuous voice stream, and typical application is awakening word recognition in a smart home; the other type is that the number of keywords is large and not fixed, but the voice to be detected exists in advance, the voice segment where the keywords are located is found through an algorithm, and the typical application is voice data mining.

However, in the existing keyword detection process, a large amount of voice features of targeted keyword data are mainly extracted, normalized and then placed into a neural network for machine learning model training, the obtained model has poor robustness, the recognition rate is greatly influenced under the condition that scenes are inconsistent, and the accuracy of keyword detection is further reduced.

Disclosure of Invention

The embodiment of the invention aims to provide a keyword detection method, a keyword detection system, a mobile terminal and a storage medium, and aims to solve the problem that the existing keyword detection method is low in detection accuracy.

The embodiment of the invention is realized in such a way that a keyword detection method comprises the following steps:

acquiring a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;

performing model training on a chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model;

inputting the voice segments to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;

converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;

and respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value.

Further, the step of performing model training on the chain model according to the acoustic features in the training set includes:

training a single-tone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;

carrying out triphone model training on the training set according to the difference characteristics to obtain a triphone acoustic model, and aligning the phonemes according to the triphone acoustic model;

carrying out vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;

training the chain model according to the triphone acoustic model.

Further, after the step of training the monophonic acoustic model according to the acoustic features, the method further includes:

constructing a pronunciation dictionary according to the text corpus and the transcribed text, and controlling the single-phone acoustic model, the language model and the pronunciation dictionary to decode a verification set to obtain a verification decoding result;

and inquiring model adjusting parameters according to the verification decoding result, and updating the parameters of the single-phone acoustic model and the language model according to the model adjusting parameters.

Further, the formula for calculating the occurrence probability of each preset keyword according to the keyword search result is as follows:

wherein s is the preset keyword to be calculated, N _true (s) is the actual occurrence frequency of the preset keyword in the voice segment to be detected, N _correct (s) the corresponding calculated occurrence frequency of the preset keyword in the keyword search result, N _spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability.

Further, the keyword search result stores a voice segment ID, a start time, an end time and a posterior probability of each preset keyword appearing in the voice segment to be detected.

Further, after the step of inputting the preset keyword in the keyword table into the factor converter for retrieval to obtain the keyword retrieval result, the method further comprises:

and sequencing the preset keywords according to the posterior probability, and sequentially calculating the occurrence probability of each preset keyword according to a sequencing result.

Further, the step of training the monophonic acoustic model according to the acoustic features comprises:

obtaining the use frequency of the acoustic features, and sequencing the acoustic features according to the use frequency;

acquiring a locally pre-stored characteristic quantity value, and acquiring the sorted acoustic characteristics according to the characteristic quantity value;

and training the single-tone acoustic model according to the acquired acoustic features.

Another object of an embodiment of the present invention is to provide a keyword detection system, including:

the language model training module is used for acquiring a text corpus and a transcription text corresponding to the text corpus in a training set and performing model training on a language model according to the text corpus and the transcription text;

the model combination module is used for carrying out model training on the chain model according to the acoustic features in the training set and combining the chain model with the language model to obtain a voice recognition model;

the word and graph indexing module is used for inputting the voice segment to be detected into the voice recognition model for analysis to obtain a word and graph and performing reverse indexing on the word and graph;

the keyword retrieval module is used for converting the index result into a factor converter and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;

and the occurrence probability calculation module is used for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the corresponding preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value.

Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above keyword detection method.

Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program realizes the steps of the keyword detection method when executed by a processor.

According to the embodiment of the invention, the voice recognition model is controlled to decode the voice segment to be detected to generate the word graph, so that the condition of keyword detection error caused by voice recognition error is effectively avoided, the accuracy of keyword detection is improved, the word graph allows an acoustic modeling unit smaller than a word, so that the out-of-set words can be detected, and the detection speed and the detection efficiency of keyword detection are effectively increased by performing inverted indexing on the word graph of the voice segment to be detected and converting the index into the design of a factor converter.

Drawings

Fig. 1 is a flowchart of a keyword detection method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a keyword detection method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a keyword detection system according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically stated.

Example one

Please refer to fig. 1, which is a flowchart illustrating a keyword detection method according to a first embodiment of the present invention, including the steps of:

step S10, obtaining a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;

the text corpus is a language to be recognized by the speech recognition model, the text corpus can be selected according to requirements, for example, the text corpus can be a language such as cantonese or Minnan, in the step, an expression mode of Mandarin is adopted in the transcription text, and the text corpus and the transcription text adopt a one-to-one corresponding relation;

preferably, the data is divided into a training set, a verification set and a test set by dividing a locally pre-stored data set, wherein the training set is used for providing training data for a language model and an acoustic model in the speech recognition model, and the verification set and the test set are used for verifying and testing the language model and the acoustic model, and in particular, in the step, the data of the training set, the verification set and the test set accounts for 70%, 10% and 20%;

step S20, performing model training on a chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model;

when the training of the chain model is finished, the chain model, the language model and the pronunciation dictionary are controlled to decode the verification set and the test set so as to judge whether the chain model and the language model meet the training requirements;

preferably, when the test results of the chain model and the language model are judged not to meet the training requirements, parameter adjustment is carried out on the chain model and the language model, so that the accuracy of parameters in the voice recognition model is effectively guaranteed, and the accuracy of subsequent voice recognition is improved;

step S30, inputting the voice segment to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;

the speech recognition model is controlled to decode the speech segment to be detected to generate a word graph (lattice), so that the condition that the keyword detection is wrong due to the speech recognition error is effectively avoided, and the accuracy of the keyword detection is improved;

step S40, converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;

the method comprises the steps that the number and the vocabulary of preset keywords can be selected according to requirements, in the step, inverted indexing is carried out on a word graph of a voice fragment to be detected, and the indexing is converted into a factor converter, so that the detection speed and the detection efficiency of the keyword detection are effectively improved, and particularly, in the step, an indexing result can be converted into a factor converter (factor converter) by adopting WFST (WFST), and the factor converter is of a three-dimensional data structure and comprises the starting time, the ending time and the posterior probability of the preset keywords in the voice fragment;

therefore, in the step, the preset keywords in the keyword table are input into the factor converter for searching to obtain the design of the keyword search result, so that the voice segment ID, the starting time, the ending time and the posterior probability of the keywords appearing in the voice segment to be detected of each preset keyword are obtained;

s50, respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result;

the probability value of each preset keyword which possibly appears in the voice segment to be detected is analyzed by calculating the occurrence probability of each preset keyword; preferably, the occurrence probability of the preset keyword can be calculated by adopting a preset function or a preset calculation formula;

step S60, when the occurrence probability is larger than a probability threshold value, judging that the preset keywords appear in the voice segment to be detected;

the probability threshold value can be set according to the requirement, the occurrence probability and the probability threshold value can be judged in a comparator mode, preferably, when the occurrence probability is judged to be larger than the probability threshold value, the corresponding preset keywords are marked, and a user is prompted that the corresponding preset keywords appear in the voice segment to be detected;

in the embodiment, the voice fragment to be detected is decoded by controlling the voice recognition model to generate the word graph, the condition of keyword detection error caused by voice recognition error is effectively avoided, the accuracy of keyword detection is improved, the word graph allows an acoustic modeling unit smaller than a word, out-of-collection words can be detected, inverted indexing is carried out on the word graph of the voice fragment to be detected, and the index is converted into the design of a factor converter, so that the detection speed and the detection efficiency of keyword detection are effectively accelerated.

Example two

Please refer to fig. 2, which is a flowchart illustrating a keyword detection method according to a second embodiment of the present invention, including the steps of:

s11, acquiring a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;

after the text corpus is acquired, noise and reverberation processing can be performed on the text corpus, so that data can be effectively expanded, the robustness of a language model is improved, and the model can adapt to more complex environments;

step S21, training a single-phone acoustic model according to the acoustic features, and constructing a pronunciation dictionary according to the text corpus and the transcription text;

in this step, the training of the monophonic acoustic model according to the acoustic features includes:

training the single-phone acoustic model according to the acquired acoustic features;

step S31, controlling the single-phone acoustic model, the language model and the pronunciation dictionary to decode a verification set to obtain a verification decoding result, and inquiring model adjusting parameters according to the verification decoding result;

step S41, updating parameters of the single-phone acoustic model and the language model according to the model adjusting parameters and carrying out differential processing on the acoustic features to obtain differential features;

the accuracy of the phoneme acoustic model and the language model identification is effectively improved by designing the parameter updating of the single-phoneme acoustic model and the language model according to the model adjusting parameters, and the overall identification efficiency of the voice identification model is further guaranteed;

specifically, in this step, the difference feature is obtained by performing first-order difference and second-order difference on the acoustic feature;

step S51, carrying out triphone model training on the training set according to the difference characteristics to obtain a triphone acoustic model, and aligning the phonemes according to the triphone acoustic model;

the phoneme is subjected to the design of initial and final alignment by controlling the triphone acoustic model, so that the training of a subsequent chain model is effectively facilitated;

s61, carrying out vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;

the feature vector may be an MFCC feature vector or an FBank feature vector, etc., in this embodiment, the MFCC feature vector is used, and in terms of voice recognition and speaker recognition, the most commonly used voice feature is Mel-scale frequency cepstral Coefficients (MFCC for short);

specifically, in the step, fast fourier transform is performed through the acoustic features, the transform structure is input into a triangular band-pass filter, logarithmic energy output by each filter bank is calculated, and Discrete Cosine Transform (DCT) is performed on the logarithmic energy to obtain MFCC coefficient features;

because the standard cepstrum parameter MFCC only reflects the static characteristics of the voice parameters, and the dynamic characteristics of the voice can be described by using the difference spectrum of the static characteristics, the feature vector is obtained by extracting the dynamic difference parameters of the MFCC coefficient characteristics;

step S71, training the chain model according to the triphone acoustic model, and combining the chain model and the language model to obtain a voice recognition model;

s81, inputting a voice segment to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;

step S91, converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;

the number and the vocabulary of the preset keywords can be selected according to requirements, and in the step, the word graph of the voice fragment to be detected is subjected to inverted indexing, and the indexing is converted into the design of a factor converter, so that the detection speed and the detection efficiency of the keyword detection are effectively increased;

preferably, in this step, the keyword search result stores a voice segment ID, a start time, an end time, and a posterior probability of each preset keyword appearing in the voice segment to be detected;

s101, sequencing the preset keywords according to the posterior probability, and sequentially calculating the occurrence probability of each preset keyword according to a sequencing result;

in this step, the calculation formula for calculating the occurrence probability of each preset keyword according to the keyword search result is as follows:

wherein s is the preset keyword to be calculated, N _true (s) the preset keywords are in the voice segment to be detectedNumber of actual occurrences, N _correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N _spurious (s) is the number of occurrences that the preset keyword is not in the to-be-detected voice segment but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the to-be-detected voice segment, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability;

step S111, when the probability of occurrence is greater than a probability threshold, judging that the preset keywords appear in the voice segment to be detected;

in the embodiment, the voice recognition model is controlled to decode the voice fragment to be detected to generate the word graph, the condition of keyword detection error caused by voice recognition error is effectively avoided, the accuracy of keyword detection is improved, the word graph allows an acoustic modeling unit smaller than a word, out-of-set words can be detected, inverted indexing is carried out on the word graph of the voice fragment to be detected, and the index is converted into a factor converter, so that the detection speed and the detection efficiency of keyword detection are effectively increased.

EXAMPLE III

Please refer to fig. 3, which is a schematic structural diagram of a keyword detection system 100 according to a third embodiment of the present invention, including: the system comprises a language model training module 10, a model combination module 11, a word graph index module 12, a keyword retrieval module 13 and an occurrence probability calculation module 14, wherein:

the language model training module 10 is configured to obtain a text corpus and a transcription text corresponding to the text corpus in a training set, and perform model training on a language model according to the text corpus and the transcription text;

and the model combination module 11 is used for performing model training on the chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model.

Wherein the model combination module 11 is further configured to: training a single-phone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;

training the chain model according to the triphone acoustic model.

Preferably, the module combination module 11 is further configured to: constructing a pronunciation dictionary according to the text corpus and the transcribed text, and controlling the single-phone acoustic model, the language model and the pronunciation dictionary to decode a verification set to obtain a verification decoding result;

In addition, in this embodiment, the module combination module 11 is further configured to: obtaining the use frequency of the acoustic features, and sequencing the acoustic features according to the use frequency;

And the word and graph indexing module 12 is configured to input the speech segment to be detected into the speech recognition model for analysis, obtain a word and graph, and perform reverse indexing on the word and graph.

The keyword retrieval module 13 is configured to convert the index result into a factor converter, and input a preset keyword in the keyword table into the factor converter for retrieval, so as to obtain a keyword retrieval result;

and the occurrence probability calculation module 14 is configured to calculate occurrence probability of each preset keyword according to the keyword search result, and determine that the preset keyword appears in the to-be-detected speech segment when the occurrence probability is greater than a probability threshold.

The calculation formula for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result is as follows:

wherein s is the preset keyword to be calculated, N _true (s) the actual occurrence frequency of the preset keyword in the voice segment to be detected, N _correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N _spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability.

Specifically, the keyword search result stores a voice segment ID, a start time, an end time, and a posterior probability of each preset keyword appearing in the to-be-detected voice segment.

Further, the occurrence probability calculation module 14 is further configured to: and sequencing the preset keywords according to the posterior probability, and sequentially calculating the occurrence probability of each preset keyword according to a sequencing result.

Example four

Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to enable the mobile terminal 101 to execute the keyword detection method.

The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:

and respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value. The storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application.

Those skilled in the art will appreciate that the component structures shown in fig. 3 are not intended to limit the keyword detection system of the present invention and may include more or less components than those shown, or some components in combination, or a different arrangement of components, and that the keyword detection method of fig. 1-2 may also be implemented using more or less components than those shown in fig. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to in the present invention are a series of computer programs that can be executed by a processor (not shown) in the target keyword detection system and that can perform specific functions, and all of the computer programs can be stored in a storage device (not shown) of the target keyword detection system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A keyword detection method, the method comprising:

respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value;

the step of performing model training on the chain model according to the acoustic features in the training set comprises:

training a single-phone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;

performing vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;

training the chain model according to the triphone acoustic model;

wherein s is the preset keyword to be calculated, N _true (s) is the actual occurrence frequency of the preset keyword in the voice segment to be detected, N _correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N _spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, and beta is the adjustment of the false detection rate and the omissionThe detection rate parameter, ATWV, is the probability of occurrence.

2. The keyword detection method of claim 1, wherein after the step of training a monophonic acoustic model based on the acoustic features, the method further comprises:

and inquiring a model adjusting parameter according to the verification decoding result, and updating the parameters of the single-phone acoustic model and the language model according to the model adjusting parameter.

3. The keyword detection method according to claim 1, wherein the keyword search result stores therein a speech segment ID, a start time, an end time, and a posterior probability of each of the predetermined keywords appearing in the speech segment to be detected.

4. The method for detecting keywords according to claim 3, wherein after the step of inputting the preset keywords in the keyword list into the factor converter for retrieval to obtain the keyword retrieval result, the method further comprises:

5. The keyword detection method of claim 1, wherein the step of training a monophonic acoustic model based on the acoustic features comprises:

and training the single-phone acoustic model according to the acquired acoustic features.

6. A keyword detection system, the system comprising:

the language model training module is used for acquiring text corpora and transcription texts corresponding to the text corpora in a training set and performing model training on a language model according to the text corpora and the transcription texts;

the occurrence probability calculation module is used for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value;

the model combination module is further configured to: training a single-phone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;

training the chain model according to the triphone acoustic model;

wherein s is the preset keyword to be calculated, N _true (s) is the actual occurrence frequency of the preset keyword in the voice segment to be detected, N _correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N _spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability.

7. A mobile terminal, characterized by comprising a storage device for storing a computer program and a processor for executing the computer program to cause the mobile terminal to perform the keyword detection method according to any one of claims 1 to 5.

8. A storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the keyword detection method as claimed in any one of claims 1 to 5.