CN111429912B - Keyword detection method, system, mobile terminal and storage medium - Google Patents

Keyword detection method, system, mobile terminal and storage medium Download PDF

Info

Publication number
CN111429912B
CN111429912B CN202010184549.6A CN202010184549A CN111429912B CN 111429912 B CN111429912 B CN 111429912B CN 202010184549 A CN202010184549 A CN 202010184549A CN 111429912 B CN111429912 B CN 111429912B
Authority
CN
China
Prior art keywords
model
keyword
training
acoustic
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010184549.6A
Other languages
Chinese (zh)
Other versions
CN111429912A (en
Inventor
徐敏
肖龙源
李稀敏
蔡振华
刘晓葳
谭玉坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010184549.6A priority Critical patent/CN111429912B/en
Publication of CN111429912A publication Critical patent/CN111429912A/en
Application granted granted Critical
Publication of CN111429912B publication Critical patent/CN111429912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a keyword detection method, a system, a mobile terminal and a storage medium, wherein the method comprises the following steps: obtaining text corpora and a transcription text to perform model training on a language model; performing model training on the chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model; inputting the voice segment to be detected into a voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph; converting the index result into a factor converter, inputting a preset keyword into the factor converter for retrieval to obtain a keyword retrieval result; and calculating the occurrence probability of the preset keywords according to the keyword retrieval result, and judging that the preset keywords appear in the voice segment to be detected when the occurrence probability is greater than a probability threshold value. The method and the device decode the voice segment to be detected to generate the word graph by controlling the voice recognition model, avoid the condition of wrong keyword detection caused by wrong voice recognition, and improve the accuracy of keyword detection.

Description

Keyword detection method, system, mobile terminal and storage medium
Technical Field
The invention belongs to the technical field of keyword detection, and particularly relates to a keyword detection method, a keyword detection system, a mobile terminal and a storage medium.
Background
Keyword detection is a technology for detecting keywords from interested continuous voices, and has important application in the fields of intelligent home, telephone monitoring, voice data mining and the like. Keyword detection has been studied for over 40 years, but keyword detection in low-resource, low-power consumption, low-computational complexity environments remains a research hotspot. Keyword detection can be divided into two categories from an application scene, one category is that the number of keywords is small and fixed, whether the keywords in a keyword list exist in a voice stream is continuously detected from a continuous voice stream, and typical application is awakening word recognition in a smart home; the other type is that the number of keywords is large and not fixed, but the voice to be detected exists in advance, the voice segment where the keywords are located is found through an algorithm, and the typical application is voice data mining.
However, in the existing keyword detection process, a large amount of voice features of targeted keyword data are mainly extracted, normalized and then placed into a neural network for machine learning model training, the obtained model has poor robustness, the recognition rate is greatly influenced under the condition that scenes are inconsistent, and the accuracy of keyword detection is further reduced.
Disclosure of Invention
The embodiment of the invention aims to provide a keyword detection method, a keyword detection system, a mobile terminal and a storage medium, and aims to solve the problem that the existing keyword detection method is low in detection accuracy.
The embodiment of the invention is realized in such a way that a keyword detection method comprises the following steps:
acquiring a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;
performing model training on a chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model;
inputting the voice segments to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;
converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;
and respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value.
Further, the step of performing model training on the chain model according to the acoustic features in the training set includes:
training a single-tone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;
carrying out triphone model training on the training set according to the difference characteristics to obtain a triphone acoustic model, and aligning the phonemes according to the triphone acoustic model;
carrying out vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;
training the chain model according to the triphone acoustic model.
Further, after the step of training the monophonic acoustic model according to the acoustic features, the method further includes:
constructing a pronunciation dictionary according to the text corpus and the transcribed text, and controlling the single-phone acoustic model, the language model and the pronunciation dictionary to decode a verification set to obtain a verification decoding result;
and inquiring model adjusting parameters according to the verification decoding result, and updating the parameters of the single-phone acoustic model and the language model according to the model adjusting parameters.
Further, the formula for calculating the occurrence probability of each preset keyword according to the keyword search result is as follows:
Figure GDA0003937125750000031
wherein s is the preset keyword to be calculated, N true (s) is the actual occurrence frequency of the preset keyword in the voice segment to be detected, N correct (s) the corresponding calculated occurrence frequency of the preset keyword in the keyword search result, N spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability.
Further, the keyword search result stores a voice segment ID, a start time, an end time and a posterior probability of each preset keyword appearing in the voice segment to be detected.
Further, after the step of inputting the preset keyword in the keyword table into the factor converter for retrieval to obtain the keyword retrieval result, the method further comprises:
and sequencing the preset keywords according to the posterior probability, and sequentially calculating the occurrence probability of each preset keyword according to a sequencing result.
Further, the step of training the monophonic acoustic model according to the acoustic features comprises:
obtaining the use frequency of the acoustic features, and sequencing the acoustic features according to the use frequency;
acquiring a locally pre-stored characteristic quantity value, and acquiring the sorted acoustic characteristics according to the characteristic quantity value;
and training the single-tone acoustic model according to the acquired acoustic features.
Another object of an embodiment of the present invention is to provide a keyword detection system, including:
the language model training module is used for acquiring a text corpus and a transcription text corresponding to the text corpus in a training set and performing model training on a language model according to the text corpus and the transcription text;
the model combination module is used for carrying out model training on the chain model according to the acoustic features in the training set and combining the chain model with the language model to obtain a voice recognition model;
the word and graph indexing module is used for inputting the voice segment to be detected into the voice recognition model for analysis to obtain a word and graph and performing reverse indexing on the word and graph;
the keyword retrieval module is used for converting the index result into a factor converter and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;
and the occurrence probability calculation module is used for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the corresponding preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value.
Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above keyword detection method.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program realizes the steps of the keyword detection method when executed by a processor.
According to the embodiment of the invention, the voice recognition model is controlled to decode the voice segment to be detected to generate the word graph, so that the condition of keyword detection error caused by voice recognition error is effectively avoided, the accuracy of keyword detection is improved, the word graph allows an acoustic modeling unit smaller than a word, so that the out-of-set words can be detected, and the detection speed and the detection efficiency of keyword detection are effectively increased by performing inverted indexing on the word graph of the voice segment to be detected and converting the index into the design of a factor converter.
Drawings
Fig. 1 is a flowchart of a keyword detection method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a keyword detection method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a keyword detection system according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically stated.
Example one
Please refer to fig. 1, which is a flowchart illustrating a keyword detection method according to a first embodiment of the present invention, including the steps of:
step S10, obtaining a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;
the text corpus is a language to be recognized by the speech recognition model, the text corpus can be selected according to requirements, for example, the text corpus can be a language such as cantonese or Minnan, in the step, an expression mode of Mandarin is adopted in the transcription text, and the text corpus and the transcription text adopt a one-to-one corresponding relation;
preferably, the data is divided into a training set, a verification set and a test set by dividing a locally pre-stored data set, wherein the training set is used for providing training data for a language model and an acoustic model in the speech recognition model, and the verification set and the test set are used for verifying and testing the language model and the acoustic model, and in particular, in the step, the data of the training set, the verification set and the test set accounts for 70%, 10% and 20%;
step S20, performing model training on a chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model;
when the training of the chain model is finished, the chain model, the language model and the pronunciation dictionary are controlled to decode the verification set and the test set so as to judge whether the chain model and the language model meet the training requirements;
preferably, when the test results of the chain model and the language model are judged not to meet the training requirements, parameter adjustment is carried out on the chain model and the language model, so that the accuracy of parameters in the voice recognition model is effectively guaranteed, and the accuracy of subsequent voice recognition is improved;
step S30, inputting the voice segment to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;
the speech recognition model is controlled to decode the speech segment to be detected to generate a word graph (lattice), so that the condition that the keyword detection is wrong due to the speech recognition error is effectively avoided, and the accuracy of the keyword detection is improved;
step S40, converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;
the method comprises the steps that the number and the vocabulary of preset keywords can be selected according to requirements, in the step, inverted indexing is carried out on a word graph of a voice fragment to be detected, and the indexing is converted into a factor converter, so that the detection speed and the detection efficiency of the keyword detection are effectively improved, and particularly, in the step, an indexing result can be converted into a factor converter (factor converter) by adopting WFST (WFST), and the factor converter is of a three-dimensional data structure and comprises the starting time, the ending time and the posterior probability of the preset keywords in the voice fragment;
therefore, in the step, the preset keywords in the keyword table are input into the factor converter for searching to obtain the design of the keyword search result, so that the voice segment ID, the starting time, the ending time and the posterior probability of the keywords appearing in the voice segment to be detected of each preset keyword are obtained;
s50, respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result;
the probability value of each preset keyword which possibly appears in the voice segment to be detected is analyzed by calculating the occurrence probability of each preset keyword; preferably, the occurrence probability of the preset keyword can be calculated by adopting a preset function or a preset calculation formula;
step S60, when the occurrence probability is larger than a probability threshold value, judging that the preset keywords appear in the voice segment to be detected;
the probability threshold value can be set according to the requirement, the occurrence probability and the probability threshold value can be judged in a comparator mode, preferably, when the occurrence probability is judged to be larger than the probability threshold value, the corresponding preset keywords are marked, and a user is prompted that the corresponding preset keywords appear in the voice segment to be detected;
in the embodiment, the voice fragment to be detected is decoded by controlling the voice recognition model to generate the word graph, the condition of keyword detection error caused by voice recognition error is effectively avoided, the accuracy of keyword detection is improved, the word graph allows an acoustic modeling unit smaller than a word, out-of-collection words can be detected, inverted indexing is carried out on the word graph of the voice fragment to be detected, and the index is converted into the design of a factor converter, so that the detection speed and the detection efficiency of keyword detection are effectively accelerated.
Example two
Please refer to fig. 2, which is a flowchart illustrating a keyword detection method according to a second embodiment of the present invention, including the steps of:
s11, acquiring a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;
after the text corpus is acquired, noise and reverberation processing can be performed on the text corpus, so that data can be effectively expanded, the robustness of a language model is improved, and the model can adapt to more complex environments;
preferably, the data is divided into a training set, a verification set and a test set by dividing a locally pre-stored data set, wherein the training set is used for providing training data for a language model and an acoustic model in the speech recognition model, and the verification set and the test set are used for verifying and testing the language model and the acoustic model, and in particular, in the step, the data of the training set, the verification set and the test set accounts for 70%, 10% and 20%;
step S21, training a single-phone acoustic model according to the acoustic features, and constructing a pronunciation dictionary according to the text corpus and the transcription text;
in this step, the training of the monophonic acoustic model according to the acoustic features includes:
obtaining the use frequency of the acoustic features, and sequencing the acoustic features according to the use frequency;
acquiring a locally pre-stored characteristic quantity value, and acquiring the sorted acoustic characteristics according to the characteristic quantity value;
training the single-phone acoustic model according to the acquired acoustic features;
step S31, controlling the single-phone acoustic model, the language model and the pronunciation dictionary to decode a verification set to obtain a verification decoding result, and inquiring model adjusting parameters according to the verification decoding result;
step S41, updating parameters of the single-phone acoustic model and the language model according to the model adjusting parameters and carrying out differential processing on the acoustic features to obtain differential features;
the accuracy of the phoneme acoustic model and the language model identification is effectively improved by designing the parameter updating of the single-phoneme acoustic model and the language model according to the model adjusting parameters, and the overall identification efficiency of the voice identification model is further guaranteed;
specifically, in this step, the difference feature is obtained by performing first-order difference and second-order difference on the acoustic feature;
step S51, carrying out triphone model training on the training set according to the difference characteristics to obtain a triphone acoustic model, and aligning the phonemes according to the triphone acoustic model;
the phoneme is subjected to the design of initial and final alignment by controlling the triphone acoustic model, so that the training of a subsequent chain model is effectively facilitated;
s61, carrying out vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;
the feature vector may be an MFCC feature vector or an FBank feature vector, etc., in this embodiment, the MFCC feature vector is used, and in terms of voice recognition and speaker recognition, the most commonly used voice feature is Mel-scale frequency cepstral Coefficients (MFCC for short);
specifically, in the step, fast fourier transform is performed through the acoustic features, the transform structure is input into a triangular band-pass filter, logarithmic energy output by each filter bank is calculated, and Discrete Cosine Transform (DCT) is performed on the logarithmic energy to obtain MFCC coefficient features;
because the standard cepstrum parameter MFCC only reflects the static characteristics of the voice parameters, and the dynamic characteristics of the voice can be described by using the difference spectrum of the static characteristics, the feature vector is obtained by extracting the dynamic difference parameters of the MFCC coefficient characteristics;
step S71, training the chain model according to the triphone acoustic model, and combining the chain model and the language model to obtain a voice recognition model;
s81, inputting a voice segment to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;
the speech recognition model is controlled to decode the speech segment to be detected to generate a word graph (lattice), so that the condition that the keyword detection is wrong due to the speech recognition error is effectively avoided, and the accuracy of the keyword detection is improved;
step S91, converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;
the number and the vocabulary of the preset keywords can be selected according to requirements, and in the step, the word graph of the voice fragment to be detected is subjected to inverted indexing, and the indexing is converted into the design of a factor converter, so that the detection speed and the detection efficiency of the keyword detection are effectively increased;
preferably, in this step, the keyword search result stores a voice segment ID, a start time, an end time, and a posterior probability of each preset keyword appearing in the voice segment to be detected;
s101, sequencing the preset keywords according to the posterior probability, and sequentially calculating the occurrence probability of each preset keyword according to a sequencing result;
in this step, the calculation formula for calculating the occurrence probability of each preset keyword according to the keyword search result is as follows:
Figure GDA0003937125750000101
wherein s is the preset keyword to be calculated, N true (s) the preset keywords are in the voice segment to be detectedNumber of actual occurrences, N correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N spurious (s) is the number of occurrences that the preset keyword is not in the to-be-detected voice segment but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the to-be-detected voice segment, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability;
step S111, when the probability of occurrence is greater than a probability threshold, judging that the preset keywords appear in the voice segment to be detected;
in the embodiment, the voice recognition model is controlled to decode the voice fragment to be detected to generate the word graph, the condition of keyword detection error caused by voice recognition error is effectively avoided, the accuracy of keyword detection is improved, the word graph allows an acoustic modeling unit smaller than a word, out-of-set words can be detected, inverted indexing is carried out on the word graph of the voice fragment to be detected, and the index is converted into a factor converter, so that the detection speed and the detection efficiency of keyword detection are effectively increased.
EXAMPLE III
Please refer to fig. 3, which is a schematic structural diagram of a keyword detection system 100 according to a third embodiment of the present invention, including: the system comprises a language model training module 10, a model combination module 11, a word graph index module 12, a keyword retrieval module 13 and an occurrence probability calculation module 14, wherein:
the language model training module 10 is configured to obtain a text corpus and a transcription text corresponding to the text corpus in a training set, and perform model training on a language model according to the text corpus and the transcription text;
and the model combination module 11 is used for performing model training on the chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model.
Wherein the model combination module 11 is further configured to: training a single-phone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;
carrying out triphone model training on the training set according to the difference characteristics to obtain a triphone acoustic model, and aligning the phonemes according to the triphone acoustic model;
carrying out vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;
training the chain model according to the triphone acoustic model.
Preferably, the module combination module 11 is further configured to: constructing a pronunciation dictionary according to the text corpus and the transcribed text, and controlling the single-phone acoustic model, the language model and the pronunciation dictionary to decode a verification set to obtain a verification decoding result;
and inquiring model adjusting parameters according to the verification decoding result, and updating the parameters of the single-phone acoustic model and the language model according to the model adjusting parameters.
In addition, in this embodiment, the module combination module 11 is further configured to: obtaining the use frequency of the acoustic features, and sequencing the acoustic features according to the use frequency;
acquiring a locally pre-stored characteristic quantity value, and acquiring the sorted acoustic characteristics according to the characteristic quantity value;
and training the single-tone acoustic model according to the acquired acoustic features.
And the word and graph indexing module 12 is configured to input the speech segment to be detected into the speech recognition model for analysis, obtain a word and graph, and perform reverse indexing on the word and graph.
The keyword retrieval module 13 is configured to convert the index result into a factor converter, and input a preset keyword in the keyword table into the factor converter for retrieval, so as to obtain a keyword retrieval result;
and the occurrence probability calculation module 14 is configured to calculate occurrence probability of each preset keyword according to the keyword search result, and determine that the preset keyword appears in the to-be-detected speech segment when the occurrence probability is greater than a probability threshold.
The calculation formula for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result is as follows:
Figure GDA0003937125750000121
wherein s is the preset keyword to be calculated, N true (s) the actual occurrence frequency of the preset keyword in the voice segment to be detected, N correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability.
Specifically, the keyword search result stores a voice segment ID, a start time, an end time, and a posterior probability of each preset keyword appearing in the to-be-detected voice segment.
Further, the occurrence probability calculation module 14 is further configured to: and sequencing the preset keywords according to the posterior probability, and sequentially calculating the occurrence probability of each preset keyword according to a sequencing result.
In the embodiment, the voice fragment to be detected is decoded by controlling the voice recognition model to generate the word graph, the condition of keyword detection error caused by voice recognition error is effectively avoided, the accuracy of keyword detection is improved, the word graph allows an acoustic modeling unit smaller than a word, out-of-collection words can be detected, inverted indexing is carried out on the word graph of the voice fragment to be detected, and the index is converted into the design of a factor converter, so that the detection speed and the detection efficiency of keyword detection are effectively accelerated.
Example four
Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to enable the mobile terminal 101 to execute the keyword detection method.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
acquiring a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;
performing model training on a chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model;
inputting the voice segments to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;
converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;
and respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value. The storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the component structures shown in fig. 3 are not intended to limit the keyword detection system of the present invention and may include more or less components than those shown, or some components in combination, or a different arrangement of components, and that the keyword detection method of fig. 1-2 may also be implemented using more or less components than those shown in fig. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to in the present invention are a series of computer programs that can be executed by a processor (not shown) in the target keyword detection system and that can perform specific functions, and all of the computer programs can be stored in a storage device (not shown) of the target keyword detection system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A keyword detection method, the method comprising:
acquiring a text corpus and a transcription text corresponding to the text corpus in a training set, and performing model training on a language model according to the text corpus and the transcription text;
performing model training on a chain model according to the acoustic features in the training set, and combining the chain model and the language model to obtain a voice recognition model;
inputting the voice segments to be detected into the voice recognition model for analysis to obtain a word graph, and performing reverse indexing on the word graph;
converting the index result into a factor converter, and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;
respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value;
the step of performing model training on the chain model according to the acoustic features in the training set comprises:
training a single-phone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;
carrying out triphone model training on the training set according to the difference characteristics to obtain a triphone acoustic model, and aligning the phonemes according to the triphone acoustic model;
performing vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;
training the chain model according to the triphone acoustic model;
the calculation formula for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result is as follows:
Figure FDA0003937125740000011
wherein s is the preset keyword to be calculated, N true (s) is the actual occurrence frequency of the preset keyword in the voice segment to be detected, N correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, and beta is the adjustment of the false detection rate and the omissionThe detection rate parameter, ATWV, is the probability of occurrence.
2. The keyword detection method of claim 1, wherein after the step of training a monophonic acoustic model based on the acoustic features, the method further comprises:
constructing a pronunciation dictionary according to the text corpus and the transcribed text, and controlling the single-phone acoustic model, the language model and the pronunciation dictionary to decode a verification set to obtain a verification decoding result;
and inquiring a model adjusting parameter according to the verification decoding result, and updating the parameters of the single-phone acoustic model and the language model according to the model adjusting parameter.
3. The keyword detection method according to claim 1, wherein the keyword search result stores therein a speech segment ID, a start time, an end time, and a posterior probability of each of the predetermined keywords appearing in the speech segment to be detected.
4. The method for detecting keywords according to claim 3, wherein after the step of inputting the preset keywords in the keyword list into the factor converter for retrieval to obtain the keyword retrieval result, the method further comprises:
and sequencing the preset keywords according to the posterior probability, and sequentially calculating the occurrence probability of each preset keyword according to a sequencing result.
5. The keyword detection method of claim 1, wherein the step of training a monophonic acoustic model based on the acoustic features comprises:
obtaining the use frequency of the acoustic features, and sequencing the acoustic features according to the use frequency;
acquiring a locally pre-stored characteristic quantity value, and acquiring the sorted acoustic characteristics according to the characteristic quantity value;
and training the single-phone acoustic model according to the acquired acoustic features.
6. A keyword detection system, the system comprising:
the language model training module is used for acquiring text corpora and transcription texts corresponding to the text corpora in a training set and performing model training on a language model according to the text corpora and the transcription texts;
the model combination module is used for carrying out model training on the chain model according to the acoustic features in the training set and combining the chain model with the language model to obtain a voice recognition model;
the word and graph indexing module is used for inputting the voice segment to be detected into the voice recognition model for analysis to obtain a word and graph and performing reverse indexing on the word and graph;
the keyword retrieval module is used for converting the index result into a factor converter and inputting preset keywords in a keyword table into the factor converter for retrieval to obtain a keyword retrieval result;
the occurrence probability calculation module is used for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result, and judging that the preset keyword appears in the voice segment to be detected when the occurrence probability is greater than a probability threshold value;
the model combination module is further configured to: training a single-phone acoustic model according to the acoustic features, and carrying out differential processing on the acoustic features to obtain differential features;
carrying out triphone model training on the training set according to the difference characteristics to obtain a triphone acoustic model, and aligning the phonemes according to the triphone acoustic model;
carrying out vector transformation on the acoustic features to obtain feature vectors, and training the triphone acoustic model according to the feature vectors;
training the chain model according to the triphone acoustic model;
the calculation formula for respectively calculating the occurrence probability of each preset keyword according to the keyword retrieval result is as follows:
Figure FDA0003937125740000031
wherein s is the preset keyword to be calculated, N true (s) is the actual occurrence frequency of the preset keyword in the voice segment to be detected, N correct (s) is the corresponding calculated occurrence number of the preset keyword in the keyword search result, N spurious (s) is the number of occurrences that the preset keyword is not in the voice segment to be detected but is judged to be in, namely the number of false detection times of the preset keyword, T is the total duration of the voice segment to be detected, beta is a parameter for adjusting the false detection rate and the missed detection rate, and ATWV is the occurrence probability.
7. A mobile terminal, characterized by comprising a storage device for storing a computer program and a processor for executing the computer program to cause the mobile terminal to perform the keyword detection method according to any one of claims 1 to 5.
8. A storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the keyword detection method as claimed in any one of claims 1 to 5.
CN202010184549.6A 2020-03-17 2020-03-17 Keyword detection method, system, mobile terminal and storage medium Active CN111429912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010184549.6A CN111429912B (en) 2020-03-17 2020-03-17 Keyword detection method, system, mobile terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010184549.6A CN111429912B (en) 2020-03-17 2020-03-17 Keyword detection method, system, mobile terminal and storage medium

Publications (2)

Publication Number Publication Date
CN111429912A CN111429912A (en) 2020-07-17
CN111429912B true CN111429912B (en) 2023-02-10

Family

ID=71547970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010184549.6A Active CN111429912B (en) 2020-03-17 2020-03-17 Keyword detection method, system, mobile terminal and storage medium

Country Status (1)

Country Link
CN (1) CN111429912B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738808B (en) * 2020-07-24 2021-04-13 浙江口碑网络技术有限公司 Data processing method, device and equipment
CN112331207B (en) * 2020-09-30 2024-08-30 音数汇元(上海)智能科技有限公司 Service content monitoring method, device, electronic equipment and storage medium
CN112634870B (en) * 2020-12-11 2023-05-30 平安科技(深圳)有限公司 Keyword detection method, device, equipment and storage medium
CN112767921A (en) * 2021-01-07 2021-05-07 国网浙江省电力有限公司 Voice recognition self-adaption method and system based on cache language model
CN112836039B (en) * 2021-01-27 2023-04-21 成都网安科技发展有限公司 Voice data processing method and device based on deep learning
CN112926637B (en) * 2021-02-08 2023-06-09 天津职业技术师范大学(中国职业培训指导教师进修中心) Method for generating text detection training set

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281534A (en) * 2008-05-28 2008-10-08 叶睿智 Method for searching multimedia resource based on audio content retrieval
CN103440253A (en) * 2013-07-25 2013-12-11 清华大学 Speech retrieval method and system
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method
CN107210045A (en) * 2015-02-03 2017-09-26 杜比实验室特许公司 The playback of search session and search result
CN107665705A (en) * 2017-09-20 2018-02-06 平安科技(深圳)有限公司 Voice keyword recognition method, device, equipment and computer-readable recording medium
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN109119072A (en) * 2018-09-28 2019-01-01 中国民航大学 Civil aviaton's land sky call acoustic model construction method based on DNN-HMM
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2237345B1 (en) * 2005-02-28 2006-06-16 Prous Institute For Biomedical Research S.A. PROCEDURE FOR CONVERSION OF PHONEMES TO WRITTEN TEXT AND CORRESPONDING INFORMATIC SYSTEM AND PROGRAM.
CN104143329B (en) * 2013-08-19 2015-10-21 腾讯科技(深圳)有限公司 Carry out method and the device of voice keyword retrieval

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281534A (en) * 2008-05-28 2008-10-08 叶睿智 Method for searching multimedia resource based on audio content retrieval
CN103440253A (en) * 2013-07-25 2013-12-11 清华大学 Speech retrieval method and system
CN107210045A (en) * 2015-02-03 2017-09-26 杜比实验室特许公司 The playback of search session and search result
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method
CN107665705A (en) * 2017-09-20 2018-02-06 平安科技(深圳)有限公司 Voice keyword recognition method, device, equipment and computer-readable recording medium
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN109119072A (en) * 2018-09-28 2019-01-01 中国民航大学 Civil aviaton's land sky call acoustic model construction method based on DNN-HMM
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection

Also Published As

Publication number Publication date
CN111429912A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN111429912B (en) Keyword detection method, system, mobile terminal and storage medium
US9911413B1 (en) Neural latent variable model for spoken language understanding
CN105723449B (en) speech content analysis system and speech content analysis method
WO2017076222A1 (en) Speech recognition method and apparatus
US8478591B2 (en) Phonetic variation model building apparatus and method and phonetic recognition system and method thereof
US8271283B2 (en) Method and apparatus for recognizing speech by measuring confidence levels of respective frames
US9646603B2 (en) Various apparatus and methods for a speech recognition system
US8321218B2 (en) Searching in audio speech
US11030999B1 (en) Word embeddings for natural language processing
US20110218802A1 (en) Continuous Speech Recognition
GB2468203A (en) A speech recognition system using multiple resolution analysis
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN113112992B (en) Voice recognition method and device, storage medium and server
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN111785302A (en) Speaker separation method and device and electronic equipment
CN114627896A (en) Voice evaluation method, device, equipment and storage medium
CN111640423B (en) Word boundary estimation method and device and electronic equipment
Dey et al. Cross-corpora language recognition: A preliminary investigation with Indian languages
CN112820281B (en) Voice recognition method, device and equipment
Nguyen et al. Vietnamese voice recognition for home automation using MFCC and DTW techniques
CN109378004B (en) Phoneme comparison method, device and equipment and computer readable storage medium
US11011155B2 (en) Multi-phrase difference confidence scoring
CN112420022A (en) Noise extraction method, device, equipment and storage medium
CN111933121A (en) Acoustic model training method and device
Han et al. Phone mismatch penalty matrices for two-stage keyword spotting via multi-pass phone recognizer.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant