CN115798461A

CN115798461A - Training method and device for user-defined command words and electronic equipment

Info

Publication number: CN115798461A
Application number: CN202211317550.7A
Authority: CN
Inventors: 岳昌洁; 黄惠祥; 吴人杰; 林聚财; 殷俊; 方瑞东; 王宝俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-03-14

Abstract

The application relates to a training method and a device for a user-defined command word and electronic equipment, wherein the method comprises the following steps: obtaining an original decoding graph based on the first acoustic model and the first language model; acquiring audio data of a preset number of user-defined command words, and acquiring confusion words of the user-defined command words based on the audio data and an original decoding graph; training the first language model through the user-defined command words and the confusion words to obtain a second language model; and carrying out composite operation on the first acoustic model and the second language model to obtain a target decoding graph, wherein the target decoding graph is used for identifying the user-defined command words in the audio to be detected. Based on the method, the configuration of the user-defined command words can be completed without the user-defined command word audio participating in the acoustic model training, so that resources are saved, and the decoding efficiency is improved.

Description

Training method and device for user-defined command words and electronic equipment

Technical Field

The application relates to the technical field of voice recognition, in particular to a training method and device for a user-defined command word and electronic equipment.

Background

Currently, a voice basic task in voice recognition can be roughly divided into keyword detection and continuous voice recognition, and the keyword detection can be divided into wakeup word recognition and command word recognition. Command word recognition is a technique that lets a machine correctly perform a specific function by detecting and recognizing human voice instructions.

The existing command word recognition system needs to collect a large amount of audio of command words according to the command words, frame and window functions are carried out on the audio of the command words, acoustic features of the audio are extracted, and an acoustic model is trained according to the acoustic features. When different command words need to be customized, a large amount of audio of the customized command words needs to be collected again to extract acoustic features, and an acoustic model is retrained. However, acquiring a large amount of audio and processing the audio to extract acoustic features to retrain the acoustic model is time consuming.

Disclosure of Invention

The application aims to provide a training method and device for a user-defined command word and electronic equipment, so that the configuration of the user-defined command word can be completed without the audio frequency of the user-defined command word participating in acoustic model training, resources are saved, and decoding efficiency is improved.

In a first aspect, the present application provides a method for training a custom command word, where the method includes:

obtaining an original decoding graph based on the first acoustic model and the first language model;

collecting audio data of a preset number of user-defined command words, and obtaining confusion vocabularies of the user-defined command words based on the audio data and the original decoding graph;

training the first language model through the user-defined command words and the confusion vocabulary to obtain a second language model;

and carrying out composite operation on the first acoustic model and the second language model to obtain a target decoding graph, wherein the target decoding graph is used for identifying the custom command words in the audio to be detected.

By the method, when the user-defined command word is obtained, a large amount of audio data does not need to be collected to retrain the acoustic model, so that time and resources are saved; the user-defined command words and the confusion words are adopted to participate in the language model training, so that the distinguishing capability of the language model for the confusion audio is greatly improved; the user-defined command words and the language model for confusing vocabulary training are combined with the large-scale acoustic model to perform composition, so that the scale of the decoding graph is greatly reduced, and the decoding efficiency is improved.

In a possible implementation manner, the obtaining an original decoding graph based on the first acoustic model and the first language model includes: training the first acoustic model and the first language model by using audio data and data of a corpus respectively; and carrying out compound operation on the first acoustic model and the first language model to obtain the original decoding graph.

The first acoustic model with a large scale is trained by utilizing the open source audio data, and when command words are configured subsequently, the first acoustic model has enough capacity to distinguish each state, so that the audio data of the command words do not need to be collected again to participate in the training of the acoustic model, and the time and the resources are saved.

In one possible design, the obtaining an obfuscated vocabulary of the custom command words based on the audio data and the original decoding graph includes: extracting acoustic features of the audio data; performing decoding search on the original decoding graph by using the acoustic features, and obtaining N decoding paths of the acoustic features based on the result of the decoding search, wherein N is an integer greater than 1; and taking the vocabulary corresponding to the N decoding paths as the confusion vocabulary of the self-defined command words.

In one possible design, the N decoding paths are the first N decoding paths in the results of the decoding search, with decoding probability values that characterize the probability of decoding the audio data into the corresponding vocabulary, ordered from large to small.

The audio data of the user-defined command word is decoded on an original decoding graph formed by the large-scale first acoustic model and the first language model, so that the confusion vocabulary of the user-defined command word can be obtained, and the confusion vocabulary is subsequently used for retraining the language model, so that the capability of distinguishing the confusion audio by the language model can be improved.

In one possible design, the training the first language model by the custom command words and the confusion vocabulary to obtain a second language model includes: taking the self-defined command words and the confusion vocabulary as an in-set word set; and retraining the first language model by using the set of words in the set to obtain the second language model.

The second language model is trained again by mixing the user-defined command words and the confusion words, so that the resolution capability of the language model can be greatly improved, and the error recognition of the command words is reduced.

In one possible design, the performing a composite operation on the first acoustic model and the second language model to obtain a target decoding graph includes: performing composite composition on the first acoustic model and the second language model to generate decoding paths corresponding to the user-defined command words and the confusion words and decoding paths corresponding to a foreign word set, wherein the foreign word set is a word set except the user-defined command words and the confusion words; changing the decoding path corresponding to each of the out-of-set word sets into a noise path; and obtaining the target decoding image based on the decoding path and the noise path which respectively correspond to the user-defined command word and the confusion word.

By taking the self-defined command words and the confusion vocabulary as the words in the set to participate in composition, the decoding path corresponding to the words out of the set is changed into a noise path, and the finally obtained target decoding graph has small scale, so that the decoding speed can be greatly improved.

In a second aspect, the present application provides a device for training a custom command word, the device comprising:

the first acquisition module is used for acquiring an original decoding graph based on a first acoustic model and a first language model;

the second acquisition module is used for acquiring audio data of a preset number of user-defined command words and acquiring confusion vocabularies of the user-defined command words based on the audio data and the original decoding graph;

the model training module trains the first language model through the user-defined command words and the confusion vocabularies to obtain a second language model;

and the third acquisition module is used for carrying out compound operation on the first acoustic model and the second language model to obtain a target decoding graph, wherein the target decoding graph is used for identifying the user-defined command words in the audio to be detected.

In one possible design, the first obtaining module is specifically configured to: respectively training the first acoustic model and the first language model by using audio data and data of a corpus; and carrying out compound operation on the first acoustic model and the first language model to obtain the original decoding graph.

In a possible design, the second obtaining module is specifically configured to: extracting acoustic features of the audio data; performing decoding search on the original decoding graph by using the acoustic features, and obtaining N decoding paths of the acoustic features based on the decoding search result, wherein N is an integer greater than 1, the N decoding paths are the first N decoding paths with decoding probability values ordered from large to small in the decoding search result, and the decoding probability represents the probability of the audio data decoding to corresponding words; and taking the vocabulary corresponding to the N decoding paths as the confusion vocabulary of the self-defined command words.

In a possible design, the model training module is specifically configured to: taking the self-defined command words and the confusion vocabulary as an in-set word set; and retraining the first language model by using the set of words in the set to obtain the second language model.

In a possible design, the third obtaining module is specifically configured to: performing composite composition on the first acoustic model and the second language model to generate decoding paths corresponding to the user-defined command words and the confusion words and decoding paths corresponding to a foreign word set, wherein the foreign word set is a word set except the user-defined command words and the confusion words; changing the decoding path corresponding to each of the out-of-set word sets into a noise path; and obtaining the target decoding image based on the decoding path and the noise path which respectively correspond to the user-defined command word and the confusion word.

In a third aspect, the present application provides an electronic device, comprising:

a memory for storing a computer program;

and the processor is used for realizing the steps of the training method of the self-defined command words when executing the computer program stored in the memory.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for training a custom command word described above.

For each of the second aspect to the fourth aspect and possible technical effects achieved by each aspect, please refer to the above description of the technical effects that can be achieved by the first aspect or various possible schemes in the first aspect, and details are not repeated here.

Drawings

Fig. 1 is a flowchart of a training method for custom command words provided in the present application;

FIG. 2 is a schematic diagram of a training apparatus for custom command words provided in the present application;

fig. 3 is a schematic diagram of a structure of an electronic device provided in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail with reference to the accompanying drawings. The particular methods of operation in the method embodiments may also be applied to apparatus embodiments or system embodiments, and computer program products.

In the description of the present application "plurality" is understood as "at least two". "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. A is connected with B and can represent: a and B are directly connected and A and B are connected through C. In addition, in the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not intended to indicate or imply relative importance nor order to be construed.

In order to facilitate those skilled in the art to better understand the technical solutions provided in the embodiments of the present application, the following technical terms are briefly described as follows:

open source data, refers to data that is legally collected from publicly and publicly available resources. In short, open source data is data that anyone can access, modify, reuse, and share.

Corpus, refers to a large-scale electronic text library that is scientifically sampled and processed, and stores therein language materials that actually appear in the actual use of languages.

MFCC (Mel-frequency cepstral coeffients) features, also called Mel frequency cepstral coefficient features. The MFCC features retain some content that is semantically related, filtering out extraneous information such as background noise. MFCC is characterized by the use of a set of key coefficients that are used to create a mel-frequency cepstrum, such that its cepstrum is closer to the human nonlinear auditory system.

i-Vector (Identity Vector) refers to the mapping coordinates of each segment of speech in the Total Factor Matrix (Total Factor Matrix) that simulates both speaker difference and channel difference, and is equivalent to the Identity of the speaker.

pitch characteristic, i.e. fundamental frequency characteristic. The Pitch frequency (Pitch) is the vibration frequency of the vocal cords. It is the reciprocal of the period that a person is voiced and air flows through the vocal tract, which is open and closed.

The chain model is a discriminative training neural network model which is trained by utilizing a maximum mutual information objective function and does not need to generate a denominator word lattice.

A pronunciation dictionary, a dictionary in speech recognition, contains mappings from words to phonemes for describing the pronunciation of each word or for giving the relation between each word and phoneme for connecting an acoustic model and a language model.

It should be noted that the technical features included in the embodiments of the present application may be arbitrarily combined for use, and those skilled in the art should understand that, from the practical application situation, the technical solutions obtained by reasonably combining the technical features in the embodiments of the present application may also solve the same technical problems or achieve the same technical effects.

The method provided by the embodiment of the application is further described in detail with reference to the attached drawings.

Referring to fig. 1, an embodiment of the present application provides a method for training a custom command word, which includes the following specific processes:

step 101: obtaining an original decoding graph based on the first acoustic model and the first language model;

in the embodiment of the present application, a first acoustic model for large-scale speech recognition may be trained by collecting a large amount of audio data, where the audio data used in the embodiment of the present application may be open-source audio data or audio data collected or purchased through a legal approach.

Specifically, a large amount of audio data is collected, acoustic feature extraction is performed on the audio data, and in the acoustic feature extraction stage, a series of operations such as framing, windowing, pre-emphasis and the like need to be performed on the audio data to convert the audio data into acoustic feature vectors. The acoustic features which are usually extracted comprise 13-dimensional or 39-dimensional Mel frequency cepstrum coefficient MFCC features, i-vector features of speaker information and specific ptich features of Chinese, and the acoustic features are used as training data to train a first acoustic model. The structure of the first acoustic model can be a chain model structure, when the first acoustic model of the chain model structure is trained, firstly, a Gaussian mixture-hidden Markov GMM-HMM model is used for obtaining the state of each frame of audio data, and then the extracted acoustic features are input into a chain neural network model to form the first acoustic model of the chain model structure. The first acoustic model is used to obtain the probability of an audio feature corresponding to a certain word (or word, or phoneme, or any modeling unit), that is, the probability of generating a certain segment of acoustic feature. Due to the existence of homophonic words, the acoustic features of each audio data may correspond to a plurality of words, which are stored in a lattice word graph. In a word graph, these words are typically stored on paths or nodes.

In the embodiment of the application, the first language model for large-scale speech recognition can be trained by using data in the corpus.

Specifically, the large-scale first language model is trained by using the collected text information in the large-scale corpus as training data of the first language model. The first language model may be structured using a conventional statistical language model n-gram model. The first language model is used for calculating the probability of the word sequence, and can be understood as correcting the grammar and the semantics of the word.

After a large-scale first acoustic model and a first language model are trained, the first acoustic model and the first language model are subjected to composite operation to obtain an original decoding graph. In the embodiment of the present application, the original decoding graph can be understood as a decoding space defined by a large-scale first language model, the decoding space includes multiple paths, each path corresponds to a word, and each path corresponds to a decoding probability.

Step 102: acquiring audio data of a preset number of user-defined command words, and acquiring confusion words of the user-defined command words based on the audio data and an original decoding graph;

in the embodiment of the application, the audio data of the preset number of the custom command words is collected based on the custom command word list. The number of the custom command words on the custom command word list is at least 1, and the preset number is smaller than the number of audio data required to be collected when the acoustic model is trained in the prior art. For example, if a user needs to reconfigure 5 command words, the 5 command words are made into a custom command word list, and when an acoustic model is trained in the prior art, 100 pieces of audio data of the 5 custom command words need to be acquired, and 500 pieces of audio data are acquired in total for training the acoustic model; in the embodiment of the application, because the original decoding diagram is obtained by compounding the first acoustic model and the first language model in a large scale, and the first acoustic model has enough capacity to analyze each state of the audio, 30 audio data of the 5 self-defined command words can be acquired, and 150 audio data are acquired in total for decoding on the original decoding diagram without retraining the acoustic model.

Compared with the method of collecting a large amount of audio data and retraining an acoustic model, the method saves a large amount of resources and time by collecting a small amount of self-defined command word audio data for decoding on the original decoding graph.

In the embodiment of the application, acoustic features are extracted from the collected audio data of the user-defined command words in the preset number, and decoding search is performed on an original decoding graph by using the acoustic features to obtain N decoding paths of the acoustic features, wherein the N decoding paths are the first N decoding paths with decoding probability values ordered from large to small in a search result, which are also called N-best paths, and N is an integer greater than 1; the decoding probability represents the probability of the self-defined command word audio data to be decoded into a corresponding vocabulary. For example, if the collected audio data of the custom command word is "turn on a television", the acoustic features are extracted to perform decoding search on the original decoding graph, and the obtained search result may be as shown in table 1.

TABLE 1

In the embodiment of the application, the N-best paths in the decoded search result can be saved as a word graph structure file, the vocabularies corresponding to the N-best paths are used as the confusion vocabularies of the self-defined command words, and the confusion vocabularies of the self-defined command words are vocabularies with semantic similarity or text similarity meeting a preset threshold with the self-defined command words.

The method has the advantages that the audio data of the user-defined command words are decoded and searched on the original decoding graph formed by the large-scale first acoustic model and the first language model, the confusion words of the user-defined command words can be obtained, the confusion words are stored and used for retraining the language model, the capability of the language model for distinguishing the confusion audio can be improved, and the occurrence of misrecognition can be effectively reduced through confidence degree judgment when the user-defined command words are subsequently recognized.

Step 103: training the first language model through the user-defined command words and the confusion words to obtain a second language model;

in the embodiment of the application, the confusion vocabulary of the custom command words is obtained from the corpus used for training the first language model, so that the second language model can be obtained only by adding the text of the custom command words to the corpus and retraining the first language model.

Specifically, the user-defined command words and the confusion vocabulary are used as an in-set word set, other vocabularies excluding the user-defined command words and the confusion vocabulary in the corpus are used as an out-set word set, the user-defined command words and the confusion vocabulary are expressed by SPOKEN _ NOISE, the first language model is retrained, and finally the second language model is obtained.

The second language model is trained again by mixing the user-defined command words and the confusion words, so that the resolution capability of the language model can be greatly improved, and the occurrence of false recognition can be effectively reduced through confidence judgment when the user-defined command words are subsequently recognized.

Step 104: and carrying out compound operation on the first acoustic model and the second language model to obtain a target decoding graph.

In the embodiment of the present application, the compound operation refers to a compound operation of the finite state transcription machine WFST, and the target decoding graph is obtained by performing the compound operation on the first acoustic model and the second language model. Specifically, the mapping relationship between the modeling unit (phoneme) of the first acoustic model and the modeling unit (word) of the second language model is obtained through a pronunciation dictionary, so that the first acoustic model and the second language model are connected to form a searched state space (target decoding graph) for the decoding work of a decoder.

First, the pronunciation dictionary, the second language model, and the first acoustic model are each expressed in the form of a different finite state transcription machine WFST. A pronunciation dictionary (L.fst) in the WFST form expands words into phonemes, namely, the input words are transcribed by the WFST, and the pronunciation sequences corresponding to the words are output; the second language model in WFST (g.fst) is a mapping that expands words into grammatically constrained words. And compounding the L.fst and the G.fst to obtain the LG.fst, namely combining the transcription of the pronunciation dictionary in the WFST form and the transcription of the second language model in the WFST form into equivalent single transcription to obtain the mapping of the phoneme to the words constrained by the grammar.

And compounding the LG.fst and the WFST form of context correlation (C.fst) to obtain CLG.fst, and obtaining the expansion from the words constrained by grammar to the context correlation words. And finally, compounding the CLG.fst and a first acoustic model (H.fst) in the WFST form to obtain the HCLG.fst, and obtaining the expansion from the words constrained by the grammar to the acoustic features.

In the embodiment of the present application, mapping relationships of l.fst, g.fst, lg.fst obtained by compositing l.fst and g.fst, c.fst, clg.fst obtained by compositing c.fst and lg.fst, h.fst obtained by compositing h.fst and clg.fst, and mapping relationships of hclg.fst obtained by compositing h.fst and clg.fst are specifically shown in table 2 below.

TABLE 2

In table 2, the mapping relationship between the words and the acoustic features constrained by the syntax can be regarded as a path, and the weight of hclg.fst can be regarded as a decoding probability. Because of the existence of homophonic words, a plurality of words constrained by grammar can map an acoustic characteristic, and the words, the respective mapping relation and the respective weight are stored as a word graph structure file, so that the target decoding graph is formed.

In the target decoding graph, the acoustic features of the audio data correspond to a plurality of words, each word corresponds to a path, and each path corresponds to a decoding probability. When the decoding paths corresponding to the self-defined command words and the confusion words thereof and the decoding paths corresponding to the out-of-set word sets are generated, the decoding paths corresponding to the out-of-set word sets are changed into a NOISE path of < SPOKEN _ NOISE >.

By taking the self-defined command words and the confusion vocabulary as the words in the set to participate in composition, the decoding path corresponding to the words out of the set is changed into a noise path, the finally obtained target decoding image has small scale, and the decoding efficiency is high when the audio is decoded.

Further, after the target decoding image is obtained, when the user-defined command word in the audio to be detected is identified through the target decoding image, whether the command word is output or not can be determined by judging the confidence of the identification result.

By integrating the method, a large amount of audio data does not need to be collected to retrain the acoustic model when the user-defined command word is obtained, so that time and resources are saved; the user-defined command words and the confusion words are adopted to participate in the language model training, so that the distinguishing capability of the language model for the confusion audio is greatly improved; the user-defined command words and the language model for confusing vocabulary training are combined with the large-scale acoustic model to perform composition, so that the scale of the decoding graph is greatly reduced, and the decoding efficiency is improved.

Based on the same inventive concept, the present application further provides a training device for a custom command word, which is used to implement configuration of the custom command word without the need of the custom command word audio participating in acoustic model training, save resources, and improve decoding efficiency, and with reference to fig. 2, the device includes:

a first obtaining module 201, configured to obtain an original decoding graph based on a first acoustic model and a first language model;

the second obtaining module 202 is configured to collect audio data of a preset number of custom command words, and obtain confusion words of the custom command words based on the audio data and the original decoding graph;

the model training module 203 trains the first language model through the user-defined command words and the confusion vocabulary to obtain a second language model;

the third obtaining module 204 performs a composite operation on the first acoustic model and the second language model to obtain a target decoding graph, where the target decoding graph is used to identify the custom command word in the audio to be detected.

In a possible design, the first obtaining module 201 is specifically configured to: respectively training the first acoustic model and the first language model by using audio data and data of a corpus; and carrying out compound operation on the first acoustic model and the first language model to obtain the original decoding graph.

In a possible design, the second obtaining module 202 is specifically configured to: extracting acoustic features of the audio data; performing decoding search on the original decoding graph by using the acoustic features, and obtaining N decoding paths of the acoustic features based on the decoding search result, wherein N is an integer greater than 1, the N decoding paths are the first N decoding paths with decoding probability values sorted from large to small in the decoding search result, and the decoding probability represents the probability of the audio data decoding to corresponding words; and taking the vocabulary corresponding to the N decoding paths as the confusion vocabulary of the self-defined command words.

In one possible design, the model training module 203 is specifically configured to: taking the self-defined command words and the confusion vocabulary as an in-set word set; and retraining the first language model by using the set of words in the set to obtain the second language model.

In a possible design, the third obtaining module 204 is specifically configured to: performing composite composition on the first acoustic model and the second language model to generate decoding paths corresponding to the user-defined command words and the confusion words and decoding paths corresponding to a foreign word set, wherein the foreign word set is a word set except the user-defined command words and the confusion words; changing the decoding path corresponding to each set of the out-of-set words into a noise path; and obtaining the target decoding image based on the decoding path and the noise path which respectively correspond to the user-defined command word and the confusion word.

Based on the same inventive concept, an embodiment of the present application further provides an electronic device, where the electronic device can implement the function of the training apparatus for self-defining command words, and with reference to fig. 3, the electronic device includes:

at least one processor 301 and a memory 302 connected to the at least one processor 301, in this embodiment, a specific connection medium between the processor 301 and the memory 302 is not limited in this application, and fig. 3 illustrates an example where the processor 301 and the memory 302 are connected through a bus 300. The bus 300 is shown in fig. 3 by a thick line, and the connection between other components is merely illustrative and not limited thereto. The bus 300 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 3 for ease of illustration, but does not represent only one bus or type of bus. Alternatively, the processor 301 may also be referred to as a controller, without limitation to name a few.

In the embodiment of the present application, the memory 302 stores instructions executable by the at least one processor 301, and the at least one processor 301 can execute the training method of the custom command word discussed above by executing the instructions stored in the memory 302. The processor 301 may implement the functions of the various modules in the apparatus shown in fig. 2.

The processor 301 is a control center of the apparatus, and may connect various parts of the entire control device by using various interfaces and lines, and perform various functions of the apparatus and process data by operating or executing instructions stored in the memory 302 and calling up data stored in the memory 302, thereby performing overall monitoring of the apparatus.

In one possible design, processor 301 may include one or more processing units, and processor 301 may integrate an application processor that primarily handles operating systems, user interfaces, application programs, and the like, and a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 301. In some embodiments, the processor 301 and the memory 302 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 301 may be a general-purpose processor, such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the training method for the custom command word disclosed in the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

Memory 302, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 302 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and the like. The memory 302 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 302 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

By programming the processor 301, the codes corresponding to the training method for the custom command word described in the foregoing embodiment may be solidified into the chip, so that the chip can execute the steps of the training method for the custom command word of the embodiment shown in fig. 1 when running. How to program the processor 301 is well known to those skilled in the art and will not be described herein.

Based on the same inventive concept, embodiments of the present application further provide a storage medium storing computer instructions, which when executed on a computer, cause the computer to perform the method for training the custom command word discussed above.

In some possible embodiments, the aspects of the method for training a custom command word provided by the present application may also be implemented in the form of a program product including program code for causing the control apparatus to perform the steps of the method for training a custom command word according to various exemplary embodiments of the present application described above in this specification when the program product is run on a device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A training method of a custom command word is characterized by comprising the following steps:

acquiring audio data of a preset number of user-defined command words, and acquiring confusion words of the user-defined command words based on the audio data and the original decoding graph;

2. The method of claim 1, wherein obtaining the original decoding graph based on the first acoustic model and the first language model comprises:

respectively training the first acoustic model and the first language model by using audio data and data of a corpus;

and carrying out compound operation on the first acoustic model and the first language model to obtain the original decoding graph.

3. The method of claim 1, wherein the obtaining an obfuscated vocabulary of the custom command words based on the audio data and the original decoding graph comprises:

extracting acoustic features of the audio data;

performing decoding search on the original decoding graph by using the acoustic features, and obtaining N decoding paths of the acoustic features based on the result of the decoding search, wherein N is an integer greater than 1;

and determining the vocabulary corresponding to each decoding path in the N decoding paths as the confusion vocabulary of the self-defined command words.

4. The method of claim 3, wherein the N decoding paths are the first N decoding paths with decoding probability values ordered from large to small in the results of the decoding search, the decoding probabilities characterizing the probability of the audio data being decoded into corresponding words.

5. The method of claim 1, wherein said training said first language model with said custom command words and said obfuscated vocabulary to obtain a second language model comprises:

taking the self-defined command words and the confusion vocabulary as an in-set word set;

and retraining the first language model by using the set of words in the set to obtain the second language model.

6. The method of claim 1, wherein said compounding the first acoustic model and the second language model to obtain a target decoding graph comprises:

performing a compound operation on the first acoustic model and the second language model to generate a decoding path corresponding to each of the custom command word and the confusion vocabulary and a decoding path corresponding to each of a set of foreign words, wherein the set of foreign words is a set of words except the custom command word and the confusion vocabulary;

changing the decoding path corresponding to each set of the out-of-set words into a noise path;

and obtaining the target decoding image based on the decoding path and the noise path which respectively correspond to the user-defined command word and the confusion word.

7. An apparatus for training a custom command word, the apparatus comprising:

8. The apparatus of claim 7, wherein the second obtaining module is specifically configured to:

extracting acoustic features of the audio data;

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1-6 when executing the computer program stored on the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-6.