CN113506574A

CN113506574A - Method and device for recognizing user-defined command words and computer equipment

Info

Publication number: CN113506574A
Application number: CN202111054121.0A
Authority: CN
Inventors: 李�杰; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-10-15

Abstract

The application provides a method and a device for recognizing a user-defined command word and computer equipment. The recognition system calls a plurality of predefined self-defined command words and searches corresponding command word paths from the sequence matrix based on the self-defined command words. The recognition system calculates the posterior scores corresponding to the command word paths respectively, and selects the self-defined command word corresponding to the command word path with the highest posterior score as the command word contained in the voice data collected at the current time. In the application, the recognition system searches the sequence matrix based on each user-defined command word to obtain the corresponding command word path, does not need to recognize and compare the whole sequence matrix, can greatly reduce the recognition time, and effectively improves the recognition accuracy of the user-defined command words.

Description

Method and device for recognizing user-defined command words and computer equipment

Technical Field

The present application relates to the field of command word recognition technologies, and in particular, to a method and an apparatus for recognizing a user-defined command word, and a computer device.

Background

Command word recognition, also called keyword recognition, refers to a technique of detecting a specific command word in a passage. The technology is widely applied to various internet of things devices which have extremely high requirements on power consumption. The existing command word recognition is generally implemented in the following way: 1. voice data are collected at the end side and are identified by a cloud server; 2. and voice data are collected at the end side, and recognition is carried out at the end side. The former needs to upload the collected voice data to the server in the implementation, so that the problems of processing delay and user privacy disclosure exist, the processing speed is low, and certain potential safety hazards exist. The latter can only recognize fixed command words, and if the user redefines the command words, new command word corpora need to be collected for model training, which is time-consuming and energy-consuming.

Disclosure of Invention

The application mainly aims to provide a method, a device and computer equipment for identifying a user-defined command word, and aims to solve the problem that the existing command word identification method is slow in identification processing speed of the user-defined command word.

In order to achieve the above object, the present application provides a method for identifying a custom command word, including:

collecting voice data;

inputting the voice data into a pre-constructed voice recognition model to obtain a sequence matrix;

calling a plurality of predefined self-defined command words, and searching a corresponding command word path from the sequence matrix based on each self-defined command word;

and calculating posterior scores corresponding to the command word paths respectively, and selecting the self-defined command word corresponding to the command word path with the highest posterior score and the posterior score larger than a score threshold value as the command word contained in the voice data.

The application also provides a device for identifying the user-defined command words, which comprises:

the acquisition module is used for acquiring voice data;

the recognition module is used for inputting the voice data into a pre-constructed voice recognition model to obtain a sequence matrix;

the searching module is used for calling a plurality of predefined self-defined command words and searching a corresponding command word path from the sequence matrix based on each self-defined command word;

and the calculating module is used for calculating the posterior scores corresponding to the command word paths respectively, and selecting the self-defined command word corresponding to the command word path with the highest posterior score and the posterior score larger than the score threshold value as the command word contained in the voice data.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.

According to the method, the device and the computer equipment for recognizing the user-defined command words, when the method and the device are applied, a recognition system collects voice data input by a user, and then the voice data are input into a pre-constructed voice recognition model to obtain a sequence matrix. The recognition system calls a plurality of predefined self-defined command words and searches corresponding command word paths from the sequence matrix based on the self-defined command words. The recognition system calculates the posterior scores corresponding to the command word paths respectively, and selects the self-defined command word corresponding to the command word path with the highest posterior score and the posterior score larger than the score threshold value as the command word contained in the voice data. In the application, the recognition system searches the sequence matrix based on each self-defined command word to obtain the corresponding command word path, the whole sequence matrix does not need to be recognized and compared, the recognition system carries out posterior score calculation on the command word path corresponding to each self-defined command word so as to recognize the command words contained in the voice data, the recognition time can be greatly reduced, and meanwhile, the recognition accuracy rate of the self-defined command words is effectively improved.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a method for recognizing a custom command word according to an embodiment of the present application;

fig. 2 is a block diagram illustrating an overall structure of a device for recognizing a custom command word according to an embodiment of the present application;

fig. 3 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a method for identifying a custom command word, including:

s1, collecting voice data;

s2, inputting the voice data into a pre-constructed voice recognition model to obtain a sequence matrix;

s3, calling a plurality of predefined self-defined command words, and searching corresponding command word paths from the sequence matrix based on the self-defined command words;

and S4, calculating posterior scores corresponding to the command word paths respectively, and selecting the self-defined command word corresponding to the command word path with the highest posterior score and the posterior score larger than the score threshold value as the command word contained in the voice data.

In the embodiment, the recognition system collects the voice data spoken by the user, then inputs the voice data into a pre-constructed voice recognition model, and obtains a sequence matrix after the voice recognition model performs corresponding processing; wherein, the voice recognition model is preferably deployed on the end-side device and used for recognizing the self-defined command words in the voice of the user. The sequence matrix is a multidimensional matrix, and the shape of the sequence matrix is as follows: n x T, wherein N represents the number of the types of the model modeling units, and T represents the time length of the voice data; for example, if the modeling unit of the speech recognition model is a phoneme, the category of the phoneme is 65, N is 66 (with a blank dimension), and the shape of the sequence matrix is: 66 × T. The recognition system calls a plurality of user-defined command words defined by a user in advance, and determines respective corresponding command word paths in the sequence matrix based on the user-defined command words. The recognition system respectively calculates the posterior scores corresponding to each command word path (the calculation method of the posterior scores of the paths is the same as that in the prior art, and is not described in detail here), the recognition system screens out the highest posterior score from the posterior scores corresponding to each command word path, compares the highest posterior score with a preset score threshold, and judges whether the highest posterior score is greater than the score threshold. And if the highest posterior score is smaller than the score threshold, judging that the command word cannot be recognized in the voice data collected at the current time, and re-collecting and recognizing the command word. And if the highest posterior score is larger than the score threshold, selecting a self-defined command word corresponding to the command word path with the highest posterior score (namely, the command word path corresponding to the highest posterior score) as the command word contained in the voice data collected at the current time. After the user-defined command words spoken by the user are identified and obtained, the identification system controls the equipment to execute command actions associated with the user-defined command words according to the pre-established association relationship. For example, if the equipment is an air conditioner and the custom command word is 'turn on the air conditioner', the recognition system controls the air conditioner to be automatically turned on.

In this embodiment, the recognition system searches from the sequence matrix based on each custom command word to obtain a corresponding command word path (which is equivalent to limiting the search range to each custom command word predefined by the user), and performs posterior score calculation on the command word path corresponding to each custom command word, without performing recognition and comparison on the whole sequence matrix (in the prior art, a recognition result needs to be obtained on a search space formed by the whole sequence matrix, and then the recognition result is compared with the custom command word). Therefore, the recognition system can greatly reduce the recognition time and effectively improve the recognition accuracy of the user-defined command words.

Further, before the step of inputting the speech data into a pre-constructed speech recognition model to obtain a sequence matrix, the method includes:

s5, acquiring universal language material;

s6, carrying out data processing on the general corpus to obtain a training corpus;

and S7, training the speech recognition network by using the training corpus, using a multi-scale feature fusion method in the training process, defining a loss function of the model to use a sequence loss function, wherein a modeling unit of the model is a phoneme, and obtaining the speech recognition model after the training is finished.

In this embodiment, the recognition system obtains a general corpus (which may be chinese corpus, english corpus, or other various linguistic data, and when a linguistic data is selected as the training data for model training, phonemes used by a modeling unit of a subsequent model correspond to the linguistic data) as training data for model training (compared to using a corpus in a limited field or scene as the training data, the embodiment uses a general corpus without collecting corpus data corresponding to a specific command word list, greatly improves flexibility of model training, and reduces training cost of a model. for example, in the existing market, the price of general corpus data provided by a data company is about 300 yuan/hour, and the price of specific fields, such as young english and dialects, is over 700 yuan/hour), and then performs a series of data processing on the general corpus, for example, long and short sentences of the general corpus are screened (since command word recognition is phrase voice recognition and is generally within 3s, the long sentences in the general corpus can be removed or segmented, after the long sentences in the general corpus are removed or segmented, the distribution of training data can be ensured to be consistent with the distribution of data during inference, and thus the generalization capability of a trained voice recognition model is improved); different types of data enhancement are carried out on the general corpus so as to improve the diversification of training data; and converting the text corresponding to the audio data of the universal corpus into phonemes, and then mapping the phonemes into digital IDs according to a phoneme table so as to facilitate engineering operation in practical application. After the data processing is completed, the recognition system obtains training corpora required by model training. The recognition system uses the training corpus to train the speech recognition network (the speech recognition network may adopt the network commonly used in the speech recognition field, such as conventional TDNN, LSTM, or state of the art CONFORMER); in the training process, the convolutional neural network extracts the characteristics of a target in a layer-by-layer abstract mode, the recognition system uses a multi-scale characteristic fusion method to fuse the characteristics of a low-layer network with the characteristics of a high-layer network, because short voice contains information (few phonemes), the topmost characteristic is directly used, the trained model is not robust, and by adding multi-layer characteristic fusion, the robustness of the characteristics and the generalization capability of the model can be improved, so that the robustness of the trained voice recognition model in recognition is improved. Meanwhile, the loss function of the model training is defined by using CTC (connected temporal classification) loss, and the modeling unit of the model is defined as a phoneme, so that the accuracy of model identification after training is improved. The phonemes include pinyin (which can be divided into tones and non-tones), initials and finals, Chinese characters and single characters. After model training is performed according to the above settings, a desired speech recognition model is obtained.

Preferably, in the model training process, the recognition system obtains the aligned labels of the audio data frames in the training corpus by using the ASR model, and assists in training the NN model by using the CE method.

Further, the step of performing data processing on the general corpus to obtain a training corpus includes:

s601, carrying out short sentence processing on the general corpus to obtain a preprocessed corpus;

s602, performing data enhancement on the preprocessed corpus to obtain a secondary processed corpus;

s603, extracting the characteristics of the audio data of the secondary processed corpus and the preprocessed corpus, and performing digital conversion on the text data of the secondary processed corpus and the preprocessed corpus to form the training corpus.

In this embodiment, since the command word recognition belongs to phrase voice recognition (usually, the voice data of the command word does not exceed 3 s), and the general corpus used in training is the general corpus (an application scene or a field is not limited, that is, even though the trained voice recognition model is used for air conditioner control, a user can also open an air conditioner by "turning on a television" through a custom command, so that the universality of the user-defined command word is improved), short-sentence processing needs to be performed on a long sentence in the general corpus (that is, the long sentence with a duration exceeding a duration threshold is segmented or directly removed), and a preprocessed corpus is obtained after processing. Then, the recognition system performs data enhancement on the preprocessed corpus (for example, using data enhancement means such as changing speech speed, changing volume, adding noise, SpecAug, pitch, and the like) to obtain a secondary processed corpus, so as to improve the diversity of the training data. When the data enhancement is carried out on the preprocessed linguistic data, only one type of data enhancement is carried out on a single preprocessed linguistic data, and a plurality of types of data enhancement methods are not superposed on the same preprocessed linguistic data. The recognition system performs feature extraction (using conventional feature extraction, such as using MFCC (mel-frequency cepstral coefficient), fbank (filter banks), LOGFBANK, and other features) on the audio data of the secondary processed corpus and the preprocessed corpus (i.e., corpus that is not subjected to data enhancement), converts the text data corresponding to the secondary processed corpus and the preprocessed corpus into phonemes through a pypinyin tool, and maps the phonemes into digital IDs according to a phoneme table (in the processing logic of the speech recognition model after training, the phonemes of command words are used as the reference, but the actual engineering operation is based on the digital IDs), and the audio features obtained after feature extraction and the corresponding digital IDs together form the training corpus required by model training.

Further, the step of performing data enhancement on the preprocessed corpus to obtain a secondary processed corpus includes:

s6021, dividing the preprocessed corpus into a plurality of corpora to be enhanced according to a preset proportion;

and S6022, respectively performing data enhancement on the linguistic data to be enhanced by using different types of data enhancement methods to obtain the secondary processing linguistic data, wherein a single linguistic data to be enhanced corresponds to a single type of data enhancement method.

In this embodiment, the recognition system divides the preprocessed corpus into a plurality of to-be-enhanced corpora according to a preset ratio (the proportions of the to-be-enhanced corpora in the preprocessed corpus may be the same or different, and are specifically set by developers without specific limitations), and then performs data enhancement on the to-be-enhanced corpora respectively by using different types of data enhancement methods, where a single to-be-enhanced corpus corresponds to a single type of data enhancement method (i.e., the data enhancement methods are not superimposed on a single to-be-enhanced corpus), and the plurality of to-be-enhanced corpora after data enhancement constitute a secondary processed corpus.

Further, the step of performing short-term processing on the general corpus to obtain a preprocessed corpus includes:

s6011, removing long sentences with the duration exceeding a duration threshold value from the general corpus to obtain the preprocessed corpus;

or S6012, screening out long sentences of which the duration exceeds a duration threshold from the universal corpus;

s6013, aligning the long sentences, and dividing the aligned long sentences according to the alignment time nodes to obtain short sentence corpora;

s6014, synthesizing the short-sentence corpus and the corpus except the long sentence in the general corpus to obtain the preprocessed corpus.

In this embodiment, the recognition system removes the long sentences with the duration exceeding the duration threshold from the general corpus according to the duration threshold set by the developer (for example, if the duration threshold is set to be 3s, the long sentence corpus with the duration exceeding 3s is deleted), and the remaining general corpus after the long sentences are removed is the preprocessed corpus. Or, the Recognition system screens out long sentences with the time length exceeding the time length threshold from the general sentences, then aligns the long sentences by means of an Automatic Speech Recognition (ASR) technology, and then segments the long sentences according to aligned time nodes, so that the time length of the segmented short sentences does not exceed the time length threshold, and the short sentence linguistic data is obtained. The recognition system integrates the divided short-sentence corpus and the corpus except the long sentence in the original general corpus (i.e. the corpus which is not divided in the original general corpus) to obtain the preprocessed corpus.

Further, before the step of retrieving a plurality of predefined custom command words and searching for a corresponding command word path from the sequence matrix based on each of the custom command words, the method includes:

s8, acquiring a plurality of correlated self-defined command words and command actions input by a user;

s9, analyzing each self-defined command word into pinyin or initials and finals;

and S10, mapping each pinyin or each initial and final to a corresponding digital ID according to a pre-constructed phoneme table, and associating each digital ID with the corresponding command action, wherein the digital ID corresponds to the output of the voice recognition model.

In this embodiment, the user can customize the command words of the control device according to the own needs, and when the user defines the command words, the user inputs a plurality of customized command words and associates each customized command word with a command action. For example, the current device is an air conditioner, and a user inputs a custom command word "turn on the air conditioner" and associates the custom command word "turn on the air conditioner" with an action of controlling the air conditioner to start; when the recognition system of the air conditioner receives a voice command 'turn on the air conditioner' spoken by a user, the recognition system can automatically control the air conditioner to start. The user-defined command words input by the user are preferably texts, the recognition system analyzes each user-defined command word input by the user into phonemes such as pinyin or initials and finals through a pypinyin tool, and then maps each pinyin or each initials and finals into corresponding digital IDs according to a pre-constructed phoneme table (the phoneme table comprises a plurality of groups of phonemes and digital IDs which correspond to one another one by one, and each phoneme corresponds to a single digital ID). And the identification system associates the digital ID corresponding to each user-defined command word with the corresponding command action according to the association relationship between the user-defined command word and the command action. The numeric ID corresponds to the output of the speech recognition model, i.e. the value of the vector represented in the sequence matrix finally output by the speech recognition model is also the numeric ID.

Further, the step of retrieving a plurality of predefined custom command words and searching a corresponding command word path from the sequence matrix based on each of the custom command words includes:

s301, calling digital IDs corresponding to a plurality of user-defined command words;

s302, searching paths respectively corresponding to the digital IDs from the sequence matrix to obtain the command word paths.

In this embodiment, the recognition system retrieves digital IDs corresponding to a plurality of user-defined command words predefined by the user, and then searches paths corresponding to the digital IDs from a sequence matrix output by the speech recognition model based on the digital IDs, where the paths are command word paths corresponding to the user-defined command words in the sequence matrix. In the prior art, a text corresponding to voice data needs to be recognized in a whole sequence matrix (or called a decoding space), and then the text is compared with candidate command words in a command word list to obtain a final recognition result. In the embodiment, the command word path is obtained by searching in the sequence matrix directly based on the user-defined command word set by the user, which is equivalent to limiting the search range on each user-defined command word, so that the search time is greatly reduced, the recognition efficiency of the user-defined command word is improved, and the recognition accuracy can be improved.

After each command word path is obtained through searching, the recognition system calculates the sum of scores of all paths mapped into the same self-defined command word, namely the posterior score of the command word path corresponding to the single self-defined command word. Taking the loss function used in the speech recognition model training as ctc loss as an example, assuming that the self-defined command word is "good bar", and the length of the sequence matrix is 4, the following paths can be mapped as the path of the "good bar": bb good bar, b good bar. Wherein b represents blank; the recognition system merges the same characters and removes blank, then counts the sum of scores of the 15 mapping paths, and the sum of scores is the posterior score of the command word path corresponding to the self-defined command word 'good bar'. The actual engineering calculation adopts a forward and backward algorithm, the result can be calculated by traversing the sequence matrix once, and the time complexity and the sequence length are in a linear relation.

Referring to fig. 2, an embodiment of the present application further provides an apparatus for recognizing a custom command word, including:

the acquisition module 1 is used for acquiring voice data;

the recognition module 2 is used for inputting the voice data into a pre-constructed voice recognition model to obtain a sequence matrix;

the searching module 3 is used for calling a plurality of predefined self-defined command words and searching a corresponding command word path from the sequence matrix based on each self-defined command word;

and the calculating module 4 is configured to calculate posterior scores corresponding to the command word paths, and select a custom command word corresponding to the command word path with the highest posterior score and the posterior score larger than the score threshold as the command word included in the voice data.

Further, the identification apparatus further includes:

the first obtaining module 5 is used for obtaining general-purpose linguistic data;

the processing module 6 is used for performing data processing on the general corpus to obtain a training corpus;

and the training module 7 is used for training the speech recognition network by using the training corpus, using a multi-scale feature fusion method in the training process, defining a loss function of the model to use a sequence loss function, and obtaining the speech recognition model after training, wherein a modeling unit of the model is a phoneme.

Further, the processing module 6 includes:

the processing unit is used for carrying out short sentence processing on the general corpus to obtain a preprocessed corpus;

the enhancing unit is used for enhancing the data of the preprocessed corpus to obtain a secondary processed corpus;

and the extraction unit is used for extracting the characteristics of the audio data of the secondary processed corpus and the pre-processed corpus and carrying out digital conversion on the text data of the secondary processed corpus and the pre-processed corpus to form the training corpus.

Further, the enhancement unit includes:

the dividing subunit is used for dividing the preprocessed corpus into a plurality of corpora to be enhanced according to a preset proportion;

and the enhancer unit is used for respectively performing data enhancement on the linguistic data to be enhanced by using different types of data enhancement methods to obtain the secondarily processed linguistic data, wherein a single linguistic data to be enhanced corresponds to a single type of data enhancement method.

Further, the processing unit includes:

a removing subunit, configured to remove long sentences in the general corpus, where the length of the long sentences exceeds a length threshold, to obtain the preprocessed corpus;

or, the screening subunit is configured to screen out long sentences with a duration exceeding a duration threshold from the general corpus;

the segmentation subunit is used for aligning the long sentences and segmenting the aligned long sentences according to the alignment time nodes to obtain short sentence corpora;

and the combining subunit is used for integrating the short sentence linguistic data and the linguistic data except the long sentence in the general linguistic data to obtain the preprocessed linguistic data.

Further, the identification apparatus further includes:

the second obtaining module 8 is configured to obtain a plurality of correlated custom command words and command actions input by the user;

the analysis module 9 is used for analyzing each self-defined command word into pinyin or initials and finals;

a mapping module 10, configured to map each pinyin or each initial and final to a corresponding numeric ID according to a pre-constructed phoneme table, and associate each numeric ID with the corresponding command action, where the numeric ID corresponds to an output of the speech recognition model.

Further, the search module 3 includes:

the calling unit is used for calling the digital IDs corresponding to the user-defined command words;

and the searching unit is used for searching paths corresponding to the digital IDs from the sequence matrix to obtain the command word paths.

In this embodiment, each module, unit, and subunit in the apparatus for recognizing the custom command word are used to correspondingly execute each step in the method for recognizing the custom command word, and the specific implementation process thereof is not described in detail herein.

In the recognition device for the user-defined command word provided by the embodiment, when the recognition device is applied, the recognition system collects voice data input by a user, and then inputs the voice data into a pre-constructed voice recognition model to obtain a sequence matrix. The recognition system calls a plurality of predefined self-defined command words and searches corresponding command word paths from the sequence matrix based on the self-defined command words. The recognition system calculates the posterior scores corresponding to the command word paths respectively, and selects the self-defined command word corresponding to the command word path with the highest posterior score and the posterior score larger than the score threshold value as the command word contained in the voice data. In the application, the recognition system searches the sequence matrix based on each self-defined command word to obtain the corresponding command word path, the whole sequence matrix does not need to be recognized and compared, the recognition system carries out posterior score calculation on the command word path corresponding to each self-defined command word so as to recognize the command words contained in the voice data, the recognition time can be greatly reduced, and meanwhile, the recognition accuracy rate of the self-defined command words is effectively improved.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as custom command words. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of identifying custom command words.

The processor executes the method for identifying the user-defined command words, and comprises the following steps:

s1, collecting voice data;

s5, acquiring universal language material;

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for identifying a custom command word, where the method for identifying a custom command word specifically includes:

s1, collecting voice data;

s5, acquiring universal language material;

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, first object, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, first object, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of another identical element in a process, apparatus, first object or method that comprises the element.

The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A method for recognizing a custom command word is characterized by comprising the following steps:

collecting voice data;

2. The method for recognizing the custom command word according to claim 1, wherein the step of inputting the speech data into a pre-constructed speech recognition model to obtain a sequence matrix is preceded by the steps of:

acquiring a universal corpus;

performing data processing on the general corpus to obtain a training corpus;

and training the voice recognition network by using the training corpus, using a multi-scale feature fusion method in the training process, defining a loss function of the model to use a sequence loss function, wherein a modeling unit of the model is a phoneme, and obtaining the voice recognition model after training.

3. The method for recognizing the custom command word according to claim 2, wherein the step of performing data processing on the general corpus to obtain a training corpus comprises:

performing short sentence processing on the general corpus to obtain a preprocessed corpus;

performing data enhancement on the preprocessed corpus to obtain a secondary processed corpus;

and performing feature extraction on the audio data of the secondary processed corpus and the pre-processed corpus, and performing digital conversion on the text data of the secondary processed corpus and the pre-processed corpus to form the training corpus.

4. The method according to claim 3, wherein the step of performing data enhancement on the preprocessed corpus to obtain a secondary processed corpus comprises:

dividing the preprocessed corpus into a plurality of corpora to be enhanced according to a preset proportion;

and respectively performing data enhancement on the linguistic data to be enhanced by using different types of data enhancement methods to obtain the secondarily processed linguistic data, wherein a single linguistic data to be enhanced corresponds to a single type of data enhancement method.

5. The method according to claim 3, wherein the step of performing short-term processing on the general corpus to obtain a preprocessed corpus comprises:

removing long sentences with the duration exceeding a duration threshold value from the general corpus to obtain the preprocessed corpus;

or screening out long sentences with the duration exceeding a duration threshold from the universal corpus;

aligning the long sentences, and dividing the aligned long sentences according to alignment time nodes to obtain short sentence corpora;

and synthesizing the short sentence corpus and the corpus except the long sentence in the general corpus to obtain the preprocessed corpus.

6. The method for identifying custom command words according to claim 1, wherein the step of retrieving a plurality of pre-defined custom command words and searching for corresponding command word paths from the sequence matrix based on each of the custom command words is preceded by the step of:

acquiring a plurality of correlated self-defined command words and command actions input by a user;

analyzing each self-defined command word into pinyin or initials and finals;

and mapping each pinyin or each initial and final to a corresponding digital ID according to a pre-constructed phoneme table, and associating each digital ID with the corresponding command action, wherein the digital ID corresponds to the output of the voice recognition model.

7. The method for identifying custom command words according to claim 6, wherein the step of retrieving a plurality of predefined custom command words and searching the corresponding command word path from the sequence matrix based on each of the custom command words comprises:

calling digital IDs corresponding to a plurality of the user-defined command words;

and searching paths respectively corresponding to the digital IDs from the sequence matrix to obtain the paths of the command words.

8. An apparatus for recognizing a custom command word, comprising:

the acquisition module is used for acquiring voice data;

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.