CN116386613B

CN116386613B - Model training method for enhancing command word voice

Info

Publication number: CN116386613B
Application number: CN202310650948.0A
Authority: CN
Inventors: 温登峰
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-07-25
Anticipated expiration: 2043-06-05
Also published as: CN116386613A

Abstract

S1, performing initial training to obtain an original speech recognition model MD1; s2: acquiring command word entry C1 used by a client project, selecting corresponding audio for screening, and expanding the audio of a command word to be expanded; s3: removing the audio with decoding errors to obtain a corrected command word corpus B4; s4: recording the actual noise of the customer home products, and carrying out noise adding and reverberation processing on the corrected command word corpus B4 to obtain a command word expansion corpus B5; s5, selecting an initial model MD2 used by a corpus training chip end, and using the command word expansion corpus B5 obtained in the step S4 to adjust so as to obtain a final model MD3. The invention utilizes the existing corpus to generate the lacking command word corpus, can improve the recognition rate of the model in the actual scene without obviously increasing the labor cost, and meets the application requirements of clients.

Description

Model training method for enhancing command word voice

Technical Field

The invention belongs to the technical field of voice signal processing, and particularly relates to a model training method for enhancing command word voice.

Background

In recent years, with the continued development of artificial intelligence, various voice devices have been used more and more frequently. However, since the computing power of the chip side is limited, the actual application of the voice chip is mainly based on command words. The conventional end-side command word recognition process uses a large number of corpora to train a continuous speech recognition model, and then uses the model to recognize command words corresponding to a certain product. Command word recognition in quiet environments is essentially unproblematic, however recognition in noisy and far-field environments remains a major challenge.

Therefore, the developer usually records the corresponding corpus according to the command word instruction given by the client to fine tune the initial model. However, the corpus is recorded with great manpower and material resources, thereby bringing more cost. Another solution is to use a speech synthesis model to synthesize the corresponding command word corpus, which brings a new problem that the corresponding text-to-speech model (TTS) needs to be trained, and the training effect brought by the synthesized corpus is limited.

Disclosure of Invention

To overcome the deficiencies of the prior art, the present invention discloses a model training method for command word speech enhancement.

The invention relates to a model training method for enhancing command word voice, which comprises the following steps:

s1, performing initial training to obtain an original voice recognition model MD1, and establishing a quiet corpus;

s2: acquiring a command word entry C1 used by a client project, wherein the command word entry comprises at least one command word;

s21, selecting corresponding audio in the existing quiet corpus A1 according to command word entries C1, counting the number of audio entries of each command word in the quiet corpus A1, setting a first screening threshold value to screen all command words in the command word entries C1, and screening out to-be-expanded command words with the corresponding audio numbers lower than the first screening threshold value;

performing audio expansion on command words to be expanded; the expansion method comprises the following steps:

s22, word segmentation is carried out on the command words, wherein the first word segmentation is to decompose the command words into more than one word, the second word segmentation is carried out on the basis of the first word segmentation, and each word is divided into more than one single word;

s23, setting a second screening threshold, screening audio entries comprising each word from the quiet corpus A1 according to the first word segmentation result, and screening again in the quiet corpus A1 according to the second word segmentation result if the number of the audio entries of the single word is less than the second screening threshold; screening out single words with the number of the audio frequency greater than a second screening threshold value;

if the results cannot be screened out by the two screening steps, the second screening threshold value is reduced; repeating step S23;

the step S23 of obtaining the original audio containing the two word segmentation results through screening;

s24, aligning the audio obtained in the step S23 by using an original voice recognition model MD1 model to obtain a corresponding time tag of the word or the single word in the audio, and cutting out the audio only containing the corresponding word or the single word according to the corresponding time tag, wherein the cut-out audio is used as a word segmentation sub-corpus B1;

s25, repeating the steps S22-S24 for each command word, and combining all word segmentation sub-corpora B1 to obtain a word segmentation corpus B2;

s26, randomly screening audio from the word segmentation corpus B2 to combine to obtain a command word whole word corpus B3;

s3: decoding the synthesized command word whole word stock B3 by using the original voice recognition model MD1, and removing audio with decoding errors to obtain a corrected command word stock B4;

s4: recording the actual noise of the customer home products, and carrying out noise adding and reverberation processing on the corrected command word corpus B4 to obtain a command word expansion corpus B5;

s5, selecting an initial model MD2 used by a corpus training chip end, and using the command word expansion corpus B5 obtained in the step S4 to adjust so as to obtain a final model MD3.

Preferably, in the step S1, the original speech recognition model MD1 is obtained by training using a CTC/RNNT training method.

Preferably, in the step S4, part of the audio is selected from the corrected command word corpus B4, played in the corresponding noise environment, and picked up, and the audio obtained by picking up the audio is the complementary noise corpus B6;

in the step S5, the command word expansion corpus B5 and the supplementary noise corpus B6 are used to adjust MD2 at the same time, so as to obtain a final model MD3.

The invention utilizes the existing corpus to generate the lacking command word corpus, can improve the recognition rate of the model in the actual scene without obviously increasing the labor cost, and meets the application requirements of clients.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Description of the embodiments

The following detailed description of the invention is, therefore, not to be taken in a limiting sense, and is set forth in the appended drawings.

The following describes the embodiments of the present invention in further detail with reference to the accompanying drawings.

s1: initial training is carried out by using a large corpus exceeding 1 ten thousand hours to obtain an original speech recognition model MD1, the model uses a training method of CTC/RNNT (time sequence classification Connectionist Temporal Classification based on a neural network and a recurrent neural network transducer Recurrent Neural Network Transducer) of word modeling, and for Chinese, a modeling unit of the model is Chinese characters;

establishing a quiet corpus, wherein the quiet corpus comprises a large number of sounding audios collected in a quiet environment, the audios comprise single words, short sentences, long sentences or whole articles, and the speakers comprise people with different sexes, ages and different pronunciation habits;

s2: obtaining command word entries C1 used by client projects, wherein the command word entries comprise at least one command word, taking a tea bar machine project as an example, the command word entries C1 comprise command words such as tea mode, black tea mode, green tea mode, tea boiling closing, tea boiling stopping and the like,

s21, firstly selecting corresponding audio in the existing quiet corpus A1 according to command word entries C1, counting the number of audio entries of each command word in the command word entries C1 in the quiet corpus A1, setting a first screening threshold value to screen command words in the command word entries C1, screening command words to be expanded, and independently listing audio with the command words less than the screening threshold value, such as less than 1000 entries, as the command words to be expanded C2, wherein the number of audio entries in the current nectar mode is assumed to be 0, so that the number of audio of each command word in the command words to be expanded is required to be expanded. The specific expansion method is as follows:

if the results cannot be screened out by the two screening steps, the second screening threshold is set higher, and the second screening threshold is lowered; repeating step S23;

s25, repeating the steps S21-S24 for each command word, and combining all word segmentation sub-corpora B1 to obtain a word segmentation corpus B2;

for example, one embodiment is shown in FIG. 1:

(1) the method comprises the steps of performing word segmentation on a fruit tea mode, wherein word segmentation results of first word segmentation are fruit tea and mode, and single word results obtained by second word segmentation, which are obtained by continuing word segmentation on the first word segmentation results, are fruit, tea, and modes;

(2) setting a second screening threshold value as 50, screening audio entries corresponding to fruit tea and a mode from the corpus A1 according to the first word segmentation, and screening the corpus again according to the second word segmentation if the number of the audio entries of the single entry is less than 50; if the audio frequency of the nectar is still less than 50, corpus screening is carried out according to the second word of fruit and tea;

(3) obtaining the original audio which is obtained by screening in the steps (1) and (2) and contains the two word segmentation results;

(4) aligning the audio obtained in the step (3) by using an original voice recognition model MD1 model to obtain corresponding time labels of words or single words such as fruit tea, mode and the like in the audio, cutting out the audio only containing the corresponding words or single words according to the corresponding time labels, and taking the cut-out audio as a word segmentation sub-corpus B1;

for example, the audio of "fruit" and "mode" are respectively cut out from the audio of "fruit candy", "group purchase mode", etc.;

(5) repeating the steps (1) - (4) for each command word, and combining all word segmentation sub-corpora B1 to obtain a word segmentation corpus B2;

(6) randomly screening audio from the word segmentation corpus B2 to combine to obtain a command word whole word corpus B3;

for example, a plurality of audio frequencies of fruit, tea and mode exist in the word segmentation corpus B2, partial audio frequencies are selected or randomly screened out, the audio frequencies are randomly combined into a finished command word of fruit tea mode, and the obtained result is used as a command word whole word corpus B3;

the step of screening and word segmentation is adopted twice, because the pronunciation habit of a person carries out word segmentation on a whole word, then pronouncing is carried out, the pronunciations of the words are more consistent, the corpus obtained by the two word segmentation not only maintains the complete word pronunciation, but also is replaced by single word audio when the condition of insufficient complete word pronunciation audio is not satisfied, thus being beneficial to obtaining the optimal screening result and being beneficial to the segmentation of the corpus;

s3: the raw speech recognition model MD1 is used to decode the synthesized command word whole word library B3,

the decoding is to identify the synthesized audio, if the identified text is different from the corresponding command word text, the decoding is considered to be wrong, and the audio needs to be deleted to make the corpus put into training more effectively. The reason for the decoding error is usually that the original speech recognition model MD1 has systematic small probability decoding error or that the synthesized audio is distorted due to the audio segmentation inaccuracy;

removing the audio with decoding errors to obtain a corrected command word corpus B4;

s4, recording the actual noise of the customer home products, and carrying out noise adding and reverberation processing on the corrected command word corpus B4 to obtain a command word expansion corpus B5;

a small amount of audio can be selected from the correction command word corpus B4 to be played in the corresponding noise environment so as to simulate the real environment, and a voice chip is used for picking up the voice to obtain a supplementary noise corpus B6; the corresponding noise environment is the noise environment where the audio corresponding command word is always located, for example, the equipment corresponding to the audio of 'I want to receive water' is a tea bar machine, and the corresponding noise environment is the noise generated during water heating and water receiving;

the supplementary noise corpus B6 is noise audio of real environment sound collection, and has better effect compared with the command word expansion corpus B5 generated by simulated noise addition;

s5, selecting an initial model MD2 used for corpus training chip end, wherein the MD2 model is a basic model which can be operated on a chip and is constructed for chip end training, and training the MD2 by using the command word expansion corpus B5 and the supplementary noise corpus B6 obtained in the step S4 to obtain a final model MD3.

The invention improves the recognition rate by expanding command word materials, and for model training, the twice screening word segmentation is due to the pronunciation habit of a person (for a whole word, word segmentation is performed firstly and then pronunciation is performed, and the pronunciation of the words is more consistent, which is beneficial to corpus segmentation, and relatively speaking, the segmentation of a single word is more difficult because the word segmentation rarely occurs in a section of voice independently.

This embodiment is implemented in the open source speech recognition tool k2 and kaldi (kaldi) environments; firstly, training an end-to-end consumer model on a k2 platform by using tens of thousands of hours of basic corpus as an original speech recognition model MD1 for alignment and decoding of subsequent corpus. The conformation model is a voice recognition model proposed by Google (Google) company in 2020

Screening and expanding about 100 command words of a tea bar machine according to the step S2 of the training method to obtain a command word whole word stock B3; and finishing the rejection in the step S3;

and S4, recording the noise generated during the running of the tea bar machine, and performing noise adding processing to obtain a command word expansion corpus B5 and a supplementary noise corpus B6 so as to improve the recognition effect in the noise environment during actual use.

The initial model MD2 with the model structure f-TDNN (decomposed time-delay neural network Factorized TDNN) running at the end of a basic corpus training chip for thousands of hours is used in a Kaldi (kaldi) environment, and then the obtained command word expansion corpus B5 and the supplementary noise corpus B6 are used for fine tuning the initial model MD2, so that the final model MD3 suitable for specific command words is obtained.

In this embodiment, based on the test items of the tea bar machine, the number of samples in each test set is 220 audios, and the test results of each model at the PC end are shown in table 1:

	MD2	MD3	MD4
				quantity of parameters	850K	850K	850K
Quiet	98%	99%	100%
				Music	79%	87%	93%
Noise of water receiving	90%	95%	96%
				Noise of water heating	90%	93%	97%

Table 1: tea bar machine test meter

In table 1, MD2 is a model trained by using a basic corpus, MD3 is a model trained after generating an extended command word corpus by using the method of the present patent, MD4 is a model trained after using a command word audio recorded by high fidelity sound (i.e. a command word voice audio recorded truly), and from the test result, the extended command word corpus can improve the recognition effect of the model under the actual use item.

The foregoing description of the preferred embodiments of the present invention is not obvious contradiction or on the premise of a certain preferred embodiment, but all the preferred embodiments can be used in any overlapped combination, and the embodiments and specific parameters in the embodiments are only for clearly describing the invention verification process of the inventor and are not intended to limit the scope of the invention, and the scope of the invention is still subject to the claims, and all equivalent structural changes made by applying the specification and the content of the drawings of the present invention are included in the scope of the invention.

Claims

1. A model training method for command word speech enhancement, comprising the steps of:

s1: initial training is carried out to obtain an original voice recognition model MD1, and a quiet corpus is established;

2. The model training method according to claim 1, wherein the training method of CTC/RNNT is used in step S1 to train the original speech recognition model MD1.

3. The model training method of claim 1,

in the step S4, part of audio is selected from the corrected command word corpus B4, played in the corresponding noise environment, and picked up, and the audio obtained by picking up the sound is the supplementary noise corpus B6;