CN111933116B

CN111933116B - Speech recognition model training method, system, mobile terminal and storage medium

Info

Publication number: CN111933116B
Application number: CN202010573045.3A
Authority: CN
Inventors: 张广学; 肖龙源; 叶志坚; 李稀敏; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2023-02-14
Anticipated expiration: 2040-06-22
Also published as: CN111933116A

Abstract

The invention provides a method, a system, a mobile terminal and a storage medium for training a speech recognition model, wherein the method comprises the following steps: obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text; training a language model in the speech recognition model according to the amplified linguistic data and the linguistic data text, and performing sentence alignment on the amplified linguistic data according to the specified phonemes to obtain a sentence alignment position; acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio at the formant initial position in the amplified corpus; and performing feature extraction on the amplified corpus after the data deletion to obtain acoustic features, and training an acoustic model in the speech recognition model according to the acoustic features. According to the invention, by designing the corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are increased, and the training effect of the voice recognition model is further improved.

Description

Speech recognition model training method, system, mobile terminal and storage medium

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition model training method, a system, a mobile terminal and a storage medium.

Background

The voice recognition research has been in history for decades, the voice recognition technology mainly comprises four parts, namely acoustic model modeling, language model modeling, pronunciation dictionary construction and decoding, each part can become an independent research direction, and the difficulty of voice data acquisition and labeling is greatly improved relative to images and texts, so that the construction of a complete voice recognition model training system is a work which consumes a lot of time and has high difficulty, and the development of the voice recognition technology is greatly hindered.

In the existing speech recognition model training process, a language model and an acoustic model are trained correspondingly according to input sample corpora and corpora texts, the sizes of the sample corpora and the corpora texts influence the training effect of the speech recognition model, but in the speech recognition model training process for small languages, the data of the sample corpora and the corpora texts are less, and then the recognition efficiency of the trained speech recognition model is low.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a system, a mobile terminal and a storage medium for training a speech recognition model, and aims to solve the problem of poor training effect of the speech recognition model caused by less data of sample corpora and corpus texts in the conventional Chinese speech recognition model training.

The embodiment of the invention is realized in such a way that a speech recognition model training method comprises the following steps:

obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;

training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to a specified phoneme to obtain a sentence alignment position;

acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio in the amplified corpus at the formant initial position;

and performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features.

Further, the step of performing corpus expansion on the sample corpus and the corpus text to obtain an expanded corpus and an expanded text includes:

extracting single character pronunciations in the sample corpus, and extracting single character texts in the corpus texts;

mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;

and mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.

Furthermore, the step of mapping the homophone audio in the single-word pronunciation to a specific word audio to obtain the augmented corpus includes:

carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;

if the pronunciation audio frequency is matched with any homophone audio frequency in the preset homophone list, setting the pronunciation audio frequency as the homophone audio frequency;

and acquiring the matched pronunciation serial number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation serial number to obtain the amplification corpus.

Furthermore, the step of mapping the homophone text in the single character text into a specific character text according to the augmented corpus to obtain the augmented text includes:

inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;

and inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text.

Further, the step of performing sentence alignment on the augmented corpus according to the specified phoneme to obtain a sentence alignment position includes:

performing phoneme recognition on the linguistic data in the amplified linguistic data respectively according to the specified phonemes;

and acquiring the initial position and the end position of the appointed phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.

Furthermore, the method for acquiring the formant initial position of the audio corresponding to the sentence alignment position includes a spectral envelope extraction method, a cepstrum method, an LPC method or a root method.

Another object of an embodiment of the present invention is to provide a speech recognition model training system, including:

the corpus amplification module is used for acquiring a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;

the language model training module is used for training a language model in the speech recognition model according to the corpus text and performing statement alignment on the amplified corpus according to the specified phonemes to obtain a statement alignment position;

a formant obtaining module, configured to obtain a formant starting position of an audio corresponding to the sentence alignment position, and delete data of the audio in the amplified corpus at the formant starting position;

and the acoustic model training module is used for extracting the characteristics of the amplified corpus after the data deletion is finished to obtain acoustic characteristics, and training an acoustic model in the voice recognition model according to the acoustic characteristics.

Still further, the corpus expansion module is further configured to:

extracting single-word pronunciations in the sample corpus, and extracting single-word texts in the corpus texts;

mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.

Another object of an embodiment of the present invention is to provide a mobile terminal, which includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned speech recognition model training method.

Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the speech recognition model training method.

According to the embodiment of the invention, by designing the corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are effectively increased, so that the training effect of the speech recognition model is improved, a better model training effect can be achieved based on less training data, and by designing the data of the audio frequency in the amplified corpus at the initial position of the formant, the influence of transition characteristics among different characters on the training of the speech recognition model is effectively avoided, and the training effect of the speech recognition model is further improved.

Drawings

FIG. 1 is a flow chart of a voice separation method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a speech separation method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of a voice separation method according to a third embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a speech separation system according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a mobile terminal according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to illustrate the technical means of the present invention, the following description is given by way of specific examples.

Example one

Referring to fig. 1, a flowchart of a speech recognition model training method according to a first embodiment of the present invention includes the steps of:

step S10, obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;

the sample corpus is a language to be recognized by the speech recognition model, such as a cantonese language or a Minnan language, an expression mode of Mandarin is adopted in the corpus text, and the sample corpus and the corpus text are stored in a one-to-one corresponding relation;

furthermore, the sample corpus comprises all vowels, consonants and mixed tones, in the step, the corpus amplification operation of the sample corpus and the corpus text is carried out in a homophone mapping mode, namely, the homophone audio in the sample corpus is mapped into a specific word audio so as to achieve the effect of amplifying the sample corpus, and the homophone text in the corpus text is mapped into a specific word text so as to achieve the effect of amplifying the corpus text;

in the step, the corpus amplification is carried out on the sample corpus and the corpus text to obtain the design of the amplified corpus and the amplified text, so that the training data of the speech recognition model is effectively improved, the phenomenon of poor training effect of the speech recognition model of the small language caused by less training data is prevented, and the accuracy of the trained speech recognition model is high.

Step S20, training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;

the designated phoneme may be set as required, for example, the designated phoneme may be any consonant, a start position and an end position of the designated phoneme in the amplified corpus are obtained by performing sentence alignment design on the amplified corpus according to the designated phoneme, and a range formed between the start position and the end position is set as the sentence alignment position;

specifically, the step of training a language model in a speech recognition model according to the corpus text includes: preprocessing the corpus text, and segmenting words of the preprocessed corpus text to obtain segmented words text; carrying out vocabulary statistics on the word segmentation text, and removing low-frequency vocabularies according to the vocabulary statistics result; constructing a dictionary according to the word segmentation text, calculating the word frequency of the 3gram in the word segmentation text, and training a language model according to the dictionary and the word frequency of the 3 gram;

the preprocessing is used for removing punctuation marks in the corpus text, converting English into lower case and normalizing numbers, the word frequency of each word in the word segmentation text is calculated by carrying out word statistics design on the word segmentation text, and if the word frequency corresponding to any word is smaller than a word frequency threshold value, the word corresponding to the word frequency is set as a low-frequency word.

Step S30, obtaining the formant initial position of the audio corresponding to the sentence alignment position, and deleting the data of the audio in the amplified corpus at the formant initial position;

the formants refer to some regions with relatively concentrated energy in the frequency spectrum of the sound, the formants are not only determining factors of the sound quality, but also reflect the physical characteristics of the sound channel (resonant cavity), and the method for acquiring the initial position of the formants of the audio corresponding to the sentence alignment position comprises a spectrum envelope extraction method, a cepstrum method, an LPC method or a root method;

specifically, in the step, cepstrum separation is performed on the audio corresponding to the sentence alignment position through a cepstrum filter, inverse fourier transform is performed on the separated cepstrum, and the formant initial position is obtained based on the transform result of the inverse fourier transform.

Step S40, extracting the characteristics of the amplified corpus after the data deletion is completed to obtain acoustic characteristics, and training an acoustic model in the voice recognition model according to the acoustic characteristics;

extracting MFCC features and IVECTOR features in the amplification corpus, and performing feature combination on the MFCC features and the IVECTOR features to obtain the acoustic features;

specifically, in this step, a monophonic training is performed on the acoustic model according to the acoustic features, a difference processing is performed on the acoustic features to obtain difference features, a triphone training is performed on the acoustic model according to the difference features to obtain a triphone model, phonemes are aligned according to the triphone model, vector transformation is performed on the acoustic features to obtain feature vectors, and the acoustic model is trained according to the feature vectors.

This embodiment, through carrying out the design of corpus augmentation to sample corpus and corpus text, the effectual data that have increased sample corpus and corpus text, and then improved the training effect of speech recognition model, also can reach better model training effect based on less training data, through carrying out the design that deletes with the data of audio frequency at formant initial position in the augmentation corpus, the effectual influence of transition characteristic to speech recognition model training between the different characters of having avoided, the training effect of speech recognition model has further been improved.

Example two

Please refer to fig. 2, which is a flowchart of a speech recognition model training method according to a second embodiment of the present application, where the second embodiment is used to refine step S10 in the first embodiment to describe how to perform corpus expansion on a sample corpus and a corpus text to obtain an expanded corpus and an expanded text, and includes the steps of:

s11, extracting the pronunciation of the single character in the sample corpus and extracting the text of the single character in the corpus text;

wherein, the single character pronunciation is the audio frequency with only one pronunciation, the single character text is the character with only one pronunciation, for example, the pronunciation of the word is hao, lai or lao, the text of the word is sign, star, pick, cross, score, loss,

Has a consumption, 21728, 26276, 34243, liquid;

step S12, mapping the homophone voice frequency in the single character pronunciation to a specific character voice frequency to obtain the amplified corpus;

specifically, in this step, the step of mapping the homophone audio in the single character pronunciation to a specific character audio to obtain the augmented corpus includes:

acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus;

for example, the letter Hao is a homophone, a horn, a hao, a needle, a hanger,

Has a consumption, 21728, 26276, 34243and video corresponding audio is mapped to the homophone audio corresponding to the Hao pronunciation;

step S13, mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;

specifically, in this step, the step of mapping the homophone text in the single character text into a specific character text according to the augmented corpus to obtain the augmented text includes:

inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text;

for example, the single-word text is a sign, a score, a pick, a cross, a score, a friction,

Has a consumption, 21728, 26276, 34243, and display are mapped to the corresponding homophonic text Heao.

According to the embodiment, by the design of corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are effectively increased, the training effect of the voice recognition model is further improved, and a better model training effect can be achieved based on less training data.

EXAMPLE III

Please refer to fig. 3, which is a flowchart of a speech recognition model training method according to a third embodiment of the present application, where the third embodiment is used to refine step S20 in the first embodiment to refine and describe how to perform sentence alignment on an augmented corpus according to a specified phoneme to obtain a sentence alignment position, and includes the steps of:

step S21, performing phoneme recognition on the linguistic data in the amplified linguistic data respectively according to the specified phonemes;

the method comprises the steps of respectively carrying out phoneme recognition design on linguistic data in an amplified linguistic data according to specified phonemes to query phoneme audio frequencies corresponding to the specified phonemes in different linguistic data in the amplified linguistic data, and extracting the queried phoneme audio frequencies, wherein the specified phonemes can be any consonant;

step S22, acquiring the initial position and the end position of the designated phoneme in the corresponding corpus according to the phoneme recognition result to obtain the sentence alignment position;

the starting time and the stopping time of each phoneme audio are respectively obtained to obtain the starting position and the ending position, and the phoneme audio is extracted according to the starting position and the ending position to obtain the sentence alignment position.

In the embodiment, by the design of obtaining the sentence alignment position, the audio formant starting position is effectively and conveniently obtained, and the training efficiency of the speech recognition model training is further improved.

Example four

Referring to fig. 4, a schematic structural diagram of a speech recognition model training system 100 according to a fourth embodiment of the present invention is shown, including: corpus augmentation module 10, language model training module 11, formant acquisition module 12 and acoustic model training module 13, wherein:

the corpus amplification module 10 is configured to obtain a sample corpus and a corpus text corresponding to the sample corpus, and perform corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text.

Wherein the corpus expansion module 10 is further configured to: extracting single character pronunciations in the sample corpus, and extracting single character texts in the corpus texts;

Preferably, the corpus expansion module 10 is further configured to: carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;

and acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus.

Further, the corpus expansion module 10 is further configured to: inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;

And the language model training module 11 is configured to train a language model in the speech recognition model according to the corpus text, and perform sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position.

Wherein, the language model training module 11 is further configured to: respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;

A formant obtaining module 12, configured to obtain a formant starting position of the audio corresponding to the sentence alignment position, and delete data of the audio in the augmented corpus at the formant starting position, where the method used to obtain the formant starting position of the audio corresponding to the sentence alignment position includes a spectral envelope extraction method, a cepstrum method, an LPC method, or a root-finding method.

And the acoustic model training module 13 is configured to perform feature extraction on the augmented corpus after the data deletion is completed, obtain acoustic features, and train an acoustic model in the speech recognition model according to the acoustic features.

This embodiment, through the design of carrying out the corpus augmentation to sample corpus and corpus text, the effectual data that has increased sample corpus and corpus text, and then improved the training effect of speech recognition model, also can reach better model training effect based on less training data, through carrying out the design that deletes the data of audio frequency at formant initial position in the corpus that will augment, the effectual influence of transition characteristic to speech recognition model training between the different characters of having avoided, the training effect of speech recognition model has further been improved.

EXAMPLE five

Referring to fig. 5, a mobile terminal 101 according to a fifth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to enable the mobile terminal 101 to execute the above-mentioned speech recognition model training method, and the mobile terminal 101 may be a robot.

The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:

and performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features. The storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application.

Those skilled in the art will appreciate that the component structures illustrated in FIG. 4 are not intended to limit the speech recognition model training system of the present invention and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components, and that the speech recognition model training methods of FIGS. 1-3 may be implemented using more or fewer components than those illustrated in FIG. 4, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the current speech recognition model training system and that can perform certain functions, and all of them can be stored in a storage device (not shown) of the current speech recognition model training system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for training a speech recognition model, the method comprising:

training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;

performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features;

the step of performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text comprises:

mapping the homophone voice frequency in the single character pronunciation to a specific character voice frequency to obtain the amplification corpus;

mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;

the step of performing sentence alignment on the amplified corpus according to the designated phoneme to obtain a sentence alignment position comprises:

respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;

and acquiring the initial position and the end position of the designated phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.

2. The method for training a speech recognition model according to claim 1, wherein the step of mapping the homophone audio in the single character pronunciation to a specific character audio to obtain the augmented corpus comprises:

if the pronunciation audio is matched with any homophone audio in the preset homophone list, setting the pronunciation audio as the homophone audio;

3. The method for training a speech recognition model according to claim 2, wherein the step of mapping homophone texts in the single character texts to specific character texts according to the augmented corpus to obtain the augmented texts comprises:

4. The method for training a speech recognition model according to claim 1, wherein the method for obtaining the formant start position of the audio corresponding to the sentence alignment position comprises a spectral envelope extraction method, a cepstrum method, an LPC method or a root finding method.

5. A speech recognition model training system, the system comprising:

the language model training module is used for training a language model in the speech recognition model according to the corpus text and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;

a formant obtaining module, configured to obtain a formant starting position of the audio corresponding to the sentence alignment position, and delete data of the audio in the amplified corpus at the formant starting position;

the acoustic model training module is used for extracting the features of the amplified corpus after data deletion is completed to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features;

the corpus expansion module is further to:

the language model training module is further configured to: respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;

6. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor that runs the computer program to make the mobile terminal execute the speech recognition model training method according to any one of claims 1 to 4.

7. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition model training method of any one of claims 1 to 4.