CN111933116B - Speech recognition model training method, system, mobile terminal and storage medium - Google Patents
Speech recognition model training method, system, mobile terminal and storage medium Download PDFInfo
- Publication number
- CN111933116B CN111933116B CN202010573045.3A CN202010573045A CN111933116B CN 111933116 B CN111933116 B CN 111933116B CN 202010573045 A CN202010573045 A CN 202010573045A CN 111933116 B CN111933116 B CN 111933116B
- Authority
- CN
- China
- Prior art keywords
- corpus
- text
- amplified
- audio
- homophone
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 83
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000003321 amplification Effects 0.000 claims abstract description 47
- 238000003199 nucleic acid amplification method Methods 0.000 claims abstract description 47
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 238000012217 deletion Methods 0.000 claims abstract description 8
- 230000037430 deletion Effects 0.000 claims abstract description 8
- 238000013507 mapping Methods 0.000 claims description 23
- 230000003190 augmentative effect Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 11
- 230000003595 spectral effect Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 17
- 238000013461 design Methods 0.000 description 10
- 230000011218 segmentation Effects 0.000 description 5
- 238000000926 separation method Methods 0.000 description 5
- 230000003416 augmentation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides a method, a system, a mobile terminal and a storage medium for training a speech recognition model, wherein the method comprises the following steps: obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text; training a language model in the speech recognition model according to the amplified linguistic data and the linguistic data text, and performing sentence alignment on the amplified linguistic data according to the specified phonemes to obtain a sentence alignment position; acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio at the formant initial position in the amplified corpus; and performing feature extraction on the amplified corpus after the data deletion to obtain acoustic features, and training an acoustic model in the speech recognition model according to the acoustic features. According to the invention, by designing the corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are increased, and the training effect of the voice recognition model is further improved.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition model training method, a system, a mobile terminal and a storage medium.
Background
The voice recognition research has been in history for decades, the voice recognition technology mainly comprises four parts, namely acoustic model modeling, language model modeling, pronunciation dictionary construction and decoding, each part can become an independent research direction, and the difficulty of voice data acquisition and labeling is greatly improved relative to images and texts, so that the construction of a complete voice recognition model training system is a work which consumes a lot of time and has high difficulty, and the development of the voice recognition technology is greatly hindered.
In the existing speech recognition model training process, a language model and an acoustic model are trained correspondingly according to input sample corpora and corpora texts, the sizes of the sample corpora and the corpora texts influence the training effect of the speech recognition model, but in the speech recognition model training process for small languages, the data of the sample corpora and the corpora texts are less, and then the recognition efficiency of the trained speech recognition model is low.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a system, a mobile terminal and a storage medium for training a speech recognition model, and aims to solve the problem of poor training effect of the speech recognition model caused by less data of sample corpora and corpus texts in the conventional Chinese speech recognition model training.
The embodiment of the invention is realized in such a way that a speech recognition model training method comprises the following steps:
obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to a specified phoneme to obtain a sentence alignment position;
acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio in the amplified corpus at the formant initial position;
and performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features.
Further, the step of performing corpus expansion on the sample corpus and the corpus text to obtain an expanded corpus and an expanded text includes:
extracting single character pronunciations in the sample corpus, and extracting single character texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
and mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.
Furthermore, the step of mapping the homophone audio in the single-word pronunciation to a specific word audio to obtain the augmented corpus includes:
carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio frequency is matched with any homophone audio frequency in the preset homophone list, setting the pronunciation audio frequency as the homophone audio frequency;
and acquiring the matched pronunciation serial number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation serial number to obtain the amplification corpus.
Furthermore, the step of mapping the homophone text in the single character text into a specific character text according to the augmented corpus to obtain the augmented text includes:
inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
and inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text.
Further, the step of performing sentence alignment on the augmented corpus according to the specified phoneme to obtain a sentence alignment position includes:
performing phoneme recognition on the linguistic data in the amplified linguistic data respectively according to the specified phonemes;
and acquiring the initial position and the end position of the appointed phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
Furthermore, the method for acquiring the formant initial position of the audio corresponding to the sentence alignment position includes a spectral envelope extraction method, a cepstrum method, an LPC method or a root method.
Another object of an embodiment of the present invention is to provide a speech recognition model training system, including:
the corpus amplification module is used for acquiring a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
the language model training module is used for training a language model in the speech recognition model according to the corpus text and performing statement alignment on the amplified corpus according to the specified phonemes to obtain a statement alignment position;
a formant obtaining module, configured to obtain a formant starting position of an audio corresponding to the sentence alignment position, and delete data of the audio in the amplified corpus at the formant starting position;
and the acoustic model training module is used for extracting the characteristics of the amplified corpus after the data deletion is finished to obtain acoustic characteristics, and training an acoustic model in the voice recognition model according to the acoustic characteristics.
Still further, the corpus expansion module is further configured to:
extracting single-word pronunciations in the sample corpus, and extracting single-word texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.
Another object of an embodiment of the present invention is to provide a mobile terminal, which includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned speech recognition model training method.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the speech recognition model training method.
According to the embodiment of the invention, by designing the corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are effectively increased, so that the training effect of the speech recognition model is improved, a better model training effect can be achieved based on less training data, and by designing the data of the audio frequency in the amplified corpus at the initial position of the formant, the influence of transition characteristics among different characters on the training of the speech recognition model is effectively avoided, and the training effect of the speech recognition model is further improved.
Drawings
FIG. 1 is a flow chart of a voice separation method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a speech separation method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a voice separation method according to a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a speech separation system according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a mobile terminal according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to illustrate the technical means of the present invention, the following description is given by way of specific examples.
Example one
Referring to fig. 1, a flowchart of a speech recognition model training method according to a first embodiment of the present invention includes the steps of:
step S10, obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
the sample corpus is a language to be recognized by the speech recognition model, such as a cantonese language or a Minnan language, an expression mode of Mandarin is adopted in the corpus text, and the sample corpus and the corpus text are stored in a one-to-one corresponding relation;
furthermore, the sample corpus comprises all vowels, consonants and mixed tones, in the step, the corpus amplification operation of the sample corpus and the corpus text is carried out in a homophone mapping mode, namely, the homophone audio in the sample corpus is mapped into a specific word audio so as to achieve the effect of amplifying the sample corpus, and the homophone text in the corpus text is mapped into a specific word text so as to achieve the effect of amplifying the corpus text;
in the step, the corpus amplification is carried out on the sample corpus and the corpus text to obtain the design of the amplified corpus and the amplified text, so that the training data of the speech recognition model is effectively improved, the phenomenon of poor training effect of the speech recognition model of the small language caused by less training data is prevented, and the accuracy of the trained speech recognition model is high.
Step S20, training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;
the designated phoneme may be set as required, for example, the designated phoneme may be any consonant, a start position and an end position of the designated phoneme in the amplified corpus are obtained by performing sentence alignment design on the amplified corpus according to the designated phoneme, and a range formed between the start position and the end position is set as the sentence alignment position;
specifically, the step of training a language model in a speech recognition model according to the corpus text includes: preprocessing the corpus text, and segmenting words of the preprocessed corpus text to obtain segmented words text; carrying out vocabulary statistics on the word segmentation text, and removing low-frequency vocabularies according to the vocabulary statistics result; constructing a dictionary according to the word segmentation text, calculating the word frequency of the 3gram in the word segmentation text, and training a language model according to the dictionary and the word frequency of the 3 gram;
the preprocessing is used for removing punctuation marks in the corpus text, converting English into lower case and normalizing numbers, the word frequency of each word in the word segmentation text is calculated by carrying out word statistics design on the word segmentation text, and if the word frequency corresponding to any word is smaller than a word frequency threshold value, the word corresponding to the word frequency is set as a low-frequency word.
Step S30, obtaining the formant initial position of the audio corresponding to the sentence alignment position, and deleting the data of the audio in the amplified corpus at the formant initial position;
the formants refer to some regions with relatively concentrated energy in the frequency spectrum of the sound, the formants are not only determining factors of the sound quality, but also reflect the physical characteristics of the sound channel (resonant cavity), and the method for acquiring the initial position of the formants of the audio corresponding to the sentence alignment position comprises a spectrum envelope extraction method, a cepstrum method, an LPC method or a root method;
specifically, in the step, cepstrum separation is performed on the audio corresponding to the sentence alignment position through a cepstrum filter, inverse fourier transform is performed on the separated cepstrum, and the formant initial position is obtained based on the transform result of the inverse fourier transform.
Step S40, extracting the characteristics of the amplified corpus after the data deletion is completed to obtain acoustic characteristics, and training an acoustic model in the voice recognition model according to the acoustic characteristics;
extracting MFCC features and IVECTOR features in the amplification corpus, and performing feature combination on the MFCC features and the IVECTOR features to obtain the acoustic features;
specifically, in this step, a monophonic training is performed on the acoustic model according to the acoustic features, a difference processing is performed on the acoustic features to obtain difference features, a triphone training is performed on the acoustic model according to the difference features to obtain a triphone model, phonemes are aligned according to the triphone model, vector transformation is performed on the acoustic features to obtain feature vectors, and the acoustic model is trained according to the feature vectors.
This embodiment, through carrying out the design of corpus augmentation to sample corpus and corpus text, the effectual data that have increased sample corpus and corpus text, and then improved the training effect of speech recognition model, also can reach better model training effect based on less training data, through carrying out the design that deletes with the data of audio frequency at formant initial position in the augmentation corpus, the effectual influence of transition characteristic to speech recognition model training between the different characters of having avoided, the training effect of speech recognition model has further been improved.
Example two
Please refer to fig. 2, which is a flowchart of a speech recognition model training method according to a second embodiment of the present application, where the second embodiment is used to refine step S10 in the first embodiment to describe how to perform corpus expansion on a sample corpus and a corpus text to obtain an expanded corpus and an expanded text, and includes the steps of:
s11, extracting the pronunciation of the single character in the sample corpus and extracting the text of the single character in the corpus text;
wherein, the single character pronunciation is the audio frequency with only one pronunciation, the single character text is the character with only one pronunciation, for example, the pronunciation of the word is hao, lai or lao, the text of the word is sign, star, pick, cross, score, loss,Has a consumption, 21728, 26276, 34243, liquid;
step S12, mapping the homophone voice frequency in the single character pronunciation to a specific character voice frequency to obtain the amplified corpus;
specifically, in this step, the step of mapping the homophone audio in the single character pronunciation to a specific character audio to obtain the augmented corpus includes:
carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio frequency is matched with any homophone audio frequency in the preset homophone list, setting the pronunciation audio frequency as the homophone audio frequency;
acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus;
for example, the letter Hao is a homophone, a horn, a hao, a needle, a hanger,Has a consumption, 21728, 26276, 34243and video corresponding audio is mapped to the homophone audio corresponding to the Hao pronunciation;
step S13, mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;
specifically, in this step, the step of mapping the homophone text in the single character text into a specific character text according to the augmented corpus to obtain the augmented text includes:
inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text;
for example, the single-word text is a sign, a score, a pick, a cross, a score, a friction,Has a consumption, 21728, 26276, 34243, and display are mapped to the corresponding homophonic text Heao.
According to the embodiment, by the design of corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are effectively increased, the training effect of the voice recognition model is further improved, and a better model training effect can be achieved based on less training data.
EXAMPLE III
Please refer to fig. 3, which is a flowchart of a speech recognition model training method according to a third embodiment of the present application, where the third embodiment is used to refine step S20 in the first embodiment to refine and describe how to perform sentence alignment on an augmented corpus according to a specified phoneme to obtain a sentence alignment position, and includes the steps of:
step S21, performing phoneme recognition on the linguistic data in the amplified linguistic data respectively according to the specified phonemes;
the method comprises the steps of respectively carrying out phoneme recognition design on linguistic data in an amplified linguistic data according to specified phonemes to query phoneme audio frequencies corresponding to the specified phonemes in different linguistic data in the amplified linguistic data, and extracting the queried phoneme audio frequencies, wherein the specified phonemes can be any consonant;
step S22, acquiring the initial position and the end position of the designated phoneme in the corresponding corpus according to the phoneme recognition result to obtain the sentence alignment position;
the starting time and the stopping time of each phoneme audio are respectively obtained to obtain the starting position and the ending position, and the phoneme audio is extracted according to the starting position and the ending position to obtain the sentence alignment position.
In the embodiment, by the design of obtaining the sentence alignment position, the audio formant starting position is effectively and conveniently obtained, and the training efficiency of the speech recognition model training is further improved.
Example four
Referring to fig. 4, a schematic structural diagram of a speech recognition model training system 100 according to a fourth embodiment of the present invention is shown, including: corpus augmentation module 10, language model training module 11, formant acquisition module 12 and acoustic model training module 13, wherein:
the corpus amplification module 10 is configured to obtain a sample corpus and a corpus text corresponding to the sample corpus, and perform corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text.
Wherein the corpus expansion module 10 is further configured to: extracting single character pronunciations in the sample corpus, and extracting single character texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
and mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.
Preferably, the corpus expansion module 10 is further configured to: carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio frequency is matched with any homophone audio frequency in the preset homophone list, setting the pronunciation audio frequency as the homophone audio frequency;
and acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus.
Further, the corpus expansion module 10 is further configured to: inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
and inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text.
And the language model training module 11 is configured to train a language model in the speech recognition model according to the corpus text, and perform sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position.
Wherein, the language model training module 11 is further configured to: respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;
and acquiring the initial position and the end position of the appointed phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
A formant obtaining module 12, configured to obtain a formant starting position of the audio corresponding to the sentence alignment position, and delete data of the audio in the augmented corpus at the formant starting position, where the method used to obtain the formant starting position of the audio corresponding to the sentence alignment position includes a spectral envelope extraction method, a cepstrum method, an LPC method, or a root-finding method.
And the acoustic model training module 13 is configured to perform feature extraction on the augmented corpus after the data deletion is completed, obtain acoustic features, and train an acoustic model in the speech recognition model according to the acoustic features.
This embodiment, through the design of carrying out the corpus augmentation to sample corpus and corpus text, the effectual data that has increased sample corpus and corpus text, and then improved the training effect of speech recognition model, also can reach better model training effect based on less training data, through carrying out the design that deletes the data of audio frequency at formant initial position in the corpus that will augment, the effectual influence of transition characteristic to speech recognition model training between the different characters of having avoided, the training effect of speech recognition model has further been improved.
EXAMPLE five
Referring to fig. 5, a mobile terminal 101 according to a fifth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to enable the mobile terminal 101 to execute the above-mentioned speech recognition model training method, and the mobile terminal 101 may be a robot.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to a specified phoneme to obtain a sentence alignment position;
acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio in the amplified corpus at the formant initial position;
and performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features. The storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the component structures illustrated in FIG. 4 are not intended to limit the speech recognition model training system of the present invention and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components, and that the speech recognition model training methods of FIGS. 1-3 may be implemented using more or fewer components than those illustrated in FIG. 4, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the current speech recognition model training system and that can perform certain functions, and all of them can be stored in a storage device (not shown) of the current speech recognition model training system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (7)
1. A method for training a speech recognition model, the method comprising:
obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;
acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio in the amplified corpus at the formant initial position;
performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features;
the step of performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text comprises:
extracting single-word pronunciations in the sample corpus, and extracting single-word texts in the corpus texts;
mapping the homophone voice frequency in the single character pronunciation to a specific character voice frequency to obtain the amplification corpus;
mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;
the step of performing sentence alignment on the amplified corpus according to the designated phoneme to obtain a sentence alignment position comprises:
respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;
and acquiring the initial position and the end position of the designated phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
2. The method for training a speech recognition model according to claim 1, wherein the step of mapping the homophone audio in the single character pronunciation to a specific character audio to obtain the augmented corpus comprises:
carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio is matched with any homophone audio in the preset homophone list, setting the pronunciation audio as the homophone audio;
and acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus.
3. The method for training a speech recognition model according to claim 2, wherein the step of mapping homophone texts in the single character texts to specific character texts according to the augmented corpus to obtain the augmented texts comprises:
inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
and inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text.
4. The method for training a speech recognition model according to claim 1, wherein the method for obtaining the formant start position of the audio corresponding to the sentence alignment position comprises a spectral envelope extraction method, a cepstrum method, an LPC method or a root finding method.
5. A speech recognition model training system, the system comprising:
the corpus amplification module is used for acquiring a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
the language model training module is used for training a language model in the speech recognition model according to the corpus text and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;
a formant obtaining module, configured to obtain a formant starting position of the audio corresponding to the sentence alignment position, and delete data of the audio in the amplified corpus at the formant starting position;
the acoustic model training module is used for extracting the features of the amplified corpus after data deletion is completed to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features;
the corpus expansion module is further to:
extracting single-word pronunciations in the sample corpus, and extracting single-word texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;
the language model training module is further configured to: respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;
and acquiring the initial position and the end position of the appointed phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
6. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor that runs the computer program to make the mobile terminal execute the speech recognition model training method according to any one of claims 1 to 4.
7. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition model training method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010573045.3A CN111933116B (en) | 2020-06-22 | 2020-06-22 | Speech recognition model training method, system, mobile terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010573045.3A CN111933116B (en) | 2020-06-22 | 2020-06-22 | Speech recognition model training method, system, mobile terminal and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111933116A CN111933116A (en) | 2020-11-13 |
CN111933116B true CN111933116B (en) | 2023-02-14 |
Family
ID=73316584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010573045.3A Active CN111933116B (en) | 2020-06-22 | 2020-06-22 | Speech recognition model training method, system, mobile terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111933116B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112634897B (en) * | 2020-12-31 | 2022-10-28 | 青岛海尔科技有限公司 | Equipment awakening method and device, storage medium and electronic device |
CN113539245B (en) * | 2021-07-05 | 2024-03-15 | 思必驰科技股份有限公司 | Language model automatic training method and system |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5146539A (en) * | 1984-11-30 | 1992-09-08 | Texas Instruments Incorporated | Method for utilizing formant frequencies in speech recognition |
US6618699B1 (en) * | 1999-08-30 | 2003-09-09 | Lucent Technologies Inc. | Formant tracking based on phoneme information |
CN101004911B (en) * | 2006-01-17 | 2012-06-27 | 纽昂斯通讯公司 | Method and device for generating frequency bending function and carrying out frequency bending |
CN101030197A (en) * | 2006-02-28 | 2007-09-05 | 株式会社东芝 | Method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model |
CN109036381A (en) * | 2018-08-08 | 2018-12-18 | 平安科技(深圳)有限公司 | Method of speech processing and device, computer installation and readable storage medium storing program for executing |
CN109326277B (en) * | 2018-12-05 | 2022-02-08 | 四川长虹电器股份有限公司 | Semi-supervised phoneme forced alignment model establishing method and system |
CN110853625B (en) * | 2019-09-18 | 2022-05-17 | 厦门快商通科技股份有限公司 | Speech recognition model word segmentation training method and system, mobile terminal and storage medium |
CN111145729B (en) * | 2019-12-23 | 2022-10-28 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN111192570B (en) * | 2020-01-06 | 2022-12-06 | 厦门快商通科技股份有限公司 | Language model training method, system, mobile terminal and storage medium |
CN111179917B (en) * | 2020-01-17 | 2023-01-03 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN111210807B (en) * | 2020-02-21 | 2023-03-31 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
-
2020
- 2020-06-22 CN CN202010573045.3A patent/CN111933116B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111933116A (en) | 2020-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Schuster et al. | Japanese and korean voice search | |
Ramani et al. | A common attribute based unified HTS framework for speech synthesis in Indian languages | |
US7957969B2 (en) | Systems and methods for building a native language phoneme lexicon having native pronunciations of non-native words derived from non-native pronunciatons | |
US7085716B1 (en) | Speech recognition using word-in-phrase command | |
US7840399B2 (en) | Method, device, and computer program product for multi-lingual speech recognition | |
WO2007097176A1 (en) | Speech recognition dictionary making supporting system, speech recognition dictionary making supporting method, and speech recognition dictionary making supporting program | |
JP2002287787A (en) | Disambiguation language model | |
CN111933116B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN111192572A (en) | Semantic recognition method, device and system | |
CN110852075A (en) | Voice transcription method and device for automatically adding punctuation marks and readable storage medium | |
Al-Anzi et al. | Synopsis on Arabic speech recognition | |
CN111798841B (en) | Acoustic model training method and system, mobile terminal and storage medium | |
Kiecza et al. | Data-driven determination of appropriate dictionary units for Korean LVCSR | |
Jyothi et al. | Improved hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge. | |
CN111429921B (en) | Voiceprint recognition method, system, mobile terminal and storage medium | |
Al-Anzi et al. | Performance evaluation of sphinx and HTK speech recognizers for spoken Arabic language | |
KR20050101695A (en) | A system for statistical speech recognition using recognition results, and method thereof | |
Sung et al. | Deploying google search by voice in cantonese | |
JP2938865B1 (en) | Voice recognition device | |
KR20050101694A (en) | A system for statistical speech recognition with grammatical constraints, and method thereof | |
Dawa et al. | Multilingual Text–Speech Corpus of Mongolian | |
Ma et al. | Russian speech recognition system design based on HMM | |
CN113506561B (en) | Text pinyin conversion method and device, storage medium and electronic equipment | |
US20220189462A1 (en) | Method of training a speech recognition model of an extended language by speech in a source language | |
JP2001188556A (en) | Method and device for voice recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |