CN111933116B - Speech recognition model training method, system, mobile terminal and storage medium - Google Patents

Speech recognition model training method, system, mobile terminal and storage medium Download PDF

Info

Publication number
CN111933116B
CN111933116B CN202010573045.3A CN202010573045A CN111933116B CN 111933116 B CN111933116 B CN 111933116B CN 202010573045 A CN202010573045 A CN 202010573045A CN 111933116 B CN111933116 B CN 111933116B
Authority
CN
China
Prior art keywords
corpus
text
amplified
audio
homophone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010573045.3A
Other languages
Chinese (zh)
Other versions
CN111933116A (en
Inventor
张广学
肖龙源
叶志坚
李稀敏
刘晓葳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010573045.3A priority Critical patent/CN111933116B/en
Publication of CN111933116A publication Critical patent/CN111933116A/en
Application granted granted Critical
Publication of CN111933116B publication Critical patent/CN111933116B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a method, a system, a mobile terminal and a storage medium for training a speech recognition model, wherein the method comprises the following steps: obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text; training a language model in the speech recognition model according to the amplified linguistic data and the linguistic data text, and performing sentence alignment on the amplified linguistic data according to the specified phonemes to obtain a sentence alignment position; acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio at the formant initial position in the amplified corpus; and performing feature extraction on the amplified corpus after the data deletion to obtain acoustic features, and training an acoustic model in the speech recognition model according to the acoustic features. According to the invention, by designing the corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are increased, and the training effect of the voice recognition model is further improved.

Description

Speech recognition model training method, system, mobile terminal and storage medium
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition model training method, a system, a mobile terminal and a storage medium.
Background
The voice recognition research has been in history for decades, the voice recognition technology mainly comprises four parts, namely acoustic model modeling, language model modeling, pronunciation dictionary construction and decoding, each part can become an independent research direction, and the difficulty of voice data acquisition and labeling is greatly improved relative to images and texts, so that the construction of a complete voice recognition model training system is a work which consumes a lot of time and has high difficulty, and the development of the voice recognition technology is greatly hindered.
In the existing speech recognition model training process, a language model and an acoustic model are trained correspondingly according to input sample corpora and corpora texts, the sizes of the sample corpora and the corpora texts influence the training effect of the speech recognition model, but in the speech recognition model training process for small languages, the data of the sample corpora and the corpora texts are less, and then the recognition efficiency of the trained speech recognition model is low.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a system, a mobile terminal and a storage medium for training a speech recognition model, and aims to solve the problem of poor training effect of the speech recognition model caused by less data of sample corpora and corpus texts in the conventional Chinese speech recognition model training.
The embodiment of the invention is realized in such a way that a speech recognition model training method comprises the following steps:
obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to a specified phoneme to obtain a sentence alignment position;
acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio in the amplified corpus at the formant initial position;
and performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features.
Further, the step of performing corpus expansion on the sample corpus and the corpus text to obtain an expanded corpus and an expanded text includes:
extracting single character pronunciations in the sample corpus, and extracting single character texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
and mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.
Furthermore, the step of mapping the homophone audio in the single-word pronunciation to a specific word audio to obtain the augmented corpus includes:
carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio frequency is matched with any homophone audio frequency in the preset homophone list, setting the pronunciation audio frequency as the homophone audio frequency;
and acquiring the matched pronunciation serial number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation serial number to obtain the amplification corpus.
Furthermore, the step of mapping the homophone text in the single character text into a specific character text according to the augmented corpus to obtain the augmented text includes:
inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
and inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text.
Further, the step of performing sentence alignment on the augmented corpus according to the specified phoneme to obtain a sentence alignment position includes:
performing phoneme recognition on the linguistic data in the amplified linguistic data respectively according to the specified phonemes;
and acquiring the initial position and the end position of the appointed phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
Furthermore, the method for acquiring the formant initial position of the audio corresponding to the sentence alignment position includes a spectral envelope extraction method, a cepstrum method, an LPC method or a root method.
Another object of an embodiment of the present invention is to provide a speech recognition model training system, including:
the corpus amplification module is used for acquiring a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
the language model training module is used for training a language model in the speech recognition model according to the corpus text and performing statement alignment on the amplified corpus according to the specified phonemes to obtain a statement alignment position;
a formant obtaining module, configured to obtain a formant starting position of an audio corresponding to the sentence alignment position, and delete data of the audio in the amplified corpus at the formant starting position;
and the acoustic model training module is used for extracting the characteristics of the amplified corpus after the data deletion is finished to obtain acoustic characteristics, and training an acoustic model in the voice recognition model according to the acoustic characteristics.
Still further, the corpus expansion module is further configured to:
extracting single-word pronunciations in the sample corpus, and extracting single-word texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.
Another object of an embodiment of the present invention is to provide a mobile terminal, which includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned speech recognition model training method.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the speech recognition model training method.
According to the embodiment of the invention, by designing the corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are effectively increased, so that the training effect of the speech recognition model is improved, a better model training effect can be achieved based on less training data, and by designing the data of the audio frequency in the amplified corpus at the initial position of the formant, the influence of transition characteristics among different characters on the training of the speech recognition model is effectively avoided, and the training effect of the speech recognition model is further improved.
Drawings
FIG. 1 is a flow chart of a voice separation method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a speech separation method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a voice separation method according to a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a speech separation system according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a mobile terminal according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to illustrate the technical means of the present invention, the following description is given by way of specific examples.
Example one
Referring to fig. 1, a flowchart of a speech recognition model training method according to a first embodiment of the present invention includes the steps of:
step S10, obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
the sample corpus is a language to be recognized by the speech recognition model, such as a cantonese language or a Minnan language, an expression mode of Mandarin is adopted in the corpus text, and the sample corpus and the corpus text are stored in a one-to-one corresponding relation;
furthermore, the sample corpus comprises all vowels, consonants and mixed tones, in the step, the corpus amplification operation of the sample corpus and the corpus text is carried out in a homophone mapping mode, namely, the homophone audio in the sample corpus is mapped into a specific word audio so as to achieve the effect of amplifying the sample corpus, and the homophone text in the corpus text is mapped into a specific word text so as to achieve the effect of amplifying the corpus text;
in the step, the corpus amplification is carried out on the sample corpus and the corpus text to obtain the design of the amplified corpus and the amplified text, so that the training data of the speech recognition model is effectively improved, the phenomenon of poor training effect of the speech recognition model of the small language caused by less training data is prevented, and the accuracy of the trained speech recognition model is high.
Step S20, training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;
the designated phoneme may be set as required, for example, the designated phoneme may be any consonant, a start position and an end position of the designated phoneme in the amplified corpus are obtained by performing sentence alignment design on the amplified corpus according to the designated phoneme, and a range formed between the start position and the end position is set as the sentence alignment position;
specifically, the step of training a language model in a speech recognition model according to the corpus text includes: preprocessing the corpus text, and segmenting words of the preprocessed corpus text to obtain segmented words text; carrying out vocabulary statistics on the word segmentation text, and removing low-frequency vocabularies according to the vocabulary statistics result; constructing a dictionary according to the word segmentation text, calculating the word frequency of the 3gram in the word segmentation text, and training a language model according to the dictionary and the word frequency of the 3 gram;
the preprocessing is used for removing punctuation marks in the corpus text, converting English into lower case and normalizing numbers, the word frequency of each word in the word segmentation text is calculated by carrying out word statistics design on the word segmentation text, and if the word frequency corresponding to any word is smaller than a word frequency threshold value, the word corresponding to the word frequency is set as a low-frequency word.
Step S30, obtaining the formant initial position of the audio corresponding to the sentence alignment position, and deleting the data of the audio in the amplified corpus at the formant initial position;
the formants refer to some regions with relatively concentrated energy in the frequency spectrum of the sound, the formants are not only determining factors of the sound quality, but also reflect the physical characteristics of the sound channel (resonant cavity), and the method for acquiring the initial position of the formants of the audio corresponding to the sentence alignment position comprises a spectrum envelope extraction method, a cepstrum method, an LPC method or a root method;
specifically, in the step, cepstrum separation is performed on the audio corresponding to the sentence alignment position through a cepstrum filter, inverse fourier transform is performed on the separated cepstrum, and the formant initial position is obtained based on the transform result of the inverse fourier transform.
Step S40, extracting the characteristics of the amplified corpus after the data deletion is completed to obtain acoustic characteristics, and training an acoustic model in the voice recognition model according to the acoustic characteristics;
extracting MFCC features and IVECTOR features in the amplification corpus, and performing feature combination on the MFCC features and the IVECTOR features to obtain the acoustic features;
specifically, in this step, a monophonic training is performed on the acoustic model according to the acoustic features, a difference processing is performed on the acoustic features to obtain difference features, a triphone training is performed on the acoustic model according to the difference features to obtain a triphone model, phonemes are aligned according to the triphone model, vector transformation is performed on the acoustic features to obtain feature vectors, and the acoustic model is trained according to the feature vectors.
This embodiment, through carrying out the design of corpus augmentation to sample corpus and corpus text, the effectual data that have increased sample corpus and corpus text, and then improved the training effect of speech recognition model, also can reach better model training effect based on less training data, through carrying out the design that deletes with the data of audio frequency at formant initial position in the augmentation corpus, the effectual influence of transition characteristic to speech recognition model training between the different characters of having avoided, the training effect of speech recognition model has further been improved.
Example two
Please refer to fig. 2, which is a flowchart of a speech recognition model training method according to a second embodiment of the present application, where the second embodiment is used to refine step S10 in the first embodiment to describe how to perform corpus expansion on a sample corpus and a corpus text to obtain an expanded corpus and an expanded text, and includes the steps of:
s11, extracting the pronunciation of the single character in the sample corpus and extracting the text of the single character in the corpus text;
wherein, the single character pronunciation is the audio frequency with only one pronunciation, the single character text is the character with only one pronunciation, for example, the pronunciation of the word is hao, lai or lao, the text of the word is sign, star, pick, cross, score, loss,
Figure GDA0003900403530000071
Has a consumption, 21728, 26276, 34243, liquid;
step S12, mapping the homophone voice frequency in the single character pronunciation to a specific character voice frequency to obtain the amplified corpus;
specifically, in this step, the step of mapping the homophone audio in the single character pronunciation to a specific character audio to obtain the augmented corpus includes:
carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio frequency is matched with any homophone audio frequency in the preset homophone list, setting the pronunciation audio frequency as the homophone audio frequency;
acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus;
for example, the letter Hao is a homophone, a horn, a hao, a needle, a hanger,
Figure GDA0003900403530000072
Has a consumption, 21728, 26276, 34243and video corresponding audio is mapped to the homophone audio corresponding to the Hao pronunciation;
step S13, mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;
specifically, in this step, the step of mapping the homophone text in the single character text into a specific character text according to the augmented corpus to obtain the augmented text includes:
inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text;
for example, the single-word text is a sign, a score, a pick, a cross, a score, a friction,
Figure GDA0003900403530000081
Has a consumption, 21728, 26276, 34243, and display are mapped to the corresponding homophonic text Heao.
According to the embodiment, by the design of corpus amplification of the sample corpus and the corpus text, the data of the sample corpus and the corpus text are effectively increased, the training effect of the voice recognition model is further improved, and a better model training effect can be achieved based on less training data.
EXAMPLE III
Please refer to fig. 3, which is a flowchart of a speech recognition model training method according to a third embodiment of the present application, where the third embodiment is used to refine step S20 in the first embodiment to refine and describe how to perform sentence alignment on an augmented corpus according to a specified phoneme to obtain a sentence alignment position, and includes the steps of:
step S21, performing phoneme recognition on the linguistic data in the amplified linguistic data respectively according to the specified phonemes;
the method comprises the steps of respectively carrying out phoneme recognition design on linguistic data in an amplified linguistic data according to specified phonemes to query phoneme audio frequencies corresponding to the specified phonemes in different linguistic data in the amplified linguistic data, and extracting the queried phoneme audio frequencies, wherein the specified phonemes can be any consonant;
step S22, acquiring the initial position and the end position of the designated phoneme in the corresponding corpus according to the phoneme recognition result to obtain the sentence alignment position;
the starting time and the stopping time of each phoneme audio are respectively obtained to obtain the starting position and the ending position, and the phoneme audio is extracted according to the starting position and the ending position to obtain the sentence alignment position.
In the embodiment, by the design of obtaining the sentence alignment position, the audio formant starting position is effectively and conveniently obtained, and the training efficiency of the speech recognition model training is further improved.
Example four
Referring to fig. 4, a schematic structural diagram of a speech recognition model training system 100 according to a fourth embodiment of the present invention is shown, including: corpus augmentation module 10, language model training module 11, formant acquisition module 12 and acoustic model training module 13, wherein:
the corpus amplification module 10 is configured to obtain a sample corpus and a corpus text corresponding to the sample corpus, and perform corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text.
Wherein the corpus expansion module 10 is further configured to: extracting single character pronunciations in the sample corpus, and extracting single character texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
and mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts.
Preferably, the corpus expansion module 10 is further configured to: carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio frequency is matched with any homophone audio frequency in the preset homophone list, setting the pronunciation audio frequency as the homophone audio frequency;
and acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus.
Further, the corpus expansion module 10 is further configured to: inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
and inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text.
And the language model training module 11 is configured to train a language model in the speech recognition model according to the corpus text, and perform sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position.
Wherein, the language model training module 11 is further configured to: respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;
and acquiring the initial position and the end position of the appointed phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
A formant obtaining module 12, configured to obtain a formant starting position of the audio corresponding to the sentence alignment position, and delete data of the audio in the augmented corpus at the formant starting position, where the method used to obtain the formant starting position of the audio corresponding to the sentence alignment position includes a spectral envelope extraction method, a cepstrum method, an LPC method, or a root-finding method.
And the acoustic model training module 13 is configured to perform feature extraction on the augmented corpus after the data deletion is completed, obtain acoustic features, and train an acoustic model in the speech recognition model according to the acoustic features.
This embodiment, through the design of carrying out the corpus augmentation to sample corpus and corpus text, the effectual data that has increased sample corpus and corpus text, and then improved the training effect of speech recognition model, also can reach better model training effect based on less training data, through carrying out the design that deletes the data of audio frequency at formant initial position in the corpus that will augment, the effectual influence of transition characteristic to speech recognition model training between the different characters of having avoided, the training effect of speech recognition model has further been improved.
EXAMPLE five
Referring to fig. 5, a mobile terminal 101 according to a fifth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to enable the mobile terminal 101 to execute the above-mentioned speech recognition model training method, and the mobile terminal 101 may be a robot.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to a specified phoneme to obtain a sentence alignment position;
acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio in the amplified corpus at the formant initial position;
and performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features. The storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the component structures illustrated in FIG. 4 are not intended to limit the speech recognition model training system of the present invention and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components, and that the speech recognition model training methods of FIGS. 1-3 may be implemented using more or fewer components than those illustrated in FIG. 4, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the current speech recognition model training system and that can perform certain functions, and all of them can be stored in a storage device (not shown) of the current speech recognition model training system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. A method for training a speech recognition model, the method comprising:
obtaining a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
training a language model in a speech recognition model according to the corpus text, and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;
acquiring a formant initial position of the audio corresponding to the sentence alignment position, and deleting data of the audio in the amplified corpus at the formant initial position;
performing feature extraction on the amplified corpus after data deletion to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features;
the step of performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text comprises:
extracting single-word pronunciations in the sample corpus, and extracting single-word texts in the corpus texts;
mapping the homophone voice frequency in the single character pronunciation to a specific character voice frequency to obtain the amplification corpus;
mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;
the step of performing sentence alignment on the amplified corpus according to the designated phoneme to obtain a sentence alignment position comprises:
respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;
and acquiring the initial position and the end position of the designated phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
2. The method for training a speech recognition model according to claim 1, wherein the step of mapping the homophone audio in the single character pronunciation to a specific character audio to obtain the augmented corpus comprises:
carrying out pronunciation matching according to homophone audio in a preset homophone list and pronunciation audio in the single character pronunciation;
if the pronunciation audio is matched with any homophone audio in the preset homophone list, setting the pronunciation audio as the homophone audio;
and acquiring the matched pronunciation number of the homophone audio, and mapping and marking the homophone audio according to the pronunciation number to obtain the amplification corpus.
3. The method for training a speech recognition model according to claim 2, wherein the step of mapping homophone texts in the single character texts to specific character texts according to the augmented corpus to obtain the augmented texts comprises:
inquiring a text corresponding to the homophone audio in the single character text, and setting the inquired text as the homophone text;
and inquiring the specific character text according to the pronunciation number, and replacing the homophone text corresponding to the pronunciation number by the specific character text to obtain the amplified text.
4. The method for training a speech recognition model according to claim 1, wherein the method for obtaining the formant start position of the audio corresponding to the sentence alignment position comprises a spectral envelope extraction method, a cepstrum method, an LPC method or a root finding method.
5. A speech recognition model training system, the system comprising:
the corpus amplification module is used for acquiring a sample corpus and a corpus text corresponding to the sample corpus, and performing corpus amplification on the sample corpus and the corpus text to obtain an amplified corpus and an amplified text;
the language model training module is used for training a language model in the speech recognition model according to the corpus text and performing sentence alignment on the amplified corpus according to the specified phonemes to obtain a sentence alignment position;
a formant obtaining module, configured to obtain a formant starting position of the audio corresponding to the sentence alignment position, and delete data of the audio in the amplified corpus at the formant starting position;
the acoustic model training module is used for extracting the features of the amplified corpus after data deletion is completed to obtain acoustic features, and training an acoustic model in the voice recognition model according to the acoustic features;
the corpus expansion module is further to:
extracting single-word pronunciations in the sample corpus, and extracting single-word texts in the corpus texts;
mapping homophone audio in the single character pronunciation to specific character audio to obtain the amplified corpus;
mapping homophone texts in the single character texts into specific character texts according to the amplification linguistic data to obtain the amplification texts, and carrying out data correspondence on the amplification linguistic data and the amplification texts;
the language model training module is further configured to: respectively carrying out phoneme recognition on the linguistic data in the amplified linguistic data according to the specified phonemes;
and acquiring the initial position and the end position of the appointed phoneme in the corresponding corpus according to the phoneme recognition result so as to obtain the sentence alignment position.
6. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor that runs the computer program to make the mobile terminal execute the speech recognition model training method according to any one of claims 1 to 4.
7. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition model training method of any one of claims 1 to 4.
CN202010573045.3A 2020-06-22 2020-06-22 Speech recognition model training method, system, mobile terminal and storage medium Active CN111933116B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010573045.3A CN111933116B (en) 2020-06-22 2020-06-22 Speech recognition model training method, system, mobile terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010573045.3A CN111933116B (en) 2020-06-22 2020-06-22 Speech recognition model training method, system, mobile terminal and storage medium

Publications (2)

Publication Number Publication Date
CN111933116A CN111933116A (en) 2020-11-13
CN111933116B true CN111933116B (en) 2023-02-14

Family

ID=73316584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010573045.3A Active CN111933116B (en) 2020-06-22 2020-06-22 Speech recognition model training method, system, mobile terminal and storage medium

Country Status (1)

Country Link
CN (1) CN111933116B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634897B (en) * 2020-12-31 2022-10-28 青岛海尔科技有限公司 Equipment awakening method and device, storage medium and electronic device
CN113539245B (en) * 2021-07-05 2024-03-15 思必驰科技股份有限公司 Language model automatic training method and system

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5146539A (en) * 1984-11-30 1992-09-08 Texas Instruments Incorporated Method for utilizing formant frequencies in speech recognition
US6618699B1 (en) * 1999-08-30 2003-09-09 Lucent Technologies Inc. Formant tracking based on phoneme information
CN101004911B (en) * 2006-01-17 2012-06-27 纽昂斯通讯公司 Method and device for generating frequency bending function and carrying out frequency bending
CN101030197A (en) * 2006-02-28 2007-09-05 株式会社东芝 Method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model
CN109036381A (en) * 2018-08-08 2018-12-18 平安科技(深圳)有限公司 Method of speech processing and device, computer installation and readable storage medium storing program for executing
CN109326277B (en) * 2018-12-05 2022-02-08 四川长虹电器股份有限公司 Semi-supervised phoneme forced alignment model establishing method and system
CN110853625B (en) * 2019-09-18 2022-05-17 厦门快商通科技股份有限公司 Speech recognition model word segmentation training method and system, mobile terminal and storage medium
CN111145729B (en) * 2019-12-23 2022-10-28 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111192570B (en) * 2020-01-06 2022-12-06 厦门快商通科技股份有限公司 Language model training method, system, mobile terminal and storage medium
CN111179917B (en) * 2020-01-17 2023-01-03 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111210807B (en) * 2020-02-21 2023-03-31 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium

Also Published As

Publication number Publication date
CN111933116A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
Schuster et al. Japanese and korean voice search
Ramani et al. A common attribute based unified HTS framework for speech synthesis in Indian languages
US7957969B2 (en) Systems and methods for building a native language phoneme lexicon having native pronunciations of non-native words derived from non-native pronunciatons
US7085716B1 (en) Speech recognition using word-in-phrase command
US7840399B2 (en) Method, device, and computer program product for multi-lingual speech recognition
WO2007097176A1 (en) Speech recognition dictionary making supporting system, speech recognition dictionary making supporting method, and speech recognition dictionary making supporting program
JP2002287787A (en) Disambiguation language model
CN111933116B (en) Speech recognition model training method, system, mobile terminal and storage medium
CN111192572A (en) Semantic recognition method, device and system
CN110852075A (en) Voice transcription method and device for automatically adding punctuation marks and readable storage medium
Al-Anzi et al. Synopsis on Arabic speech recognition
CN111798841B (en) Acoustic model training method and system, mobile terminal and storage medium
Kiecza et al. Data-driven determination of appropriate dictionary units for Korean LVCSR
Jyothi et al. Improved hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge.
CN111429921B (en) Voiceprint recognition method, system, mobile terminal and storage medium
Al-Anzi et al. Performance evaluation of sphinx and HTK speech recognizers for spoken Arabic language
KR20050101695A (en) A system for statistical speech recognition using recognition results, and method thereof
Sung et al. Deploying google search by voice in cantonese
JP2938865B1 (en) Voice recognition device
KR20050101694A (en) A system for statistical speech recognition with grammatical constraints, and method thereof
Dawa et al. Multilingual Text–Speech Corpus of Mongolian
Ma et al. Russian speech recognition system design based on HMM
CN113506561B (en) Text pinyin conversion method and device, storage medium and electronic equipment
US20220189462A1 (en) Method of training a speech recognition model of an extended language by speech in a source language
JP2001188556A (en) Method and device for voice recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant