CN111696530B - Target acoustic model obtaining method and device - Google Patents

Target acoustic model obtaining method and device Download PDF

Info

Publication number
CN111696530B
CN111696530B CN202010366725.8A CN202010366725A CN111696530B CN 111696530 B CN111696530 B CN 111696530B CN 202010366725 A CN202010366725 A CN 202010366725A CN 111696530 B CN111696530 B CN 111696530B
Authority
CN
China
Prior art keywords
preset
word
target
pronunciation
tone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010366725.8A
Other languages
Chinese (zh)
Other versions
CN111696530A (en
Inventor
郑晓明
李健
武卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202010366725.8A priority Critical patent/CN111696530B/en
Publication of CN111696530A publication Critical patent/CN111696530A/en
Application granted granted Critical
Publication of CN111696530B publication Critical patent/CN111696530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Abstract

The embodiment of the invention provides a method and a device for acquiring a target acoustic model, wherein the method comprises the following steps: screening preset words which are subjected to tone change when matched with the specific words from a preset pronunciation dictionary to obtain target preset words; adding the pronunciation of the target preset word matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary; and then training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model. The embodiment of the invention can process the tone change when words are matched in the pronunciation dictionary, and supplements the pronunciation of the words after the tone change in the words to the pronunciation dictionary, so that the pronunciation of the words in the pronunciation dictionary is more comprehensive and more fit with the actual pronunciation of a user; and then, the acoustic model is trained by utilizing the optimized pronunciation dictionary, so that the accuracy of the voice recognition of the obtained target acoustic model can be improved.

Description

Target acoustic model obtaining method and device
Technical Field
The invention relates to the field of voice recognition, in particular to a target acoustic model obtaining method and device.
Background
Speech Recognition (ASR) is a technology for studying how to convert voice Recognition of human Speech into text, and is widely applied to services such as voice dialing, voice navigation, indoor device control, voice document retrieval, and simple dictation data entry.
To implement speech recognition, it is often necessary to obtain an acoustic model. Therefore, a method of acquiring an acoustic model is needed.
Disclosure of Invention
The embodiment of the invention provides a target acoustic model obtaining method and device, and aims to solve the problem of low accuracy of voice recognition in the prior art.
In order to solve the above problem, the embodiment of the present invention is implemented as follows:
in a first aspect, an embodiment of the present invention discloses a target acoustic model obtaining method, including:
screening preset words which are generated in a tone change manner when matched with the specific words from a preset pronunciation dictionary to obtain target preset words; the pronunciation dictionary at least comprises preset words and basic pronunciations of the preset words;
adding the pronunciation of the target preset word matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary;
and training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model.
In a second aspect, an embodiment of the present invention discloses a target acoustic model obtaining apparatus, including:
the screening module is used for screening preset words which are generated by tone change when matched with the specific words from a preset pronunciation dictionary to obtain target preset words; the pronunciation dictionary at least comprises preset words and basic pronunciations of the preset words;
the adding module is used for adding the pronunciation when the target preset word is matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary;
and the training module is used for training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, and when executed by the processor, the electronic device implements the step of acquiring the acoustic model according to the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of obtaining the acoustic model according to the first aspect.
In the embodiment of the invention, preset words which are tonal-changed when matched with specific words are screened from a preset pronunciation dictionary to obtain target preset words; adding the pronunciation of the target preset word matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary; and then training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model. The embodiment of the invention can process the tone change when words are matched in the pronunciation dictionary, and supplements the pronunciation of the words after the tone change in the words to the pronunciation dictionary, so that the pronunciation of the words in the pronunciation dictionary is more comprehensive and more fit with the actual pronunciation of a user; and then, the acoustic model is trained by utilizing the optimized pronunciation dictionary, so that the accuracy of the voice recognition of the obtained target acoustic model can be improved.
Drawings
FIG. 1 is a flow chart illustrating the steps of a target acoustic model acquisition method of the present invention;
FIG. 2 is a flow chart illustrating the steps of another target acoustic model acquisition method of the present invention;
fig. 3 shows a block diagram of a target acoustic model acquisition apparatus according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a flowchart illustrating steps of a target acoustic model obtaining method according to the present invention is shown, where the method may specifically include:
step 101, screening preset words which are generated in a tone-changing manner when matched with specific words from a preset pronunciation dictionary to obtain target preset words; the pronunciation dictionary at least comprises preset words and basic pronunciations of the preset words.
In the embodiment of the present invention, the preset pronunciation dictionary (lexicon) may be a corresponding relationship between words and phonemes, and the phonemes may be used to represent pronunciations of the corresponding words, specifically, pinyin or phonetic symbols. The word may be a word consisting of one word or a plurality of words. For example, the mapping relationship between the initial consonants and vowels and the characters or words may be the mapping relationship between the english phonetic symbols and the words.
The base reading may refer to the original reading of the word and/or the reading after the inflexion within the word. The pronunciation dictionary stores a plurality of preset words and basic pronunciation thereof, for example, the pronunciation dictionary stores word "general theory" and its corresponding basic pronunciation "zong2li3", word "medlar" and its corresponding basic pronunciation "gou2qi3". The number 2 in the base reading indicates that the tone of the current syllable such as "zong", "gou" is positive (two sounds), and the number 3 in the base reading indicates that the tone of the current syllable such as "li", "qi" is up (three sounds).
The pitch change may refer to a pitch change phenomenon occurring when syllables are continuously uttered in the pronunciation of an actual person, that is, the pitch values of some syllables are changed by the influence of the later pitch tone, for example, the original pronunciation of the "president" is "zong3li3", and in the pronunciation of an actual person, the pitch change generally occurs, that is, the "president" is read as "zong2li3", which is the pitch change in words. It should be noted that, usually, the pronunciation after the pitch change is directly labeled in the pronunciation dictionary, and in the embodiment of the present invention, the pronunciation after the pitch change in the word labeled in the preset pronunciation dictionary is also used as the basic pronunciation of the preset word.
Further, the tone change situation may occur not only in a single word, but also in connection with a connective word collocated after the word, for example, in the word "good overall" the pronunciation of "good overall" is changed to "zong2li2", that is, the word "good" of "general" is the first sound (three sounds) in the basic pronunciation, when the word "good overall" is formed by being collocated with "good" the tone of the last character is changed to the positive (two sounds) in the actual pronunciation, and the tone change at this time is the inter-word tone change occurring when the word is collocated with other specific words, and also "exhibition hall", "election method", "horse race", "show prize", and the like.
The target preset word may be a set of preset words in the preset pronunciation dictionary that are transposed when collocated with the particular word. Specifically, during screening, the target preset words can be screened through manual screening, namely, the screening can be judged one by one manually, and the programming language can also be used for screening the target preset words.
And 102, adding the pronunciation of the target preset word matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary.
In the embodiment of the present invention, the pronunciation when the target preset word is matched with the specific word may be a pronunciation after tone change when the word is matched. For example, the "general theory" sounds after tone-change "zong2li2" and the "medlar" sounds after tone-change "gou2qi2". Adding the modified pronunciation into a preset pronunciation dictionary, so that the target pronunciation dictionary comprises the basic pronunciation of the preset word and the pronunciation of the preset word after inter-word modification, and the phonetic transcription expansion of the preset pronunciation dictionary is realized; when the acoustic model is trained subsequently, the used marking tone can be matched with the actual pronunciation, so that the accuracy of the acoustic model can be improved, and the accuracy of voice recognition can be further improved.
And 103, training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model.
In the embodiment of the present invention, an Acoustic Model (AM) may be used to receive a speech signal and output phoneme information corresponding to the speech signal, and may be used to represent differences in acoustics, phonetics, environmental variables, speaker gender, accent, and the like.
Specifically, in the step, when the acoustic model is trained, the preset text corpus to be recognized and the training audio corresponding to the collected preset text corpus may be obtained first, and then the monophonic phoneme model is trained according to the training audio and the target pronunciation dictionary, where the monophonic phoneme model is a model obtained by training only with a single word and phoneme without using the previous or subsequent context information of the current phoneme. In a specific training process, a Gaussian mixture model-hidden Markov model (GMM-HMM) can be used as a basic framework, training audios are aligned according to an acoustic model, and an additional training algorithm can be used for improving and refining a parameter model by aligning the training audios and characters. And then training a triphone model on the basis of the single factor model, wherein the triphone model can show the change of the phonemes by using the front phonemes and the rear phonemes of the context. The training audio is then re-aligned according to the acoustic model and the triphone model is re-trained, i.e., the acoustic model is optimized using a training and alignment loop, which is also known as Viterbi (Viterbi) training. In this way, the acoustic model is trained based on the target pronunciation dictionary containing the basic pronunciation of the preset word and the inter-word modulated pronunciation, so that the accuracy of correspondence between the speech signal and the phoneme can be improved, and the accuracy of speech recognition can be further improved.
Of course, other ways, such as training the acoustic model according to the pronunciation dictionary based on artificial neural network recognition (ANN), etc., may also be adopted, and the embodiment of the present invention is not limited thereto.
In summary, in the target acoustic model obtaining method provided in the embodiment of the present invention, a preset word that generates a modified tone when being matched with a specific word is screened from a preset pronunciation dictionary to obtain a target preset word; adding the pronunciation of the target preset word when the target preset word is matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary; and then training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model. The embodiment of the invention can process the tone change when words are matched in the pronunciation dictionary, and supplements the pronunciation of the words after the tone change in the words to the pronunciation dictionary, so that the pronunciation of the preset words in the pronunciation dictionary is more comprehensive and more fit with the actual pronunciation of a user; and then, the acoustic model is trained by utilizing the optimized pronunciation dictionary, so that the accuracy of the voice recognition of the obtained target acoustic model can be improved.
Referring to fig. 2, a flowchart illustrating steps of another target acoustic model acquisition method according to the present invention is shown, where the method may specifically include:
step 201, screening preset words which are generated in a tone change manner when matched with specific words from a preset pronunciation dictionary to obtain target preset words; the pronunciation dictionary at least comprises preset words and basic pronunciations of the preset words.
Specifically, the implementation manner of this step may refer to step 101, and details of this embodiment of the present invention are not described herein.
Specifically, this step may be implemented by the following substeps 2011 to 2014:
sub-step 2011: determining an original pitch of a last character of the preset word.
In this step, the original tone may be a tone of a last character included in a basic pronunciation of a predetermined word in the predetermined pronunciation dictionary. For example, the word "presidential theory" stored in the preset pronunciation dictionary corresponds to the basic pronunciation of "zong2li3", the last character of the preset word is "theory", and the original tone is attack.
It should be noted that the basic pronunciation in the embodiment of the present invention may be an original pronunciation of a word and/or a pronunciation after a tone change in the word, and a tone of a last character in the original pronunciation of the word may be directly used as an original tone of the last character; the tone of the last character is not changed when the tone is changed in the word, and the tone of the last character in the basic pronunciation of the preset word in the preset pronunciation dictionary is directly used as the original tone in the embodiment of the invention.
Substep 2012: and determining the original tone of the last character as a preset word of a first preset tone, and determining the original tone as the first word.
In this step, the first preset tone may be a tone that may be modified after being collocated with other words, for example, the first preset tone may be up (three tones). Of course, if the original tones of the last character are flat (first tone), flat (second tone), and flat (fourth tone), the three tones may be respectively used as the first preset tones, which is not limited in the embodiment of the present invention.
In the step, the original tones of the last characters of all the words in the pronunciation dictionary can be determined in advance, and then the original tones of the last characters are words with the first preset tone and are stored in a preset set; accordingly, the preset words can be directly obtained from the word set. Therefore, the embodiment of the invention can classify the preset words based on the original tone of the characters at the tail of the words, thereby facilitating the subsequent comparison and search.
Substep 2013: and determining the pronunciation of the first word when the first word is matched with other words to obtain the comparison pronunciation of the first word.
In this step, the comparison pronunciation may be the pronunciation when the first word is matched with other words, and the comparison pronunciation is the same as the basic pronunciation when the first word is not modified when matched with other words; when the first word is matched with other words, the tone of the first word is changed, and the comparison pronunciation is different from the basic pronunciation.
Optionally, the original tone of the other words is a word with a second preset tone; the first preset tone and the second preset tone are up tones.
In this step, the other words may be words included in the preset pronunciation dictionary or other training text corpora, which is not limited in the embodiment of the present invention.
Specifically, when a preset word with the original tone of the last character as the upper tone is matched with the upper tone word, inter-word tone change may occur, for example, the basic pronunciation of "president" is "zong2li3", in the word "president good", the pronunciation of "president" is changed to "zong2li2", that is, the original pronunciation of the "reason" character of the preset word as the "president" is the upper tone, and when the preset word is matched with the "good" of the upper tone to form "president good", the tone of the last character during actual pronunciation is changed to positive (second tone); similarly, the basic pronunciation of the Chinese wolfberry is "gou2qi3", the pronunciation of the Chinese wolfberry in the term of the Chinese wolfberry is changed into "gou2qi2", namely, the original pronunciation of the Chinese wolfberry character of the last character of the preset term of the Chinese wolfberry is the upper sound, and when the Chinese wolfberry is matched with the sub-characters of the upper sound to form a total rational, the tone of the last character in the actual pronunciation is changed into the positive tone.
In the step, the original tone of the last character is determined to be the first word of the voice, and the first word is matched with other words of the voice of the original tone of the first character, so that the search range of the target preset word with the inter-word tone change is narrowed, the search efficiency of the target preset word with the inter-word tone change can be improved, and the search time is saved.
Substep 2014: if the original tone of the last character of the first word is not matched with the target tone, determining the first word as the target preset word; the target pitch is the pitch of the last character of the first word in the comparison reading.
In this step, when the target pitch is not matched with the original pitch, it may be determined that the inter-word transposition occurs when the first word is matched with other words, and the first word may be determined as a target preset word, for example, "presidential theory", "chinese wolfberry" and the like in the foregoing example.
Step 202, adding the pronunciation of the target preset word matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary.
The step can be specifically realized by an implementation mode one consisting of the following substep (1) or an implementation mode (two) consisting of the substeps (2) to (3):
the implementation method is as follows:
substep (1): acquiring the comparison pronunciation of a first word corresponding to the target preset word; and adding the comparison pronunciation to the preset pronunciation dictionary.
In this step, the pronunciation is compared to the pronunciation after the target preset word is matched with other words. For example, the basic reading of the "presidential theory" stored in the pronunciation dictionary is "zong2li3", the comparative reading of the target preset word in the "presidential theory is" zong2li2", and at this time, the mapping relationship between the" presidential theory "and the" zong2li2 "may be added to the preset pronunciation dictionary, so as to implement phonetic transcription expansion of the pronunciation dictionary.
In the step, the comparison pronunciation of the target preset word is added into the preset pronunciation dictionary, so that the pronunciation in the preset pronunciation dictionary is expanded, the pronunciation in the pronunciation dictionary is more comprehensive, the actual pronunciation of the user can be fitted, and the accuracy rate can be improved in the subsequent voice recognition process.
The implementation mode two is as follows:
substep (2): and acquiring the basic pronunciation of the target preset word.
In the step, after the target preset word is determined, the basic pronunciation of the target preset word is obtained from the preset pronunciation dictionary. For example, for the target preset word "exhibition", the basic pronunciation thereof is acquired as "zhan2lan3".
Substep (3): modifying the corresponding tone of the tail character of the target preset word in the basic pronunciation to obtain a target pronunciation; adding the target pronunciation into the preset pronunciation dictionary; and the target pronunciation is the same as the comparison pronunciation of the first word corresponding to the target preset word.
In this step, after the basic pronunciation of the target preset word is determined, the tone of the last character in the basic pronunciation is modified, and then the target pronunciation obtained after modification is added to the preset pronunciation dictionary. For example, the basic pronunciation of the target preset word "exhibition" is "zhan2lan3", inter-word transposition occurs when the target preset word is matched with the "museum" word, that is, the contrast pronunciation of the target preset word "exhibition" in the "exhibition museum" is "zhan2lan2" during actual pronunciation, in this step, the tone of the last character in the basic pronunciation can be modified, that is, the pronunciation of the "exhibition" word is modified from the upper sound to the positive flat, and the target pronunciation of the target preset word "exhibition" is "zhan2lan2". Therefore, the tone is modified on the basis of presetting the mapping relation between the words and the basic pronunciation in the preset pronunciation dictionary, so that the processing amount of input data can be reduced, and the method is more flexible and convenient.
Step 203, training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model.
Specifically, the implementation manner of this step may refer to step 103, which is not described herein again in this embodiment of the present invention.
And 204, fusing the target pronunciation dictionary, the target acoustic model and a preset language model according to a preset mode to obtain a decoder.
In this step, the decoder may be configured to perform text output on the audio data after the feature extraction through the target acoustic model, the target pronunciation dictionary, and the language model. A predetermined language model (LanguageModel, LM) may be a probability that a single word or a word is associated with each other by training a large amount of text information, and may be used to represent the inherent association of words with words in an arrangement structure.
Specifically, the decoder can be obtained by a speech recognition networking mode, namely HCLG networking, namely, a language model (G), a vocabulary (L), context related information (C) and a hidden Markov model (H) are respectively constructed into a standard finite state converter, and then the four parts are combined through the operation of the standard finite state converter to construct a decoder from a context related phoneme substate to a word. The decoder can generate a state space for searching according to a certain optimization criterion in a space formed by sentences or word sequences and according to a target acoustic model, a language model and a target pronunciation dictionary, and an optimal state sequence is searched in the state space, so that sentences or word sequences capable of outputting the speech signals with the maximum probability are searched.
Step 205, inputting the acoustic features of the speech signal to be recognized into the decoder.
In this step, before performing speech recognition on a speech signal using a decoder, feature extraction needs to be performed on the speech signal first. Speech signals usually contain very rich feature parameters, and different feature vectors characterize different acoustic meanings. The process of extracting the acoustic features of the speech signal to be recognized in this step may be a process of selecting an effective audio representation from the speech signal.
Specifically, mel-frequency cepstrum coefficient (MFCC) features are typically used in feature extraction of speech signals. The extraction process may specifically include: firstly, performing Fast Fourier Transform (FFT) on a voice signal, then performing Mel frequency scale conversion, then configuring a triangular filter bank, calculating the output of each triangular filter after filtering a signal amplitude spectrum, finally performing logarithmic operation on the output of all the filters, and further performing Discrete Cosine Transform (DCT) on the output of all the filters to obtain the MFCC acoustic characteristics of the voice signal.
Of course, the process of extracting the acoustic features in this step may also adopt other manners, which is not limited in this embodiment of the present invention.
In the step, the voice signal is subjected to feature extraction firstly and then the acoustic features of the voice signal to be recognized are input into the decoder, so that the interference of redundant influence factors such as noise, a mute segment and the like in the voice signal can be reduced, the quality of the voice signal is improved, and the accuracy of voice signal recognition can be improved.
And step 206, performing voice recognition on the acoustic features based on the decoder, and outputting a text corresponding to the voice signal.
In this step, after the acoustic features are input into the decoder, the decoder determines the current phoneme from the feature vector of each frame of acoustic features by using the search state space composed of the target acoustic model, the language model and the target pronunciation dictionary, and then forms words from a plurality of phonemes, and then forms a text sentence from the words. Therefore, the voice signal is finally converted into the text corresponding to the voice signal through voice recognition.
In summary, in the target acoustic model obtaining method provided by the embodiment of the present invention, a preset word that is modified when being collocated with a specific word is screened from a preset pronunciation dictionary, so as to obtain a target preset word; adding the pronunciation of the target preset word when the target preset word is matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary; then training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model; then, performing voice recognition networking by using the target pronunciation dictionary, the target acoustic model and a preset language model to obtain a decoder; then, the acoustic characteristics of the speech signal to be recognized can be input into a decoder, and the decoder outputs the text corresponding to the speech signal after speech recognition. The embodiment of the invention can process the tone change when words are matched in the pronunciation dictionary, and supplements the pronunciation of the words after the tone change among the words into the pronunciation dictionary, so that the pronunciation of the preset words in the pronunciation dictionary is more comprehensive and more fit with the actual pronunciation of a user; and then, the optimized pronunciation dictionary is utilized to train an acoustic model, and voice recognition networking is carried out according to the target acoustic model and the expanded target pronunciation dictionary, so that the problem of inter-word tone change processing in the voice recognition networking process can be solved, and the accuracy of voice recognition is improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those of skill in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the embodiments of the invention.
Referring to fig. 3, a block diagram of a target acoustic model obtaining apparatus of the present invention is shown, and specifically, the apparatus 30 may include the following modules:
the screening module 301 is configured to screen preset words which are generated by tone change when matched with the specific word from a preset pronunciation dictionary to obtain target preset words; the pronunciation dictionary at least comprises preset words and basic pronunciations of the preset words.
An adding module 302, configured to add the pronunciation of the target preset word when the target preset word is matched with the specific word to the preset pronunciation dictionary to generate a target pronunciation dictionary.
And the training module 303 is configured to train a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model.
Optionally, the screening module 301 is specifically configured to:
determining the original tone of the last character of the preset word; determining the original tone of the last character as a preset word of a first preset tone, and determining the original tone as the first word; determining the pronunciation of the first word when the first word is matched with other words to obtain the comparative pronunciation of the first word; if the original tone of the last character of the first word is not matched with the target tone, determining the first word as the target preset word; the target pitch is the pitch of the last character of the first word in the comparison reading.
Optionally, the original tone of the other words is a word with a second preset tone; the first preset tone and the second preset tone are up tones.
Optionally, the adding module 302 is configured to: acquiring the comparison pronunciation of a first word corresponding to the target preset word; adding the comparison pronunciation to the preset pronunciation dictionary; or acquiring the basic pronunciation of the target preset word; modifying the corresponding tone of the tail character of the target preset word in the basic pronunciation to obtain a target pronunciation; adding the target pronunciation into the preset pronunciation dictionary; and the target pronunciation is the same as the contrast pronunciation of the first word corresponding to the target preset word.
Optionally, the apparatus further comprises:
the fusion module is used for fusing the target pronunciation dictionary, the target acoustic model and a preset language model according to a preset mode to obtain a decoder; the input module is used for inputting the acoustic characteristics of the voice signal to be recognized into the decoder; and the output module is used for carrying out voice recognition on the acoustic characteristics based on the decoder and outputting a text corresponding to the voice signal.
In summary, the apparatus provided in the embodiment of the present invention can filter the preset words that generate tonal modification when being collocated with the specific word from the preset pronunciation dictionary to obtain the target preset word; adding the pronunciation of the target preset word matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary; and then training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model. The embodiment of the invention can process the tone change when words are matched in the pronunciation dictionary, and supplements the pronunciation of the words after the tone change among the words into the pronunciation dictionary, so that the pronunciation of the preset words in the pronunciation dictionary is more comprehensive and more fit with the actual pronunciation of a user; and then, the acoustic model is trained by utilizing the optimized pronunciation dictionary, so that the accuracy of the voice recognition of the obtained target acoustic model can be improved.
Optionally, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program that is stored in the memory and is executable on the processor, and when the computer program is executed by the processor, the processes of the embodiment of the acoustic model obtaining method are implemented, and the same technical effect can be achieved, and details are not repeated here to avoid repetition.
Optionally, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned embodiment of the acoustic model obtaining method, and can achieve the same technical effect, and is not described here again to avoid repetition.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore, may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Moreover, those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments, not others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (10)

1. A method for obtaining an acoustic model of a target, comprising:
screening preset words which are generated in a tone change manner when matched with the specific words from a preset pronunciation dictionary to obtain target preset words; the pronunciation dictionary at least comprises preset words and basic pronunciations of the preset words;
adding the pronunciation of the target preset word when the target preset word is matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary;
and training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model.
2. The method of claim 1, wherein the step of selecting a modified word from the pronunciation dictionary when the modified word is matched with the specific word to obtain the target preset word comprises:
determining the original tone of the last character of the preset word;
determining the original tone of the last character as a preset word of a first preset tone, and determining the original tone as the first word;
determining the pronunciation of the first word when the first word is matched with other words to obtain the comparative pronunciation of the first word;
if the original tone of the last character of the first word is not matched with the target tone, determining the first word as the target preset word; the target pitch is the pitch of the last character of the first word in the comparison reading.
3. The method of claim 2, wherein the original pitch of the other words is a first character is a second preset pitch;
the first preset tone and the second preset tone are up tones.
4. The method according to claim 2 or 3, wherein the adding the pronunciation of the target preset word when the target preset word is matched with the specific word into the preset pronunciation dictionary comprises:
acquiring the comparison pronunciation of a first word corresponding to the target preset word; adding the comparison pronunciation to the preset pronunciation dictionary;
or acquiring the basic pronunciation of the target preset word; modifying the corresponding tone of the tail character of the target preset word in the basic pronunciation to obtain a target pronunciation; adding the target pronunciation into the preset pronunciation dictionary; and the target pronunciation is the same as the comparison pronunciation of the first word corresponding to the target preset word.
5. The method of claim 1, wherein after the step of training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model, the method further comprises:
fusing the target pronunciation dictionary, the target acoustic model and a preset language model according to a preset mode to obtain a decoder;
inputting acoustic features of a speech signal to be recognized into the decoder;
and performing voice recognition on the acoustic features based on the decoder, and outputting texts corresponding to the voice signals.
6. An apparatus for obtaining an acoustic model of an object, the apparatus comprising:
the screening module is used for screening preset words which are generated in tone change when matched with the specific words from a preset pronunciation dictionary to obtain target preset words; the pronunciation dictionary at least comprises preset words and basic pronunciations of the preset words;
the adding module is used for adding the pronunciation when the target preset word is matched with the specific word into the preset pronunciation dictionary to generate a target pronunciation dictionary;
and the training module is used for training a preset acoustic model according to the target pronunciation dictionary to obtain a target acoustic model.
7. The apparatus of claim 6, wherein the screening module is configured to:
determining the original tone of the last character of the preset word;
determining the original tone of the last character as a preset word of a first preset tone, and determining the original tone as the first word;
determining the pronunciation of the first word when the first word is matched with other words to obtain the comparative pronunciation of the first word;
if the original tone of the last character of the first word is not matched with the target tone, determining the first word as the target preset word; the target pitch is the pitch of the last character of the first word in the comparison reading.
8. The apparatus of claim 7, wherein the original pitch of the other words is a second predetermined pitch; the first preset tone and the second preset tone are up tones.
9. The apparatus of claim 7 or 8, wherein the adding module is configured to:
acquiring the comparison pronunciation of a first word corresponding to the target preset word; adding the comparison pronunciation to the preset pronunciation dictionary;
or acquiring the basic pronunciation of the target preset word;
modifying the corresponding tone of the tail character of the target preset word in the basic pronunciation to obtain a target pronunciation;
adding the target pronunciation into the preset pronunciation dictionary; and the target pronunciation is the same as the comparison pronunciation of the first word corresponding to the target preset word.
10. The apparatus of claim 6, further comprising:
the fusion module is used for fusing the target pronunciation dictionary, the target acoustic model and a preset language model according to a preset mode to obtain a decoder;
the input module is used for inputting the acoustic characteristics of the voice signal to be recognized into the decoder;
and the output module is used for carrying out voice recognition on the acoustic characteristics based on the decoder and outputting a text corresponding to the voice signal.
CN202010366725.8A 2020-04-30 2020-04-30 Target acoustic model obtaining method and device Active CN111696530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010366725.8A CN111696530B (en) 2020-04-30 2020-04-30 Target acoustic model obtaining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010366725.8A CN111696530B (en) 2020-04-30 2020-04-30 Target acoustic model obtaining method and device

Publications (2)

Publication Number Publication Date
CN111696530A CN111696530A (en) 2020-09-22
CN111696530B true CN111696530B (en) 2023-04-18

Family

ID=72476971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010366725.8A Active CN111696530B (en) 2020-04-30 2020-04-30 Target acoustic model obtaining method and device

Country Status (1)

Country Link
CN (1) CN111696530B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510410B1 (en) * 2000-07-28 2003-01-21 International Business Machines Corporation Method and apparatus for recognizing tone languages using pitch information
US7181391B1 (en) * 2000-09-30 2007-02-20 Intel Corporation Method, apparatus, and system for bottom-up tone integration to Chinese continuous speech recognition system
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009042509A (en) * 2007-08-09 2009-02-26 Toshiba Corp Accent information extractor and method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510410B1 (en) * 2000-07-28 2003-01-21 International Business Machines Corporation Method and apparatus for recognizing tone languages using pitch information
US7181391B1 (en) * 2000-09-30 2007-02-20 Intel Corporation Method, apparatus, and system for bottom-up tone integration to Chinese continuous speech recognition system
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Using Automatic Speech Recognition to Assess Thai Speech Language Fluency in the Montreal Cognitive Assessment (MoCA);Pimarn Kantithammakorn;《https://doi.org/10.3390/s22041583》;全文 *
基于声调信息的拉萨方言声学建模方法研究;李健;《中国优秀硕士学位论文全文数据库(信息科技辑)》;全文 *
基于智能语音系统的声调研究;祖漪清;《中国语音学报》;全文 *

Also Published As

Publication number Publication date
CN111696530A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN107195296B (en) Voice recognition method, device, terminal and system
Ghai et al. Literature review on automatic speech recognition
O’Shaughnessy Automatic speech recognition: History, methods and challenges
US11605371B2 (en) Method and system for parametric speech synthesis
KR100815115B1 (en) An Acoustic Model Adaptation Method Based on Pronunciation Variability Analysis for Foreign Speech Recognition and apparatus thereof
CN111862954B (en) Method and device for acquiring voice recognition model
US20100057435A1 (en) System and method for speech-to-speech translation
US20070239444A1 (en) Voice signal perturbation for speech recognition
WO1996023298A2 (en) System amd method for generating and using context dependent sub-syllable models to recognize a tonal language
Gaurav et al. Development of application specific continuous speech recognition system in Hindi
US11335324B2 (en) Synthesized data augmentation using voice conversion and speech recognition models
US11676572B2 (en) Instantaneous learning in text-to-speech during dialog
CN112349289B (en) Voice recognition method, device, equipment and storage medium
JP6330069B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
CN111696530B (en) Target acoustic model obtaining method and device
US9928832B2 (en) Method and apparatus for classifying lexical stress
US11043212B2 (en) Speech signal processing and evaluation
Khalifa et al. Statistical modeling for speech recognition
Yong et al. Low footprint high intelligibility Malay speech synthesizer based on statistical data
Yeh et al. A consistency analysis on an acoustic module for Mandarin text-to-speech
JP2001005483A (en) Word voice recognizing method and word voice recognition device
Dzibela et al. Hidden-Markov-model based speech enhancement
AJ et al. Speech to Speech Based Effortless Malayalam Dictionary Using Kaldi and Effect of CVR Modification on Isolated Word Recognition
Kaur et al. HMM-based phonetic engine for continuous speech of a regional language
EP4275203A1 (en) Self-learning end-to-end automatic speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant