CN107767858B - Pronunciation dictionary generating method and device, storage medium and electronic equipment - Google Patents
Pronunciation dictionary generating method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN107767858B CN107767858B CN201710805626.3A CN201710805626A CN107767858B CN 107767858 B CN107767858 B CN 107767858B CN 201710805626 A CN201710805626 A CN 201710805626A CN 107767858 B CN107767858 B CN 107767858B
- Authority
- CN
- China
- Prior art keywords
- pronunciation
- unit
- path
- score
- alternative
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000008859 change Effects 0.000 claims abstract description 64
- 238000010276 construction Methods 0.000 claims description 9
- 230000009191 jumping Effects 0.000 claims description 6
- 239000012634 fragment Substances 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 6
- 235000019580 granularity Nutrition 0.000 description 6
- 238000005406 washing Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 3
- 241001672694 Citrus reticulata Species 0.000 description 2
- 238000003287 bathing Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The disclosure provides a pronunciation dictionary generation method and device, a storage medium and an electronic device. The method comprises the following steps: acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined; decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit; and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value. By the scheme, the generated pronunciation dictionary can better conform to the actual pronunciation of the user, and the accuracy is higher.
Description
Technical Field
The present disclosure relates to the field of speech recognition, and in particular, to a method and an apparatus for generating a pronunciation dictionary, a storage medium, and an electronic device.
Background
With the continuous development of speech recognition technology, speech recognition is widely applied in many fields, such as speech input method, conference transcription, movie and television subtitle generation, etc.
One important resource in speech recognition technology is a pronunciation dictionary, which typically maps words to phoneme strings. For example, the chinese pronunciation dictionary "pluck b a 2" indicates that the word "plucks" is pronounced to "b a 2", where "b" and "a 2" are monaural phonemes and are units of characterization modeled by an acoustic model, generally a stable pronunciation unit, indicating how a word is pronounced. That is, the accurate pronunciation dictionary directly determines the quality of the acoustic model, and further may affect the overall recognition effect of the speech recognition model.
Currently, pronunciation dictionaries are mostly generated by different schemes for different language types:
in a language in which pronunciation information can be obtained from characters, that is, a pinyin language, such as a dimension language or a korean language, a word corresponding to the language is spelled by a phoneme, and it is generally not necessary to manually construct a pronunciation dictionary. For example, the word KoGun transcribed by latin: k o G u n, where the colon is preceded by words and followed by a phonetic string of phonemes.
For the pronunciation which can not directly obtain pronunciation information according to characters, namely, the ideophonetic language such as Chinese and the like, the language has the characteristics that the word table has a plurality of meanings, pronunciation can not be directly obtained from word characters, and a pronunciation dictionary needs to be manually constructed. As exemplified above, the word "plucks" the corresponding pronunciation's phoneme string "b a 2" is not available from the word text itself.
In the practical application process, a great amount of sound variation phenomena exist in both Pinyin languages and Italian languages. For example, a clean-up in a dimension language, such as "b" in the word "kitablar" will be cleaned up to "p" when read; turbidification in dimension languages, such as "k" in the word "qelikikin" when read would be turbidified to "g"; a drop in a wiki, such as "r" in the word "kitablar", when read; chinese dialects, such as fertile dialects, read "bathing" as "si bathing".
Due to the existence of the voice change phenomenon, for the dictionary generation scheme of the Pinyin language, the generated pronunciation dictionary may have deviation from the actual pronunciation; it is difficult for a dictionary generation scheme for an ideographic language to manually mark out the inflexion pronunciations of all words.
Disclosure of Invention
The present disclosure provides a pronunciation dictionary generating method and apparatus, a storage medium, and an electronic device, so that the generated pronunciation dictionary is more consistent with the actual pronunciation of the user and has higher accuracy.
In order to achieve the above object, the present disclosure provides a pronunciation dictionary generating method, the method including:
acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined;
decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.
Optionally, the pronunciation word to be determined is a pinyin language, and constructing a pronunciation recognition network for the pronunciation word to be determined includes:
sequentially setting the levels of all correct pronunciation units according to the spelling sequence of the pronunciation words to be determined;
for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level;
and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
Optionally, the decoding, by using the pronunciation recognition network, the voice segment to determine a pronunciation path corresponding to the voice segment includes:
taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level, and determining all alternative pronunciation paths corresponding to the voice segments;
and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Optionally, the manner of calculating the score value of the alternative pronunciation path is:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
Optionally, the pronunciation words to be determined are ideophonous languages, and the constructing a pronunciation recognition network for the pronunciation words to be determined includes:
sequentially arranging a pronunciation unit level and a hollow node level, wherein the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level rebound before the decoding of the voice fragment is finished;
and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
Optionally, the decoding, by using the pronunciation recognition network, the voice segment to determine a pronunciation path corresponding to the voice segment includes:
taking the voice segments as the input of the pronunciation recognition network, and reaching the empty node level after passing through the pronunciation unit level;
judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.
Optionally, the manner of calculating the score value of the alternative pronunciation path is:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
Alternatively, if M pronunciation paths are obtained, M ≧ 1, the confidence of the pronunciation represented by each pronunciation path is obtained as follows:
obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM;
Will SjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.
The present disclosure provides a pronunciation dictionary generating apparatus, the apparatus including:
the voice segment acquisition module is used for acquiring a voice segment corresponding to the pronunciation word to be determined;
the pronunciation recognition network construction module is used for constructing a pronunciation recognition network aiming at the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined;
a pronunciation path determining module, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
the confidence coefficient calculation module is used for calculating the confidence coefficient of the pronunciation represented by the pronunciation path;
and the pronunciation dictionary generating module is used for generating a pronunciation dictionary of the pronunciation words to be determined by utilizing the pronunciation represented by the pronunciation path with the confidence coefficient higher than the preset value.
Optionally, the pronunciation words to be determined are Pinyin language,
the pronunciation identification network construction module is used for adding the pronunciation unit corresponding to the correct pronunciation unit in parallel at the hierarchy with possible pronunciation; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level and determining all alternative pronunciation paths corresponding to the voice segments;
and the pronunciation path determining submodule is used for determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
Optionally, the pronunciation words to be determined are ideographic languages,
the pronunciation identification network construction module is used for sequentially setting a pronunciation unit level and a hollow node level, the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level jump back before the decoding of the voice segment is finished; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segment as the input of the pronunciation recognition network and reaching the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and the pronunciation path determining submodule is used for determining all alternative pronunciation paths corresponding to the voice segment, and then determining the pronunciation path corresponding to the voice segment with the highest score value in the alternative pronunciation paths.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
Alternatively, if M pronunciation paths are available, M ≧ 1,
the confidence coefficient calculation module is used for obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM(ii) a Will SjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.
The present disclosure provides a storage medium having stored therein a plurality of instructions, which are loaded by a processor, for performing the steps of the pronunciation dictionary generation method described above.
The present disclosure provides an electronic device, comprising;
the storage medium described above; and
a processor to execute the instructions in the storage medium.
In the scheme, a pronunciation recognition network can be constructed for the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined, so that the pronunciation recognition network can be used for decoding the voice segments corresponding to the pronunciation words to be determined to obtain pronunciation paths corresponding to the voice segments, and then the pronunciation of the pronunciation words to be determined is obtained according to the confidence coefficient of the pronunciation represented by the pronunciation paths to generate a pronunciation dictionary. By the scheme, the generated pronunciation dictionary can better conform to the actual pronunciation of the user, and the accuracy is higher.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a flow chart of a pronunciation dictionary generation method according to the present disclosure;
FIG. 2 is a schematic diagram of a pronunciation recognition network in accordance with aspects of the present disclosure;
FIG. 3 is a schematic diagram of another pronunciation recognition network in accordance with aspects of the present disclosure;
FIG. 4 is a schematic diagram of a pronunciation dictionary generating device according to the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device for generating a pronunciation dictionary according to the present disclosure.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Referring to fig. 1, a flow chart of the pronunciation dictionary generation method of the present disclosure is shown. May include the steps of:
s101, obtaining a voice segment corresponding to the pronunciation word to be determined.
It is understood that the pronunciation words to be determined in the present disclosure may be different types of languages, such as pinyin language, ideographic language, etc.; or languages of different languages, such as Chinese, Wei language, etc.; but also different dialects in the same language, such as the combined fat dialects of chinese, the south beige dialects, etc. The form of the pronunciation words to be determined in the scheme of the disclosure may not be particularly limited.
In the present disclosure, the speech segment corresponding to the pronunciation word to be determined may be obtained in various ways, which is exemplified below.
In the first mode, the corresponding relation between the pronunciation words to be determined and the voice segments can be established in a manual labeling mode, and the voice segments corresponding to the pronunciation words to be determined can be obtained according to the corresponding relation when needed.
And in the second mode, the voice segment corresponding to the pronunciation word to be determined is obtained in an automatic identification mode.
For example, the historical speech data may be recognized and decoded, and after the audio format is converted into the text format, the speech segment corresponding to the pronunciation word to be determined is obtained from the historical speech data through forced alignment segmentation, that is, the time information from the pronunciation word to be determined to the historical speech data is determined.
For example, for chinese in an ideophonetic language, if a piece of historical speech data is labeled as "laundry", after forced alignment segmentation, the time information shown in table 1 below can be obtained.
TABLE 1
Word and phrase | Starting time | At the end ofWorkshop | Duration of time |
Washing machine | 20ms | 39ms | 20ms |
Clothes | 40ms | 79ms | 40ms |
If the pronunciation word to be determined is 'clothes', the voice segment corresponding to 40-79 ms in the historical voice data can be determined as the voice segment corresponding to 'clothes'.
For example, for a wiki in the Pinyin language, if a piece of historical speech data is labeled as "nurGun kitablar bar", after the forced alignment segmentation, the time information shown in Table 2 below can be obtained.
TABLE 2
Monaural phonemes | Starting time | End time | Duration of time |
nurGun | 20ms | 59ms | 40ms |
kitablar | 60ms | 109ms | 50ms |
bar | 110ms | 139ms | 30ms |
If the pronunciation word to be determined is "kitablar", the voice segment corresponding to 60-109 ms in the historical voice data can be determined as the voice segment corresponding to "kitablar".
S102, establishing a pronunciation recognition network aiming at the pronunciation words to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined.
In the practical application process, the sound change of the pinyin language generally has certain regularity, such as clearing, turbidity and falling in the above introduced dimensional language, while the sound change of the meaning language has poor regularity, such as various local dialects derived from mandarin.
It is understood that the scheme of the present disclosure can set the granularity of the pronunciation units in the pronunciation recognition network according to the practical application requirements. For example, the granularity of the pronunciation unit may be a phoneme level, or the granularity of the pronunciation unit may be a syllable level, which is not specifically limited by the present disclosure, and it is sufficient to ensure that the granularities of the pronunciation units at the same level in the network are consistent.
1. To-be-determined pronunciation word is Pinyin language
Correspondingly, the levels of all correct pronunciation units can be sequentially set according to the spelling sequence of the pronunciation words to be determined; for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
As an example, the level at which the change is possible may be determined in conjunction with the regularity of the change. For example, for the word "kitablar" in the dimension language, the correct pronunciation itself is "ki t a b l a r", but in the actual pronunciation process, the phoneme "b" may be cleared to be "p", the phoneme "r" may fall off the unvoiced sound, and if the granularity of the pronunciation unit is the phoneme level, the level of the pronunciation unit "b" or "r" may be determined as the level where the pronunciation is possible. For this, the levels of the correct pronunciation units "k", "i", "t", "a", "b", "l", "a" and "r" may be set in order, and then the inflexion pronunciation unit "p" is set as the same node as "b", and the inflexion sound change is fitted at "r", so as to construct the pronunciation identification network shown in fig. 2.
2. The pronunciation words to be determined are meaning pronunciation language
In response to this, a pronunciation unit hierarchy including correct pronunciation units and inflected pronunciation units arranged in parallel and an empty node hierarchy for performing hierarchy rebound before the end of decoding of a speech fragment may be sequentially arranged; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
As an example, to ensure that all possible variant pronunciations are constructed, a variant pronunciation unit may be all the phonemes contained in a Chinese dictionary when the granularity of the pronunciation unit is at the phoneme level.
As an example, a variation pronunciation unit for fitting the dropped variation can be arranged in the network according to the practical application requirement.
If the pronunciation word to be determined is the Chinese word 'washing', the corresponding correct pronunciation is 'x i 3', namely 'x' and 'i 3' are correct pronunciation units of the pronunciation word to be determined, when the network is constructed, besides the correct pronunciation units, all phonemes and dropped pronunciation changes contained in the Chinese dictionary can be used as pronunciation units of pronunciation changes, and are set as peer parallel nodes of the correct pronunciation units to construct the pronunciation identification network shown in FIG. 3.
As can be understood, for the pronunciation words to be determined, the voice segments can be obtained first and then the pronunciation recognition network is constructed as shown in fig. 1; or, a pronunciation recognition network can be constructed first and then the voice segments can be obtained; alternatively, both actions may be performed simultaneously, which may not be specifically limited by the present disclosure.
S103, decoding the voice segment by using the pronunciation recognition network, and determining a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit.
Based on the two pronunciation recognition networks introduced above, the disclosed scheme provides the following two decoding schemes, which are explained separately below.
1. Pronunciation recognition network constructed for Pinyin language
Specifically, the voice segment may be used as an input of the pronunciation recognition network, and all the alternative pronunciation paths corresponding to the voice segment are determined by sequentially traversing each level; and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Taking the network shown in fig. 2 as an example, after the "start" node sequentially passes through the levels of "k", "i", "t" and "a" without occurrence of voice change, the node may pass through the correct pronunciation unit "b" and may also pass through the voice change pronunciation unit "p"; and then sequentially passing through the levels of 'l' and 'a' without the occurrence of the sound change, passing through a correct pronunciation unit 'r', or fitting an empty arc separated from the sound change, and finally reaching an 'end' node, namely obtaining 4 alternative pronunciation paths. In this way, the score of each alternative pronunciation path can be calculated, and the pronunciation path corresponding to the speech segment is determined as the pronunciation with the highest score, that is, the pronunciation of the pronunciation word "kitablar" to be determined is the pronunciation represented by the path with the highest score.
Note that, L in fig. 2 is a score value corresponding to each pronunciation unit, and can be understood as a penalty score corresponding to each pronunciation unit. When decoding is performed by using the pronunciation recognition network, besides the variant pronunciation unit of the drop type, other pronunciation units also correspond to a decoded acoustic score, and as an example, the acoustic score may be represented as a probability value corresponding to the pronunciation unit when the speech segment is decoded.
Specifically, the score value of the alternative pronunciation path may be calculated by: obtaining final scores of all pronunciation units included in the alternative pronunciation path; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path. If the pronunciation unit is a drop-type sound variation pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound variation pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit. The mathematical operations in the present disclosure may be addition, multiplication, etc., and this may not be specifically limited in the present disclosure.
2. Pronunciation recognition network constructed for ideographic language
Specifically, the speech segment may be used as an input of the pronunciation recognition network, and reaches the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path; and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.
Taking the network shown in fig. 3 as an example, after the "start" node passes through the "x" node and reaches the empty node "null", since the decoding of the speech segment is not yet finished, the speech segment can jump back to the pronunciation unit level from the "null" node level, and then reaches the "null" node after passing through the "i 3" node, and at this time, the decoding of the speech segment is finished, so that the speech segment can finally reach the "end" node, and an alternative pronunciation path is obtained. In this example, the alternative pronunciation path may include 2 pronunciation units, 1-level hierarchical jump back.
According to the above process, all alternative pronunciation paths corresponding to the voice segments can be obtained, that is, all possible pronunciations of the pronunciation word "wash" to be determined are obtained. In this way, the score of each alternative pronunciation path can be calculated, and the pronunciation path corresponding to the speech segment is determined as the pronunciation with the highest score, that is, the pronunciation of the pronunciation word "washing" to be determined is the pronunciation represented by the path with the highest score.
Note that, L in fig. 3 is a score value corresponding to each pronunciation unit, and can be understood as a penalty score corresponding to each pronunciation unit. When decoding is performed by using the pronunciation recognition network, besides the variant pronunciation unit of the drop type, other pronunciation units also correspond to a decoded acoustic score, and as an example, the acoustic score may be represented as a probability value corresponding to the pronunciation unit when the speech segment is decoded. In addition, it should be noted that, in the decoding process, each time a jump back is performed, a score value of a level jump back is also corresponded, and the score value of the alternative pronunciation path can also be calculated.
Specifically, the score value of the alternative pronunciation path may be calculated by: obtaining final scores of all pronunciation units included in the alternative pronunciation path; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path. If the pronunciation unit is a drop-type sound variation pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound variation pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit. The mathematical operations in the present disclosure may be addition, multiplication, etc., and this may not be specifically limited in the present disclosure.
As an example, when setting the penalty score, the penalty score of the pronunciation unit corresponding to the original pronunciation of the pronunciation word to be determined may be set to 0, and the penalty scores of other pronunciation units, level rebound, etc. may be set to-10. This is mainly because the pitch of a word is usually a few cases and does not change too strongly, so setting the penalty score can make the decoding result not too confusing. As an example, if the pronunciation word to be determined is "wash", if the user reads "xi bath", the pronunciation units corresponding to the original pronunciation are "x" and "i 3"; if the user reads "sibathe", the pronunciation units corresponding to the original pronunciation are "s" and "i 3".
As one example, the penalty score may be determined empirically. For example, a plurality of pronunciation words to be determined of standard mandarin are selected, a network is constructed and decoded according to the scheme disclosed by the invention, and when the accuracy of pronunciation paths decoded by all pronunciation words to be determined is higher than a preset threshold value, penalty scores corresponding to pronunciation units, level rebound and the like at the moment can be obtained. As an example, the preset threshold may be 95%. In the present disclosure, the accuracy may be understood as the pronunciation represented by the decoded pronunciation path, which is the same as the original pronunciation of the user.
And S104, calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.
The pronunciation dictionary processing method and the pronunciation dictionary processing device can screen out pronunciations corresponding to the pronunciation words to be determined based on the confidence degrees and the preset values, so that the determined pronunciations can be added or replaced into the original pronunciation dictionary to form a new pronunciation dictionary. The value of the preset value can be set by combining the actual application requirement without limitation.
As an example, if M pronunciation paths are obtained, M ≧ 1, the confidence level can be calculated as follows: obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM(ii) a Then S is convertedjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path. May be embodied as the following equation:
s (i, j) represents the confidence coefficient of the pronunciation represented by the jth pronunciation path of the pronunciation word i to be determined;score S of j-th pronunciation path for representing pronunciation word i to be determinedj;And a sum S of scores representing the M kinds of pronunciation paths.
Taking the word "washing" of the Chinese fertile dialect as an example, assuming that 1000 voice segments are obtained in total, 1000 pronunciation paths and the score value of each pronunciation path can be obtained according to the above description. It can be understood that, in 1000 pronunciation paths, some pronunciation paths may be the same, or all pronunciation paths may be different, that is, the type M of the pronunciation path is less than or equal to 1000, and the confidence of the pronunciation represented by each pronunciation path can be calculated by using formula 1.
As an example, the preset value may be set to 0.5, if there are 3 pronunciation paths corresponding to 1000 speech segments corresponding to the pronunciation word "washing" to be determined, and the confidence of pronunciation represented by each pronunciation path is as follows: the confidence coefficient of the pronunciation of "x i 3" is 0.38, the confidence coefficient of the pronunciation of "s i 3" is 0.52, and the threshold value of the pronunciation of "q i 3" is 0.1, then "s i 3" can be used as the pronunciation of the pronunciation word "washing" in the fertile dialect to be determined, and a pronunciation dictionary is generated.
As an example, the preset value may be set to 0.3, if there are 4 pronunciation paths corresponding to 1000 speech segments corresponding to the pronunciation word "kitablar" to be determined, and the confidence of pronunciation represented by each pronunciation path is as follows: the confidence coefficient of pronunciation of "ki t a b l a r" is 0.2; the confidence coefficient of the pronunciation of "ki a p l r" is 0.31, the confidence coefficient of the pronunciation of "ki a b l a" is 0.35, and the confidence coefficient of the pronunciation of "ki a p l a" is 0.14, then the pronunciation dictionary can be generated by using the "ki a p l r" and the "ki a b l a" as the pronunciations of the pronunciation word "kitablar" to be determined in the wiki.
In conclusion, when the pronunciation dictionary of the Pinyin language is generated by using the scheme disclosed by the invention, the problem of deviation caused by the fact that actual voice change is not considered in the prior art is solved; when the pronunciation dictionary of the ideographic language is generated by using the scheme disclosed by the invention, the problem that all the pronunciations of the words are difficult to mark manually in the prior art is solved. That is to say, pronunciation dictionary that this disclosure scheme generated more accords with user's actual pronunciation, and the accuracy is higher, carries out speech recognition based on this pronunciation dictionary, helps improving the quality of acoustic model, and then improves the whole recognition effect of speech recognition model.
Referring to fig. 4, a schematic diagram of the pronunciation dictionary generating device of the present disclosure is shown. The apparatus may include:
a voice segment acquiring module 201, configured to acquire a voice segment corresponding to a pronunciation word to be determined;
a pronunciation identification network construction module 202, configured to construct a pronunciation identification network for the pronunciation word to be determined, where the pronunciation identification network includes a correct pronunciation unit and a pitch change pronunciation unit of the pronunciation word to be determined;
a pronunciation path determining module 203, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
a confidence calculation module 204, configured to calculate a confidence of the pronunciation represented by the pronunciation path;
and the pronunciation dictionary generating module 205 is configured to generate a pronunciation dictionary of the pronunciation word to be determined by using the pronunciation represented by the pronunciation path with the confidence higher than the preset value.
Optionally, the pronunciation words to be determined are Pinyin language,
the pronunciation identification network construction module is used for adding the pronunciation unit corresponding to the correct pronunciation unit in parallel at the hierarchy with possible pronunciation; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level and determining all alternative pronunciation paths corresponding to the voice segments;
and the pronunciation path determining submodule is used for determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
Optionally, the pronunciation words to be determined are ideographic languages,
the pronunciation identification network construction module is used for sequentially setting a pronunciation unit level and a hollow node level, the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level jump back before the decoding of the voice segment is finished; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segment as the input of the pronunciation recognition network and reaching the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and the pronunciation path determining submodule is used for determining all alternative pronunciation paths corresponding to the voice segment, and then determining the pronunciation path corresponding to the voice segment with the highest score value in the alternative pronunciation paths.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
Alternatively, if M pronunciation paths are available, M ≧ 1,
the confidence coefficient calculation module is used for obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM(ii) a Will SjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Referring to fig. 5, a schematic structural diagram of an electronic device 300 for generating a pronunciation dictionary according to the present disclosure is shown. Referring to fig. 5, the electronic device 300 includes a processing component 301 that further includes one or more processors, and storage device resources, represented by storage medium 302, for storing instructions, such as application programs, that are executable by the processing component 301. The application programs stored in the storage medium 302 may include one or more modules that each correspond to a set of instructions. Further, the processing component 301 is configured to execute instructions to perform the pronunciation dictionary generation method described above.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.
Claims (18)
1. A pronunciation dictionary generating method, the method comprising:
acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined; when the pronunciation words to be determined are ideographic languages, the pronunciation recognition network comprises: the pronunciation unit level and the empty node level are sequentially arranged, and the empty node level is used for carrying out level jump back before the decoding of the voice fragment is finished;
decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.
2. The method of claim 1, wherein the pronunciation word to be determined is a Pinyin language, and wherein constructing a pronunciation recognition network for the pronunciation word to be determined comprises:
sequentially setting the levels of all correct pronunciation units according to the spelling sequence of the pronunciation words to be determined;
for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level;
and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
3. The method according to claim 2, wherein the decoding the speech segment by using the pronunciation recognition network to determine a pronunciation path corresponding to the speech segment comprises:
taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level, and determining all alternative pronunciation paths corresponding to the voice segments;
and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
4. The method according to claim 3, wherein the score values of the alternative pronunciation paths are calculated by:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
5. The method according to claim 1, wherein the pronunciation words to be determined are ideographic languages, and the constructing a pronunciation recognition network for the pronunciation words to be determined specifically comprises:
the pronunciation unit hierarchy comprises a correct pronunciation unit and a sound change pronunciation unit which are arranged in parallel;
and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
6. The method according to claim 5, wherein the decoding the speech segment by using the pronunciation recognition network to determine the pronunciation path corresponding to the speech segment comprises:
taking the voice segments as the input of the pronunciation recognition network, and reaching the empty node level after passing through the pronunciation unit level;
judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.
7. The method according to claim 6, wherein the score values of the alternative pronunciation paths are calculated by:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
8. The method according to any one of claims 1 to 7, wherein if M pronunciation paths are obtained, M ≧ 1, the confidence of the pronunciation represented by each pronunciation path is obtained as follows:
obtaining score Sj of j pronunciation paths and sum S of the score Sj of M pronunciation paths, wherein S is 1+ … + Sj + … SM;
and determining the ratio of Sj to S as the confidence of the pronunciation represented by the jth pronunciation path.
9. A pronunciation dictionary generating apparatus, comprising:
the voice segment acquisition module is used for acquiring a voice segment corresponding to the pronunciation word to be determined;
the pronunciation recognition network construction module is used for constructing a pronunciation recognition network aiming at the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined; when the pronunciation words to be determined are ideographic languages, the pronunciation recognition network comprises: the pronunciation unit level and the empty node level are sequentially arranged, and the empty node level is used for carrying out level jump back before the decoding of the voice fragment is finished;
a pronunciation path determining module, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
the confidence coefficient calculation module is used for calculating the confidence coefficient of the pronunciation represented by the pronunciation path;
and the pronunciation dictionary generating module is used for generating a pronunciation dictionary of the pronunciation words to be determined by utilizing the pronunciation represented by the pronunciation path with the confidence coefficient higher than the preset value.
10. The apparatus of claim 9, wherein the pronunciation words to be determined are Pinyin language,
the pronunciation identification network construction module is used for adding the pronunciation unit corresponding to the correct pronunciation unit in parallel at the hierarchy with possible pronunciation; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
11. The apparatus of claim 10, wherein the pronunciation path determination module comprises:
the alternative pronunciation path determining module is used for taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level and determining all alternative pronunciation paths corresponding to the voice segments;
and the pronunciation path determining submodule is used for determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
12. The apparatus of claim 11, wherein the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
13. The apparatus of claim 9, wherein the pronunciation words to be determined are ideographic languages,
the pronunciation identification network construction module is specifically used for arranging a correct pronunciation unit and a sound change pronunciation unit in parallel at the pronunciation unit level; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
14. The apparatus of claim 13, wherein the pronunciation path determination module comprises:
the alternative pronunciation path determining module is used for taking the voice segment as the input of the pronunciation recognition network and reaching the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and the pronunciation path determining submodule is used for determining all alternative pronunciation paths corresponding to the voice segment, and then determining the pronunciation path corresponding to the voice segment with the highest score value in the alternative pronunciation paths.
15. The apparatus of claim 14, wherein the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
16. The apparatus according to any one of claims 9 to 15, wherein if M pronunciation paths are obtained, M ≧ 1,
the confidence coefficient calculation module is used for obtaining the score Sj of the jth pronunciation path and the sum S of the scores of the M pronunciation paths, which is S1+ … + Sj + … SM; and determining the ratio of Sj to S as the confidence of the pronunciation represented by the jth pronunciation path.
17. A storage medium having stored thereon a plurality of instructions, wherein the instructions are loadable by a processor and adapted to cause execution of the steps of the method according to any of claims 1 to 8.
18. An electronic device, characterized in that the electronic device comprises;
the storage medium of claim 17; and
a processor to execute the instructions in the storage medium.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710805626.3A CN107767858B (en) | 2017-09-08 | 2017-09-08 | Pronunciation dictionary generating method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710805626.3A CN107767858B (en) | 2017-09-08 | 2017-09-08 | Pronunciation dictionary generating method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107767858A CN107767858A (en) | 2018-03-06 |
CN107767858B true CN107767858B (en) | 2021-05-04 |
Family
ID=61265107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710805626.3A Active CN107767858B (en) | 2017-09-08 | 2017-09-08 | Pronunciation dictionary generating method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107767858B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110827803A (en) * | 2019-11-11 | 2020-02-21 | 广州国音智能科技有限公司 | Method, device and equipment for constructing dialect pronunciation dictionary and readable storage medium |
CN111369974B (en) * | 2020-03-11 | 2024-01-19 | 北京声智科技有限公司 | Dialect pronunciation marking method, language identification method and related device |
CN111681635A (en) * | 2020-05-12 | 2020-09-18 | 深圳市镜象科技有限公司 | Method, apparatus, device and medium for real-time cloning of voice based on small sample |
CN111798834B (en) * | 2020-07-03 | 2022-03-15 | 北京字节跳动网络技术有限公司 | Method and device for identifying polyphone, readable medium and electronic equipment |
CN113506559B (en) * | 2021-07-21 | 2023-06-09 | 成都启英泰伦科技有限公司 | Method for generating pronunciation dictionary according to Vietnam written text |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063900A (en) * | 2010-11-26 | 2011-05-18 | 北京交通大学 | Speech recognition method and system for overcoming confusing pronunciation |
CN105893414A (en) * | 2015-11-26 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Method and apparatus for screening valid term of a pronunciation lexicon |
CN105957518A (en) * | 2016-06-16 | 2016-09-21 | 内蒙古大学 | Mongolian large vocabulary continuous speech recognition method |
CN106935239A (en) * | 2015-12-29 | 2017-07-07 | 阿里巴巴集团控股有限公司 | The construction method and device of a kind of pronunciation dictionary |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3481497B2 (en) * | 1998-04-29 | 2003-12-22 | 松下電器産業株式会社 | Method and apparatus using a decision tree to generate and evaluate multiple pronunciations for spelled words |
US7353174B2 (en) * | 2003-03-31 | 2008-04-01 | Sony Corporation | System and method for effectively implementing a Mandarin Chinese speech recognition dictionary |
US7428491B2 (en) * | 2004-12-10 | 2008-09-23 | Microsoft Corporation | Method and system for obtaining personal aliases through voice recognition |
CN100411011C (en) * | 2005-11-18 | 2008-08-13 | 清华大学 | Pronunciation quality evaluating method for language learning machine |
CN101447184B (en) * | 2007-11-28 | 2011-07-27 | 中国科学院声学研究所 | Chinese-English bilingual speech recognition method based on phoneme confusion |
CN101763855B (en) * | 2009-11-20 | 2012-01-04 | 安徽科大讯飞信息科技股份有限公司 | Method and device for judging confidence of speech recognition |
CN101840699B (en) * | 2010-04-30 | 2012-08-15 | 中国科学院声学研究所 | Voice quality evaluation method based on pronunciation model |
CN103164403B (en) * | 2011-12-08 | 2016-03-16 | 深圳市北科瑞声科技有限公司 | The generation method and system of video index data |
JP6413220B2 (en) * | 2013-10-15 | 2018-10-31 | ヤマハ株式会社 | Composite information management device |
CN103578464B (en) * | 2013-10-18 | 2017-01-11 | 威盛电子股份有限公司 | Language model establishing method, speech recognition method and electronic device |
CN106155341B (en) * | 2015-03-25 | 2020-05-26 | 李佳俊 | Computer character input method for Chinese characters based on Chinese and Zhuang language writing system |
CN105513589B (en) * | 2015-12-18 | 2020-04-28 | 百度在线网络技术(北京)有限公司 | Speech recognition method and device |
CN106653007B (en) * | 2016-12-05 | 2019-07-16 | 苏州奇梦者网络科技有限公司 | A kind of speech recognition system |
-
2017
- 2017-09-08 CN CN201710805626.3A patent/CN107767858B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063900A (en) * | 2010-11-26 | 2011-05-18 | 北京交通大学 | Speech recognition method and system for overcoming confusing pronunciation |
CN105893414A (en) * | 2015-11-26 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Method and apparatus for screening valid term of a pronunciation lexicon |
CN106935239A (en) * | 2015-12-29 | 2017-07-07 | 阿里巴巴集团控股有限公司 | The construction method and device of a kind of pronunciation dictionary |
CN105957518A (en) * | 2016-06-16 | 2016-09-21 | 内蒙古大学 | Mongolian large vocabulary continuous speech recognition method |
Non-Patent Citations (1)
Title |
---|
基于发音词典自适应的民族语口音汉语普通话语音识别;陈江;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20110515(第5期);1-4、8-12、16-18 * |
Also Published As
Publication number | Publication date |
---|---|
CN107767858A (en) | 2018-03-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107767858B (en) | Pronunciation dictionary generating method and device, storage medium and electronic equipment | |
CN110648658B (en) | Method and device for generating voice recognition model and electronic equipment | |
JP6675463B2 (en) | Bidirectional stochastic rewriting and selection of natural language | |
CN111667816B (en) | Model training method, speech synthesis method, device, equipment and storage medium | |
CN105632499B (en) | Method and apparatus for optimizing speech recognition results | |
US8126714B2 (en) | Voice search device | |
CN103065630B (en) | User personalized information voice recognition method and user personalized information voice recognition system | |
Irie et al. | On the choice of modeling unit for sequence-to-sequence speech recognition | |
CN109858038B (en) | Text punctuation determination method and device | |
Zenkel et al. | Subword and crossword units for CTC acoustic models | |
US9984689B1 (en) | Apparatus and method for correcting pronunciation by contextual recognition | |
CN111369974B (en) | Dialect pronunciation marking method, language identification method and related device | |
Li et al. | Language modeling with functional head constraint for code switching speech recognition | |
EP3915104A1 (en) | Word lattice augmentation for automatic speech recognition | |
Waters et al. | Leveraging language id in multilingual end-to-end speech recognition | |
CN102439660A (en) | Voice-tag method and apparatus based on confidence score | |
CN113808571B (en) | Speech synthesis method, speech synthesis device, electronic device and storage medium | |
CN112580340A (en) | Word-by-word lyric generating method and device, storage medium and electronic equipment | |
JP2016001242A (en) | Question sentence creation method, device, and program | |
Song et al. | Zeroprompt: Streaming acoustic encoders are zero-shot masked lms | |
JP6485941B2 (en) | LANGUAGE MODEL GENERATION DEVICE, ITS PROGRAM, AND VOICE RECOGNIZING DEVICE | |
CN105632500B (en) | Speech recognition apparatus and control method thereof | |
CN114519358A (en) | Translation quality evaluation method and device, electronic equipment and storage medium | |
JP6276516B2 (en) | Dictionary creation apparatus and dictionary creation program | |
CN113378553A (en) | Text processing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |