CN107767858B - Pronunciation dictionary generating method and device, storage medium and electronic equipment - Google Patents

Pronunciation dictionary generating method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN107767858B
CN107767858B CN201710805626.3A CN201710805626A CN107767858B CN 107767858 B CN107767858 B CN 107767858B CN 201710805626 A CN201710805626 A CN 201710805626A CN 107767858 B CN107767858 B CN 107767858B
Authority
CN
China
Prior art keywords
pronunciation
unit
path
score
alternative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710805626.3A
Other languages
Chinese (zh)
Other versions
CN107767858A (en
Inventor
方昕
刘俊华
魏思
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201710805626.3A priority Critical patent/CN107767858B/en
Publication of CN107767858A publication Critical patent/CN107767858A/en
Application granted granted Critical
Publication of CN107767858B publication Critical patent/CN107767858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The disclosure provides a pronunciation dictionary generation method and device, a storage medium and an electronic device. The method comprises the following steps: acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined; decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit; and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value. By the scheme, the generated pronunciation dictionary can better conform to the actual pronunciation of the user, and the accuracy is higher.

Description

Pronunciation dictionary generating method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of speech recognition, and in particular, to a method and an apparatus for generating a pronunciation dictionary, a storage medium, and an electronic device.
Background
With the continuous development of speech recognition technology, speech recognition is widely applied in many fields, such as speech input method, conference transcription, movie and television subtitle generation, etc.
One important resource in speech recognition technology is a pronunciation dictionary, which typically maps words to phoneme strings. For example, the chinese pronunciation dictionary "pluck b a 2" indicates that the word "plucks" is pronounced to "b a 2", where "b" and "a 2" are monaural phonemes and are units of characterization modeled by an acoustic model, generally a stable pronunciation unit, indicating how a word is pronounced. That is, the accurate pronunciation dictionary directly determines the quality of the acoustic model, and further may affect the overall recognition effect of the speech recognition model.
Currently, pronunciation dictionaries are mostly generated by different schemes for different language types:
in a language in which pronunciation information can be obtained from characters, that is, a pinyin language, such as a dimension language or a korean language, a word corresponding to the language is spelled by a phoneme, and it is generally not necessary to manually construct a pronunciation dictionary. For example, the word KoGun transcribed by latin: k o G u n, where the colon is preceded by words and followed by a phonetic string of phonemes.
For the pronunciation which can not directly obtain pronunciation information according to characters, namely, the ideophonetic language such as Chinese and the like, the language has the characteristics that the word table has a plurality of meanings, pronunciation can not be directly obtained from word characters, and a pronunciation dictionary needs to be manually constructed. As exemplified above, the word "plucks" the corresponding pronunciation's phoneme string "b a 2" is not available from the word text itself.
In the practical application process, a great amount of sound variation phenomena exist in both Pinyin languages and Italian languages. For example, a clean-up in a dimension language, such as "b" in the word "kitablar" will be cleaned up to "p" when read; turbidification in dimension languages, such as "k" in the word "qelikikin" when read would be turbidified to "g"; a drop in a wiki, such as "r" in the word "kitablar", when read; chinese dialects, such as fertile dialects, read "bathing" as "si bathing".
Due to the existence of the voice change phenomenon, for the dictionary generation scheme of the Pinyin language, the generated pronunciation dictionary may have deviation from the actual pronunciation; it is difficult for a dictionary generation scheme for an ideographic language to manually mark out the inflexion pronunciations of all words.
Disclosure of Invention
The present disclosure provides a pronunciation dictionary generating method and apparatus, a storage medium, and an electronic device, so that the generated pronunciation dictionary is more consistent with the actual pronunciation of the user and has higher accuracy.
In order to achieve the above object, the present disclosure provides a pronunciation dictionary generating method, the method including:
acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined;
decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.
Optionally, the pronunciation word to be determined is a pinyin language, and constructing a pronunciation recognition network for the pronunciation word to be determined includes:
sequentially setting the levels of all correct pronunciation units according to the spelling sequence of the pronunciation words to be determined;
for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level;
and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
Optionally, the decoding, by using the pronunciation recognition network, the voice segment to determine a pronunciation path corresponding to the voice segment includes:
taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level, and determining all alternative pronunciation paths corresponding to the voice segments;
and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Optionally, the manner of calculating the score value of the alternative pronunciation path is:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
Optionally, the pronunciation words to be determined are ideophonous languages, and the constructing a pronunciation recognition network for the pronunciation words to be determined includes:
sequentially arranging a pronunciation unit level and a hollow node level, wherein the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level rebound before the decoding of the voice fragment is finished;
and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
Optionally, the decoding, by using the pronunciation recognition network, the voice segment to determine a pronunciation path corresponding to the voice segment includes:
taking the voice segments as the input of the pronunciation recognition network, and reaching the empty node level after passing through the pronunciation unit level;
judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.
Optionally, the manner of calculating the score value of the alternative pronunciation path is:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
Alternatively, if M pronunciation paths are obtained, M ≧ 1, the confidence of the pronunciation represented by each pronunciation path is obtained as follows:
obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM
Will SjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.
The present disclosure provides a pronunciation dictionary generating apparatus, the apparatus including:
the voice segment acquisition module is used for acquiring a voice segment corresponding to the pronunciation word to be determined;
the pronunciation recognition network construction module is used for constructing a pronunciation recognition network aiming at the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined;
a pronunciation path determining module, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
the confidence coefficient calculation module is used for calculating the confidence coefficient of the pronunciation represented by the pronunciation path;
and the pronunciation dictionary generating module is used for generating a pronunciation dictionary of the pronunciation words to be determined by utilizing the pronunciation represented by the pronunciation path with the confidence coefficient higher than the preset value.
Optionally, the pronunciation words to be determined are Pinyin language,
the pronunciation identification network construction module is used for adding the pronunciation unit corresponding to the correct pronunciation unit in parallel at the hierarchy with possible pronunciation; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level and determining all alternative pronunciation paths corresponding to the voice segments;
and the pronunciation path determining submodule is used for determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
Optionally, the pronunciation words to be determined are ideographic languages,
the pronunciation identification network construction module is used for sequentially setting a pronunciation unit level and a hollow node level, the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level jump back before the decoding of the voice segment is finished; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segment as the input of the pronunciation recognition network and reaching the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and the pronunciation path determining submodule is used for determining all alternative pronunciation paths corresponding to the voice segment, and then determining the pronunciation path corresponding to the voice segment with the highest score value in the alternative pronunciation paths.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
Alternatively, if M pronunciation paths are available, M ≧ 1,
the confidence coefficient calculation module is used for obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM(ii) a Will SjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.
The present disclosure provides a storage medium having stored therein a plurality of instructions, which are loaded by a processor, for performing the steps of the pronunciation dictionary generation method described above.
The present disclosure provides an electronic device, comprising;
the storage medium described above; and
a processor to execute the instructions in the storage medium.
In the scheme, a pronunciation recognition network can be constructed for the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined, so that the pronunciation recognition network can be used for decoding the voice segments corresponding to the pronunciation words to be determined to obtain pronunciation paths corresponding to the voice segments, and then the pronunciation of the pronunciation words to be determined is obtained according to the confidence coefficient of the pronunciation represented by the pronunciation paths to generate a pronunciation dictionary. By the scheme, the generated pronunciation dictionary can better conform to the actual pronunciation of the user, and the accuracy is higher.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a flow chart of a pronunciation dictionary generation method according to the present disclosure;
FIG. 2 is a schematic diagram of a pronunciation recognition network in accordance with aspects of the present disclosure;
FIG. 3 is a schematic diagram of another pronunciation recognition network in accordance with aspects of the present disclosure;
FIG. 4 is a schematic diagram of a pronunciation dictionary generating device according to the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device for generating a pronunciation dictionary according to the present disclosure.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Referring to fig. 1, a flow chart of the pronunciation dictionary generation method of the present disclosure is shown. May include the steps of:
s101, obtaining a voice segment corresponding to the pronunciation word to be determined.
It is understood that the pronunciation words to be determined in the present disclosure may be different types of languages, such as pinyin language, ideographic language, etc.; or languages of different languages, such as Chinese, Wei language, etc.; but also different dialects in the same language, such as the combined fat dialects of chinese, the south beige dialects, etc. The form of the pronunciation words to be determined in the scheme of the disclosure may not be particularly limited.
In the present disclosure, the speech segment corresponding to the pronunciation word to be determined may be obtained in various ways, which is exemplified below.
In the first mode, the corresponding relation between the pronunciation words to be determined and the voice segments can be established in a manual labeling mode, and the voice segments corresponding to the pronunciation words to be determined can be obtained according to the corresponding relation when needed.
And in the second mode, the voice segment corresponding to the pronunciation word to be determined is obtained in an automatic identification mode.
For example, the historical speech data may be recognized and decoded, and after the audio format is converted into the text format, the speech segment corresponding to the pronunciation word to be determined is obtained from the historical speech data through forced alignment segmentation, that is, the time information from the pronunciation word to be determined to the historical speech data is determined.
For example, for chinese in an ideophonetic language, if a piece of historical speech data is labeled as "laundry", after forced alignment segmentation, the time information shown in table 1 below can be obtained.
TABLE 1
Word and phrase Starting time At the end ofWorkshop Duration of time
Washing machine 20ms 39ms 20ms
Clothes 40ms 79ms 40ms
If the pronunciation word to be determined is 'clothes', the voice segment corresponding to 40-79 ms in the historical voice data can be determined as the voice segment corresponding to 'clothes'.
For example, for a wiki in the Pinyin language, if a piece of historical speech data is labeled as "nurGun kitablar bar", after the forced alignment segmentation, the time information shown in Table 2 below can be obtained.
TABLE 2
Monaural phonemes Starting time End time Duration of time
nurGun 20ms 59ms 40ms
kitablar 60ms 109ms 50ms
bar 110ms 139ms 30ms
If the pronunciation word to be determined is "kitablar", the voice segment corresponding to 60-109 ms in the historical voice data can be determined as the voice segment corresponding to "kitablar".
S102, establishing a pronunciation recognition network aiming at the pronunciation words to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined.
In the practical application process, the sound change of the pinyin language generally has certain regularity, such as clearing, turbidity and falling in the above introduced dimensional language, while the sound change of the meaning language has poor regularity, such as various local dialects derived from mandarin.
It is understood that the scheme of the present disclosure can set the granularity of the pronunciation units in the pronunciation recognition network according to the practical application requirements. For example, the granularity of the pronunciation unit may be a phoneme level, or the granularity of the pronunciation unit may be a syllable level, which is not specifically limited by the present disclosure, and it is sufficient to ensure that the granularities of the pronunciation units at the same level in the network are consistent.
1. To-be-determined pronunciation word is Pinyin language
Correspondingly, the levels of all correct pronunciation units can be sequentially set according to the spelling sequence of the pronunciation words to be determined; for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
As an example, the level at which the change is possible may be determined in conjunction with the regularity of the change. For example, for the word "kitablar" in the dimension language, the correct pronunciation itself is "ki t a b l a r", but in the actual pronunciation process, the phoneme "b" may be cleared to be "p", the phoneme "r" may fall off the unvoiced sound, and if the granularity of the pronunciation unit is the phoneme level, the level of the pronunciation unit "b" or "r" may be determined as the level where the pronunciation is possible. For this, the levels of the correct pronunciation units "k", "i", "t", "a", "b", "l", "a" and "r" may be set in order, and then the inflexion pronunciation unit "p" is set as the same node as "b", and the inflexion sound change is fitted at "r", so as to construct the pronunciation identification network shown in fig. 2.
2. The pronunciation words to be determined are meaning pronunciation language
In response to this, a pronunciation unit hierarchy including correct pronunciation units and inflected pronunciation units arranged in parallel and an empty node hierarchy for performing hierarchy rebound before the end of decoding of a speech fragment may be sequentially arranged; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
As an example, to ensure that all possible variant pronunciations are constructed, a variant pronunciation unit may be all the phonemes contained in a Chinese dictionary when the granularity of the pronunciation unit is at the phoneme level.
As an example, a variation pronunciation unit for fitting the dropped variation can be arranged in the network according to the practical application requirement.
If the pronunciation word to be determined is the Chinese word 'washing', the corresponding correct pronunciation is 'x i 3', namely 'x' and 'i 3' are correct pronunciation units of the pronunciation word to be determined, when the network is constructed, besides the correct pronunciation units, all phonemes and dropped pronunciation changes contained in the Chinese dictionary can be used as pronunciation units of pronunciation changes, and are set as peer parallel nodes of the correct pronunciation units to construct the pronunciation identification network shown in FIG. 3.
As can be understood, for the pronunciation words to be determined, the voice segments can be obtained first and then the pronunciation recognition network is constructed as shown in fig. 1; or, a pronunciation recognition network can be constructed first and then the voice segments can be obtained; alternatively, both actions may be performed simultaneously, which may not be specifically limited by the present disclosure.
S103, decoding the voice segment by using the pronunciation recognition network, and determining a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit.
Based on the two pronunciation recognition networks introduced above, the disclosed scheme provides the following two decoding schemes, which are explained separately below.
1. Pronunciation recognition network constructed for Pinyin language
Specifically, the voice segment may be used as an input of the pronunciation recognition network, and all the alternative pronunciation paths corresponding to the voice segment are determined by sequentially traversing each level; and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Taking the network shown in fig. 2 as an example, after the "start" node sequentially passes through the levels of "k", "i", "t" and "a" without occurrence of voice change, the node may pass through the correct pronunciation unit "b" and may also pass through the voice change pronunciation unit "p"; and then sequentially passing through the levels of 'l' and 'a' without the occurrence of the sound change, passing through a correct pronunciation unit 'r', or fitting an empty arc separated from the sound change, and finally reaching an 'end' node, namely obtaining 4 alternative pronunciation paths. In this way, the score of each alternative pronunciation path can be calculated, and the pronunciation path corresponding to the speech segment is determined as the pronunciation with the highest score, that is, the pronunciation of the pronunciation word "kitablar" to be determined is the pronunciation represented by the path with the highest score.
Note that, L in fig. 2 is a score value corresponding to each pronunciation unit, and can be understood as a penalty score corresponding to each pronunciation unit. When decoding is performed by using the pronunciation recognition network, besides the variant pronunciation unit of the drop type, other pronunciation units also correspond to a decoded acoustic score, and as an example, the acoustic score may be represented as a probability value corresponding to the pronunciation unit when the speech segment is decoded.
Specifically, the score value of the alternative pronunciation path may be calculated by: obtaining final scores of all pronunciation units included in the alternative pronunciation path; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path. If the pronunciation unit is a drop-type sound variation pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound variation pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit. The mathematical operations in the present disclosure may be addition, multiplication, etc., and this may not be specifically limited in the present disclosure.
2. Pronunciation recognition network constructed for ideographic language
Specifically, the speech segment may be used as an input of the pronunciation recognition network, and reaches the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path; and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.
Taking the network shown in fig. 3 as an example, after the "start" node passes through the "x" node and reaches the empty node "null", since the decoding of the speech segment is not yet finished, the speech segment can jump back to the pronunciation unit level from the "null" node level, and then reaches the "null" node after passing through the "i 3" node, and at this time, the decoding of the speech segment is finished, so that the speech segment can finally reach the "end" node, and an alternative pronunciation path is obtained. In this example, the alternative pronunciation path may include 2 pronunciation units, 1-level hierarchical jump back.
According to the above process, all alternative pronunciation paths corresponding to the voice segments can be obtained, that is, all possible pronunciations of the pronunciation word "wash" to be determined are obtained. In this way, the score of each alternative pronunciation path can be calculated, and the pronunciation path corresponding to the speech segment is determined as the pronunciation with the highest score, that is, the pronunciation of the pronunciation word "washing" to be determined is the pronunciation represented by the path with the highest score.
Note that, L in fig. 3 is a score value corresponding to each pronunciation unit, and can be understood as a penalty score corresponding to each pronunciation unit. When decoding is performed by using the pronunciation recognition network, besides the variant pronunciation unit of the drop type, other pronunciation units also correspond to a decoded acoustic score, and as an example, the acoustic score may be represented as a probability value corresponding to the pronunciation unit when the speech segment is decoded. In addition, it should be noted that, in the decoding process, each time a jump back is performed, a score value of a level jump back is also corresponded, and the score value of the alternative pronunciation path can also be calculated.
Specifically, the score value of the alternative pronunciation path may be calculated by: obtaining final scores of all pronunciation units included in the alternative pronunciation path; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path. If the pronunciation unit is a drop-type sound variation pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound variation pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit. The mathematical operations in the present disclosure may be addition, multiplication, etc., and this may not be specifically limited in the present disclosure.
As an example, when setting the penalty score, the penalty score of the pronunciation unit corresponding to the original pronunciation of the pronunciation word to be determined may be set to 0, and the penalty scores of other pronunciation units, level rebound, etc. may be set to-10. This is mainly because the pitch of a word is usually a few cases and does not change too strongly, so setting the penalty score can make the decoding result not too confusing. As an example, if the pronunciation word to be determined is "wash", if the user reads "xi bath", the pronunciation units corresponding to the original pronunciation are "x" and "i 3"; if the user reads "sibathe", the pronunciation units corresponding to the original pronunciation are "s" and "i 3".
As one example, the penalty score may be determined empirically. For example, a plurality of pronunciation words to be determined of standard mandarin are selected, a network is constructed and decoded according to the scheme disclosed by the invention, and when the accuracy of pronunciation paths decoded by all pronunciation words to be determined is higher than a preset threshold value, penalty scores corresponding to pronunciation units, level rebound and the like at the moment can be obtained. As an example, the preset threshold may be 95%. In the present disclosure, the accuracy may be understood as the pronunciation represented by the decoded pronunciation path, which is the same as the original pronunciation of the user.
And S104, calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.
The pronunciation dictionary processing method and the pronunciation dictionary processing device can screen out pronunciations corresponding to the pronunciation words to be determined based on the confidence degrees and the preset values, so that the determined pronunciations can be added or replaced into the original pronunciation dictionary to form a new pronunciation dictionary. The value of the preset value can be set by combining the actual application requirement without limitation.
As an example, if M pronunciation paths are obtained, M ≧ 1, the confidence level can be calculated as follows: obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM(ii) a Then S is convertedjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path. May be embodied as the following equation:
Figure BDA0001402632270000131
s (i, j) represents the confidence coefficient of the pronunciation represented by the jth pronunciation path of the pronunciation word i to be determined;
Figure BDA0001402632270000132
score S of j-th pronunciation path for representing pronunciation word i to be determinedj
Figure BDA0001402632270000141
And a sum S of scores representing the M kinds of pronunciation paths.
Taking the word "washing" of the Chinese fertile dialect as an example, assuming that 1000 voice segments are obtained in total, 1000 pronunciation paths and the score value of each pronunciation path can be obtained according to the above description. It can be understood that, in 1000 pronunciation paths, some pronunciation paths may be the same, or all pronunciation paths may be different, that is, the type M of the pronunciation path is less than or equal to 1000, and the confidence of the pronunciation represented by each pronunciation path can be calculated by using formula 1.
As an example, the preset value may be set to 0.5, if there are 3 pronunciation paths corresponding to 1000 speech segments corresponding to the pronunciation word "washing" to be determined, and the confidence of pronunciation represented by each pronunciation path is as follows: the confidence coefficient of the pronunciation of "x i 3" is 0.38, the confidence coefficient of the pronunciation of "s i 3" is 0.52, and the threshold value of the pronunciation of "q i 3" is 0.1, then "s i 3" can be used as the pronunciation of the pronunciation word "washing" in the fertile dialect to be determined, and a pronunciation dictionary is generated.
As an example, the preset value may be set to 0.3, if there are 4 pronunciation paths corresponding to 1000 speech segments corresponding to the pronunciation word "kitablar" to be determined, and the confidence of pronunciation represented by each pronunciation path is as follows: the confidence coefficient of pronunciation of "ki t a b l a r" is 0.2; the confidence coefficient of the pronunciation of "ki a p l r" is 0.31, the confidence coefficient of the pronunciation of "ki a b l a" is 0.35, and the confidence coefficient of the pronunciation of "ki a p l a" is 0.14, then the pronunciation dictionary can be generated by using the "ki a p l r" and the "ki a b l a" as the pronunciations of the pronunciation word "kitablar" to be determined in the wiki.
In conclusion, when the pronunciation dictionary of the Pinyin language is generated by using the scheme disclosed by the invention, the problem of deviation caused by the fact that actual voice change is not considered in the prior art is solved; when the pronunciation dictionary of the ideographic language is generated by using the scheme disclosed by the invention, the problem that all the pronunciations of the words are difficult to mark manually in the prior art is solved. That is to say, pronunciation dictionary that this disclosure scheme generated more accords with user's actual pronunciation, and the accuracy is higher, carries out speech recognition based on this pronunciation dictionary, helps improving the quality of acoustic model, and then improves the whole recognition effect of speech recognition model.
Referring to fig. 4, a schematic diagram of the pronunciation dictionary generating device of the present disclosure is shown. The apparatus may include:
a voice segment acquiring module 201, configured to acquire a voice segment corresponding to a pronunciation word to be determined;
a pronunciation identification network construction module 202, configured to construct a pronunciation identification network for the pronunciation word to be determined, where the pronunciation identification network includes a correct pronunciation unit and a pitch change pronunciation unit of the pronunciation word to be determined;
a pronunciation path determining module 203, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
a confidence calculation module 204, configured to calculate a confidence of the pronunciation represented by the pronunciation path;
and the pronunciation dictionary generating module 205 is configured to generate a pronunciation dictionary of the pronunciation word to be determined by using the pronunciation represented by the pronunciation path with the confidence higher than the preset value.
Optionally, the pronunciation words to be determined are Pinyin language,
the pronunciation identification network construction module is used for adding the pronunciation unit corresponding to the correct pronunciation unit in parallel at the hierarchy with possible pronunciation; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level and determining all alternative pronunciation paths corresponding to the voice segments;
and the pronunciation path determining submodule is used for determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
Optionally, the pronunciation words to be determined are ideographic languages,
the pronunciation identification network construction module is used for sequentially setting a pronunciation unit level and a hollow node level, the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level jump back before the decoding of the voice segment is finished; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
Optionally, the pronunciation path determination module includes:
the alternative pronunciation path determining module is used for taking the voice segment as the input of the pronunciation recognition network and reaching the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and the pronunciation path determining submodule is used for determining all alternative pronunciation paths corresponding to the voice segment, and then determining the pronunciation path corresponding to the voice segment with the highest score value in the alternative pronunciation paths.
Optionally, the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
Alternatively, if M pronunciation paths are available, M ≧ 1,
the confidence coefficient calculation module is used for obtaining the score S of the jth pronunciation pathjAnd the sum of the scores of the M pronunciation paths is S ═ S1+…+Sj+…SM(ii) a Will SjAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Referring to fig. 5, a schematic structural diagram of an electronic device 300 for generating a pronunciation dictionary according to the present disclosure is shown. Referring to fig. 5, the electronic device 300 includes a processing component 301 that further includes one or more processors, and storage device resources, represented by storage medium 302, for storing instructions, such as application programs, that are executable by the processing component 301. The application programs stored in the storage medium 302 may include one or more modules that each correspond to a set of instructions. Further, the processing component 301 is configured to execute instructions to perform the pronunciation dictionary generation method described above.
Electronic device 300 may also include a power component 303 configured to perform power management of electronic device 300; a wired or wireless network interface 306 configured to connect the electronic device 300 to a network; and an input/output (I/O) interface 305. The electronic device 300 may operate based on an operating system stored on the storage medium 302, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims (18)

1. A pronunciation dictionary generating method, the method comprising:
acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined; when the pronunciation words to be determined are ideographic languages, the pronunciation recognition network comprises: the pronunciation unit level and the empty node level are sequentially arranged, and the empty node level is used for carrying out level jump back before the decoding of the voice fragment is finished;
decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.
2. The method of claim 1, wherein the pronunciation word to be determined is a Pinyin language, and wherein constructing a pronunciation recognition network for the pronunciation word to be determined comprises:
sequentially setting the levels of all correct pronunciation units according to the spelling sequence of the pronunciation words to be determined;
for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level;
and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
3. The method according to claim 2, wherein the decoding the speech segment by using the pronunciation recognition network to determine a pronunciation path corresponding to the speech segment comprises:
taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level, and determining all alternative pronunciation paths corresponding to the voice segments;
and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
4. The method according to claim 3, wherein the score values of the alternative pronunciation paths are calculated by:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
5. The method according to claim 1, wherein the pronunciation words to be determined are ideographic languages, and the constructing a pronunciation recognition network for the pronunciation words to be determined specifically comprises:
the pronunciation unit hierarchy comprises a correct pronunciation unit and a sound change pronunciation unit which are arranged in parallel;
and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
6. The method according to claim 5, wherein the decoding the speech segment by using the pronunciation recognition network to determine the pronunciation path corresponding to the speech segment comprises:
taking the voice segments as the input of the pronunciation recognition network, and reaching the empty node level after passing through the pronunciation unit level;
judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.
7. The method according to claim 6, wherein the score values of the alternative pronunciation paths are calculated by:
obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;
and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
8. The method according to any one of claims 1 to 7, wherein if M pronunciation paths are obtained, M ≧ 1, the confidence of the pronunciation represented by each pronunciation path is obtained as follows:
obtaining score Sj of j pronunciation paths and sum S of the score Sj of M pronunciation paths, wherein S is 1+ … + Sj + … SM;
and determining the ratio of Sj to S as the confidence of the pronunciation represented by the jth pronunciation path.
9. A pronunciation dictionary generating apparatus, comprising:
the voice segment acquisition module is used for acquiring a voice segment corresponding to the pronunciation word to be determined;
the pronunciation recognition network construction module is used for constructing a pronunciation recognition network aiming at the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined; when the pronunciation words to be determined are ideographic languages, the pronunciation recognition network comprises: the pronunciation unit level and the empty node level are sequentially arranged, and the empty node level is used for carrying out level jump back before the decoding of the voice fragment is finished;
a pronunciation path determining module, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;
the confidence coefficient calculation module is used for calculating the confidence coefficient of the pronunciation represented by the pronunciation path;
and the pronunciation dictionary generating module is used for generating a pronunciation dictionary of the pronunciation words to be determined by utilizing the pronunciation represented by the pronunciation path with the confidence coefficient higher than the preset value.
10. The apparatus of claim 9, wherein the pronunciation words to be determined are Pinyin language,
the pronunciation identification network construction module is used for adding the pronunciation unit corresponding to the correct pronunciation unit in parallel at the hierarchy with possible pronunciation; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.
11. The apparatus of claim 10, wherein the pronunciation path determination module comprises:
the alternative pronunciation path determining module is used for taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level and determining all alternative pronunciation paths corresponding to the voice segments;
and the pronunciation path determining submodule is used for determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.
12. The apparatus of claim 11, wherein the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.
13. The apparatus of claim 9, wherein the pronunciation words to be determined are ideographic languages,
the pronunciation identification network construction module is specifically used for arranging a correct pronunciation unit and a sound change pronunciation unit in parallel at the pronunciation unit level; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.
14. The apparatus of claim 13, wherein the pronunciation path determination module comprises:
the alternative pronunciation path determining module is used for taking the voice segment as the input of the pronunciation recognition network and reaching the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;
and the pronunciation path determining submodule is used for determining all alternative pronunciation paths corresponding to the voice segment, and then determining the pronunciation path corresponding to the voice segment with the highest score value in the alternative pronunciation paths.
15. The apparatus of claim 14, wherein the pronunciation path determination module further comprises:
a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.
16. The apparatus according to any one of claims 9 to 15, wherein if M pronunciation paths are obtained, M ≧ 1,
the confidence coefficient calculation module is used for obtaining the score Sj of the jth pronunciation path and the sum S of the scores of the M pronunciation paths, which is S1+ … + Sj + … SM; and determining the ratio of Sj to S as the confidence of the pronunciation represented by the jth pronunciation path.
17. A storage medium having stored thereon a plurality of instructions, wherein the instructions are loadable by a processor and adapted to cause execution of the steps of the method according to any of claims 1 to 8.
18. An electronic device, characterized in that the electronic device comprises;
the storage medium of claim 17; and
a processor to execute the instructions in the storage medium.
CN201710805626.3A 2017-09-08 2017-09-08 Pronunciation dictionary generating method and device, storage medium and electronic equipment Active CN107767858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710805626.3A CN107767858B (en) 2017-09-08 2017-09-08 Pronunciation dictionary generating method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710805626.3A CN107767858B (en) 2017-09-08 2017-09-08 Pronunciation dictionary generating method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN107767858A CN107767858A (en) 2018-03-06
CN107767858B true CN107767858B (en) 2021-05-04

Family

ID=61265107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710805626.3A Active CN107767858B (en) 2017-09-08 2017-09-08 Pronunciation dictionary generating method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN107767858B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827803A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Method, device and equipment for constructing dialect pronunciation dictionary and readable storage medium
CN111369974B (en) * 2020-03-11 2024-01-19 北京声智科技有限公司 Dialect pronunciation marking method, language identification method and related device
CN111681635A (en) * 2020-05-12 2020-09-18 深圳市镜象科技有限公司 Method, apparatus, device and medium for real-time cloning of voice based on small sample
CN111798834B (en) * 2020-07-03 2022-03-15 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment
CN113506559B (en) * 2021-07-21 2023-06-09 成都启英泰伦科技有限公司 Method for generating pronunciation dictionary according to Vietnam written text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063900A (en) * 2010-11-26 2011-05-18 北京交通大学 Speech recognition method and system for overcoming confusing pronunciation
CN105893414A (en) * 2015-11-26 2016-08-24 乐视致新电子科技(天津)有限公司 Method and apparatus for screening valid term of a pronunciation lexicon
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method
CN106935239A (en) * 2015-12-29 2017-07-07 阿里巴巴集团控股有限公司 The construction method and device of a kind of pronunciation dictionary

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3481497B2 (en) * 1998-04-29 2003-12-22 松下電器産業株式会社 Method and apparatus using a decision tree to generate and evaluate multiple pronunciations for spelled words
US7353174B2 (en) * 2003-03-31 2008-04-01 Sony Corporation System and method for effectively implementing a Mandarin Chinese speech recognition dictionary
US7428491B2 (en) * 2004-12-10 2008-09-23 Microsoft Corporation Method and system for obtaining personal aliases through voice recognition
CN100411011C (en) * 2005-11-18 2008-08-13 清华大学 Pronunciation quality evaluating method for language learning machine
CN101447184B (en) * 2007-11-28 2011-07-27 中国科学院声学研究所 Chinese-English bilingual speech recognition method based on phoneme confusion
CN101763855B (en) * 2009-11-20 2012-01-04 安徽科大讯飞信息科技股份有限公司 Method and device for judging confidence of speech recognition
CN101840699B (en) * 2010-04-30 2012-08-15 中国科学院声学研究所 Voice quality evaluation method based on pronunciation model
CN103164403B (en) * 2011-12-08 2016-03-16 深圳市北科瑞声科技有限公司 The generation method and system of video index data
JP6413220B2 (en) * 2013-10-15 2018-10-31 ヤマハ株式会社 Composite information management device
CN103578464B (en) * 2013-10-18 2017-01-11 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
CN106155341B (en) * 2015-03-25 2020-05-26 李佳俊 Computer character input method for Chinese characters based on Chinese and Zhuang language writing system
CN105513589B (en) * 2015-12-18 2020-04-28 百度在线网络技术(北京)有限公司 Speech recognition method and device
CN106653007B (en) * 2016-12-05 2019-07-16 苏州奇梦者网络科技有限公司 A kind of speech recognition system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063900A (en) * 2010-11-26 2011-05-18 北京交通大学 Speech recognition method and system for overcoming confusing pronunciation
CN105893414A (en) * 2015-11-26 2016-08-24 乐视致新电子科技(天津)有限公司 Method and apparatus for screening valid term of a pronunciation lexicon
CN106935239A (en) * 2015-12-29 2017-07-07 阿里巴巴集团控股有限公司 The construction method and device of a kind of pronunciation dictionary
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于发音词典自适应的民族语口音汉语普通话语音识别;陈江;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20110515(第5期);1-4、8-12、16-18 *

Also Published As

Publication number Publication date
CN107767858A (en) 2018-03-06

Similar Documents

Publication Publication Date Title
CN107767858B (en) Pronunciation dictionary generating method and device, storage medium and electronic equipment
CN110648658B (en) Method and device for generating voice recognition model and electronic equipment
JP6675463B2 (en) Bidirectional stochastic rewriting and selection of natural language
CN111667816B (en) Model training method, speech synthesis method, device, equipment and storage medium
CN105632499B (en) Method and apparatus for optimizing speech recognition results
US8126714B2 (en) Voice search device
CN103065630B (en) User personalized information voice recognition method and user personalized information voice recognition system
Irie et al. On the choice of modeling unit for sequence-to-sequence speech recognition
CN109858038B (en) Text punctuation determination method and device
Zenkel et al. Subword and crossword units for CTC acoustic models
US9984689B1 (en) Apparatus and method for correcting pronunciation by contextual recognition
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
Li et al. Language modeling with functional head constraint for code switching speech recognition
EP3915104A1 (en) Word lattice augmentation for automatic speech recognition
Waters et al. Leveraging language id in multilingual end-to-end speech recognition
CN102439660A (en) Voice-tag method and apparatus based on confidence score
CN113808571B (en) Speech synthesis method, speech synthesis device, electronic device and storage medium
CN112580340A (en) Word-by-word lyric generating method and device, storage medium and electronic equipment
JP2016001242A (en) Question sentence creation method, device, and program
Song et al. Zeroprompt: Streaming acoustic encoders are zero-shot masked lms
JP6485941B2 (en) LANGUAGE MODEL GENERATION DEVICE, ITS PROGRAM, AND VOICE RECOGNIZING DEVICE
CN105632500B (en) Speech recognition apparatus and control method thereof
CN114519358A (en) Translation quality evaluation method and device, electronic equipment and storage medium
JP6276516B2 (en) Dictionary creation apparatus and dictionary creation program
CN113378553A (en) Text processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant