CN107767858B

CN107767858B - Pronunciation dictionary generating method and device, storage medium and electronic equipment

Info

Publication number: CN107767858B
Application number: CN201710805626.3A
Authority: CN
Inventors: 方昕; 刘俊华; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-09-08
Filing date: 2017-09-08
Publication date: 2021-05-04
Anticipated expiration: 2037-09-08
Also published as: CN107767858A

Abstract

The disclosure provides a pronunciation dictionary generation method and device, a storage medium and an electronic device. The method comprises the following steps: acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined; decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit; and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value. By the scheme, the generated pronunciation dictionary can better conform to the actual pronunciation of the user, and the accuracy is higher.

Description

Pronunciation dictionary generating method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a method and an apparatus for generating a pronunciation dictionary, a storage medium, and an electronic device.

Background

With the continuous development of speech recognition technology, speech recognition is widely applied in many fields, such as speech input method, conference transcription, movie and television subtitle generation, etc.

One important resource in speech recognition technology is a pronunciation dictionary, which typically maps words to phoneme strings. For example, the chinese pronunciation dictionary "pluck b a 2" indicates that the word "plucks" is pronounced to "b a 2", where "b" and "a 2" are monaural phonemes and are units of characterization modeled by an acoustic model, generally a stable pronunciation unit, indicating how a word is pronounced. That is, the accurate pronunciation dictionary directly determines the quality of the acoustic model, and further may affect the overall recognition effect of the speech recognition model.

Currently, pronunciation dictionaries are mostly generated by different schemes for different language types:

in a language in which pronunciation information can be obtained from characters, that is, a pinyin language, such as a dimension language or a korean language, a word corresponding to the language is spelled by a phoneme, and it is generally not necessary to manually construct a pronunciation dictionary. For example, the word KoGun transcribed by latin: k o G u n, where the colon is preceded by words and followed by a phonetic string of phonemes.

For the pronunciation which can not directly obtain pronunciation information according to characters, namely, the ideophonetic language such as Chinese and the like, the language has the characteristics that the word table has a plurality of meanings, pronunciation can not be directly obtained from word characters, and a pronunciation dictionary needs to be manually constructed. As exemplified above, the word "plucks" the corresponding pronunciation's phoneme string "b a 2" is not available from the word text itself.

In the practical application process, a great amount of sound variation phenomena exist in both Pinyin languages and Italian languages. For example, a clean-up in a dimension language, such as "b" in the word "kitablar" will be cleaned up to "p" when read; turbidification in dimension languages, such as "k" in the word "qelikikin" when read would be turbidified to "g"; a drop in a wiki, such as "r" in the word "kitablar", when read; chinese dialects, such as fertile dialects, read "bathing" as "si bathing".

Due to the existence of the voice change phenomenon, for the dictionary generation scheme of the Pinyin language, the generated pronunciation dictionary may have deviation from the actual pronunciation; it is difficult for a dictionary generation scheme for an ideographic language to manually mark out the inflexion pronunciations of all words.

Disclosure of Invention

The present disclosure provides a pronunciation dictionary generating method and apparatus, a storage medium, and an electronic device, so that the generated pronunciation dictionary is more consistent with the actual pronunciation of the user and has higher accuracy.

In order to achieve the above object, the present disclosure provides a pronunciation dictionary generating method, the method including:

acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined;

decoding the voice segment by utilizing the pronunciation recognition network to determine a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;

and calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.

Optionally, the pronunciation word to be determined is a pinyin language, and constructing a pronunciation recognition network for the pronunciation word to be determined includes:

sequentially setting the levels of all correct pronunciation units according to the spelling sequence of the pronunciation words to be determined;

for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level;

and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.

Optionally, the decoding, by using the pronunciation recognition network, the voice segment to determine a pronunciation path corresponding to the voice segment includes:

taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level, and determining all alternative pronunciation paths corresponding to the voice segments;

and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.

Optionally, the manner of calculating the score value of the alternative pronunciation path is:

obtaining the final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit;

and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.

Optionally, the pronunciation words to be determined are ideophonous languages, and the constructing a pronunciation recognition network for the pronunciation words to be determined includes:

sequentially arranging a pronunciation unit level and a hollow node level, wherein the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level rebound before the decoding of the voice fragment is finished;

and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.

taking the voice segments as the input of the pronunciation recognition network, and reaching the empty node level after passing through the pronunciation unit level;

judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;

and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.

and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.

Alternatively, if M pronunciation paths are obtained, M ≧ 1, the confidence of the pronunciation represented by each pronunciation path is obtained as follows:

obtaining the score S of the jth pronunciation path_jAnd the sum of the scores of the M pronunciation paths is S ═ S₁+…+S_j+…S_M；

Will S_jAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.

The present disclosure provides a pronunciation dictionary generating apparatus, the apparatus including:

the voice segment acquisition module is used for acquiring a voice segment corresponding to the pronunciation word to be determined;

the pronunciation recognition network construction module is used for constructing a pronunciation recognition network aiming at the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined;

a pronunciation path determining module, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;

the confidence coefficient calculation module is used for calculating the confidence coefficient of the pronunciation represented by the pronunciation path;

and the pronunciation dictionary generating module is used for generating a pronunciation dictionary of the pronunciation words to be determined by utilizing the pronunciation represented by the pronunciation path with the confidence coefficient higher than the preset value.

Optionally, the pronunciation words to be determined are Pinyin language,

the pronunciation identification network construction module is used for adding the pronunciation unit corresponding to the correct pronunciation unit in parallel at the hierarchy with possible pronunciation; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.

Optionally, the pronunciation path determination module includes:

the alternative pronunciation path determining module is used for taking the voice segments as the input of the pronunciation recognition network, sequentially traversing each level and determining all alternative pronunciation paths corresponding to the voice segments;

and the pronunciation path determining submodule is used for determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.

Optionally, the pronunciation path determination module further comprises:

a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path.

Optionally, the pronunciation words to be determined are ideographic languages,

the pronunciation identification network construction module is used for sequentially setting a pronunciation unit level and a hollow node level, the pronunciation unit level comprises a correct pronunciation unit and a sound variation pronunciation unit which are arranged in parallel, and the hollow node level is used for carrying out level jump back before the decoding of the voice segment is finished; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.

Optionally, the pronunciation path determination module includes:

the alternative pronunciation path determining module is used for taking the voice segment as the input of the pronunciation recognition network and reaching the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path;

and the pronunciation path determining submodule is used for determining all alternative pronunciation paths corresponding to the voice segment, and then determining the pronunciation path corresponding to the voice segment with the highest score value in the alternative pronunciation paths.

Optionally, the pronunciation path determination module further comprises:

a score value obtaining module, configured to obtain final scores of all pronunciation units included in the alternative pronunciation path: if the pronunciation unit is a falling-off type sound change pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound change pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path.

Alternatively, if M pronunciation paths are available, M ≧ 1,

the confidence coefficient calculation module is used for obtaining the score S of the jth pronunciation path_jAnd the sum of the scores of the M pronunciation paths is S ═ S₁+…+S_j+…S_M(ii) a Will S_jAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path.

The present disclosure provides a storage medium having stored therein a plurality of instructions, which are loaded by a processor, for performing the steps of the pronunciation dictionary generation method described above.

The present disclosure provides an electronic device, comprising;

the storage medium described above; and

a processor to execute the instructions in the storage medium.

In the scheme, a pronunciation recognition network can be constructed for the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined, so that the pronunciation recognition network can be used for decoding the voice segments corresponding to the pronunciation words to be determined to obtain pronunciation paths corresponding to the voice segments, and then the pronunciation of the pronunciation words to be determined is obtained according to the confidence coefficient of the pronunciation represented by the pronunciation paths to generate a pronunciation dictionary. By the scheme, the generated pronunciation dictionary can better conform to the actual pronunciation of the user, and the accuracy is higher.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow chart of a pronunciation dictionary generation method according to the present disclosure;

FIG. 2 is a schematic diagram of a pronunciation recognition network in accordance with aspects of the present disclosure;

FIG. 3 is a schematic diagram of another pronunciation recognition network in accordance with aspects of the present disclosure;

FIG. 4 is a schematic diagram of a pronunciation dictionary generating device according to the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device for generating a pronunciation dictionary according to the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Referring to fig. 1, a flow chart of the pronunciation dictionary generation method of the present disclosure is shown. May include the steps of:

s101, obtaining a voice segment corresponding to the pronunciation word to be determined.

It is understood that the pronunciation words to be determined in the present disclosure may be different types of languages, such as pinyin language, ideographic language, etc.; or languages of different languages, such as Chinese, Wei language, etc.; but also different dialects in the same language, such as the combined fat dialects of chinese, the south beige dialects, etc. The form of the pronunciation words to be determined in the scheme of the disclosure may not be particularly limited.

In the present disclosure, the speech segment corresponding to the pronunciation word to be determined may be obtained in various ways, which is exemplified below.

In the first mode, the corresponding relation between the pronunciation words to be determined and the voice segments can be established in a manual labeling mode, and the voice segments corresponding to the pronunciation words to be determined can be obtained according to the corresponding relation when needed.

And in the second mode, the voice segment corresponding to the pronunciation word to be determined is obtained in an automatic identification mode.

For example, the historical speech data may be recognized and decoded, and after the audio format is converted into the text format, the speech segment corresponding to the pronunciation word to be determined is obtained from the historical speech data through forced alignment segmentation, that is, the time information from the pronunciation word to be determined to the historical speech data is determined.

For example, for chinese in an ideophonetic language, if a piece of historical speech data is labeled as "laundry", after forced alignment segmentation, the time information shown in table 1 below can be obtained.

TABLE 1

Word and phrase	Starting time	At the end ofWorkshop	Duration of time
				Washing machine	20ms	39ms	20ms
Clothes	40ms	79ms	40ms

If the pronunciation word to be determined is 'clothes', the voice segment corresponding to 40-79 ms in the historical voice data can be determined as the voice segment corresponding to 'clothes'.

For example, for a wiki in the Pinyin language, if a piece of historical speech data is labeled as "nurGun kitablar bar", after the forced alignment segmentation, the time information shown in Table 2 below can be obtained.

TABLE 2

Monaural phonemes	Starting time	End time	Duration of time
				nurGun	20ms	59ms	40ms
kitablar	60ms	109ms	50ms
				bar	110ms	139ms	30ms

If the pronunciation word to be determined is "kitablar", the voice segment corresponding to 60-109 ms in the historical voice data can be determined as the voice segment corresponding to "kitablar".

S102, establishing a pronunciation recognition network aiming at the pronunciation words to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined.

In the practical application process, the sound change of the pinyin language generally has certain regularity, such as clearing, turbidity and falling in the above introduced dimensional language, while the sound change of the meaning language has poor regularity, such as various local dialects derived from mandarin.

It is understood that the scheme of the present disclosure can set the granularity of the pronunciation units in the pronunciation recognition network according to the practical application requirements. For example, the granularity of the pronunciation unit may be a phoneme level, or the granularity of the pronunciation unit may be a syllable level, which is not specifically limited by the present disclosure, and it is sufficient to ensure that the granularities of the pronunciation units at the same level in the network are consistent.

1. To-be-determined pronunciation word is Pinyin language

Correspondingly, the levels of all correct pronunciation units can be sequentially set according to the spelling sequence of the pronunciation words to be determined; for the level with the possibility of sound change, adding the sound change pronunciation units corresponding to the correct pronunciation units in parallel at the level; and setting the scoring values corresponding to the correct pronunciation units and the sound change pronunciation units to form the pronunciation identification network.

As an example, the level at which the change is possible may be determined in conjunction with the regularity of the change. For example, for the word "kitablar" in the dimension language, the correct pronunciation itself is "ki t a b l a r", but in the actual pronunciation process, the phoneme "b" may be cleared to be "p", the phoneme "r" may fall off the unvoiced sound, and if the granularity of the pronunciation unit is the phoneme level, the level of the pronunciation unit "b" or "r" may be determined as the level where the pronunciation is possible. For this, the levels of the correct pronunciation units "k", "i", "t", "a", "b", "l", "a" and "r" may be set in order, and then the inflexion pronunciation unit "p" is set as the same node as "b", and the inflexion sound change is fitted at "r", so as to construct the pronunciation identification network shown in fig. 2.

2. The pronunciation words to be determined are meaning pronunciation language

In response to this, a pronunciation unit hierarchy including correct pronunciation units and inflected pronunciation units arranged in parallel and an empty node hierarchy for performing hierarchy rebound before the end of decoding of a speech fragment may be sequentially arranged; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.

As an example, to ensure that all possible variant pronunciations are constructed, a variant pronunciation unit may be all the phonemes contained in a Chinese dictionary when the granularity of the pronunciation unit is at the phoneme level.

As an example, a variation pronunciation unit for fitting the dropped variation can be arranged in the network according to the practical application requirement.

If the pronunciation word to be determined is the Chinese word 'washing', the corresponding correct pronunciation is 'x i 3', namely 'x' and 'i 3' are correct pronunciation units of the pronunciation word to be determined, when the network is constructed, besides the correct pronunciation units, all phonemes and dropped pronunciation changes contained in the Chinese dictionary can be used as pronunciation units of pronunciation changes, and are set as peer parallel nodes of the correct pronunciation units to construct the pronunciation identification network shown in FIG. 3.

As can be understood, for the pronunciation words to be determined, the voice segments can be obtained first and then the pronunciation recognition network is constructed as shown in fig. 1; or, a pronunciation recognition network can be constructed first and then the voice segments can be obtained; alternatively, both actions may be performed simultaneously, which may not be specifically limited by the present disclosure.

S103, decoding the voice segment by using the pronunciation recognition network, and determining a pronunciation path corresponding to the voice segment, wherein the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit.

Based on the two pronunciation recognition networks introduced above, the disclosed scheme provides the following two decoding schemes, which are explained separately below.

1. Pronunciation recognition network constructed for Pinyin language

Specifically, the voice segment may be used as an input of the pronunciation recognition network, and all the alternative pronunciation paths corresponding to the voice segment are determined by sequentially traversing each level; and determining the pronunciation path with the highest score value in the alternative pronunciation paths as the pronunciation path corresponding to the voice segment.

Taking the network shown in fig. 2 as an example, after the "start" node sequentially passes through the levels of "k", "i", "t" and "a" without occurrence of voice change, the node may pass through the correct pronunciation unit "b" and may also pass through the voice change pronunciation unit "p"; and then sequentially passing through the levels of 'l' and 'a' without the occurrence of the sound change, passing through a correct pronunciation unit 'r', or fitting an empty arc separated from the sound change, and finally reaching an 'end' node, namely obtaining 4 alternative pronunciation paths. In this way, the score of each alternative pronunciation path can be calculated, and the pronunciation path corresponding to the speech segment is determined as the pronunciation with the highest score, that is, the pronunciation of the pronunciation word "kitablar" to be determined is the pronunciation represented by the path with the highest score.

Note that, L in fig. 2 is a score value corresponding to each pronunciation unit, and can be understood as a penalty score corresponding to each pronunciation unit. When decoding is performed by using the pronunciation recognition network, besides the variant pronunciation unit of the drop type, other pronunciation units also correspond to a decoded acoustic score, and as an example, the acoustic score may be represented as a probability value corresponding to the pronunciation unit when the speech segment is decoded.

Specifically, the score value of the alternative pronunciation path may be calculated by: obtaining final scores of all pronunciation units included in the alternative pronunciation path; and performing mathematical operation by using the final scores of all the pronunciation units to obtain the score of the alternative pronunciation path. If the pronunciation unit is a drop-type sound variation pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound variation pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit. The mathematical operations in the present disclosure may be addition, multiplication, etc., and this may not be specifically limited in the present disclosure.

2. Pronunciation recognition network constructed for ideographic language

Specifically, the speech segment may be used as an input of the pronunciation recognition network, and reaches the empty node level after passing through the pronunciation unit level; judging whether the decoding of the voice segment is finished, if the decoding of the voice segment is not finished, jumping back to the pronunciation unit level until the decoding of the voice segment is finished to obtain an alternative pronunciation path; and after all the alternative pronunciation paths corresponding to the voice segments are determined, determining the pronunciation path corresponding to the voice segments with the highest score value in the alternative pronunciation paths.

Taking the network shown in fig. 3 as an example, after the "start" node passes through the "x" node and reaches the empty node "null", since the decoding of the speech segment is not yet finished, the speech segment can jump back to the pronunciation unit level from the "null" node level, and then reaches the "null" node after passing through the "i 3" node, and at this time, the decoding of the speech segment is finished, so that the speech segment can finally reach the "end" node, and an alternative pronunciation path is obtained. In this example, the alternative pronunciation path may include 2 pronunciation units, 1-level hierarchical jump back.

According to the above process, all alternative pronunciation paths corresponding to the voice segments can be obtained, that is, all possible pronunciations of the pronunciation word "wash" to be determined are obtained. In this way, the score of each alternative pronunciation path can be calculated, and the pronunciation path corresponding to the speech segment is determined as the pronunciation with the highest score, that is, the pronunciation of the pronunciation word "washing" to be determined is the pronunciation represented by the path with the highest score.

Note that, L in fig. 3 is a score value corresponding to each pronunciation unit, and can be understood as a penalty score corresponding to each pronunciation unit. When decoding is performed by using the pronunciation recognition network, besides the variant pronunciation unit of the drop type, other pronunciation units also correspond to a decoded acoustic score, and as an example, the acoustic score may be represented as a probability value corresponding to the pronunciation unit when the speech segment is decoded. In addition, it should be noted that, in the decoding process, each time a jump back is performed, a score value of a level jump back is also corresponded, and the score value of the alternative pronunciation path can also be calculated.

Specifically, the score value of the alternative pronunciation path may be calculated by: obtaining final scores of all pronunciation units included in the alternative pronunciation path; and performing mathematical operation by using the final score of each pronunciation unit and the score values corresponding to the rebound of all the levels in the alternative pronunciation path to obtain the score values of the alternative pronunciation path. If the pronunciation unit is a drop-type sound variation pronunciation unit, the final score of the pronunciation unit is the corresponding score of the sound variation pronunciation unit; otherwise, the final score of the pronunciation unit is obtained by performing mathematical operation on the decoded acoustic score of the pronunciation unit and the score value corresponding to the pronunciation unit. The mathematical operations in the present disclosure may be addition, multiplication, etc., and this may not be specifically limited in the present disclosure.

As an example, when setting the penalty score, the penalty score of the pronunciation unit corresponding to the original pronunciation of the pronunciation word to be determined may be set to 0, and the penalty scores of other pronunciation units, level rebound, etc. may be set to-10. This is mainly because the pitch of a word is usually a few cases and does not change too strongly, so setting the penalty score can make the decoding result not too confusing. As an example, if the pronunciation word to be determined is "wash", if the user reads "xi bath", the pronunciation units corresponding to the original pronunciation are "x" and "i 3"; if the user reads "sibathe", the pronunciation units corresponding to the original pronunciation are "s" and "i 3".

As one example, the penalty score may be determined empirically. For example, a plurality of pronunciation words to be determined of standard mandarin are selected, a network is constructed and decoded according to the scheme disclosed by the invention, and when the accuracy of pronunciation paths decoded by all pronunciation words to be determined is higher than a preset threshold value, penalty scores corresponding to pronunciation units, level rebound and the like at the moment can be obtained. As an example, the preset threshold may be 95%. In the present disclosure, the accuracy may be understood as the pronunciation represented by the decoded pronunciation path, which is the same as the original pronunciation of the user.

And S104, calculating the confidence coefficient of the pronunciation represented by the pronunciation path, and generating a pronunciation dictionary of the pronunciation words to be determined by using the pronunciation represented by the pronunciation path with the confidence coefficient higher than a preset value.

The pronunciation dictionary processing method and the pronunciation dictionary processing device can screen out pronunciations corresponding to the pronunciation words to be determined based on the confidence degrees and the preset values, so that the determined pronunciations can be added or replaced into the original pronunciation dictionary to form a new pronunciation dictionary. The value of the preset value can be set by combining the actual application requirement without limitation.

As an example, if M pronunciation paths are obtained, M ≧ 1, the confidence level can be calculated as follows: obtaining the score S of the jth pronunciation path_jAnd the sum of the scores of the M pronunciation paths is S ═ S₁+…+S_j+…S_M(ii) a Then S is converted_jAnd the ratio of the number of the pronunciation paths to the number of the pronunciations is determined as the confidence coefficient of the pronunciation represented by the jth pronunciation path. May be embodied as the following equation:

s (i, j) represents the confidence coefficient of the pronunciation represented by the jth pronunciation path of the pronunciation word i to be determined;

score S of j-th pronunciation path for representing pronunciation word i to be determined_j；

And a sum S of scores representing the M kinds of pronunciation paths.

Taking the word "washing" of the Chinese fertile dialect as an example, assuming that 1000 voice segments are obtained in total, 1000 pronunciation paths and the score value of each pronunciation path can be obtained according to the above description. It can be understood that, in 1000 pronunciation paths, some pronunciation paths may be the same, or all pronunciation paths may be different, that is, the type M of the pronunciation path is less than or equal to 1000, and the confidence of the pronunciation represented by each pronunciation path can be calculated by using formula 1.

As an example, the preset value may be set to 0.5, if there are 3 pronunciation paths corresponding to 1000 speech segments corresponding to the pronunciation word "washing" to be determined, and the confidence of pronunciation represented by each pronunciation path is as follows: the confidence coefficient of the pronunciation of "x i 3" is 0.38, the confidence coefficient of the pronunciation of "s i 3" is 0.52, and the threshold value of the pronunciation of "q i 3" is 0.1, then "s i 3" can be used as the pronunciation of the pronunciation word "washing" in the fertile dialect to be determined, and a pronunciation dictionary is generated.

As an example, the preset value may be set to 0.3, if there are 4 pronunciation paths corresponding to 1000 speech segments corresponding to the pronunciation word "kitablar" to be determined, and the confidence of pronunciation represented by each pronunciation path is as follows: the confidence coefficient of pronunciation of "ki t a b l a r" is 0.2; the confidence coefficient of the pronunciation of "ki a p l r" is 0.31, the confidence coefficient of the pronunciation of "ki a b l a" is 0.35, and the confidence coefficient of the pronunciation of "ki a p l a" is 0.14, then the pronunciation dictionary can be generated by using the "ki a p l r" and the "ki a b l a" as the pronunciations of the pronunciation word "kitablar" to be determined in the wiki.

In conclusion, when the pronunciation dictionary of the Pinyin language is generated by using the scheme disclosed by the invention, the problem of deviation caused by the fact that actual voice change is not considered in the prior art is solved; when the pronunciation dictionary of the ideographic language is generated by using the scheme disclosed by the invention, the problem that all the pronunciations of the words are difficult to mark manually in the prior art is solved. That is to say, pronunciation dictionary that this disclosure scheme generated more accords with user's actual pronunciation, and the accuracy is higher, carries out speech recognition based on this pronunciation dictionary, helps improving the quality of acoustic model, and then improves the whole recognition effect of speech recognition model.

Referring to fig. 4, a schematic diagram of the pronunciation dictionary generating device of the present disclosure is shown. The apparatus may include:

a voice segment acquiring module 201, configured to acquire a voice segment corresponding to a pronunciation word to be determined;

a pronunciation identification network construction module 202, configured to construct a pronunciation identification network for the pronunciation word to be determined, where the pronunciation identification network includes a correct pronunciation unit and a pitch change pronunciation unit of the pronunciation word to be determined;

a pronunciation path determining module 203, configured to decode the voice segment by using the pronunciation recognition network, and determine a pronunciation path corresponding to the voice segment, where the pronunciation path is formed by the correct pronunciation unit and/or the sound change pronunciation unit;

a confidence calculation module 204, configured to calculate a confidence of the pronunciation represented by the pronunciation path;

and the pronunciation dictionary generating module 205 is configured to generate a pronunciation dictionary of the pronunciation word to be determined by using the pronunciation represented by the pronunciation path with the confidence higher than the preset value.

Optionally, the pronunciation words to be determined are Pinyin language,

Optionally, the pronunciation path determination module includes:

Optionally, the pronunciation path determination module further comprises:

Optionally, the pronunciation words to be determined are ideographic languages,

Optionally, the pronunciation path determination module includes:

Optionally, the pronunciation path determination module further comprises:

Alternatively, if M pronunciation paths are available, M ≧ 1,

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 5, a schematic structural diagram of an electronic device 300 for generating a pronunciation dictionary according to the present disclosure is shown. Referring to fig. 5, the electronic device 300 includes a processing component 301 that further includes one or more processors, and storage device resources, represented by storage medium 302, for storing instructions, such as application programs, that are executable by the processing component 301. The application programs stored in the storage medium 302 may include one or more modules that each correspond to a set of instructions. Further, the processing component 301 is configured to execute instructions to perform the pronunciation dictionary generation method described above.

Electronic device 300 may also include a power component 303 configured to perform power management of electronic device 300; a wired or wireless network interface 306 configured to connect the electronic device 300 to a network; and an input/output (I/O) interface 305. The electronic device 300 may operate based on an operating system stored on the storage medium 302, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A pronunciation dictionary generating method, the method comprising:

acquiring a voice segment corresponding to a pronunciation word to be determined, and constructing a pronunciation recognition network aiming at the pronunciation word to be determined, wherein the pronunciation recognition network comprises a correct pronunciation unit and a sound variation pronunciation unit of the pronunciation word to be determined; when the pronunciation words to be determined are ideographic languages, the pronunciation recognition network comprises: the pronunciation unit level and the empty node level are sequentially arranged, and the empty node level is used for carrying out level jump back before the decoding of the voice fragment is finished;

2. The method of claim 1, wherein the pronunciation word to be determined is a Pinyin language, and wherein constructing a pronunciation recognition network for the pronunciation word to be determined comprises:

3. The method according to claim 2, wherein the decoding the speech segment by using the pronunciation recognition network to determine a pronunciation path corresponding to the speech segment comprises:

4. The method according to claim 3, wherein the score values of the alternative pronunciation paths are calculated by:

5. The method according to claim 1, wherein the pronunciation words to be determined are ideographic languages, and the constructing a pronunciation recognition network for the pronunciation words to be determined specifically comprises:

the pronunciation unit hierarchy comprises a correct pronunciation unit and a sound change pronunciation unit which are arranged in parallel;

6. The method according to claim 5, wherein the decoding the speech segment by using the pronunciation recognition network to determine the pronunciation path corresponding to the speech segment comprises:

7. The method according to claim 6, wherein the score values of the alternative pronunciation paths are calculated by:

8. The method according to any one of claims 1 to 7, wherein if M pronunciation paths are obtained, M ≧ 1, the confidence of the pronunciation represented by each pronunciation path is obtained as follows:

obtaining score Sj of j pronunciation paths and sum S of the score Sj of M pronunciation paths, wherein S is 1+ … + Sj + … SM;

and determining the ratio of Sj to S as the confidence of the pronunciation represented by the jth pronunciation path.

9. A pronunciation dictionary generating apparatus, comprising:

the pronunciation recognition network construction module is used for constructing a pronunciation recognition network aiming at the pronunciation words to be determined, and the pronunciation recognition network comprises a correct pronunciation unit and a sound change pronunciation unit of the pronunciation words to be determined; when the pronunciation words to be determined are ideographic languages, the pronunciation recognition network comprises: the pronunciation unit level and the empty node level are sequentially arranged, and the empty node level is used for carrying out level jump back before the decoding of the voice fragment is finished;

10. The apparatus of claim 9, wherein the pronunciation words to be determined are Pinyin language,

11. The apparatus of claim 10, wherein the pronunciation path determination module comprises:

12. The apparatus of claim 11, wherein the pronunciation path determination module further comprises:

13. The apparatus of claim 9, wherein the pronunciation words to be determined are ideographic languages,

the pronunciation identification network construction module is specifically used for arranging a correct pronunciation unit and a sound change pronunciation unit in parallel at the pronunciation unit level; and setting the scoring values corresponding to each correct pronunciation unit, each sound change pronunciation unit and the level jump to form the pronunciation identification network.

14. The apparatus of claim 13, wherein the pronunciation path determination module comprises:

15. The apparatus of claim 14, wherein the pronunciation path determination module further comprises:

16. The apparatus according to any one of claims 9 to 15, wherein if M pronunciation paths are obtained, M ≧ 1,

the confidence coefficient calculation module is used for obtaining the score Sj of the jth pronunciation path and the sum S of the scores of the M pronunciation paths, which is S1+ … + Sj + … SM; and determining the ratio of Sj to S as the confidence of the pronunciation represented by the jth pronunciation path.

17. A storage medium having stored thereon a plurality of instructions, wherein the instructions are loadable by a processor and adapted to cause execution of the steps of the method according to any of claims 1 to 8.

18. An electronic device, characterized in that the electronic device comprises;

the storage medium of claim 17; and

a processor to execute the instructions in the storage medium.