WO2022048354A1 - 语音强制对齐模型评价方法、装置、电子设备及存储介质 - Google Patents
语音强制对齐模型评价方法、装置、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2022048354A1 WO2022048354A1 PCT/CN2021/108899 CN2021108899W WO2022048354A1 WO 2022048354 A1 WO2022048354 A1 WO 2022048354A1 CN 2021108899 W CN2021108899 W CN 2021108899W WO 2022048354 A1 WO2022048354 A1 WO 2022048354A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phoneme
- score
- combination
- weight
- current
- Prior art date
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 58
- 238000012360 testing method Methods 0.000 claims abstract description 24
- 230000004927 fusion Effects 0.000 claims description 70
- 230000002123 temporal effect Effects 0.000 claims description 60
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 claims description 56
- 238000012937 correction Methods 0.000 claims description 56
- 238000000034 method Methods 0.000 claims description 53
- 238000004891 communication Methods 0.000 description 11
- 230000015572 biosynthetic process Effects 0.000 description 10
- 238000003786 synthesis reaction Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 7
- 238000013102 re-test Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- Embodiments of the present disclosure relate to the field of computers, and in particular, to a method, apparatus, electronic device, and storage medium for evaluating a speech forced alignment model.
- speech synthesis technology has been widely used, such as: voice broadcast, voice navigation and smart speakers.
- a speech synthesis model needs to be trained to improve the performance of speech synthesis.
- it is necessary to obtain the phoneme time points of the training speech.
- the phoneme time points are usually obtained by using the phoneme forced alignment technology (ie, machine annotation).
- the accuracy of phoneme time points obtained by the forced alignment model is not high.
- Embodiments of the present disclosure provide a method, device, electronic device, and storage medium for evaluating a forced voice alignment model, so as to realize the accuracy evaluation of a forced voice alignment model on the basis of lower cost.
- an embodiment of the present disclosure provides a method for evaluating a speech forced alignment model, including:
- the time accuracy score of the phoneme is obtained according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to characterize the predicted start and end time of the phoneme closeness to said benchmark start and end times;
- the temporal accuracy score of the to-be-evaluated speech forced alignment model is determined.
- an embodiment of the present disclosure provides a voice forced alignment model evaluation device, including:
- the first acquisition unit is configured to use the to-be-evaluated voice forced alignment model to obtain the phoneme sequence corresponding to each segment of audio and the predicted start and end of each phoneme in the phoneme sequence according to each segment of audio in the test set and the text corresponding to each segment of audio time;
- the second obtaining unit is configured to, for each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to characterize the phoneme. the closeness of the predicted start and end times of the phoneme to the reference start and end times; and
- the third obtaining unit is configured to determine the time accuracy score of the forced alignment model of the speech to be evaluated according to the time accuracy score of each phoneme.
- an embodiment of the present disclosure provides a storage medium, which stores a program suitable for evaluation of a speech forced alignment model, so as to implement the method for evaluating a speech forced alignment model according to any one of the foregoing.
- an embodiment of the present disclosure provides an electronic device, including at least one memory and at least one processor, wherein the memory stores a program, and the processor invokes the program to execute any of the foregoing The described speech forced alignment model evaluation method.
- the method, device, electronic device, and storage medium for evaluating a forced-voice alignment model provided by the embodiments of the present disclosure, wherein the method for evaluating a forced-voice alignment model includes first inputting each segment of audio in the test set and text corresponding to the audio into the speech to be evaluated
- the forced alignment model uses the forced alignment model of the speech to be evaluated to obtain the phoneme sequence corresponding to each audio segment, as well as the predicted start and end times of each phoneme of each phoneme sequence, and then obtain the predicted start and end times and the pre-known reference start and end times of the corresponding phonemes.
- the temporal accuracy score of each of the phonemes obtains the temporal accuracy score of the to-be-evaluated speech forced alignment model, so as to realize the evaluation of the to-be-evaluated speech forced alignment model. It can be seen that, in the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure, when the evaluation of the speech forced alignment model is to be evaluated, based on the closeness of the predicted start and end times of each phoneme and the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the voice forced alignment model to be evaluated.
- the method for evaluating the speech forced alignment model further performs first determining the current phoneme for each phoneme, and constructs the phoneme combination of the current phoneme, and obtains the phoneme combination of each phoneme.
- the combination method is the same. Then, when obtaining the time accuracy score of the forced alignment model of the speech to be evaluated, according to the time accuracy score of each phoneme of the phoneme combination in the current phoneme, the time accuracy correction score of the current phoneme is obtained, and the phoneme is obtained.
- the time accuracy correction score of each phoneme in the sequence is obtained, and the time accuracy score of the speech forced alignment model to be evaluated is obtained according to the time accuracy correction score of each phoneme in the phoneme sequence.
- the speech forced alignment model evaluation method uses the temporal accuracy score of at least one phoneme adjacent to the current phoneme to correct the temporal accuracy score of the current phoneme, and utilizes the context information of the current phoneme. Taking into account the influence of the current phoneme by its neighboring phonemes, the obtained temporal accuracy score of the current phoneme can be revised to have higher accuracy.
- the method for evaluating a speech forced alignment model in order to obtain the time accuracy score of each of the phonemes, first obtain the intersection and start and end times of the predicted start and end times of the same phoneme and the start and end times of the reference start and end times. Union, and then obtain the time accuracy score of the corresponding phoneme through the ratio of the intersection of the start and end times and the union of the start and end times.
- the intersection of the start and end times can represent the coincidence of the predicted start and end times and the reference start and end times
- the union of the start and end times can represent the maximum overall amount of the predicted start and end times and the reference start and end times.
- the severity and degree of the start and end times are accurately represented, so as to achieve the acquisition of the phoneme time accuracy score, and the phoneme time accuracy score can accurately represent the closeness of the predicted start and end times to the reference start and end times.
- FIG. 1 is a schematic flowchart of a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure
- FIG. 2 is a schematic flowchart of a step of obtaining a time accuracy score of each phoneme in a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure
- FIG. 3 is another schematic flowchart of a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure
- FIG. 4 is another schematic flowchart of a method for evaluating a forced alignment model of speech provided by an embodiment of the present disclosure
- FIG. 5 is a schematic flowchart of a step of obtaining a time accuracy score of a forced alignment model for speech to be evaluated according to an embodiment of the present disclosure
- FIG. 6 is a block diagram of a speech forced alignment model evaluation device provided by an embodiment of the present disclosure.
- FIG. 7 is an optional hardware device architecture of an electronic device provided by an embodiment of the present disclosure.
- the present disclosure provides a speech forced alignment model evaluation method, which can automatically realize the accuracy evaluation of the speech forced alignment model.
- Embodiments of the present disclosure provide a method for evaluating a speech forced alignment model, including:
- the time accuracy score of the phoneme is obtained according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to characterize the predicted start and end time of the phoneme closeness to said benchmark start and end times;
- the temporal accuracy score of the to-be-evaluated speech forced alignment model is determined.
- the method for evaluating a speech forced alignment model includes first inputting each segment of audio and the text corresponding to the audio in the test set into the speech forced alignment model to be evaluated, and using the speech forced alignment model to be evaluated to obtain each segment of audio
- the corresponding phoneme sequence, and the predicted start and end time of each phoneme of each phoneme sequence then according to the predicted start and end time and the known reference start and end time of the corresponding phoneme, obtain the time accuracy score of each of the phonemes, based on the time accuracy of each phoneme
- the temporal accuracy score of the voice forced alignment model to be evaluated is obtained, so as to realize the evaluation of the voice forced alignment model to be evaluated.
- the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure, when the evaluation of the speech forced alignment model is to be evaluated, based on the closeness of the predicted start and end times of each phoneme and the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the speech forced alignment model to be evaluated. It is not necessary to manually re-test each time the predicted start and end time is obtained through the speech forced alignment model, or obtain it through subsequent speech synthesis. It can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
- FIG. 1 is a schematic flowchart of a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure.
- the speech forced alignment model evaluation method includes the following steps:
- Step S10 Using the speech forced alignment model to be evaluated, according to each segment of audio in the test set and the text corresponding to each segment of audio, obtain the phoneme sequence corresponding to each segment of audio and the predicted start and end times of each phoneme in the phoneme sequence.
- the method for evaluating the speech forced alignment model provided by the embodiment of the present disclosure is used to evaluate the speech forced alignment effect of the speech forced alignment model to be evaluated. Therefore, it is necessary to first establish the speech forced alignment model to be evaluated or obtain an already established speech forced alignment model.
- the speech forced alignment model of that is, the speech forced alignment model to be evaluated.
- the prediction start and end time may include a time span from the prediction start time to the prediction end time.
- the forced alignment model of the speech to be evaluated may include a GMM model (Gaussian mixture model) and a Viterbi decoding model, and each segment of audio in the test set and the text corresponding to each segment of audio are input into the GMM model to obtain undecoded
- the phoneme sequence and the predicted start and end time are then decoded by the Viterbi decoding model to obtain the decoded phoneme sequence and the predicted start and end time.
- Step S11 For each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme.
- time accuracy score is used to represent the closeness of the predicted start and end times of the phoneme to the reference start and end times.
- the reference start and end time refers to the phoneme start and end time used as an evaluation reference, and can be obtained by manual annotation.
- the time accuracy score of the phoneme is obtained, until the time accuracy score of each phoneme is obtained.
- FIG. 2 obtains the temporal accuracy score of each phoneme in the method for evaluating a speech forced alignment model provided by the embodiment of the present disclosure
- FIG. 2 obtains the temporal accuracy score of each phoneme in the method for evaluating a speech forced alignment model provided by the embodiment of the present disclosure
- the temporal accuracy score of each phoneme can be obtained through the following steps.
- Step S110 For each phoneme, according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, obtain the start and end time intersection and start and end time union of the predicted start and end time of the phoneme and the reference start and end time.
- intersection of the predicted start and end times of the phoneme and the start and end times of the reference start and end times refers to the overlapping time of the predicted start and end times of the same phoneme and the reference start and end times, and the predicted start and end times of the phoneme and
- the union of the start and end times of the reference start and end times refers to the overall time of the predicted start and end times of the same phoneme and the reference start and end times.
- the predicted start and end times are from the 3rd to the 5th ms, and the reference start and end times are from the 4th to the 6th ms, then the intersection of the start and end times is from the 4th to the 5th ms, and the union of the start and end times is From 3ms to 6ms.
- Step S111 Obtain the time accuracy score of each phoneme according to the ratio of the intersection of the start and end times of each phoneme to the union of the start and end times.
- the ratio of the two is further obtained, and the time accuracy score of each phoneme is obtained.
- the time accuracy score of the phoneme "b" is: 4th to 5th ms/3rd to 6th ms, which is 1/3.
- the intersection of the start and end times can represent the coincidence of the predicted start and end times and the reference start and end times
- the union of the start and end times can represent the maximum overall amount of the predicted start and end times and the reference start and end times.
- the weight and degree of the start and end times are accurately represented, so as to obtain the phoneme time accuracy score, and the phoneme time accuracy score can accurately represent the closeness of the predicted start and end times to the reference start and end times.
- Step S12 According to the temporal accuracy scores of each phoneme, determine the temporal accuracy scores of the to-be-evaluated speech forced alignment model.
- the time accuracy score of the speech forced alignment model to be evaluated can be obtained by further using the time accuracy score of each phoneme.
- the temporal accuracy scores of each phoneme in the test set can be directly added to obtain the temporal accuracy scores of the speech forced alignment model to be evaluated.
- the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure, when the evaluation of the speech forced alignment model is to be evaluated, based on the closeness of the predicted start and end times of each phoneme and the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the voice forced alignment model to be evaluated. It is not necessary to manually re-test each time the predicted start and end time is obtained through the voice forced alignment model, or obtain it through subsequent speech synthesis. It can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
- FIG. 3 is the speech forced alignment model evaluation method provided by the embodiment of the present disclosure. Another schematic diagram of the process.
- Step S20 Using the speech forced alignment model to be evaluated, according to each segment of audio in the test set and the text corresponding to each segment of audio, obtain the phoneme sequence corresponding to each segment of audio and the predicted start and end times of each phoneme in the phoneme sequence.
- step S20 For the specific content of step S20 , please refer to the description of step S10 in FIG. 1 , which will not be repeated here.
- Step S21 For each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme.
- step S21 For the specific content of step S21 , please refer to the description of step S11 in FIG. 1 , which will not be repeated here.
- Step S22 Determine the current phoneme, and construct the phoneme combination of the current phoneme to obtain the phoneme combination of each phoneme.
- the phoneme combination includes the current phoneme and at least one phoneme adjacent to the current phoneme, and the combination manner of the phoneme combination of each phoneme is the same.
- each phoneme in the phoneme sequence is the current phoneme, then determine at least one phoneme adjacent to the current phoneme, and form a phoneme combination together with the current phoneme, so as to obtain all the phoneme sequences in the phoneme sequence.
- the phoneme combination corresponding to the current phoneme is described, and each phoneme in the phoneme sequence is determined one by one as the current phoneme, so as to obtain the phoneme combination corresponding to each phoneme in the phoneme sequence.
- each phoneme of the phoneme sequence will construct a phoneme combination composed of 2 phonemes, and the combination method is also the same. It can be determined that the adjacent phoneme before the current phoneme and The current phoneme constitutes a phoneme combination. Of course, it can also be determined that the adjacent phoneme behind the current phoneme and the current phoneme constitute a phoneme combination. If the phoneme combination is composed of 3 phonemes, then each phoneme in the phoneme sequence will be composed of 3 phonemes.
- Phoneme combination, and the combination method is also the same, you can determine the phoneme combination before and after the current phoneme and the current phoneme to form a phoneme combination, if the phoneme combination is composed of 4 phonemes, then each phoneme sequence will be constructed by 4 phonemes. and the combination method is the same, you can determine the 2 phonemes before the current phoneme and a phoneme after the current phoneme and the current phoneme to form a phoneme combination, of course, you can also select 1 phoneme before the current phoneme and the phoneme after the current phoneme 2 phonemes.
- the phoneme combination of the current phoneme “t” when “t” is determined as the current phoneme, if the phoneme combination consists of 2 phonemes, the phoneme combination of the current phoneme “t” can be “int” or “tian”, Either one of them can be a phoneme combination of the current phoneme “t”, or both can be used as a phoneme combination of the current phoneme “t”; if the phoneme combination consists of 3 phonemes, the phoneme combination of the current phoneme “t” can be “intian”; if the phoneme combination consists of 4 phonemes, the phoneme combination of the current phoneme “t” can be “jintian” or “intian+silence", any one of which is a phoneme combination of the current phoneme “t” , or both can be used as the phoneme combination of the current phoneme “t”.
- a phoneme combination composed of 2 phonemes a phoneme combination composed of 3 phonemes, and a phoneme combination composed of 4 phonemes can also be used as the phoneme combination of the same phoneme.
- the current phoneme and adjacent phonemes are taken into account to form a phoneme combination, which can provide corrections for the subsequent time accuracy score of the current phoneme.
- Step S23 Obtain the time accuracy correction score of the current phoneme in each phoneme combination according to the time accuracy score of each phoneme in each phoneme combination, so as to obtain the time accuracy correction score of each phoneme in the phoneme sequence.
- the time accuracy correction score of the current phoneme is obtained by using the time accuracy score of each phoneme in the phoneme combination corresponding to the current phoneme.
- the phoneme combination is composed of 3 phonemes
- the phoneme combination of the current phoneme "t” is "intian” as an example
- the time accuracy correction score of the current phoneme t can be:
- Score(t)' (Score(in)+Score(t)+Score(ian))/3.
- Step S24 According to the time accuracy correction score of each phoneme in the phoneme sequence, the time accuracy score of the to-be-evaluated speech forced alignment model is obtained.
- step S24 can refer to the content of step S12 shown in FIG. 1 , except that the time accuracy score of each phoneme is replaced by the time accuracy correction score of each phoneme, and other content will not be repeated.
- the speech forced alignment model evaluation method uses the temporal accuracy score of at least one phoneme adjacent to the current phoneme to correct the temporal accuracy score of the current phoneme, and uses the context information of the current phoneme.
- the phoneme is taken into account by the influence of its neighboring phonemes, so that the obtained temporal accuracy score of the current phoneme is revised to have higher accuracy.
- an embodiment of the present disclosure also provides another method for evaluating a forced alignment model for speech. Please refer to FIG. 4 , which is another schematic flowchart of the method for evaluating a forced alignment model for speech provided by an embodiment of the present disclosure. .
- the speech forced alignment model evaluation method includes:
- Step S30 using the to-be-evaluated speech forced alignment model to obtain the phoneme sequence corresponding to each audio segment and the predicted start and end time of each phoneme in the phoneme sequence according to each segment of audio in the test set and the text corresponding to each segment of audio.
- step S30 For the specific content of step S30 , please refer to the description of step S10 in FIG. 1 , which will not be repeated here.
- Step S31 For each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme.
- step S31 For the specific content of step S31 , please refer to the description of step S11 in FIG. 1 , which will not be repeated here.
- Step S32 Determine the current phoneme, and construct the phoneme combination of the current phoneme to obtain the phoneme combination of each phoneme.
- step S32 For the specific content of step S32, please refer to the description of step S22 in FIG. 3, which will not be repeated here.
- Step S33 classify the phoneme combination according to the pronunciation mode of each phoneme in the phoneme combination, and obtain the combination category of the phoneme combination; determine the number of phoneme combinations of the same combination category and the corresponding phoneme combination according to the combination category of each phoneme combination. combination weight.
- the phoneme combination of each current phoneme After the phoneme combination of each current phoneme is obtained, it is classified according to the pronunciation mode of each phoneme of the phoneme combination.
- the different pronunciation modes of adjacent phonemes will have a certain impact on the parameters of the current phoneme. Therefore, it can be classified according to the pronunciation mode of each phoneme combination of the phoneme combination, and the combination category of each phoneme combination can be determined.
- the combination category of the phoneme combination it can be determined
- the number of similar phoneme combinations, and then the combination weight of a certain type of phoneme combination can be obtained, and the weight score of each phoneme can be obtained according to the combination weight, which reduces the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes obtained based on the test set.
- the difference in the time accuracy scores improves the evaluation accuracy of the speech forced alignment model evaluation method provided by the embodiment of the present disclosure.
- the pronunciation modes can be divided according to the initials and the finals respectively, including the initial pronunciation mode and the final pronunciation mode, wherein, the initial pronunciation modes include the part pronunciation mode classified according to the pronunciation part and the method pronunciation mode classified according to the pronunciation method , the pronunciation of the final vowel includes a structure pronunciation according to the structure classification and a mouth pronunciation according to the mouth shape classification.
- the pronunciation mode can be divided according to the pronunciation of other languages, such as English.
- the pronunciation modes of the initials and finals can be combined to obtain specific classification categories, for example: two-phoneme combination: bilabial + nasal final, nasal final + labiodental; triphone combination : bilabial + nasal final + labiodental, single final + bilabial + single final, or single final open call + bilabial stop + single final qi tooth call; four phoneme combination: single final + double labial + single final + double labial .
- the combination weight of each phoneme combination is further acquired.
- the combination weight is the ratio of the number of phoneme combinations of the same combination category to the total number of phonemes in the phoneme sequence.
- a phoneme sequence includes 100 phonemes, if each phoneme forms a phoneme combination, then 100 phoneme combinations will be formed. According to the pronunciation of each phoneme in each phoneme combination Combination categories are determined, and then each phoneme combination is classified, assuming that a total of 3 combination categories can be formed.
- Step S34 Obtain the time accuracy correction score of the current phoneme according to the time accuracy score of each phoneme in the phoneme combination of the current phoneme.
- step S34 For the specific content of step S34, please refer to the description of step S23 shown in FIG. 3, and details are not repeated here.
- step S33 and step S34 is not limited, and the time accuracy correction score can also be obtained first, and then the combination weight can be obtained.
- Step S35 For each phoneme, obtain the weight score of the phoneme according to the time accuracy correction score of the phoneme and the combined weight of the phoneme combination corresponding to the phoneme.
- the weighted score of the phoneme is obtained.
- the combination weight and the time accuracy correction score are obtained based on the same phoneme combination based on the same phoneme, and there is a mutual correspondence between the two.
- the weighted score of each of the phonemes is obtained by multiplying the combined weight by the temporal accuracy correction score.
- Step S36 According to the weight score of each phoneme of the phoneme sequence, obtain the time accuracy score of the forced alignment model of the speech to be evaluated.
- the time accuracy score of the forced alignment model of the speech to be evaluated can be obtained through the weight score of each phoneme.
- time accuracy score of the forced alignment model of the speech to be evaluated is obtained by the following formula:
- the Score model is the time accuracy score of the forced alignment model of the speech to be evaluated
- Wn is the combined weight of the nth phoneme
- Scoren is the temporal accuracy correction score of the nth phoneme.
- the acquisition of the weight score can reduce the influence of the time accuracy score of the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes of the phoneme sequence predicted by different forced alignment models of the speech to be evaluated, and further improve the accuracy of the evaluation. .
- the phoneme combination of each phoneme may include two phonemes composed of two phonemes.
- the combination and the triphone combination composed of 3 phonemes of course, the two phoneme combination includes the current phoneme and a phoneme directly adjacent to the current phoneme, and the triphone combination includes the current phoneme and the two phonemes directly adjacent to the current phoneme, then calculate separately The time accuracy correction score of the current phoneme of each phoneme combination, so as to obtain multiple time accuracy correction scores of the same phoneme, including the diphone time accuracy correction score and the triphone time accuracy correction score, and obtain the time accuracy of the phoneme respectively.
- FIG. 5 is a schematic flowchart of a step of obtaining a temporal accuracy score of a forced voice alignment model to be evaluated according to an embodiment of the present disclosure, and the step of obtaining a temporal accuracy score of the forced voice alignment model to be evaluated may include:
- Step S361 Obtain the fusion weight score of the current phoneme according to the diphone weight score and the triphone weight score of the current phoneme.
- the fusion weight score can be obtained by the following formula:
- v2 is the diphone fusion factor
- v3 is the triphone fusion factor.
- the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is greater than the diphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
- Step S362 According to the fusion weight score of each phoneme in the phoneme sequence, obtain the time accuracy score of the speech forced alignment model to be evaluated.
- the time accuracy score of the forced alignment model of the speech to be evaluated can be obtained.
- step S12 in FIG. 1 please refer to the description of step S12 in FIG. 1 , which will not be repeated here.
- each phoneme can also have 3 phoneme combinations, in addition to the diphone combination composed of 2 phonemes and the triphone combination composed of 3 phonemes, it also includes a four phoneme combination composed of 4 phonemes.
- phoneme combination then also obtain the tetraphone combination category and tetraphone combination weight of the phoneme, as well as the tetraphone weight score, according to the weight score of each phoneme of the phoneme sequence, obtain the time of the forced alignment model of the speech to be evaluated
- Steps for accuracy scoring can include:
- the time accuracy score of the forced alignment model of the speech to be evaluated is obtained.
- the fusion weight score can be obtained by the following formula:
- v2 is the diphone fusion factor
- v3 is the triphone fusion factor
- v4 is the tetraphone fusion factor.
- the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is larger than the diphone fusion factor, and the triphone fusion factor is larger than the tetraphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
- the apparatus for evaluating the forced alignment model for speech provided by the embodiment of the present disclosure will be introduced below.
- the apparatus for evaluating the forced alignment model for speech described below can be considered as an electronic device (such as a PC) for respectively implementing the evaluation of the forced alignment model for speech provided by the embodiment of the present disclosure.
- the contents of the apparatus for evaluating the forced speech alignment model described below can be referred to each other in correspondence with the contents of the method for evaluating the forced speech alignment model described above.
- FIG. 6 is a block diagram of an apparatus for evaluating a forced voice alignment model provided by an embodiment of the present disclosure.
- the apparatus for evaluating a forced voice alignment model can be applied to both a client and a server.
- the apparatus for evaluating a forced voice alignment model Evaluation means may include:
- the first obtaining unit 100 is configured to use the to-be-evaluated speech forced alignment model to obtain the phoneme sequence corresponding to each segment of audio and the prediction of each phoneme in the phoneme sequence according to each segment of audio in the test set and the text corresponding to each segment of audio Start and end time;
- the second obtaining unit 110 is configured to, for each phoneme, obtain a time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to represent How close the predicted start and end times of the phoneme are to the reference start and end times;
- the third obtaining unit 120 is configured to determine, according to the time accuracy scores of the respective phonemes, the time accuracy scores of the speech forced alignment model to be evaluated.
- each segment of audio in the test set and the text corresponding to each segment of audio are input into the speech forced alignment model to be evaluated, so as to obtain the corresponding audio segments respectively.
- the prediction start and end time may include a time span from the prediction start time to the prediction end time.
- the forced alignment model of the speech to be evaluated may include a GMM model (Gaussian mixture model) and a Viterbi decoding model, and each segment of audio in the test set and the text corresponding to each segment of audio are input into the GMM model to obtain undecoded
- the phoneme sequence and the predicted start and end time are then decoded by the Viterbi decoding model to obtain the decoded phoneme sequence and the predicted start and end time.
- the time accuracy score is the degree of closeness between the predicted start and end times corresponding to each of the phonemes and the corresponding reference start and end times.
- the reference start and end time refers to the phoneme start and end time used as an evaluation reference, and can be obtained by manual annotation.
- the time accuracy score of the phoneme is obtained, until the time accuracy score of each phoneme is obtained.
- the second obtaining unit 110 includes:
- the third acquisition subunit is configured to obtain the intersection of the start and end times and the union of the start and end times of the predicted start and end times of each phoneme and the reference start and end times according to the predicted start and end times of each phoneme;
- the fourth obtaining subunit is configured to obtain the ratio of the intersection of the start and end times of each phoneme to the union of the start and end times, and obtain the time accuracy score of each phoneme.
- intersection of the predicted start and end times of the phoneme and the start and end times of the reference start and end times refers to the overlapping time of the predicted start and end times of the same phoneme and the reference start and end times, and the predicted start and end times of the phoneme and
- the union of the start and end times of the reference start and end times refers to the overall time of the predicted start and end times of the same phoneme and the reference start and end times.
- the ratio of the two is further obtained, and the time accuracy score of each phoneme is obtained.
- the intersection of start and end times can represent the coincidence of the predicted start and end times and the reference start and end times
- the union of start and end times can represent the maximum overall amount of the predicted start and end times and the reference start and end times.
- the severity and degree of the start and end times are accurately represented, so as to achieve the acquisition of the phoneme time accuracy score, and the phoneme time accuracy score can accurately represent the closeness of the predicted start and end times to the reference start and end times.
- the third obtaining unit 120 can obtain the temporal accuracy score of the speech forced alignment model to be evaluated through the temporal accuracy score of each phoneme.
- the temporal accuracy scores of each phoneme in the test set can be directly added to obtain the temporal accuracy scores of the speech forced alignment model to be evaluated.
- the evaluation device for the speech forced alignment model when evaluating the speech forced alignment model to be evaluated, based on the closeness of the predicted start and end times of each phoneme to the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the voice forced alignment model to be evaluated. It is not necessary to manually re-test each time the predicted start and end time is obtained through the voice forced alignment model, or obtain it through subsequent speech synthesis. It can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
- an embodiment of the present disclosure further provides a device for evaluating a speech forced alignment model.
- the apparatus for evaluating the speech forced alignment model provided by the embodiment of the present disclosure further includes:
- the fourth obtaining unit 130 is configured to determine the current phoneme, and construct the phoneme combination of the current phoneme, so as to obtain the phoneme combination of each phoneme.
- the phoneme combination includes the current phoneme and at least one phoneme adjacent to the current phoneme, and the combination manner of the phoneme combination of each phoneme is the same.
- each phoneme in the phoneme sequence is the current phoneme, then determine at least one phoneme adjacent to the current phoneme, and form a phoneme combination together with the current phoneme, so as to obtain all the phoneme sequences in the phoneme sequence.
- the phoneme combination corresponding to the current phoneme is described, and each phoneme in the phoneme sequence is determined one by one as the current phoneme, so as to obtain the phoneme combination corresponding to each phoneme in the phoneme sequence.
- the phoneme combination consists of 2 phonemes
- the phoneme combination is composed of 3 phonemes
- the phoneme combination is composed of 4 phonemes, it can be determined.
- a phoneme and the current phoneme form a phoneme combination.
- the current phoneme and adjacent phonemes are taken into account to form a phoneme combination, which can provide corrections for the subsequent time accuracy score of the current phoneme.
- the third obtaining unit 120 includes:
- the first acquisition subunit is configured to obtain the time accuracy correction score of the current phoneme in each phoneme combination according to the time accuracy score of each phoneme in each phoneme combination, so as to obtain the time accuracy correction score of each phoneme in the phoneme sequence ;as well as
- the second obtaining subunit is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the time accuracy correction score of each phoneme in the phoneme sequence.
- the time accuracy correction score of the current phoneme is obtained by using the time accuracy score of each phoneme in the phoneme combination corresponding to the current phoneme.
- the phoneme combination of the current phoneme "t” is "intian”, and the time accuracy correction score of the current phoneme t can be:
- Score(t)' (Score(in)+Score(t)+Score(ian))/3.
- the time accuracy score of the forced alignment model of the speech to be evaluated is obtained by using the time accuracy correction score of each phoneme.
- the apparatus for evaluating the forced alignment model of speech uses the temporal accuracy score of at least one phoneme adjacent to the current phoneme to correct the temporal accuracy score of the current phoneme, and utilizes the context information of the current phoneme. Taking into account the influence of the current phoneme by its neighboring phonemes, the obtained temporal accuracy score of the current phoneme can be revised to have higher accuracy.
- the apparatus for evaluating the forced alignment model of speech provided by the embodiment of the present disclosure further includes:
- the fifth obtaining unit 140 is configured to classify the phoneme combination according to the pronunciation mode of each phoneme in the phoneme combination, obtain the combination category of the phoneme combination, and determine the same combination category according to the combination category of each phoneme combination. The number of phoneme combinations and the corresponding combination weights.
- the second obtaining subunit included in the third obtaining unit 120 includes:
- the first acquisition module is configured to, for each phoneme, modify the score according to the time accuracy of the phoneme and the combined weight of the phoneme combination corresponding to the phoneme, and obtain the weight score of the phoneme;
- the second obtaining module is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the weight score of each phoneme of the phoneme sequence.
- the phoneme combination of each current phoneme After obtaining the phoneme combination of each current phoneme, classify according to the pronunciation mode of each phoneme of the phoneme combination.
- the different pronunciation modes of adjacent phonemes will have a certain impact on the parameters of the current phoneme. Therefore, it can be classified according to the pronunciation mode of each phoneme combination of the phoneme combination, and the combination category of each phoneme combination can be determined.
- the combination category of the phoneme combination it can be determined The number of similar phoneme combinations, and then the combination weight of a certain type of phoneme combination can be obtained, and the weight score of each phoneme can be obtained according to the combination weight, which reduces the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes obtained based on the test set.
- the pronunciation modes can be divided according to the initials and the finals respectively, including the initial pronunciation mode and the final pronunciation mode, wherein, the initial pronunciation modes include the part pronunciation mode classified according to the pronunciation part and the method pronunciation mode classified according to the pronunciation method , the pronunciation of the final vowel includes a structure pronunciation according to the structure classification and a mouth pronunciation according to the mouth shape classification.
- the combination weight of each phoneme combination is further acquired.
- the combination weight is the ratio of the number of phoneme combinations in the same combination category to the total number of phonemes in the phoneme sequence.
- a phoneme sequence includes 100 phonemes, if each phoneme forms a phoneme combination, then 100 phoneme combinations will be formed. According to the pronunciation of each phoneme in each phoneme combination Combination categories are determined, and then each phoneme combination is classified, assuming that a total of 3 combination categories can be formed.
- the score is revised based on the combined weight and temporal accuracy to obtain a weighted score for the phoneme.
- the combination weight and the time accuracy correction score are obtained based on the same phoneme combination based on the same phoneme, and there is a corresponding relationship between the two.
- the weighted score of each of the phonemes is obtained by multiplying the combined weight by the temporal accuracy correction score.
- the time accuracy score of the forced alignment model of the speech to be evaluated can be obtained through the weight score of each phoneme.
- time accuracy score of the forced alignment model of the speech to be evaluated is obtained by the following formula:
- the Score model is the time accuracy score of the forced alignment model of the speech to be evaluated
- Wn is the combined weight of the nth phoneme
- Scoren is the temporal accuracy correction score of the nth phoneme.
- the acquisition of the weight score can reduce the influence of the time accuracy score of the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes of the phoneme sequence predicted by different forced alignment models of the speech to be evaluated, and further improve the accuracy of the evaluation. .
- multiple phoneme combinations of the same phoneme may also be constructed, and the phoneme combinations of each phoneme may include a diphoneme combination consisting of 2 phonemes and a triphone combination consisting of 3 phonemes Combination, of course, a diphone combination includes the current phoneme and one phoneme directly adjacent to the current phoneme, and a triphone combination includes the current phoneme and two phonemes directly adjacent to the current phoneme.
- multiple phoneme combinations can be used to further improve the correction of the temporal accuracy score of the current phoneme.
- the temporal accuracy correction scores of the current phoneme of each phoneme combination need to be calculated separately, so as to obtain multiple temporal accuracy correction scores of the same phoneme.
- the diphone combination category and triphone combination category of the phoneme, as well as the diphone combination weight and the triphone combination weight, are obtained respectively.
- the combination weight includes the diphone combination weight and the triphone combination weight
- the temporal accuracy correction score includes the diphone temporal accuracy correction score and the triphone temporal accuracy correction score.
- the obtained weight scores include the diphone weight score and the triphone weight score.
- the second acquisition module in the second acquisition subunit included in the third acquisition unit 120 of the evaluation device includes:
- a first obtaining submodule configured to obtain the fusion weight score of the current phoneme according to the diphone weight score and the triphone weight score of the current phoneme;
- the second obtaining sub-module is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the fusion weight score of each phoneme of the phoneme sequence.
- the fusion weight score can be obtained by the following formula:
- v2 is the diphone fusion factor
- v3 is the triphone fusion factor.
- the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is greater than the diphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
- the fusion weight score is obtained, and the time accuracy score of the forced alignment model of the speech to be evaluated is further obtained.
- the fourth obtaining unit 130 may also construct 3 phoneme combinations for each phoneme, except for the two-phoneme combination consisting of 2 phonemes and the three-phoneme combination consisting of 3 phonemes In addition to the phoneme combination, a tetraphone combination consisting of 4 phonemes is also constructed.
- the fifth obtaining unit 140 is further configured to obtain the tetraphone combination category and the tetraphone combination weight of the phoneme.
- the first obtaining module in the second obtaining subunit included in the third obtaining unit 120 obtains a tetraphone weight score.
- the second acquisition module in the second acquisition subunit includes:
- a third obtaining submodule configured to obtain the fusion weight score of the current phoneme according to the diphone weight score, the triphone weight score and the tetraphone weight score of the current phoneme;
- the fourth obtaining sub-module is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the fusion weight score of each phoneme in the phoneme sequence.
- the fusion weight score can be obtained by the following formula:
- v2 is the diphone fusion factor
- v3 is the triphone fusion factor
- v4 is the tetraphone fusion factor.
- the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is larger than the diphone fusion factor, and the triphone fusion factor is larger than the tetraphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
- the embodiment of the present disclosure also provides an electronic device, and the electronic device provided by the embodiment of the present disclosure can load the above-mentioned program module architecture in the form of a program, so as to realize the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure;
- the The hardware electronic device can be applied to an electronic device with specific data processing capabilities, and the electronic device can be, for example, a terminal device or a server device.
- FIG. 7 shows an optional hardware device architecture provided by an embodiment of the present disclosure, which may include: at least one memory 3 and at least one processor 1; the memory stores a program, and the processor calls the Described program, in order to carry out the aforesaid speech forced alignment model evaluation method, in addition, at least one communication interface 2 and at least one communication bus 4;
- Processor 1 and memory 3 can be located in the same electronic device, for example processor 1 and memory 3 can be located in server device or terminal device; the processor 1 and the memory 3 may also be located in different electronic devices.
- the memory 3 may store a program
- the processor 1 may call the program to execute the speech forced alignment model evaluation method provided by the foregoing embodiments of the present disclosure.
- the electronic device may be a tablet computer, a notebook computer, or other device capable of performing evaluation of the speech forced alignment model.
- the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4; obviously, The communication connection diagram of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 shown in FIG. 7 is only an optional way;
- the communication interface 2 can be an interface of a communication module, such as an interface of a GSM module;
- the processor 1 may be a central processing unit (CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present disclosure.
- CPU central processing unit
- ASIC Application Specific Integrated Circuit
- the memory 3 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.
- the above-mentioned device may also include other devices (not shown) that may not be necessary for the disclosure of the embodiments of the present disclosure; since these other devices may not be necessary for understanding the disclosure of the embodiments of the present disclosure, the present disclosure The embodiments do not introduce them one by one.
- Embodiments of the present disclosure further provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the instructions are executed by a processor, the above-described method for evaluating a speech forced alignment model can be implemented.
- the computer-executable instructions stored in the storage medium can obtain the time of each phoneme based on the closeness of the predicted start and end times of each phoneme to the reference start and end times when evaluating the forced alignment model of the speech to be evaluated. Accuracy score, and then obtain the time accuracy score of the speech forced alignment model to be evaluated, without the need to manually re-test each time the predicted start and end time is obtained through the speech forced alignment model, or obtained through subsequent speech synthesis Voice verification can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLD programmable logic devices
- FPGA field programmable gate array
- the embodiments of the present disclosure may be implemented in the form of modules, procedures, functions, and the like.
- Software codes may be stored in a memory unit and executed by a processor.
- the memory unit is located inside or outside the processor and can transmit and receive data to and from the processor via various known means.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Auxiliary Devices For Music (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Machine Translation (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
Description
Claims (24)
- 一种语音强制对齐模型评价方法,包括:利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求1所述的方法,其中,所述根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分之前,还包括:确定当前音素,构建所述当前音素的音素组合,以获取各个音素的音素组合,其中,所述当前音素的音素组合包括所述当前音素和与所述当前音素临近的至少一个音素,并且其中,各个音素的音素组合的组合方式相同;并且其中,所述根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分包括:根据各个音素组合中各个音素的时间准确性得分,获取各个音素组合中当前音素的时间准确性修正得分,以得到所述音素序列中各个音素的时间准确性修正得分;以及根据所述音素序列中各个音素的时间准确性修正得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求2所述的方法,还包括:根据所述音素组合中各音素的发音方式对所述音素组合进行分类,得到所述音素组合的组合类别;以及根据各个音素组合的组合类别,确定同一组合类别的音素组合的数量以及对应的组合权重,其中,所述组合权重为同一组合类别的音素组合的数量与所述音素序列中音素的数量的比值;并且其中,所述根据所述音素序列中各个音素的时间准确性修正得分, 获取所述待评价语音强制对齐模型的时间准确性得分包括:针对各个音素,根据该音素的时间准确性修正得分和该音素对应的音素组合的组合权重,获取该音素的权重得分;以及根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求3所述的方法,其中,所述当前音素的音素组合包括二音素组合和三音素组合,所述二音素组合包括所述当前音素和与所述当前音素直接相邻的一个音素,所述三音素组合包括所述当前音素和与所述当前音素直接相邻的两个音素;所述组合类别包括各个二音素组合类别和各个三音素组合类别,所述组合权重包括与各个所述二音素组合类别对应的二音素组合权重和与各个所述三音素组合类别对应的三音素组合权重,所述时间准确性修正得分包括所述当前音素的二音素时间准确性修正得分和三音素时间准确性修正得分,所述权重得分包括所述当前音素的二音素权重得分和三音素权重得分;并且其中,所述根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分包括:根据所述当前音素的二音素权重得分和三音素权重得分获取所述当前音素的融合权重得分;以及根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求4所述的方法,其中,所述二音素组合包括所述当前音素和所述当前音素前的音素。
- 如权利要求4所述的方法,其中,所述融合权重得分通过以下公式获取:score=v2*score”+v3*score”’;其中:v2+v3=1,且v3>v2,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子。
- 如权利要求4所述的方法,其中,所述当前音素的音素组合还包括四音素组合,所述四音素组合包括所述当前音素和与所述当前音素临近的 三个音素;所述组合类别还包括各个四音素组合类别,所述组合权重还包括与各个四音素组合类别对应的四音素组合权重,所述时间准确性修正得分还包括所述当前音素的四音素时间准确性修正得分,所述权重得分还包括所述当前音素的四音素权重得分;并且其中,所述根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分包括:根据所述当前音素的所述二音素权重得分、所述三音素权重得分和所述四音素权重得分获取所述当前音素的融合权重得分;以及根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求7所述的方法,其中,所述融合权重得分通过以下公式获取:score=v2*score”+v3*score”’+v4*score””;其中:v2+v3+v4=1,且v3>v2,v3>v4,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子,score””为四音素权重得分,v4为四音素融合因子。
- 如权利要求3-8中任一项所述的方法,其中,所述待评价语音强制对齐模型的时间准确性得分通过以下公式获取:Score模型=W1*Score1+W2*Score2……+Wn*Scoren,其中,Score模型为所述待评价语音强制对齐模型的时间准确性得分,Wn为第n个音素的组合权重,Scoren为第n个音素的时间准确性修正得分。
- 如权利要求3-8中任一项所述的方法,其中,所述发音方式包括声母发音方式和韵母发音方式,所述声母发音方式包括根据发音部位分类的部位发音方式和根据发音方法分类的方法发音方式,所述韵母发音方式包括根据结构分类的结构发音方式和根据口型分类的口型发音方式。
- 如权利要求1-8中任一项所述的方法,其中,针对各个音素,所述根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分包括:获取各个音素的预测起止时间和基准起止时间的起止时间交集和起止时间并集;以及根据各个音素的起止时间交集与起止时间并集的比值,得到各个音素的时间准确性得分。
- 一种语音强制对齐模型评价装置,包括:第一获取单元,配置为利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;第二获取单元,配置为针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及第三获取单元,配置为根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
- 根据权利要求12所述的装置,还包括第四获取单元,配置为:确定当前音素,构建所述当前音素的音素组合,以获取各个音素的音素组合,其中,所述当前音素的音素组合包括所述当前音素和与所述当前音素临近的至少一个音素,并且其中,各个音素的音素组合的组合方式相同;并且其中,所述第三获取单元包括:第一获取子单元,配置为根据各个音素组合中各个音素的时间准确性得分,获取各个音素组合中当前音素的时间准确性修正得分,以得到所述音素序列中各个音素的时间准确性修正得分;以及第二获取子单元,配置为根据所述音素序列中各个音素的时间准确性修正得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求13所述的装置,还包括:第五获取单元,配置为根据所述音素组合中各音素的发音方式对所述音素组合进行分类,得到所述音素组合的组合类别;以及根据各个音素组合的组合类别,确定同一组合类别的音素组合的数量以及对应的组合权重,其中,所述组合权重为同一组合类别的音素组合的数量与所述音素序列中音素的数量的比值;并且其中,所述第二获取子单元包括:第一获取模块,配置为针对各个音素,根据该音素的时间准确性修正得分和该音素对应的音素组合的组合权重,获取该音素的权重得分;以及第二获取模块,配置为根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求14所述的装置,其中,所述当前音素的音素组合包括二音素组合和三音素组合,所述二音素组合包括所述当前音素和与所述当前音素直接相邻的一个音素,所述三音素组合包括所述当前音素和与所述当前音素直接相邻的两个音素;所述组合类别包括各个二音素组合类别和各个三音素组合类别,所述组合权重包括与各个所述二音素组合类别对应的二音素组合权重和与各个所述三音素组合类别对应的三音素组合权重,所述时间准确性修正得分包括所述当前音素的二音素时间准确性修正得分和三音素时间准确性修正得分,所述权重得分包括所述当前音素的二音素权重得分和三音素权重得分;并且其中,所述第二获取模块包括:第一获取子模块,配置为根据所述当前音素的二音素权重得分和三音素权重得分获取所述当前音素的融合权重得分;以及第二获取子模块,配置为根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求15所述的装置,其中,所述二音素组合包括所述当前音素和所述当前音素前的音素。
- 如权利要求15所述的装置,其中,所述融合权重得分通过以下公式获取:score=v2*score”+v3*score”’;其中:v2+v3=1,且v3>v2,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子。
- 如权利要求15所述的装置,其中,所述当前音素的音素组合还包括四音素组合,所述四音素组合包括所述当前音素和与所述当前音素临近的三个音素;所述组合类别还包括各个四音素组合类别,所述组合权重还包括与各个四音素组合类别对应的四音素组合权重,所述时间准确性修正得分还包括所述当前音素的四音素时间准确性修正得分,所述权重得分还包括所述当前音素的四音素权重得分;并且其中,所述第二获取模块包括:第三获取子模块,配置为根据所述当前音素的所述二音素权重得分、所述三音素权重得分和所述四音素权重得分获取所述当前音素的融合权重得分;以及第四获取子模块,配置为根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
- 如权利要求18所述的装置,其中,所述融合权重得分通过以下公式获取:score=v2*score”+v3*score”’+v4*score””;其中:v2+v3+v4=1,且v3>v2,v3>v4,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子,score””为四音素权重得分,v4为四音素融合因子。
- 如权利要求14-19中任一项所述的装置,其中,所述待评价语音强制对齐模型的时间准确性得分通过以下公式获取:Score模型=W1*Score1+W2*Score2……+Wn*Scoren,其中,Score模型为所述待评价语音强制对齐模型的时间准确性得分,Wn为第n个音素的组合权重,Scoren为第n个音素的时间准确性修正得分。
- 如权利要求14-19中任一项所述的装置,其中,所述发音方式包括声母发音方式和韵母发音方式,所述声母发音方式包括根据发音部位分类的部位发音方式和根据发音方法分类的方法发音方式,所述韵母发音方式包括根据结构分类的结构发音方式和根据口型分类的口型发音方式。
- 如权利要求12-19中任一项所述的装置,其中,所述第二获取单元包括:第三获取子单元,配置为获取各个音素的预测起止时间和基准起止时间的起止时间交集和起止时间并集;以及第四获取子单元,配置为根据各个音素的起止时间交集与起止时间并集的比值,得到各个音素的时间准确性得分。
- 一种存储介质,其中,所述存储介质存储有适于语音强制对齐模型评价的程序,以实现如权利要求1-11中任一项所述的语音强制对齐模型评价方法。
- 一种电子设备,包括:至少一个存储器;以及至少一个处理器,其中,所述存储器存储有程序,所述处理器调用所述程序,以执行如权利要求1-11中任一项所述的语音强制对齐模型评价方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021336957A AU2021336957B2 (en) | 2020-09-07 | 2021-07-28 | Method for evaluating a speech forced alignment model, electronic device, and storage medium |
CA3194051A CA3194051C (en) | 2020-09-07 | 2021-07-28 | Method for evaluating a speech forced alignment model, electronic device, and storage medium |
US18/178,813 US11749257B2 (en) | 2020-09-07 | 2023-03-06 | Method for evaluating a speech forced alignment model, electronic device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010925650.2 | 2020-09-07 | ||
CN202010925650.2A CN111798868B (zh) | 2020-09-07 | 2020-09-07 | 语音强制对齐模型评价方法、装置、电子设备及存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/178,813 Continuation US11749257B2 (en) | 2020-09-07 | 2023-03-06 | Method for evaluating a speech forced alignment model, electronic device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022048354A1 true WO2022048354A1 (zh) | 2022-03-10 |
Family
ID=72834301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/108899 WO2022048354A1 (zh) | 2020-09-07 | 2021-07-28 | 语音强制对齐模型评价方法、装置、电子设备及存储介质 |
Country Status (5)
Country | Link |
---|---|
US (1) | US11749257B2 (zh) |
CN (1) | CN111798868B (zh) |
AU (1) | AU2021336957B2 (zh) |
CA (1) | CA3194051C (zh) |
WO (1) | WO2022048354A1 (zh) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112466272B (zh) * | 2020-10-23 | 2023-01-17 | 浙江同花顺智能科技有限公司 | 一种语音合成模型的评价方法、装置、设备及存储介质 |
US11798527B2 (en) | 2020-08-19 | 2023-10-24 | Zhejiang Tonghu Ashun Intelligent Technology Co., Ltd. | Systems and methods for synthesizing speech |
CN111798868B (zh) * | 2020-09-07 | 2020-12-08 | 北京世纪好未来教育科技有限公司 | 语音强制对齐模型评价方法、装置、电子设备及存储介质 |
CN112420015A (zh) * | 2020-11-18 | 2021-02-26 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音频合成方法、装置、设备及计算机可读存储介质 |
CN112542159B (zh) * | 2020-12-01 | 2024-04-09 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种数据处理方法以及设备 |
CN112908308B (zh) * | 2021-02-02 | 2024-05-14 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音频处理方法、装置、设备及介质 |
CN112992184B (zh) * | 2021-04-20 | 2021-09-10 | 北京世纪好未来教育科技有限公司 | 一种发音评测方法、装置、电子设备和存储介质 |
CN117095672A (zh) * | 2023-07-12 | 2023-11-21 | 支付宝(杭州)信息技术有限公司 | 一种数字人唇形生成方法及装置 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7216079B1 (en) * | 1999-11-02 | 2007-05-08 | Speechworks International, Inc. | Method and apparatus for discriminative training of acoustic models of a speech recognition system |
CN101651788A (zh) * | 2008-12-26 | 2010-02-17 | 中国科学院声学研究所 | 一种在线语音文本对齐系统及方法 |
US20100324900A1 (en) * | 2009-06-19 | 2010-12-23 | Ronen Faifkov | Searching in Audio Speech |
CN109326277A (zh) * | 2018-12-05 | 2019-02-12 | 四川长虹电器股份有限公司 | 半监督的音素强制对齐模型建立方法及系统 |
CN109545243A (zh) * | 2019-01-23 | 2019-03-29 | 北京猎户星空科技有限公司 | 发音质量评价方法、装置、电子设备及存储介质 |
CN109903752A (zh) * | 2018-05-28 | 2019-06-18 | 华为技术有限公司 | 对齐语音的方法和装置 |
WO2020027394A1 (ko) * | 2018-08-02 | 2020-02-06 | 미디어젠 주식회사 | 음소 단위 발음 정확성 평가 장치 및 평가 방법 |
CN111312231A (zh) * | 2020-05-14 | 2020-06-19 | 腾讯科技(深圳)有限公司 | 音频检测方法、装置、电子设备及可读存储介质 |
CN111798868A (zh) * | 2020-09-07 | 2020-10-20 | 北京世纪好未来教育科技有限公司 | 语音强制对齐模型评价方法、装置、电子设备及存储介质 |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5333275A (en) * | 1992-06-23 | 1994-07-26 | Wheatley Barbara J | System and method for time aligning speech |
US7146319B2 (en) * | 2003-03-31 | 2006-12-05 | Novauris Technologies Ltd. | Phonetically based speech recognition system and method |
US20090326947A1 (en) * | 2008-06-27 | 2009-12-31 | James Arnold | System and method for spoken topic or criterion recognition in digital media and contextual advertising |
US11062615B1 (en) * | 2011-03-01 | 2021-07-13 | Intelligibility Training LLC | Methods and systems for remote language learning in a pandemic-aware world |
US8729374B2 (en) * | 2011-07-22 | 2014-05-20 | Howling Technology | Method and apparatus for converting a spoken voice to a singing voice sung in the manner of a target singer |
WO2014031918A2 (en) * | 2012-08-24 | 2014-02-27 | Interactive Intelligence, Inc. | Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems |
JP2014240940A (ja) * | 2013-06-12 | 2014-12-25 | 株式会社東芝 | 書き起こし支援装置、方法、及びプログラム |
US9418650B2 (en) * | 2013-09-25 | 2016-08-16 | Verizon Patent And Licensing Inc. | Training speech recognition using captions |
US9947322B2 (en) * | 2015-02-26 | 2018-04-17 | Arizona Board Of Regents Acting For And On Behalf Of Northern Arizona University | Systems and methods for automated evaluation of human speech |
US9558734B2 (en) * | 2015-06-29 | 2017-01-31 | Vocalid, Inc. | Aging a text-to-speech voice |
US9336782B1 (en) * | 2015-06-29 | 2016-05-10 | Vocalid, Inc. | Distributed collection and processing of voice bank data |
US9786270B2 (en) * | 2015-07-09 | 2017-10-10 | Google Inc. | Generating acoustic models |
US10706873B2 (en) * | 2015-09-18 | 2020-07-07 | Sri International | Real-time speaker state analytics platform |
US10884503B2 (en) * | 2015-12-07 | 2021-01-05 | Sri International | VPA with integrated object recognition and facial expression recognition |
WO2017112813A1 (en) * | 2015-12-22 | 2017-06-29 | Sri International | Multi-lingual virtual personal assistant |
US10043519B2 (en) * | 2016-09-02 | 2018-08-07 | Tim Schlippe | Generation of text from an audio speech signal |
US11443646B2 (en) * | 2017-12-22 | 2022-09-13 | Fathom Technologies, LLC | E-Reader interface system with audio and highlighting synchronization for digital books |
CN108510978B (zh) | 2018-04-18 | 2020-08-21 | 中国人民解放军62315部队 | 一种应用于语种识别的英语声学模型的建模方法及系统 |
US11887622B2 (en) * | 2018-09-14 | 2024-01-30 | United States Department Of Veteran Affairs | Mental health diagnostics using audio data |
CN109377981B (zh) * | 2018-11-22 | 2021-07-23 | 四川长虹电器股份有限公司 | 音素对齐的方法及装置 |
CN111105785B (zh) | 2019-12-17 | 2023-06-16 | 广州多益网络股份有限公司 | 一种文本韵律边界识别的方法及装置 |
-
2020
- 2020-09-07 CN CN202010925650.2A patent/CN111798868B/zh active Active
-
2021
- 2021-07-28 CA CA3194051A patent/CA3194051C/en active Active
- 2021-07-28 WO PCT/CN2021/108899 patent/WO2022048354A1/zh active Application Filing
- 2021-07-28 AU AU2021336957A patent/AU2021336957B2/en active Active
-
2023
- 2023-03-06 US US18/178,813 patent/US11749257B2/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7216079B1 (en) * | 1999-11-02 | 2007-05-08 | Speechworks International, Inc. | Method and apparatus for discriminative training of acoustic models of a speech recognition system |
CN101651788A (zh) * | 2008-12-26 | 2010-02-17 | 中国科学院声学研究所 | 一种在线语音文本对齐系统及方法 |
US20100324900A1 (en) * | 2009-06-19 | 2010-12-23 | Ronen Faifkov | Searching in Audio Speech |
CN109903752A (zh) * | 2018-05-28 | 2019-06-18 | 华为技术有限公司 | 对齐语音的方法和装置 |
WO2020027394A1 (ko) * | 2018-08-02 | 2020-02-06 | 미디어젠 주식회사 | 음소 단위 발음 정확성 평가 장치 및 평가 방법 |
CN109326277A (zh) * | 2018-12-05 | 2019-02-12 | 四川长虹电器股份有限公司 | 半监督的音素强制对齐模型建立方法及系统 |
CN109545243A (zh) * | 2019-01-23 | 2019-03-29 | 北京猎户星空科技有限公司 | 发音质量评价方法、装置、电子设备及存储介质 |
CN111312231A (zh) * | 2020-05-14 | 2020-06-19 | 腾讯科技(深圳)有限公司 | 音频检测方法、装置、电子设备及可读存储介质 |
CN111798868A (zh) * | 2020-09-07 | 2020-10-20 | 北京世纪好未来教育科技有限公司 | 语音强制对齐模型评价方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US20230206902A1 (en) | 2023-06-29 |
CA3194051A1 (en) | 2022-03-10 |
AU2021336957A1 (en) | 2023-05-04 |
CA3194051C (en) | 2023-11-07 |
CN111798868A (zh) | 2020-10-20 |
CN111798868B (zh) | 2020-12-08 |
US11749257B2 (en) | 2023-09-05 |
AU2021336957B2 (en) | 2023-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022048354A1 (zh) | 语音强制对齐模型评价方法、装置、电子设备及存储介质 | |
US10832002B2 (en) | System and method for scoring performance of chatbots | |
TWI666558B (zh) | 語意分析方法、語意分析系統及非暫態電腦可讀取媒體 | |
WO2017166650A1 (zh) | 语音识别方法及装置 | |
US9396724B2 (en) | Method and apparatus for building a language model | |
WO2018157840A1 (zh) | 语音识别测试方法及测试终端、计算设备及存储介质 | |
CN108564953B (zh) | 一种语音识别文本的标点处理方法及装置 | |
US20220375459A1 (en) | Decoding network construction method, voice recognition method, device and apparatus, and storage medium | |
US10553206B2 (en) | Voice keyword detection apparatus and voice keyword detection method | |
CN108052498A (zh) | 语音输入的字词级纠正 | |
US10796096B2 (en) | Semantic expression generation method and apparatus | |
CN107039040A (zh) | 语音识别系统 | |
US10102771B2 (en) | Method and device for learning language and computer readable recording medium | |
WO2021120602A1 (zh) | 节奏点检测方法、装置及电子设备 | |
US11373638B2 (en) | Presentation assistance device for calling attention to words that are forbidden to speak | |
TWI660340B (zh) | 聲控方法及系統 | |
US10997966B2 (en) | Voice recognition method, device and computer storage medium | |
US20230059882A1 (en) | Speech synthesis method and apparatus, device and computer storage medium | |
CN112331194A (zh) | 一种输入方法、装置和电子设备 | |
CN111048098B (zh) | 语音校正系统及语音校正方法 | |
CN115116442B (zh) | 语音交互方法和电子设备 | |
EP4095847A1 (en) | Method and apparatus for processing voice recognition result, electronic device, and computer medium | |
JP2014153479A (ja) | 診断システム、診断方法及びプログラム | |
CN117668151A (zh) | 一种智能问答方法、装置、电子设备及介质 | |
JP2013130904A (ja) | 複合語読み表示方法及びプログラム,並びに読み生成装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21863422 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 3194051 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: AU2021336957 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2021336957 Country of ref document: AU |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2021336957 Country of ref document: AU Date of ref document: 20210728 Kind code of ref document: A |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.07.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21863422 Country of ref document: EP Kind code of ref document: A1 |