WO2022048354A1 - 语音强制对齐模型评价方法、装置、电子设备及存储介质 - Google Patents

语音强制对齐模型评价方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022048354A1
WO2022048354A1 PCT/CN2021/108899 CN2021108899W WO2022048354A1 WO 2022048354 A1 WO2022048354 A1 WO 2022048354A1 CN 2021108899 W CN2021108899 W CN 2021108899W WO 2022048354 A1 WO2022048354 A1 WO 2022048354A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
score
combination
weight
current
Prior art date
Application number
PCT/CN2021/108899
Other languages
English (en)
French (fr)
Inventor
郭立钊
杨嵩
袁军峰
Original Assignee
北京世纪好未来教育科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京世纪好未来教育科技有限公司 filed Critical 北京世纪好未来教育科技有限公司
Priority to AU2021336957A priority Critical patent/AU2021336957B2/en
Priority to CA3194051A priority patent/CA3194051C/en
Publication of WO2022048354A1 publication Critical patent/WO2022048354A1/zh
Priority to US18/178,813 priority patent/US11749257B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • Embodiments of the present disclosure relate to the field of computers, and in particular, to a method, apparatus, electronic device, and storage medium for evaluating a speech forced alignment model.
  • speech synthesis technology has been widely used, such as: voice broadcast, voice navigation and smart speakers.
  • a speech synthesis model needs to be trained to improve the performance of speech synthesis.
  • it is necessary to obtain the phoneme time points of the training speech.
  • the phoneme time points are usually obtained by using the phoneme forced alignment technology (ie, machine annotation).
  • the accuracy of phoneme time points obtained by the forced alignment model is not high.
  • Embodiments of the present disclosure provide a method, device, electronic device, and storage medium for evaluating a forced voice alignment model, so as to realize the accuracy evaluation of a forced voice alignment model on the basis of lower cost.
  • an embodiment of the present disclosure provides a method for evaluating a speech forced alignment model, including:
  • the time accuracy score of the phoneme is obtained according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to characterize the predicted start and end time of the phoneme closeness to said benchmark start and end times;
  • the temporal accuracy score of the to-be-evaluated speech forced alignment model is determined.
  • an embodiment of the present disclosure provides a voice forced alignment model evaluation device, including:
  • the first acquisition unit is configured to use the to-be-evaluated voice forced alignment model to obtain the phoneme sequence corresponding to each segment of audio and the predicted start and end of each phoneme in the phoneme sequence according to each segment of audio in the test set and the text corresponding to each segment of audio time;
  • the second obtaining unit is configured to, for each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to characterize the phoneme. the closeness of the predicted start and end times of the phoneme to the reference start and end times; and
  • the third obtaining unit is configured to determine the time accuracy score of the forced alignment model of the speech to be evaluated according to the time accuracy score of each phoneme.
  • an embodiment of the present disclosure provides a storage medium, which stores a program suitable for evaluation of a speech forced alignment model, so as to implement the method for evaluating a speech forced alignment model according to any one of the foregoing.
  • an embodiment of the present disclosure provides an electronic device, including at least one memory and at least one processor, wherein the memory stores a program, and the processor invokes the program to execute any of the foregoing The described speech forced alignment model evaluation method.
  • the method, device, electronic device, and storage medium for evaluating a forced-voice alignment model provided by the embodiments of the present disclosure, wherein the method for evaluating a forced-voice alignment model includes first inputting each segment of audio in the test set and text corresponding to the audio into the speech to be evaluated
  • the forced alignment model uses the forced alignment model of the speech to be evaluated to obtain the phoneme sequence corresponding to each audio segment, as well as the predicted start and end times of each phoneme of each phoneme sequence, and then obtain the predicted start and end times and the pre-known reference start and end times of the corresponding phonemes.
  • the temporal accuracy score of each of the phonemes obtains the temporal accuracy score of the to-be-evaluated speech forced alignment model, so as to realize the evaluation of the to-be-evaluated speech forced alignment model. It can be seen that, in the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure, when the evaluation of the speech forced alignment model is to be evaluated, based on the closeness of the predicted start and end times of each phoneme and the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the voice forced alignment model to be evaluated.
  • the method for evaluating the speech forced alignment model further performs first determining the current phoneme for each phoneme, and constructs the phoneme combination of the current phoneme, and obtains the phoneme combination of each phoneme.
  • the combination method is the same. Then, when obtaining the time accuracy score of the forced alignment model of the speech to be evaluated, according to the time accuracy score of each phoneme of the phoneme combination in the current phoneme, the time accuracy correction score of the current phoneme is obtained, and the phoneme is obtained.
  • the time accuracy correction score of each phoneme in the sequence is obtained, and the time accuracy score of the speech forced alignment model to be evaluated is obtained according to the time accuracy correction score of each phoneme in the phoneme sequence.
  • the speech forced alignment model evaluation method uses the temporal accuracy score of at least one phoneme adjacent to the current phoneme to correct the temporal accuracy score of the current phoneme, and utilizes the context information of the current phoneme. Taking into account the influence of the current phoneme by its neighboring phonemes, the obtained temporal accuracy score of the current phoneme can be revised to have higher accuracy.
  • the method for evaluating a speech forced alignment model in order to obtain the time accuracy score of each of the phonemes, first obtain the intersection and start and end times of the predicted start and end times of the same phoneme and the start and end times of the reference start and end times. Union, and then obtain the time accuracy score of the corresponding phoneme through the ratio of the intersection of the start and end times and the union of the start and end times.
  • the intersection of the start and end times can represent the coincidence of the predicted start and end times and the reference start and end times
  • the union of the start and end times can represent the maximum overall amount of the predicted start and end times and the reference start and end times.
  • the severity and degree of the start and end times are accurately represented, so as to achieve the acquisition of the phoneme time accuracy score, and the phoneme time accuracy score can accurately represent the closeness of the predicted start and end times to the reference start and end times.
  • FIG. 1 is a schematic flowchart of a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of a step of obtaining a time accuracy score of each phoneme in a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure
  • FIG. 3 is another schematic flowchart of a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure
  • FIG. 4 is another schematic flowchart of a method for evaluating a forced alignment model of speech provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic flowchart of a step of obtaining a time accuracy score of a forced alignment model for speech to be evaluated according to an embodiment of the present disclosure
  • FIG. 6 is a block diagram of a speech forced alignment model evaluation device provided by an embodiment of the present disclosure.
  • FIG. 7 is an optional hardware device architecture of an electronic device provided by an embodiment of the present disclosure.
  • the present disclosure provides a speech forced alignment model evaluation method, which can automatically realize the accuracy evaluation of the speech forced alignment model.
  • Embodiments of the present disclosure provide a method for evaluating a speech forced alignment model, including:
  • the time accuracy score of the phoneme is obtained according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to characterize the predicted start and end time of the phoneme closeness to said benchmark start and end times;
  • the temporal accuracy score of the to-be-evaluated speech forced alignment model is determined.
  • the method for evaluating a speech forced alignment model includes first inputting each segment of audio and the text corresponding to the audio in the test set into the speech forced alignment model to be evaluated, and using the speech forced alignment model to be evaluated to obtain each segment of audio
  • the corresponding phoneme sequence, and the predicted start and end time of each phoneme of each phoneme sequence then according to the predicted start and end time and the known reference start and end time of the corresponding phoneme, obtain the time accuracy score of each of the phonemes, based on the time accuracy of each phoneme
  • the temporal accuracy score of the voice forced alignment model to be evaluated is obtained, so as to realize the evaluation of the voice forced alignment model to be evaluated.
  • the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure, when the evaluation of the speech forced alignment model is to be evaluated, based on the closeness of the predicted start and end times of each phoneme and the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the speech forced alignment model to be evaluated. It is not necessary to manually re-test each time the predicted start and end time is obtained through the speech forced alignment model, or obtain it through subsequent speech synthesis. It can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
  • FIG. 1 is a schematic flowchart of a method for evaluating a speech forced alignment model provided by an embodiment of the present disclosure.
  • the speech forced alignment model evaluation method includes the following steps:
  • Step S10 Using the speech forced alignment model to be evaluated, according to each segment of audio in the test set and the text corresponding to each segment of audio, obtain the phoneme sequence corresponding to each segment of audio and the predicted start and end times of each phoneme in the phoneme sequence.
  • the method for evaluating the speech forced alignment model provided by the embodiment of the present disclosure is used to evaluate the speech forced alignment effect of the speech forced alignment model to be evaluated. Therefore, it is necessary to first establish the speech forced alignment model to be evaluated or obtain an already established speech forced alignment model.
  • the speech forced alignment model of that is, the speech forced alignment model to be evaluated.
  • the prediction start and end time may include a time span from the prediction start time to the prediction end time.
  • the forced alignment model of the speech to be evaluated may include a GMM model (Gaussian mixture model) and a Viterbi decoding model, and each segment of audio in the test set and the text corresponding to each segment of audio are input into the GMM model to obtain undecoded
  • the phoneme sequence and the predicted start and end time are then decoded by the Viterbi decoding model to obtain the decoded phoneme sequence and the predicted start and end time.
  • Step S11 For each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme.
  • time accuracy score is used to represent the closeness of the predicted start and end times of the phoneme to the reference start and end times.
  • the reference start and end time refers to the phoneme start and end time used as an evaluation reference, and can be obtained by manual annotation.
  • the time accuracy score of the phoneme is obtained, until the time accuracy score of each phoneme is obtained.
  • FIG. 2 obtains the temporal accuracy score of each phoneme in the method for evaluating a speech forced alignment model provided by the embodiment of the present disclosure
  • FIG. 2 obtains the temporal accuracy score of each phoneme in the method for evaluating a speech forced alignment model provided by the embodiment of the present disclosure
  • the temporal accuracy score of each phoneme can be obtained through the following steps.
  • Step S110 For each phoneme, according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, obtain the start and end time intersection and start and end time union of the predicted start and end time of the phoneme and the reference start and end time.
  • intersection of the predicted start and end times of the phoneme and the start and end times of the reference start and end times refers to the overlapping time of the predicted start and end times of the same phoneme and the reference start and end times, and the predicted start and end times of the phoneme and
  • the union of the start and end times of the reference start and end times refers to the overall time of the predicted start and end times of the same phoneme and the reference start and end times.
  • the predicted start and end times are from the 3rd to the 5th ms, and the reference start and end times are from the 4th to the 6th ms, then the intersection of the start and end times is from the 4th to the 5th ms, and the union of the start and end times is From 3ms to 6ms.
  • Step S111 Obtain the time accuracy score of each phoneme according to the ratio of the intersection of the start and end times of each phoneme to the union of the start and end times.
  • the ratio of the two is further obtained, and the time accuracy score of each phoneme is obtained.
  • the time accuracy score of the phoneme "b" is: 4th to 5th ms/3rd to 6th ms, which is 1/3.
  • the intersection of the start and end times can represent the coincidence of the predicted start and end times and the reference start and end times
  • the union of the start and end times can represent the maximum overall amount of the predicted start and end times and the reference start and end times.
  • the weight and degree of the start and end times are accurately represented, so as to obtain the phoneme time accuracy score, and the phoneme time accuracy score can accurately represent the closeness of the predicted start and end times to the reference start and end times.
  • Step S12 According to the temporal accuracy scores of each phoneme, determine the temporal accuracy scores of the to-be-evaluated speech forced alignment model.
  • the time accuracy score of the speech forced alignment model to be evaluated can be obtained by further using the time accuracy score of each phoneme.
  • the temporal accuracy scores of each phoneme in the test set can be directly added to obtain the temporal accuracy scores of the speech forced alignment model to be evaluated.
  • the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure, when the evaluation of the speech forced alignment model is to be evaluated, based on the closeness of the predicted start and end times of each phoneme and the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the voice forced alignment model to be evaluated. It is not necessary to manually re-test each time the predicted start and end time is obtained through the voice forced alignment model, or obtain it through subsequent speech synthesis. It can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
  • FIG. 3 is the speech forced alignment model evaluation method provided by the embodiment of the present disclosure. Another schematic diagram of the process.
  • Step S20 Using the speech forced alignment model to be evaluated, according to each segment of audio in the test set and the text corresponding to each segment of audio, obtain the phoneme sequence corresponding to each segment of audio and the predicted start and end times of each phoneme in the phoneme sequence.
  • step S20 For the specific content of step S20 , please refer to the description of step S10 in FIG. 1 , which will not be repeated here.
  • Step S21 For each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme.
  • step S21 For the specific content of step S21 , please refer to the description of step S11 in FIG. 1 , which will not be repeated here.
  • Step S22 Determine the current phoneme, and construct the phoneme combination of the current phoneme to obtain the phoneme combination of each phoneme.
  • the phoneme combination includes the current phoneme and at least one phoneme adjacent to the current phoneme, and the combination manner of the phoneme combination of each phoneme is the same.
  • each phoneme in the phoneme sequence is the current phoneme, then determine at least one phoneme adjacent to the current phoneme, and form a phoneme combination together with the current phoneme, so as to obtain all the phoneme sequences in the phoneme sequence.
  • the phoneme combination corresponding to the current phoneme is described, and each phoneme in the phoneme sequence is determined one by one as the current phoneme, so as to obtain the phoneme combination corresponding to each phoneme in the phoneme sequence.
  • each phoneme of the phoneme sequence will construct a phoneme combination composed of 2 phonemes, and the combination method is also the same. It can be determined that the adjacent phoneme before the current phoneme and The current phoneme constitutes a phoneme combination. Of course, it can also be determined that the adjacent phoneme behind the current phoneme and the current phoneme constitute a phoneme combination. If the phoneme combination is composed of 3 phonemes, then each phoneme in the phoneme sequence will be composed of 3 phonemes.
  • Phoneme combination, and the combination method is also the same, you can determine the phoneme combination before and after the current phoneme and the current phoneme to form a phoneme combination, if the phoneme combination is composed of 4 phonemes, then each phoneme sequence will be constructed by 4 phonemes. and the combination method is the same, you can determine the 2 phonemes before the current phoneme and a phoneme after the current phoneme and the current phoneme to form a phoneme combination, of course, you can also select 1 phoneme before the current phoneme and the phoneme after the current phoneme 2 phonemes.
  • the phoneme combination of the current phoneme “t” when “t” is determined as the current phoneme, if the phoneme combination consists of 2 phonemes, the phoneme combination of the current phoneme “t” can be “int” or “tian”, Either one of them can be a phoneme combination of the current phoneme “t”, or both can be used as a phoneme combination of the current phoneme “t”; if the phoneme combination consists of 3 phonemes, the phoneme combination of the current phoneme “t” can be “intian”; if the phoneme combination consists of 4 phonemes, the phoneme combination of the current phoneme “t” can be “jintian” or “intian+silence", any one of which is a phoneme combination of the current phoneme “t” , or both can be used as the phoneme combination of the current phoneme “t”.
  • a phoneme combination composed of 2 phonemes a phoneme combination composed of 3 phonemes, and a phoneme combination composed of 4 phonemes can also be used as the phoneme combination of the same phoneme.
  • the current phoneme and adjacent phonemes are taken into account to form a phoneme combination, which can provide corrections for the subsequent time accuracy score of the current phoneme.
  • Step S23 Obtain the time accuracy correction score of the current phoneme in each phoneme combination according to the time accuracy score of each phoneme in each phoneme combination, so as to obtain the time accuracy correction score of each phoneme in the phoneme sequence.
  • the time accuracy correction score of the current phoneme is obtained by using the time accuracy score of each phoneme in the phoneme combination corresponding to the current phoneme.
  • the phoneme combination is composed of 3 phonemes
  • the phoneme combination of the current phoneme "t” is "intian” as an example
  • the time accuracy correction score of the current phoneme t can be:
  • Score(t)' (Score(in)+Score(t)+Score(ian))/3.
  • Step S24 According to the time accuracy correction score of each phoneme in the phoneme sequence, the time accuracy score of the to-be-evaluated speech forced alignment model is obtained.
  • step S24 can refer to the content of step S12 shown in FIG. 1 , except that the time accuracy score of each phoneme is replaced by the time accuracy correction score of each phoneme, and other content will not be repeated.
  • the speech forced alignment model evaluation method uses the temporal accuracy score of at least one phoneme adjacent to the current phoneme to correct the temporal accuracy score of the current phoneme, and uses the context information of the current phoneme.
  • the phoneme is taken into account by the influence of its neighboring phonemes, so that the obtained temporal accuracy score of the current phoneme is revised to have higher accuracy.
  • an embodiment of the present disclosure also provides another method for evaluating a forced alignment model for speech. Please refer to FIG. 4 , which is another schematic flowchart of the method for evaluating a forced alignment model for speech provided by an embodiment of the present disclosure. .
  • the speech forced alignment model evaluation method includes:
  • Step S30 using the to-be-evaluated speech forced alignment model to obtain the phoneme sequence corresponding to each audio segment and the predicted start and end time of each phoneme in the phoneme sequence according to each segment of audio in the test set and the text corresponding to each segment of audio.
  • step S30 For the specific content of step S30 , please refer to the description of step S10 in FIG. 1 , which will not be repeated here.
  • Step S31 For each phoneme, obtain the time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme.
  • step S31 For the specific content of step S31 , please refer to the description of step S11 in FIG. 1 , which will not be repeated here.
  • Step S32 Determine the current phoneme, and construct the phoneme combination of the current phoneme to obtain the phoneme combination of each phoneme.
  • step S32 For the specific content of step S32, please refer to the description of step S22 in FIG. 3, which will not be repeated here.
  • Step S33 classify the phoneme combination according to the pronunciation mode of each phoneme in the phoneme combination, and obtain the combination category of the phoneme combination; determine the number of phoneme combinations of the same combination category and the corresponding phoneme combination according to the combination category of each phoneme combination. combination weight.
  • the phoneme combination of each current phoneme After the phoneme combination of each current phoneme is obtained, it is classified according to the pronunciation mode of each phoneme of the phoneme combination.
  • the different pronunciation modes of adjacent phonemes will have a certain impact on the parameters of the current phoneme. Therefore, it can be classified according to the pronunciation mode of each phoneme combination of the phoneme combination, and the combination category of each phoneme combination can be determined.
  • the combination category of the phoneme combination it can be determined
  • the number of similar phoneme combinations, and then the combination weight of a certain type of phoneme combination can be obtained, and the weight score of each phoneme can be obtained according to the combination weight, which reduces the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes obtained based on the test set.
  • the difference in the time accuracy scores improves the evaluation accuracy of the speech forced alignment model evaluation method provided by the embodiment of the present disclosure.
  • the pronunciation modes can be divided according to the initials and the finals respectively, including the initial pronunciation mode and the final pronunciation mode, wherein, the initial pronunciation modes include the part pronunciation mode classified according to the pronunciation part and the method pronunciation mode classified according to the pronunciation method , the pronunciation of the final vowel includes a structure pronunciation according to the structure classification and a mouth pronunciation according to the mouth shape classification.
  • the pronunciation mode can be divided according to the pronunciation of other languages, such as English.
  • the pronunciation modes of the initials and finals can be combined to obtain specific classification categories, for example: two-phoneme combination: bilabial + nasal final, nasal final + labiodental; triphone combination : bilabial + nasal final + labiodental, single final + bilabial + single final, or single final open call + bilabial stop + single final qi tooth call; four phoneme combination: single final + double labial + single final + double labial .
  • the combination weight of each phoneme combination is further acquired.
  • the combination weight is the ratio of the number of phoneme combinations of the same combination category to the total number of phonemes in the phoneme sequence.
  • a phoneme sequence includes 100 phonemes, if each phoneme forms a phoneme combination, then 100 phoneme combinations will be formed. According to the pronunciation of each phoneme in each phoneme combination Combination categories are determined, and then each phoneme combination is classified, assuming that a total of 3 combination categories can be formed.
  • Step S34 Obtain the time accuracy correction score of the current phoneme according to the time accuracy score of each phoneme in the phoneme combination of the current phoneme.
  • step S34 For the specific content of step S34, please refer to the description of step S23 shown in FIG. 3, and details are not repeated here.
  • step S33 and step S34 is not limited, and the time accuracy correction score can also be obtained first, and then the combination weight can be obtained.
  • Step S35 For each phoneme, obtain the weight score of the phoneme according to the time accuracy correction score of the phoneme and the combined weight of the phoneme combination corresponding to the phoneme.
  • the weighted score of the phoneme is obtained.
  • the combination weight and the time accuracy correction score are obtained based on the same phoneme combination based on the same phoneme, and there is a mutual correspondence between the two.
  • the weighted score of each of the phonemes is obtained by multiplying the combined weight by the temporal accuracy correction score.
  • Step S36 According to the weight score of each phoneme of the phoneme sequence, obtain the time accuracy score of the forced alignment model of the speech to be evaluated.
  • the time accuracy score of the forced alignment model of the speech to be evaluated can be obtained through the weight score of each phoneme.
  • time accuracy score of the forced alignment model of the speech to be evaluated is obtained by the following formula:
  • the Score model is the time accuracy score of the forced alignment model of the speech to be evaluated
  • Wn is the combined weight of the nth phoneme
  • Scoren is the temporal accuracy correction score of the nth phoneme.
  • the acquisition of the weight score can reduce the influence of the time accuracy score of the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes of the phoneme sequence predicted by different forced alignment models of the speech to be evaluated, and further improve the accuracy of the evaluation. .
  • the phoneme combination of each phoneme may include two phonemes composed of two phonemes.
  • the combination and the triphone combination composed of 3 phonemes of course, the two phoneme combination includes the current phoneme and a phoneme directly adjacent to the current phoneme, and the triphone combination includes the current phoneme and the two phonemes directly adjacent to the current phoneme, then calculate separately The time accuracy correction score of the current phoneme of each phoneme combination, so as to obtain multiple time accuracy correction scores of the same phoneme, including the diphone time accuracy correction score and the triphone time accuracy correction score, and obtain the time accuracy of the phoneme respectively.
  • FIG. 5 is a schematic flowchart of a step of obtaining a temporal accuracy score of a forced voice alignment model to be evaluated according to an embodiment of the present disclosure, and the step of obtaining a temporal accuracy score of the forced voice alignment model to be evaluated may include:
  • Step S361 Obtain the fusion weight score of the current phoneme according to the diphone weight score and the triphone weight score of the current phoneme.
  • the fusion weight score can be obtained by the following formula:
  • v2 is the diphone fusion factor
  • v3 is the triphone fusion factor.
  • the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is greater than the diphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
  • Step S362 According to the fusion weight score of each phoneme in the phoneme sequence, obtain the time accuracy score of the speech forced alignment model to be evaluated.
  • the time accuracy score of the forced alignment model of the speech to be evaluated can be obtained.
  • step S12 in FIG. 1 please refer to the description of step S12 in FIG. 1 , which will not be repeated here.
  • each phoneme can also have 3 phoneme combinations, in addition to the diphone combination composed of 2 phonemes and the triphone combination composed of 3 phonemes, it also includes a four phoneme combination composed of 4 phonemes.
  • phoneme combination then also obtain the tetraphone combination category and tetraphone combination weight of the phoneme, as well as the tetraphone weight score, according to the weight score of each phoneme of the phoneme sequence, obtain the time of the forced alignment model of the speech to be evaluated
  • Steps for accuracy scoring can include:
  • the time accuracy score of the forced alignment model of the speech to be evaluated is obtained.
  • the fusion weight score can be obtained by the following formula:
  • v2 is the diphone fusion factor
  • v3 is the triphone fusion factor
  • v4 is the tetraphone fusion factor.
  • the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is larger than the diphone fusion factor, and the triphone fusion factor is larger than the tetraphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
  • the apparatus for evaluating the forced alignment model for speech provided by the embodiment of the present disclosure will be introduced below.
  • the apparatus for evaluating the forced alignment model for speech described below can be considered as an electronic device (such as a PC) for respectively implementing the evaluation of the forced alignment model for speech provided by the embodiment of the present disclosure.
  • the contents of the apparatus for evaluating the forced speech alignment model described below can be referred to each other in correspondence with the contents of the method for evaluating the forced speech alignment model described above.
  • FIG. 6 is a block diagram of an apparatus for evaluating a forced voice alignment model provided by an embodiment of the present disclosure.
  • the apparatus for evaluating a forced voice alignment model can be applied to both a client and a server.
  • the apparatus for evaluating a forced voice alignment model Evaluation means may include:
  • the first obtaining unit 100 is configured to use the to-be-evaluated speech forced alignment model to obtain the phoneme sequence corresponding to each segment of audio and the prediction of each phoneme in the phoneme sequence according to each segment of audio in the test set and the text corresponding to each segment of audio Start and end time;
  • the second obtaining unit 110 is configured to, for each phoneme, obtain a time accuracy score of the phoneme according to the predicted start and end time of the phoneme and the predetermined reference start and end time of the phoneme, wherein the time accuracy score is used to represent How close the predicted start and end times of the phoneme are to the reference start and end times;
  • the third obtaining unit 120 is configured to determine, according to the time accuracy scores of the respective phonemes, the time accuracy scores of the speech forced alignment model to be evaluated.
  • each segment of audio in the test set and the text corresponding to each segment of audio are input into the speech forced alignment model to be evaluated, so as to obtain the corresponding audio segments respectively.
  • the prediction start and end time may include a time span from the prediction start time to the prediction end time.
  • the forced alignment model of the speech to be evaluated may include a GMM model (Gaussian mixture model) and a Viterbi decoding model, and each segment of audio in the test set and the text corresponding to each segment of audio are input into the GMM model to obtain undecoded
  • the phoneme sequence and the predicted start and end time are then decoded by the Viterbi decoding model to obtain the decoded phoneme sequence and the predicted start and end time.
  • the time accuracy score is the degree of closeness between the predicted start and end times corresponding to each of the phonemes and the corresponding reference start and end times.
  • the reference start and end time refers to the phoneme start and end time used as an evaluation reference, and can be obtained by manual annotation.
  • the time accuracy score of the phoneme is obtained, until the time accuracy score of each phoneme is obtained.
  • the second obtaining unit 110 includes:
  • the third acquisition subunit is configured to obtain the intersection of the start and end times and the union of the start and end times of the predicted start and end times of each phoneme and the reference start and end times according to the predicted start and end times of each phoneme;
  • the fourth obtaining subunit is configured to obtain the ratio of the intersection of the start and end times of each phoneme to the union of the start and end times, and obtain the time accuracy score of each phoneme.
  • intersection of the predicted start and end times of the phoneme and the start and end times of the reference start and end times refers to the overlapping time of the predicted start and end times of the same phoneme and the reference start and end times, and the predicted start and end times of the phoneme and
  • the union of the start and end times of the reference start and end times refers to the overall time of the predicted start and end times of the same phoneme and the reference start and end times.
  • the ratio of the two is further obtained, and the time accuracy score of each phoneme is obtained.
  • the intersection of start and end times can represent the coincidence of the predicted start and end times and the reference start and end times
  • the union of start and end times can represent the maximum overall amount of the predicted start and end times and the reference start and end times.
  • the severity and degree of the start and end times are accurately represented, so as to achieve the acquisition of the phoneme time accuracy score, and the phoneme time accuracy score can accurately represent the closeness of the predicted start and end times to the reference start and end times.
  • the third obtaining unit 120 can obtain the temporal accuracy score of the speech forced alignment model to be evaluated through the temporal accuracy score of each phoneme.
  • the temporal accuracy scores of each phoneme in the test set can be directly added to obtain the temporal accuracy scores of the speech forced alignment model to be evaluated.
  • the evaluation device for the speech forced alignment model when evaluating the speech forced alignment model to be evaluated, based on the closeness of the predicted start and end times of each phoneme to the reference start and end times, the value of each phoneme can be obtained. Time accuracy score, and then obtain the time accuracy score of the voice forced alignment model to be evaluated. It is not necessary to manually re-test each time the predicted start and end time is obtained through the voice forced alignment model, or obtain it through subsequent speech synthesis. It can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
  • an embodiment of the present disclosure further provides a device for evaluating a speech forced alignment model.
  • the apparatus for evaluating the speech forced alignment model provided by the embodiment of the present disclosure further includes:
  • the fourth obtaining unit 130 is configured to determine the current phoneme, and construct the phoneme combination of the current phoneme, so as to obtain the phoneme combination of each phoneme.
  • the phoneme combination includes the current phoneme and at least one phoneme adjacent to the current phoneme, and the combination manner of the phoneme combination of each phoneme is the same.
  • each phoneme in the phoneme sequence is the current phoneme, then determine at least one phoneme adjacent to the current phoneme, and form a phoneme combination together with the current phoneme, so as to obtain all the phoneme sequences in the phoneme sequence.
  • the phoneme combination corresponding to the current phoneme is described, and each phoneme in the phoneme sequence is determined one by one as the current phoneme, so as to obtain the phoneme combination corresponding to each phoneme in the phoneme sequence.
  • the phoneme combination consists of 2 phonemes
  • the phoneme combination is composed of 3 phonemes
  • the phoneme combination is composed of 4 phonemes, it can be determined.
  • a phoneme and the current phoneme form a phoneme combination.
  • the current phoneme and adjacent phonemes are taken into account to form a phoneme combination, which can provide corrections for the subsequent time accuracy score of the current phoneme.
  • the third obtaining unit 120 includes:
  • the first acquisition subunit is configured to obtain the time accuracy correction score of the current phoneme in each phoneme combination according to the time accuracy score of each phoneme in each phoneme combination, so as to obtain the time accuracy correction score of each phoneme in the phoneme sequence ;as well as
  • the second obtaining subunit is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the time accuracy correction score of each phoneme in the phoneme sequence.
  • the time accuracy correction score of the current phoneme is obtained by using the time accuracy score of each phoneme in the phoneme combination corresponding to the current phoneme.
  • the phoneme combination of the current phoneme "t” is "intian”, and the time accuracy correction score of the current phoneme t can be:
  • Score(t)' (Score(in)+Score(t)+Score(ian))/3.
  • the time accuracy score of the forced alignment model of the speech to be evaluated is obtained by using the time accuracy correction score of each phoneme.
  • the apparatus for evaluating the forced alignment model of speech uses the temporal accuracy score of at least one phoneme adjacent to the current phoneme to correct the temporal accuracy score of the current phoneme, and utilizes the context information of the current phoneme. Taking into account the influence of the current phoneme by its neighboring phonemes, the obtained temporal accuracy score of the current phoneme can be revised to have higher accuracy.
  • the apparatus for evaluating the forced alignment model of speech provided by the embodiment of the present disclosure further includes:
  • the fifth obtaining unit 140 is configured to classify the phoneme combination according to the pronunciation mode of each phoneme in the phoneme combination, obtain the combination category of the phoneme combination, and determine the same combination category according to the combination category of each phoneme combination. The number of phoneme combinations and the corresponding combination weights.
  • the second obtaining subunit included in the third obtaining unit 120 includes:
  • the first acquisition module is configured to, for each phoneme, modify the score according to the time accuracy of the phoneme and the combined weight of the phoneme combination corresponding to the phoneme, and obtain the weight score of the phoneme;
  • the second obtaining module is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the weight score of each phoneme of the phoneme sequence.
  • the phoneme combination of each current phoneme After obtaining the phoneme combination of each current phoneme, classify according to the pronunciation mode of each phoneme of the phoneme combination.
  • the different pronunciation modes of adjacent phonemes will have a certain impact on the parameters of the current phoneme. Therefore, it can be classified according to the pronunciation mode of each phoneme combination of the phoneme combination, and the combination category of each phoneme combination can be determined.
  • the combination category of the phoneme combination it can be determined The number of similar phoneme combinations, and then the combination weight of a certain type of phoneme combination can be obtained, and the weight score of each phoneme can be obtained according to the combination weight, which reduces the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes obtained based on the test set.
  • the pronunciation modes can be divided according to the initials and the finals respectively, including the initial pronunciation mode and the final pronunciation mode, wherein, the initial pronunciation modes include the part pronunciation mode classified according to the pronunciation part and the method pronunciation mode classified according to the pronunciation method , the pronunciation of the final vowel includes a structure pronunciation according to the structure classification and a mouth pronunciation according to the mouth shape classification.
  • the combination weight of each phoneme combination is further acquired.
  • the combination weight is the ratio of the number of phoneme combinations in the same combination category to the total number of phonemes in the phoneme sequence.
  • a phoneme sequence includes 100 phonemes, if each phoneme forms a phoneme combination, then 100 phoneme combinations will be formed. According to the pronunciation of each phoneme in each phoneme combination Combination categories are determined, and then each phoneme combination is classified, assuming that a total of 3 combination categories can be formed.
  • the score is revised based on the combined weight and temporal accuracy to obtain a weighted score for the phoneme.
  • the combination weight and the time accuracy correction score are obtained based on the same phoneme combination based on the same phoneme, and there is a corresponding relationship between the two.
  • the weighted score of each of the phonemes is obtained by multiplying the combined weight by the temporal accuracy correction score.
  • the time accuracy score of the forced alignment model of the speech to be evaluated can be obtained through the weight score of each phoneme.
  • time accuracy score of the forced alignment model of the speech to be evaluated is obtained by the following formula:
  • the Score model is the time accuracy score of the forced alignment model of the speech to be evaluated
  • Wn is the combined weight of the nth phoneme
  • Scoren is the temporal accuracy correction score of the nth phoneme.
  • the acquisition of the weight score can reduce the influence of the time accuracy score of the forced alignment model of the speech to be evaluated due to the difference in the number of phonemes of the phoneme sequence predicted by different forced alignment models of the speech to be evaluated, and further improve the accuracy of the evaluation. .
  • multiple phoneme combinations of the same phoneme may also be constructed, and the phoneme combinations of each phoneme may include a diphoneme combination consisting of 2 phonemes and a triphone combination consisting of 3 phonemes Combination, of course, a diphone combination includes the current phoneme and one phoneme directly adjacent to the current phoneme, and a triphone combination includes the current phoneme and two phonemes directly adjacent to the current phoneme.
  • multiple phoneme combinations can be used to further improve the correction of the temporal accuracy score of the current phoneme.
  • the temporal accuracy correction scores of the current phoneme of each phoneme combination need to be calculated separately, so as to obtain multiple temporal accuracy correction scores of the same phoneme.
  • the diphone combination category and triphone combination category of the phoneme, as well as the diphone combination weight and the triphone combination weight, are obtained respectively.
  • the combination weight includes the diphone combination weight and the triphone combination weight
  • the temporal accuracy correction score includes the diphone temporal accuracy correction score and the triphone temporal accuracy correction score.
  • the obtained weight scores include the diphone weight score and the triphone weight score.
  • the second acquisition module in the second acquisition subunit included in the third acquisition unit 120 of the evaluation device includes:
  • a first obtaining submodule configured to obtain the fusion weight score of the current phoneme according to the diphone weight score and the triphone weight score of the current phoneme;
  • the second obtaining sub-module is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the fusion weight score of each phoneme of the phoneme sequence.
  • the fusion weight score can be obtained by the following formula:
  • v2 is the diphone fusion factor
  • v3 is the triphone fusion factor.
  • the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is greater than the diphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
  • the fusion weight score is obtained, and the time accuracy score of the forced alignment model of the speech to be evaluated is further obtained.
  • the fourth obtaining unit 130 may also construct 3 phoneme combinations for each phoneme, except for the two-phoneme combination consisting of 2 phonemes and the three-phoneme combination consisting of 3 phonemes In addition to the phoneme combination, a tetraphone combination consisting of 4 phonemes is also constructed.
  • the fifth obtaining unit 140 is further configured to obtain the tetraphone combination category and the tetraphone combination weight of the phoneme.
  • the first obtaining module in the second obtaining subunit included in the third obtaining unit 120 obtains a tetraphone weight score.
  • the second acquisition module in the second acquisition subunit includes:
  • a third obtaining submodule configured to obtain the fusion weight score of the current phoneme according to the diphone weight score, the triphone weight score and the tetraphone weight score of the current phoneme;
  • the fourth obtaining sub-module is configured to obtain the time accuracy score of the forced alignment model of the speech to be evaluated according to the fusion weight score of each phoneme in the phoneme sequence.
  • the fusion weight score can be obtained by the following formula:
  • v2 is the diphone fusion factor
  • v3 is the triphone fusion factor
  • v4 is the tetraphone fusion factor.
  • the fusion of different weighted scores of the same phoneme can be simply realized, and the triphone fusion factor is larger than the diphone fusion factor, and the triphone fusion factor is larger than the tetraphone fusion factor, which can highlight the influence of the triphone combination and further improve the accuracy.
  • the embodiment of the present disclosure also provides an electronic device, and the electronic device provided by the embodiment of the present disclosure can load the above-mentioned program module architecture in the form of a program, so as to realize the evaluation method of the speech forced alignment model provided by the embodiment of the present disclosure;
  • the The hardware electronic device can be applied to an electronic device with specific data processing capabilities, and the electronic device can be, for example, a terminal device or a server device.
  • FIG. 7 shows an optional hardware device architecture provided by an embodiment of the present disclosure, which may include: at least one memory 3 and at least one processor 1; the memory stores a program, and the processor calls the Described program, in order to carry out the aforesaid speech forced alignment model evaluation method, in addition, at least one communication interface 2 and at least one communication bus 4;
  • Processor 1 and memory 3 can be located in the same electronic device, for example processor 1 and memory 3 can be located in server device or terminal device; the processor 1 and the memory 3 may also be located in different electronic devices.
  • the memory 3 may store a program
  • the processor 1 may call the program to execute the speech forced alignment model evaluation method provided by the foregoing embodiments of the present disclosure.
  • the electronic device may be a tablet computer, a notebook computer, or other device capable of performing evaluation of the speech forced alignment model.
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4; obviously, The communication connection diagram of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 shown in FIG. 7 is only an optional way;
  • the communication interface 2 can be an interface of a communication module, such as an interface of a GSM module;
  • the processor 1 may be a central processing unit (CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present disclosure.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.
  • the above-mentioned device may also include other devices (not shown) that may not be necessary for the disclosure of the embodiments of the present disclosure; since these other devices may not be necessary for understanding the disclosure of the embodiments of the present disclosure, the present disclosure The embodiments do not introduce them one by one.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the instructions are executed by a processor, the above-described method for evaluating a speech forced alignment model can be implemented.
  • the computer-executable instructions stored in the storage medium can obtain the time of each phoneme based on the closeness of the predicted start and end times of each phoneme to the reference start and end times when evaluating the forced alignment model of the speech to be evaluated. Accuracy score, and then obtain the time accuracy score of the speech forced alignment model to be evaluated, without the need to manually re-test each time the predicted start and end time is obtained through the speech forced alignment model, or obtained through subsequent speech synthesis Voice verification can simplify the difficulty of evaluating the accuracy of the forced alignment model, and at the same time, it can reduce the labor cost and time cost required for the accuracy evaluation of the forced alignment model, and improve the efficiency.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLD programmable logic devices
  • FPGA field programmable gate array
  • the embodiments of the present disclosure may be implemented in the form of modules, procedures, functions, and the like.
  • Software codes may be stored in a memory unit and executed by a processor.
  • the memory unit is located inside or outside the processor and can transmit and receive data to and from the processor via various known means.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Auxiliary Devices For Music (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

一种语音强制对齐模型评价方法、装置、电子设备及存储介质,语音强制对齐模型评价方法包括:利用待评价语音强制对齐模型,根据测试集中各段音频和与各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间(S10);针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分(S11);根据各个音素的时间准确性得分,确定待评价语音强制对齐模型的时间准确性得分(S12)。

Description

语音强制对齐模型评价方法、装置、电子设备及存储介质
本公开要求于2020年9月7日提交中国专利局、申请号为202010925650.2、发明名称为“语音强制对齐模型评价方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开实施例涉及计算机领域,尤其涉及一种语音强制对齐模型评价方法、装置、电子设备及存储介质。
背景技术
随着计算机技术和深度学习技术的发展,语音合成技术得到广泛应用,例如:语音播报、语音导航以及智能音箱等。
在语音合成中,需要对语音合成模型进行训练,以提高语音合成的性能。为了实现对于语音合成模型的训练,需要得到训练语音的音素时间点。
为了得到音素时间点,通常利用语音强制对齐技术(即机器标注)获取,语音强制对齐技术是通过强制对齐模型确定音素时间点的技术。然而,在相关技术中,通过强制对齐模型得到的音素时间点的准确性不高。
发明内容
本公开实施例提供一种语音强制对齐模型评价方法、装置、电子设备及存储介质,以在较低成本的基础上,实现对语音强制对齐模型的准确性评价。
为解决上述问题,本公开实施例提供一种语音强制对齐模型评价方法,包括:
利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;
针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分 用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及
根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
为解决上述问题,本公开实施例提供一种语音强制对齐模型评价装置,包括:
第一获取单元,配置为利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;
第二获取单元,配置为针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及
第三获取单元,配置为根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
为解决上述问题,本公开实施例提供一种存储介质,所述存储介质存储有适于语音强制对齐模型评价的程序,以实现如前述任一项所述的语音强制对齐模型评价方法。
为解决上述问题,本公开实施例提供一种电子设备,包括至少一个存储器和至少一个处理器,其中,所述存储器存储有程序,所述处理器调用所述程序,以执行如前述任一项所述的语音强制对齐模型评价方法。
与现有技术相比,本公开的技术方案具有以下优点:
本公开实施例所提供的语音强制对齐模型评价方法、装置、电子设备及存储介质,其中,语音强制对齐模型评价方法,包括首先将测试集的各段音频和与音频对应的文本输入待评价语音强制对齐模型,利用待评价语音强制对齐模型获取每段音频对应的音素序列,以及各个音素序列的各个音素的预测起止时间,然后根据预测起止时间和预先已知的对应音素的基准起止时间,获取各个所述音素的时间准确性得分,基于各个音素的时间准确性得分,获取所述待评价语音强制对齐模型的时间准确性得分,实现对待评价语音强制对齐模型的评价。可以看出,本公开实施例所提供的语 音强制对齐模型评价方法,在对待评价语音强制对齐模型的评价时,基于各个音素的预测起止时间和基准起止时间的接近程度,就可以得到各个音素的时间准确性得分,进而得到待评价语音强制对齐模型的时间准确性得分,无需在每次通过语音强制对齐模型获取到预测起止时间时,再利用人工方式进行复验,或者通过后续的语音合成得到的语音进行验证,可以简化对强制对齐模型的准确性评价的难度,同时还可以降低对强制对齐模型的准确性评价所需的人工成本和时间成本,提高效率。
可选方案中,本公开实施例所提供的语音强制对齐模型评价方法,还对各个音素执行首先确定当前音素,并构建当前音素的音素组合,得到各个音素的音素组合,各个音素的音素组合的组合方式相同,然后,在获取待评价语音强制对齐模型的时间准确性得分时,根据当前音素中音素组合的各个音素的时间准确性得分,获取当前音素的时间准确性修正得分,得到所述音素序列的各个音素的时间准确性修正得分,根据音素序列的各个音素的时间准确性修正得分获取待评价语音强制对齐模型的时间准确性得分。这样,本公开实施例所提供的语音强制对齐模型评价方法,利用与当前音素相邻的至少一个音素的时间准确性得分对当前音素的时间准确性得分进行修正,利用了当前音素的上下文信息,将当前音素受与其相邻的音素的影响考虑在内,使得所得到的当前音素的时间准确性得分得以修正,从而具有更高的准确性。
可选方案中,本公开实施例所提供的语音强制对齐模型评价方法,为了获取各个所述音素的时间准确性得分,首先获取同一音素的预测起止时间和基准起止时间的起止时间交集和起止时间并集,然后通过起止时间交集和起止时间并集的比值,获取对应音素的时间准确性得分。这样,起止时间交集可以表示预测起止时间和基准起止时间的重合量,起止时间并集可以表示预测起止时间和基准起止时间的最大整体量,利用起止时间交集和起止时间并集的比值可以将预测起止时间的重和程度准确地表示,从而实现对音素时间准确性得分的获取,并且音素时间准确性得分能够准确地表示预测起止时间和基准起止时间的接近程度。
附图说明
图1是本公开实施例所提供的语音强制对齐模型评价方法的一流程示意图;
图2为本公开实施例所提供的语音强制对齐模型评价方法的获取各个音素的时间准确性得分步骤的一流程示意图;
图3为本公开实施例所提供的语音强制对齐模型评价方法的另一流程示意图;
图4为本公开实施例所提供的语音强制对齐模型评价方法的又一流程示意图;
图5为本公开实施例所提供的待评价语音强制对齐模型的时间准确性得分的获取步骤的一流程示意图;
图6是本公开实施例所提供的语音强制对齐模型评价装置的一框图;
图7是本公开实施例提供的电子设备一种可选硬件设备架构。
具体实施方式
在相关技术中,通过人工方式对语音强制对齐模型进行评价的方式较费时费力,且评价结果也易受主观影响。
对此,本公开提供一种语音强制对齐模型评价方法,能够自动实现对语音强制对齐模型的准确性评价。本公开实施例提供了一种语音强制对齐模型评价方法,包括:
利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;
针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及
根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
从而,本公开实施例所提供的语音强制对齐模型评价方法,包括首先 将测试集的各段音频和与音频对应的文本输入待评价语音强制对齐模型,利用待评价语音强制对齐模型获取每段音频对应的音素序列,以及各个音素序列的各个音素的预测起止时间,然后根据预测起止时间和已知的对应音素的基准起止时间,获取各个所述音素的时间准确性得分,基于各个音素的时间准确性得分,获取所述待评价语音强制对齐模型的时间准确性得分,实现对待评价语音强制对齐模型的评价。
可以看出,本公开实施例所提供的语音强制对齐模型评价方法,在对待评价语音强制对齐模型的评价时,基于各个音素的预测起止时间和基准起止时间的接近程度,就可以得到各个音素的时间准确性得分,进而得到待评价语音强制对齐模型的时间准确性得分,无需在每次通过语音强制对齐模型获取到预测起止时间时,再利用人工方式进行复验,或者通过后续的语音合成得到的语音进行验证,可以简化对强制对齐模型的准确性评价的难度,同时还可以降低对强制对齐模型的准确性评价所需的人工成本和时间成本,提高效率。
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
请参考图1,图1是本公开实施例所提供的语音强制对齐模型评价方法的一流程示意图。
如图中所示,本公开实施例所提供的语音强制对齐模型评价方法,包括以下步骤:
步骤S10:利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间。
容易理解的是,本公开实施例所提供的语音强制对齐模型评价方法,用于对待评价语音强制对齐模型的语音强制对齐效果进行评价,因此需要首先建立需要评价的语音强制对齐模型或者获取已经建立的语音强制对齐模型,即待评价语音强制对齐模型。
将测试集的各段音频和与各段音频对应的文本输入待评价语音强制对齐模型,从而得到分别与每段音频对应的音素序列,以及每段音素序列的各个音素的预测起止时间。
当然,预测起止时间可以包括从预测起始时刻到预测结束时刻的时间跨度。
具体地,待评价语音强制对齐模型可以包括GMM模型(高斯混合模型)和维特比(viterbi)解码模型,将测试集的各段音频和与各段音频对应的文本输入GMM模型,得到未解码的音素序列和预测起止时间,然后经过维特比解码模型进行解码,得到解码后的音素序列和预测起止时间。
步骤S11:针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分。
可以理解的是,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度。
其中,基准起止时间是指用于作为评价基准的音素起止时间,可以通过人工标注的方式获取。
通过比较同一音素的预测起止时间和基准起止时间接近程度,得到该音素的时间准确性得分,直至得到各个音素的时间准确性得分。
在一种具体实施方式中,为了方便各个音素的时间准确性得分的获取,请参考图2,图2为本公开实施例所提供的语音强制对齐模型评价方法的获取各个音素的时间准确性得分步骤的一流程示意图。
如图中所示,本公开实施例所提供的语音强制对齐模型评价方法,可以通过以下步骤获取各个音素的时间准确性得分。
步骤S110:针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的所述预测起止时间和所述基准起止时间的起止时间交集和起止时间并集。
容易理解的是,所述音素的所述预测起止时间和所述基准起止时间的起止时间交集是指同一音素的预测起止时间和基准起止时间的重叠时间,所述音素的所述预测起止时间和所述基准起止时间的起止时间并集是指同一音素的预测起止时间和基准起止时间的整体时间。
示例性地,对于音素“b”,假设预测起止时间是从第3ms到第5ms,基 准起止时间是从第4ms到第6ms,那么起止时间交集为从第4ms到第5ms,起止时间并集为从第3ms到第6ms。
步骤S111:根据各个音素的起止时间交集与起止时间并集的比值,得到各个音素的时间准确性得分。
得到各个音素的起止时间交集和起止时间并集后,进一步获取二者的比值,得到各个音素的时间准确性得分。
如前述例子,音素“b”的时间准确性得分即为:第4ms到第5ms/第3ms到第6ms,为1/3。
可以理解的是,某个音素的起止时间交集与起止时间并集的比值分数越大,那么待评价语音强制对齐模型在该音素的准确性越高。
这样,起止时间交集可以表示预测起止时间和基准起止时间的重合量,起止时间并集可以表示预测起止时间和基准起止时间的最大整体量,利用起止时间交集和起止时间并集的比值可以将预测起止时间的重和程度准确地表示,从而实现对音素时间准确性得分的获取,并且音素时间准确性得分能够准确地表示预测起止时间和基准起止时间的接近程度。
步骤S12:根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
得到测试集的各个音素的时间准确性得分后,进一步通过各个音素的时间准确性得分就可以获取待评价语音强制对齐模型的时间准确性得分。
在一种具体实施方式中,可以直接将测试集的各个音素的时间准确性得分进行相加,获取待评价语音强制对齐模型的时间准确性得分。
容易理解的是,各个音素的时间准确性得分越高,待评价语音强制对齐模型的时间准确性得分也就越高,待评价语音强制对齐模型的强制对齐效果也就越好,从而实现对于不同的语音强制对齐模型的对齐效果的评价,或者对于参数调整前后的语音强制对齐模型的对齐效果的评价。
可以看出,本公开实施例所提供的语音强制对齐模型评价方法,在对待评价语音强制对齐模型的评价时,基于各个音素的预测起止时间和基准起止时间的接近程度,就可以得到各个音素的时间准确性得分,进而得到待评价语音强制对齐模型的时间准确性得分,无需在每次通过语音强制对齐模型获取到预测起止时间时,再利用人工方式进行复验,或者通过后续 的语音合成得到的语音进行验证,可以简化对强制对齐模型的准确性评价的难度,同时还可以降低对强制对齐模型的准确性评价所需的人工成本和时间成本,提高效率。
为了进一步提高对语音强制对齐模型评价的准确性,本公开实施例还提供另一种语音强制对齐模型评价方法,请参考图3,图3为本公开实施例所提供的语音强制对齐模型评价方法的另一流程示意图。
本公开实施例所提供的语音强制对齐模型评价方法包括:
步骤S20:利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间。
步骤S20的具体内容请参考图1关于步骤S10的描述,在此不再赘述。
步骤S21:针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分。
步骤S21的具体内容请参考图1关于步骤S11的描述,在此不再赘述。
步骤S22:确定当前音素,构建所述当前音素的音素组合,以获取各个音素的音素组合。
当然,音素组合包括所述当前音素和与当前音素临近的至少一个音素,并且各个音素的音素组合的组合方式相同。
得到测试集的各段音频的音素序列后,确定所述音素序列中一音素为当前音素,然后确定当前音素临近的至少一个音素,与当前音素共同组成音素组合,从而得到所述音素序列中所述当前音素对应的音素组合,逐一确定所述音素序列中各个音素为当前音素,从而得到所述音素序列中各个音素对应的音素组合。
可以理解的是,如果音素组合由2个音素构建组成,那么音素序列的每个音素都会构建由2个音素组成的音素组合,并且组合方式也相同,可以确定位于当前音素前的相邻音素和当前音素组成音素组合,当然也可以确定位于当前音素后的相邻音素和当前音素组成音素组合,如果音素组合由3个音素构建组成,那么音素序列的每个音素都会构建由3个音素组成的音素组合,并且组合方式也相同,可以则确定当前音素前后相邻的音素和当前音素组成音素组合,如果音素组合由4个音素构建组成,那么音素 序列的每个音素都会构建由4个音素组成的音素组合,并且组合方式也相同,可以则确定当前音素前的2个音素和当前音素后的一个音素和当前音素组成音素组合,当然也可以选择当前音素前的1个音素和当前音素后的2个音素。
例如,对于“jintian”这样一组音素序列,确定“t”为当前音素时,如果音素组合由2个音素构建组成,则当前音素“t”的音素组合可以为“int”或“tian”,可以任选其一为当前音素“t”的一个音素组合,也可以都作为当前音素“t”的音素组合;如果音素组合由3个音素构建组成,则当前音素“t”的音素组合可以为“intian”;如果音素组合由4个音素构建组成,则当前音素“t”的音素组合可以为为“jintian”或“intian+silence”,任选其一为当前音素“t”的一个音素组合,也可以都作为当前音素“t”的音素组合。
当然,还可以将由2个音素构建组成音素组合、由3个音素构建组成的音素组合、由4个音素构建组成的音素组合都作为同一个音素的音素组合。
由于每个音素的起止时间会受与其相邻的音素的影响,将当前音素,与临近的音素考虑在内,形成音素组合,可以为后续对当前音素的时间准确性得分提供修正。
步骤S23:根据各个音素组合中各个音素的时间准确性得分,获取各个音素组合中当前音素的时间准确性修正得分,以得到所述音素序列中各个音素的时间准确性修正得分。
得到各个音素的音素组合后,利用当前音素对应的音素组合中的各个音素的时间准确性得分,获取当前音素的时间准确性修正得分。
如前所示事例,音素组合由3个音素构建组成,则当前音素“t”的音素组合为“intian”为例,当前音素t的时间准确性修正得分可以为:
Score(t)‘=(Score(in)+Score(t)+Score(ian))/3。
步骤S24:根据所述音素序列中各个音素的时间准确性修正得分,获取所述待评价语音强制对齐模型的时间准确性得分。
步骤S24的具体内容可以参考图1所示的步骤S12的内容,只是利用各个音素的时间准确性修正得分替换各个音素的时间准确性得分,其他内容不再赘述。
本公开实施例所提供的语音强制对齐模型评价方法,利用与当前音素相邻的至少一个音素的时间准确性得分对当前音素的时间准确性得分进行修正,利用了当前音素的上下文信息,将当前音素受与其相邻的音素的影响考虑在内,使得所得到的当前音素的时间准确性得分得以修正,从而具有更高的准确性。
为了进一步提高评价的准确性,本公开实施例还提供另一种语音强制对齐模型评价方法,请参考图4,图4为本公开实施例所提供的语音强制对齐模型评价方法的又一流程示意图。
如图中所示,本公开实施例所提供的语音强制对齐模型评价方法包括:
步骤S30:利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间。
步骤S30的具体内容请参考图1关于步骤S10的描述,在此不再赘述。
步骤S31:针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分。
步骤S31的具体内容请参考图1关于步骤S11的描述,在此不再赘述。
步骤S32:确定当前音素,构建所述当前音素的音素组合,以获取各个音素的音素组合。
步骤S32的具体内容请参考图3关于步骤S22的描述,在此不再赘述。
步骤S33:根据所述音素组合中各音素的发音方式对所述音素组合进行分类,得到所述音素组合的组合类别;根据各个音素组合的组合类别,确定同一组合类别的音素组合的数量以及对应的组合权重。
在得到各个当前音素的音素组合后,根据音素组合的各个音素的发音方式进行分类。相邻音素的不同发音方式对于当前音素的参数会有一定的影响,因此可以根据音素组合的各个音素的发音方式进行分类,确定各个音素组合的组合类别,然后根据音素组合的组合类别,可以确定同类音素组合的数量,进而获取某类别音素组合的组合权重,进一步根据组合权重可以获取每个音素的权重得分,降低由于基于测试集所得到的音素数量的不同而导致的待评价语音强制对齐模型的时间准确性得分的差别,提高本公开实施例所提供的语音强制对齐模型评价方法的评价准确性。
具体地,所述发音方式可以根据声母和韵母分别进行划分,包括声母发音方式和韵母发音方式,其中,所述声母发音方式包括根据发音部位分类的部位发音方式和根据发音方法分类的方法发音方式,所述韵母发音方式包括根据结构分类的结构发音方式和根据口型分类的口型发音方式。
其中声母发音方式的具体分类可以参考表1:
表1声母发音方式
Figure PCTCN2021108899-appb-000001
韵母发音方式的具体分类可以参考表2:
表2韵母发音方式
Figure PCTCN2021108899-appb-000002
Figure PCTCN2021108899-appb-000003
当然,所述发音方式可以根据其他语言的发音进行划分,例如:英语。
当基于拼音发音方式进行划分,具体进行分组时,可以将声母和韵母的发音方式进行组合,得到具体分类类别,例如:二音素组合:双唇音+鼻韵母、鼻韵母+唇齿音;三音素组合:双唇音+鼻韵母+唇齿音、单韵母+双唇音+单韵母、或者单韵母开口呼+双唇音塞音+单韵母齐齿呼;四音素组合:单韵母+双唇音+单韵母+双唇音。
这样,将发音方式分类与声母和韵母的发音方式相结合,可以更为方便地实现发音方式分类,降低发音方式分类的难度。得到各个组合类别后,进一步获取各个音素组合的组合权重,具体地,组合权重为同一组合类别的音素组合的数量与所述音素序列中音素总量的比值。
为方便理解,现进行举例说明,当某个音素序列包括100个音素时, 如果每个音素形成一个音素组合,那么就会形成100个音素组合,可以根据每个音素组合的各个音素的发音方式确定组合类别,然后对每个音素组合进行分类,假设可以共形成3个组合类别。
然后可以统计各个组合类别中的音素组合的数量,假设其中第一个组合类别有20个,第二个组合类别有45个,第三个音素组合有35个,进而可以根据各个组合类别中各个音素组合的数量确定组合权重,例如:第一类组合类别的组合权重可以为20/100=0.2,第二类组合类别的组合权重可以为45/100=0.45,第二类组合类别的组合权重可以为35/100=0.35。
步骤S34:根据所述当前音素的音素组合中的各个音素的时间准确性得分,获取所述当前音素的时间准确性修正得分。
步骤S34的具体内容,请参考图3所示的步骤S23的描述,在此不再赘述。
并且步骤S33和步骤S34的执行顺序不做限制,也可以先获取时间准确性修正得分再获取组合权重。
步骤S35:针对各个音素,根据该音素的时间准确性修正得分和与该音素对应的音素组合的组合权重,获取该音素的权重得分。
基于步骤S33得到的组合权重和步骤S34得到的时间准确性修正得分,获取音素的权重得分。
当然组合权重和时间准确性修正得分为基于相同的音素的相同的音素组合获取,二者之间具有相互对应关系。
具体地,通过将组合权重与时间准确性修正得分相乘,获取各个所述音素的权重得分。
步骤S36:根据所述音素序列的各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
得到各个音素的权重得分后,进而可以通过各个音素的权重得分获取待评价语音强制对齐模型的时间准确性得分。
具体地,待评价语音强制对齐模型的时间准确性得分通过以下公式获取:
Score模型=W1*Score1+W2*Score2……+Wn*Scoren
其中:Score模型为待评价语音强制对齐模型的时间准确性得分,
Wn为第n个音素的组合权重,
Scoren为第n个音素的时间准确性修正得分。
权重得分的获取,可以降低由于通过不同的待评价语音强制对齐模型所预测的音素序列的音素数量的不同而造成的待评价语音强制对齐模型的时间准确性得分的影响,进一步提高评价的准确性。
在另一种实施例中,为进一步提高对当前音素的时间准确性得分的修正,可以为同一个音素构建多个音素组合,具体地,各个音素的音素组合可以包括2个音素组成的二音素组合和3个音素组成的三音素组合,当然二音素组合包括当前音素和与当前音素直接相邻的一个音素,三音素组合包括当前音素和与当前音素直接相邻的两个音素,则分别计算各个音素组合的当前音素的时间准确性修正得分,从而得到同一个音素的多个时间准确性修正得分,包括二音素时间准确性修正得分和三音素时间准确性修正得分,并分别获取该音素的二音素组合类别和三音素组合类别,以及二音素组合权重和三音素组合权重,并获取二音素权重得分和三音素权重得分。
图5为本公开实施例所提供的待评价语音强制对齐模型的时间准确性得分的获取步骤的一流程示意图,获取所述待评价语音强制对齐模型的时间准确性得分的步骤可以包括:
步骤S361:根据当前音素的二音素权重得分和三音素权重得分获取所述当前音素的融合权重得分。
在一种具体实施方式中,所述融合权重得分可以通过以下公式获取:
score=v2*score”+v3*score”’;
其中:v2+v3=1,且v3>v2,
score为融合权重得分,
score”为二音素权重得分,
v2为二音素融合因子,
score”’为三音素权重得分,
v3为三音素融合因子。
这样可以简单地实现同一个音素不同权重得分的融合,并且三音素融合因子大于二音素融合因子可以凸出三音素组合的影响,进一步提高准确性。
步骤S362:根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
得到融合权重得分,就可以获取待评价语音强制对齐模型的时间准确性得分,具体内容请参考图1步骤S12的描述,在此不再赘述。
当然,在另一种具体实施方式中,每个音素也可以有3个音素组合,除了2个音素组成的二音素组合和3个音素组成的三音素组合外,还包括4个音素组成的四音素组合,那么还获取该音素的四音素组合类别和四音素组合权重,以及四音素权重得分,所述根据所述音素序列的各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分的步骤可以包括:
根据所述当前音素的所述二音素权重得分、所述三音素权重得分和所述四音素权重得分获取所述当前音素的融合权重得分;以及
根据所述音素序列的各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
在一种具体实施方式中,所述融合权重得分可以通过以下公式获取:
score=v2*score”+v3*score”’+v4*score””;
其中:v2+v3+v4=1,且v3>v2,v3>v4,
score为融合权重得分,
score”为二音素权重得分,
v2为二音素融合因子,
score”’为三音素权重得分,
v3为三音素融合因子,
score””为四音素权重得分,
v4为四音素融合因子。
这样可以简单地实现同一个音素不同权重得分的融合,并且三音素融合因子大于二音素融合因子,且三音素融合因子大于四音素融合因子可以凸出三音素组合的影响,进一步提高准确性。
下面对本公开实施例提供的语音强制对齐模型评价装置进行介绍,下文描述的语音强制对齐模型评价装置可以认为是,电子设备(如:PC)为分别实现本公开实施例提供的语音强制对齐模型评价方法所需设置的功能 模块架构。下文描述的语音强制对齐模型评价装置的内容,可分别与上文描述的语音强制对齐模型评价方法的内容相互对应参照。
图6是本公开实施例所提供的语音强制对齐模型评价装置的一框图,该语音强制对齐模型评价装置即可应用于客户端,也可应用于服务器端,参考图6,该语音强制对齐模型评价装置可以包括:
第一获取单元100,配置为利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;
第二获取单元110,配置为针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及
第三获取单元120,配置为根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
容易理解的是,本公开实施例所提供的语音强制对齐模型评价装置,将测试集中的各段音频和与各段音频对应的文本输入待评价语音强制对齐模型,从而得到分别与每段音频对应的音素序列,以及每段音素序列的各个音素的预测起止时间。
当然,预测起止时间可以包括从预测起始时刻到预测结束时刻的时间跨度。
具体地,待评价语音强制对齐模型可以包括GMM模型(高斯混合模型)和维特比(viterbi)解码模型,将测试集的各段音频和与各段音频对应的文本输入GMM模型,得到未解码的音素序列和预测起止时间,然后经过维特比解码模型进行解码,得到解码后的音素序列和预测起止时间。
可以理解的是,时间准确性得分即为各个所述音素对应的所述预测起止时间与对应的所述基准起止时间的接近程度。
其中,基准起止时间是指用于作为评价基准的音素起止时间,可以通过人工标注的方式获取。
通过比较同一音素的预测起止时间和基准起止时间接近程度,得到该音素的时间准确性得分,直至得到各个音素的时间准确性得分。
所述第二获取单元110包括:
第三获取子单元,配置为根据各个音素的预测起止时间和基准起止时间,获取各个音素的预测起止时间和基准起止时间的起止时间交集和起止时间并集;以及
第四获取子单元,配置为获取各个音素的起止时间交集与起止时间并集的比值,得到各个音素的时间准确性得分。
容易理解的是,所述音素的所述预测起止时间和所述基准起止时间的起止时间交集是指同一音素的预测起止时间和基准起止时间的重叠时间,所述音素的所述预测起止时间和所述基准起止时间的起止时间并集是指同一音素的预测起止时间和基准起止时间的整体时间。
得到各个音素的起止时间交集和起止时间并集后,进一步获取二者的比值,得到各个音素的时间准确性得分。
可以理解的是,某个音素的起止时间交集与起止时间并集的比值分数越大,那么待评价语音强制对齐模型在该音素的准确性越高。
这样,起止时间交集可以表示预测起止时间和基准起止时间的重合量,起止时间并集可以表示预测起止时间和基准起止时间的最大整体量,利用起止时间交集和起止时间并集的比值可以将预测起止时间的重和程度准确地表示,从而实现对音素时间准确性得分的获取,并且音素时间准确性得分能够准确地表示预测起止时间和基准起止时间的接近程度。
得到测试集的各个音素的时间准确性得分后,第三获取单元120,通过各个音素的时间准确性得分就可以获取待评价语音强制对齐模型的时间准确性得分。
在一种具体实施方式中,可以直接将测试集的各个音素的时间准确性得分进行相加,获取待评价语音强制对齐模型的时间准确性得分。
容易理解的是,各个音素的时间准确性得分越高,待评价语音强制对齐模型的时间准确性得分也就越高,待评价语音强制对齐模型的强制对齐效果也就越好,从而实现对于不同的语音强制对齐模型的对齐效果的评价,或者对于参数调整前后的语音强制对齐模型的对齐效果的评价。
可以看出,本公开实施例所提供的语音强制对齐模型评价装置,在对待评价语音强制对齐模型的评价时,基于各个音素的预测起止时间和基准 起止时间的接近程度,就可以得到各个音素的时间准确性得分,进而得到待评价语音强制对齐模型的时间准确性得分,无需在每次通过语音强制对齐模型获取到预测起止时间时,再利用人工方式进行复验,或者通过后续的语音合成得到的语音进行验证,可以简化对强制对齐模型的准确性评价的难度,同时还可以降低对强制对齐模型的准确性评价所需的人工成本和时间成本,提高效率。
为了进一步提高对语音强制对齐模型评价的准确性,本公开实施例还提供一种语音强制对齐模型评价装置。
如图6所示,本公开实施例所提供的语音强制对齐模型评价装置还包括:
第四获取单元130,配置为确定当前音素,构建所述当前音素的音素组合,以获取各个音素的音素组合。
音素组合包括所述当前音素和与所述当前音素临近的至少一个音素,各个音素的音素组合的组合方式相同。
得到测试集的各段音频的音素序列后,确定所述音素序列中一音素为当前音素,然后确定当前音素临近的至少一个音素,与当前音素共同组成音素组合,从而得到所述音素序列中所述当前音素对应的音素组合,逐一确定所述音素序列中各个音素为当前音素,从而得到所述音素序列中各个音素对应的音素组合。
可以理解的是,如果音素组合由2个音素组成,可以则确定位于当前音素前的相邻音素和当前音素组成音素组合,当然也可以确定位于当前音素后的相邻音素和当前音素组成音素组合,如果音素组合由3个音素组成,可以则确定当前音素前后相邻的音素和当前音素组成音素组合,如果音素组合由4个音素组成,可以则确定当前音素前的2个音素和当前音素后的一个音素和当前音素组成音素组合,当然也可以选择当前音素前的1个音素和当前音素后的2个音素。
由于每个音素的起止时间会受与其相邻的音素的影响,将当前音素,与临近的音素考虑在内,形成音素组合,可以为后续对当前音素的时间准确性得分提供修正。
所述第三获取单元120包括:
第一获取子单元,配置为根据各个音素组合中各个音素的时间准确性得分,获取各个音素组合中当前音素的时间准确性修正得分,以得到所述音素序列中各个音素的时间准确性修正得分;以及
第二获取子单元,配置为根据所述音素序列中各个音素的时间准确性修正得分,获取所述待评价语音强制对齐模型的时间准确性得分。
得到各个音素的音素组合后,利用每个音素构建有1个音素组合时,利用当前音素对应的音素组合中的各个音素的时间准确性得分,获取当前音素的时间准确性修正得分。
如音素组合包括3个音素,当前音素“t”的音素组合为“intian”,当前音素t的时间准确性修正得分可以为:
Score(t)‘=(Score(in)+Score(t)+Score(ian))/3。
然后,利用各个音素的时间准确性修正得分获取待评价语音强制对齐模型的时间准确性得分。
这样,本公开实施例所提供的语音强制对齐模型评价装置,利用与当前音素相邻的至少一个音素的时间准确性得分对当前音素的时间准确性得分进行修正,利用了当前音素的上下文信息,将当前音素受与其相邻的音素的影响考虑在内,使得所得到的当前音素的时间准确性得分得以修正,从而具有更高的准确性。
为了进一步提高评价的准确性,本公开实施例所提供的语音强制对齐模型评价装置,还包括:
第五获取单元140,配置为根据所述音素组合中各音素的发音方式对所述音素组合进行分类,得到所述音素组合的组合类别,并根据各个音素组合的组合类别,确定同一组合类别的音素组合的数量以及对应的组合权重。
所述第三获取单元120所包括的所述第二获取子单元包括:
第一获取模块,配置为针对各个音素,根据该音素的时间准确性修正得分和该音素对应的音素组合的组合权重,获取该音素的权重得分;以及
第二获取模块,配置为根据所述音素序列的各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
在得到各个当前音素的音素组合后,根据音素组合的各个音素的发音 方式进行分类。相邻音素的不同发音方式对于当前音素的参数会有一定的影响,因此可以根据音素组合的各个音素的发音方式进行分类,确定各个音素组合的组合类别,然后根据音素组合的组合类别,可以确定同类音素组合的数量,进而获取某类别音素组合的组合权重,进一步根据组合权重可以获取每个音素的权重得分,降低由于基于测试集所得到的音素数量的不同而导致的待评价语音强制对齐模型的时间准确性得分的差别,提高本公开实施例所提供的语音强制对齐模型评价方法的评价准确性。具体地,所述发音方式可以根据声母和韵母分别进行划分,包括声母发音方式和韵母发音方式,其中,所述声母发音方式包括根据发音部位分类的部位发音方式和根据发音方法分类的方法发音方式,所述韵母发音方式包括根据结构分类的结构发音方式和根据口型分类的口型发音方式。
这样,将发音方式分类与声母和韵母的发音方式相结合,可以更为方便地实现发音方式分类,降低发音方式分类的难度。
得到各个组合类别后,进一步获取各个音素组合的组合权重,具体地,所述组合权重为同一组合类别的音素组合的数量与所述音素序列中音素总量的比值。
为方便理解,现进行举例说明,当某个音素序列包括100个音素时,如果每个音素形成一个音素组合,那么就会形成100个音素组合,可以根据每个音素组合的各个音素的发音方式确定组合类别,然后对每个音素组合进行分类,假设可以共形成3个组合类别。
然后可以统计各个组合类别中的音素组合的数量,假设其中第一个组合类别有20个,第二个组合类别有45个,第三个音素组合有35个,进而可以根据各个组合类别中各个音素组合的数量确定组合权重,例如:第一类组合类别的组合权重可以为20/100=0.2,第二类组合类别的组合权重可以为45/100=0.45,第二类组合类别的组合权重可以为35/100=0.35。
然后,基于组合权重和时间准确性修正得分,获取音素的权重得分。
当然组合权重和时间准确性修正得分为基于相同的音素的相同的音素组合获取,二者之间具有相互对应关系。
具体地,通过将组合权重与时间准确性修正得分相乘,获取各个所述音素的权重得分。
得到各个音素的权重得分后,进而可以通过各个音素的权重得分获取待评价语音强制对齐模型的时间准确性得分。
具体地,待评价语音强制对齐模型的时间准确性得分通过以下公式获取:
Score模型=W1*Score1+W2*Score2……+Wn*Scoren
其中:Score模型为待评价语音强制对齐模型的时间准确性得分,
Wn为第n个音素的组合权重,
Scoren为第n个音素的时间准确性修正得分。
权重得分的获取,可以降低由于通过不同的待评价语音强制对齐模型所预测的音素序列的音素数量的不同而造成的待评价语音强制对齐模型的时间准确性得分的影响,进一步提高评价的准确性。
在另一种实施例中,为了提高评价的准确性,还可以构建同一个音素的多个音素组合,各个音素的音素组合可以包括2个音素组成的二音素组合和3个音素组成的三音素组合,当然二音素组合包括当前音素和与当前音素直接相邻的一个音素,三音素组合包括当前音素和与当前音素直接相邻的两个音素。
为同一个音素构建多个音素组合,可以利用多个音素组合进一步提高对当前音素的时间准确性得分的修正。
当同一个音素包括多个音素组合时,则需分别计算各个音素组合的当前音素的时间准确性修正得分,从而得到同一个音素的多个时间准确性修正得分。
当同一个音素同时具有至少两个音素组合时,例如:二音素组合和三音素组合时,则分别获取该音素的二音素组合类别和三音素组合类别,以及二音素组合权重和三音素组合权重。
当同一个音素同时构建有二音素组合、三音素组合时,组合权重包括二音素组合权重和三音素组合权重,时间准确性修正得分包括二音素时间准确性修正得分和三音素时间准确性修正得分,所得到的权重得分则包括二音素权重得分和三音素权重得分。
容易理解的是,当同一个音素的权重得分包括二音素权重得分和三音素权重时,为了保证待评价语音强制对齐模型的时间准确性得分的获取, 本公开实施例所提供的语音强制对齐模型的评价装置的第三获取单元120所包括的所述第二获取子单元中的所述第二获取模块,包括:
第一获取子模块,配置为根据当前音素的二音素权重得分和三音素权重得分获取所述当前音素的融合权重得分;以及
第二获取子模块,配置为根据音素序列的各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
在一种具体实施方式中,所述融合权重得分可以通过以下公式获取:
score=v2*score”+v3*score”’;
其中:v2+v3=1,且v3>v2;
score为融合权重得分;
score”为二音素权重得分;
v2为二音素融合因子;
score”’为三音素权重得分;
v3为三音素融合因子。
这样可以简单地实现同一个音素不同权重得分的融合,并且三音素融合因子大于二音素融合因子可以凸出三音素组合的影响,进一步提高准确性。
得到融合权重得分,再进一步取待评价语音强制对齐模型的时间准确性得分。
当然,在另一种具体实施方式中,为提高准确性,第四获取单元130还可以对每个音素构建有3个音素组合,除了2个音素组成的二音素组合和3个音素组成的三音素组合外,还构建包括4个音素组成的四音素组合。第五获取单元140,还配置为获取该音素的四音素组合类别和四音素组合权重。第三获取单元120所包括的所述第二获取子单元中的所述第一获取模块,获取四音素权重得分。所述第二获取子单元中的所述第二获取模块包括:
第三获取子模块,配置为根据当前音素的二音素权重得分、三音素权重得分和四音素权重得分获取所述当前音素的融合权重得分;以及
第四获取子模块,配置为根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
在一种具体实施方式中,所述融合权重得分可以通过以下公式获取:
score=v2*score”+v3*score”’+v4*score””;
其中:v2+v3+v4=1,且v3>v2,v3>v4;
score为融合权重得分;
score”为二音素权重得分;
v2为二音素融合因子;
score”’为三音素权重得分;
v3为三音素融合因子;
score””为四音素权重得分;
v4为四音素融合因子。
这样可以简单地实现同一个音素不同权重得分的融合,并且三音素融合因子大于二音素融合因子,且三音素融合因子大于四音素融合因子可以凸出三音素组合的影响,进一步提高准确性。
当然,本公开实施例还提供一种电子设备,本公开实施例提供的电子设备可以通过程序形式装载上述所述的程序模块架构,以实现本公开实施例提供的语音强制对齐模型评价方法;该硬件电子设备可以应用于具体数据处理能力的电子设备,该电子设备可以为:例如终端设备或者服务器设备。
可选的,图7示出了本公开实施例提供的一种可选硬件设备架构,可以包括:至少一个存储器3和至少一个处理器1;所述存储器存储有程序,所述处理器调用所述程序,以执行前述的语音强制对齐模型评价方法,另外,至少一个通信接口2和至少一个通信总线4;处理器1和存储器3可以位于同一电子设备,例如处理器1和存储器3可以位于服务器设备或者终端设备;处理器1和存储器3也可以位于不同的电子设备。
作为本公开实施例公开内容的一种可选实现,存储器3可以存储程序,处理器1可调用所述程序,以执行本公开上述实施例提供的语音强制对齐模型评价方法。
本公开实施例中,电子设备可以是能够进行语音强制对齐模型评价的平板电脑、笔记本电脑等设备。
在本公开实施例中,处理器1、通信接口2、存储器3、通信总线4的 数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;显然,图7所示的处理器1、通信接口2、存储器3和通信总线4的通信连接示意仅是可选的一种方式;
可选的,通信接口2可以为通信模块的接口,如GSM模块的接口;
处理器1可能是中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本公开实施例的一个或多个集成电路。
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
需要说明的是,上述的设备还可以包括与本公开实施例公开内容可能并不是必需的其他器件(未示出);鉴于这些其他器件对于理解本公开实施例公开内容可能并不是必需,本公开实施例对此不进行逐一介绍。
本公开实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机可执行指令,当该指令被处理器执行时可以实现如上所述语音强制对齐模型评价方法。
本公开实施例所提供的存储介质所存储的计算机可执行指令,在对待评价语音强制对齐模型的评价时,基于各个音素的预测起止时间和基准起止时间的接近程度,就可以得到各个音素的时间准确性得分,进而得到待评价语音强制对齐模型的时间准确性得分,无需在每次通过语音强制对齐模型获取到预测起止时间时,再利用人工方式进行复验,或者通过后续的语音合成得到的语音进行验证,可以简化对强制对齐模型的准确性评价的难度,同时还可以降低对强制对齐模型的准确性评价所需的人工成本和时间成本,提高效率。
上述本公开的实施方式是本公开的元件和特征的组合。除非另外提及,否则所述元件或特征可被视为选择性的。各个元件或特征可在不与其它元件或特征组合的情况下实践。另外,本公开的实施方式可通过组合部分元件和/或特征来构造。本公开的实施方式中所描述的操作顺序可重新排列。任一实施方式的一些构造可被包括在另一实施方式中,并且可用另一实施方式的对应构造代替。对于本领域技术人员而言明显的是,所附权利要求中彼此没有明确引用关系的权利要求可组合成本公开的实施方式,或者可 在提交本公开之后的修改中作为新的权利要求包括。
本公开的实施方式可通过例如硬件、固件、软件或其组合的各种手段来实现。在硬件配置方式中,根据本公开示例性实施方式的方法可通过一个或更多个专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理器件(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、处理器、控制器、微控制器、微处理器等来实现。
在固件或软件配置方式中,本公开的实施方式可以模块、过程、功能等形式实现。软件代码可存储在存储器单元中并由处理器执行。存储器单元位于处理器的内部或外部,并可经由各种己知手段向处理器发送数据以及从处理器接收数据。
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本公开。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其他实施例中实现。因此,本公开将不会被限制于本文所示的这些实施例,而是符合与本文所公开的原理和新颖特点相一致的最宽的范围。
虽然本公开实施例披露如上,但本公开并非限定于此。任何本领域技术人员,在不脱离本公开的精神和范围内,均可作各种变动与修改,因此本公开的保护范围应当以权利要求所限定的范围为准。

Claims (24)

  1. 一种语音强制对齐模型评价方法,包括:
    利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;
    针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及
    根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
  2. 如权利要求1所述的方法,其中,所述根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分之前,还包括:
    确定当前音素,构建所述当前音素的音素组合,以获取各个音素的音素组合,其中,所述当前音素的音素组合包括所述当前音素和与所述当前音素临近的至少一个音素,并且其中,各个音素的音素组合的组合方式相同;
    并且其中,所述根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分包括:
    根据各个音素组合中各个音素的时间准确性得分,获取各个音素组合中当前音素的时间准确性修正得分,以得到所述音素序列中各个音素的时间准确性修正得分;以及
    根据所述音素序列中各个音素的时间准确性修正得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  3. 如权利要求2所述的方法,还包括:
    根据所述音素组合中各音素的发音方式对所述音素组合进行分类,得到所述音素组合的组合类别;以及
    根据各个音素组合的组合类别,确定同一组合类别的音素组合的数量以及对应的组合权重,其中,所述组合权重为同一组合类别的音素组合的数量与所述音素序列中音素的数量的比值;
    并且其中,所述根据所述音素序列中各个音素的时间准确性修正得分, 获取所述待评价语音强制对齐模型的时间准确性得分包括:
    针对各个音素,根据该音素的时间准确性修正得分和该音素对应的音素组合的组合权重,获取该音素的权重得分;以及
    根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  4. 如权利要求3所述的方法,其中,所述当前音素的音素组合包括二音素组合和三音素组合,所述二音素组合包括所述当前音素和与所述当前音素直接相邻的一个音素,所述三音素组合包括所述当前音素和与所述当前音素直接相邻的两个音素;
    所述组合类别包括各个二音素组合类别和各个三音素组合类别,所述组合权重包括与各个所述二音素组合类别对应的二音素组合权重和与各个所述三音素组合类别对应的三音素组合权重,所述时间准确性修正得分包括所述当前音素的二音素时间准确性修正得分和三音素时间准确性修正得分,所述权重得分包括所述当前音素的二音素权重得分和三音素权重得分;
    并且其中,所述根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分包括:
    根据所述当前音素的二音素权重得分和三音素权重得分获取所述当前音素的融合权重得分;以及
    根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  5. 如权利要求4所述的方法,其中,所述二音素组合包括所述当前音素和所述当前音素前的音素。
  6. 如权利要求4所述的方法,其中,所述融合权重得分通过以下公式获取:
    score=v2*score”+v3*score”’;
    其中:v2+v3=1,且v3>v2,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子。
  7. 如权利要求4所述的方法,其中,所述当前音素的音素组合还包括四音素组合,所述四音素组合包括所述当前音素和与所述当前音素临近的 三个音素;
    所述组合类别还包括各个四音素组合类别,所述组合权重还包括与各个四音素组合类别对应的四音素组合权重,所述时间准确性修正得分还包括所述当前音素的四音素时间准确性修正得分,所述权重得分还包括所述当前音素的四音素权重得分;
    并且其中,所述根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分包括:
    根据所述当前音素的所述二音素权重得分、所述三音素权重得分和所述四音素权重得分获取所述当前音素的融合权重得分;以及
    根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  8. 如权利要求7所述的方法,其中,所述融合权重得分通过以下公式获取:
    score=v2*score”+v3*score”’+v4*score””;
    其中:v2+v3+v4=1,且v3>v2,v3>v4,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子,score””为四音素权重得分,v4为四音素融合因子。
  9. 如权利要求3-8中任一项所述的方法,其中,所述待评价语音强制对齐模型的时间准确性得分通过以下公式获取:
    Score模型=W1*Score1+W2*Score2……+Wn*Scoren,
    其中,Score模型为所述待评价语音强制对齐模型的时间准确性得分,Wn为第n个音素的组合权重,Scoren为第n个音素的时间准确性修正得分。
  10. 如权利要求3-8中任一项所述的方法,其中,所述发音方式包括声母发音方式和韵母发音方式,所述声母发音方式包括根据发音部位分类的部位发音方式和根据发音方法分类的方法发音方式,所述韵母发音方式包括根据结构分类的结构发音方式和根据口型分类的口型发音方式。
  11. 如权利要求1-8中任一项所述的方法,其中,针对各个音素,所述根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分包括:
    获取各个音素的预测起止时间和基准起止时间的起止时间交集和起止时间并集;以及
    根据各个音素的起止时间交集与起止时间并集的比值,得到各个音素的时间准确性得分。
  12. 一种语音强制对齐模型评价装置,包括:
    第一获取单元,配置为利用待评价语音强制对齐模型,根据测试集中各段音频和与所述各段音频对应的文本,获取每段音频对应的音素序列以及该音素序列中各个音素的预测起止时间;
    第二获取单元,配置为针对各个音素,根据该音素的预测起止时间和预先确定的该音素的基准起止时间,获取该音素的时间准确性得分,其中,所述时间准确性得分用于表征该音素的所述预测起止时间与所述基准起止时间的接近程度;以及
    第三获取单元,配置为根据各个音素的时间准确性得分,确定所述待评价语音强制对齐模型的时间准确性得分。
  13. 根据权利要求12所述的装置,还包括第四获取单元,配置为:
    确定当前音素,构建所述当前音素的音素组合,以获取各个音素的音素组合,其中,所述当前音素的音素组合包括所述当前音素和与所述当前音素临近的至少一个音素,并且其中,各个音素的音素组合的组合方式相同;
    并且其中,所述第三获取单元包括:
    第一获取子单元,配置为根据各个音素组合中各个音素的时间准确性得分,获取各个音素组合中当前音素的时间准确性修正得分,以得到所述音素序列中各个音素的时间准确性修正得分;以及
    第二获取子单元,配置为根据所述音素序列中各个音素的时间准确性修正得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  14. 如权利要求13所述的装置,还包括:
    第五获取单元,配置为根据所述音素组合中各音素的发音方式对所述音素组合进行分类,得到所述音素组合的组合类别;以及
    根据各个音素组合的组合类别,确定同一组合类别的音素组合的数量以及对应的组合权重,其中,所述组合权重为同一组合类别的音素组合的数量与所述音素序列中音素的数量的比值;
    并且其中,所述第二获取子单元包括:
    第一获取模块,配置为针对各个音素,根据该音素的时间准确性修正得分和该音素对应的音素组合的组合权重,获取该音素的权重得分;以及
    第二获取模块,配置为根据所述音素序列中各个音素的权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  15. 如权利要求14所述的装置,其中,所述当前音素的音素组合包括二音素组合和三音素组合,所述二音素组合包括所述当前音素和与所述当前音素直接相邻的一个音素,所述三音素组合包括所述当前音素和与所述当前音素直接相邻的两个音素;
    所述组合类别包括各个二音素组合类别和各个三音素组合类别,所述组合权重包括与各个所述二音素组合类别对应的二音素组合权重和与各个所述三音素组合类别对应的三音素组合权重,所述时间准确性修正得分包括所述当前音素的二音素时间准确性修正得分和三音素时间准确性修正得分,所述权重得分包括所述当前音素的二音素权重得分和三音素权重得分;
    并且其中,所述第二获取模块包括:
    第一获取子模块,配置为根据所述当前音素的二音素权重得分和三音素权重得分获取所述当前音素的融合权重得分;以及
    第二获取子模块,配置为根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  16. 如权利要求15所述的装置,其中,所述二音素组合包括所述当前音素和所述当前音素前的音素。
  17. 如权利要求15所述的装置,其中,所述融合权重得分通过以下公式获取:
    score=v2*score”+v3*score”’;
    其中:v2+v3=1,且v3>v2,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子。
  18. 如权利要求15所述的装置,其中,所述当前音素的音素组合还包括四音素组合,所述四音素组合包括所述当前音素和与所述当前音素临近的三个音素;
    所述组合类别还包括各个四音素组合类别,所述组合权重还包括与各个四音素组合类别对应的四音素组合权重,所述时间准确性修正得分还包括所述当前音素的四音素时间准确性修正得分,所述权重得分还包括所述当前音素的四音素权重得分;
    并且其中,所述第二获取模块包括:
    第三获取子模块,配置为根据所述当前音素的所述二音素权重得分、所述三音素权重得分和所述四音素权重得分获取所述当前音素的融合权重得分;以及
    第四获取子模块,配置为根据所述音素序列中各个音素的融合权重得分,获取所述待评价语音强制对齐模型的时间准确性得分。
  19. 如权利要求18所述的装置,其中,所述融合权重得分通过以下公式获取:
    score=v2*score”+v3*score”’+v4*score””;
    其中:v2+v3+v4=1,且v3>v2,v3>v4,score为融合权重得分,score”为二音素权重得分,v2为二音素融合因子,score”’为三音素权重得分,v3为三音素融合因子,score””为四音素权重得分,v4为四音素融合因子。
  20. 如权利要求14-19中任一项所述的装置,其中,所述待评价语音强制对齐模型的时间准确性得分通过以下公式获取:
    Score模型=W1*Score1+W2*Score2……+Wn*Scoren,
    其中,Score模型为所述待评价语音强制对齐模型的时间准确性得分,Wn为第n个音素的组合权重,Scoren为第n个音素的时间准确性修正得分。
  21. 如权利要求14-19中任一项所述的装置,其中,所述发音方式包括声母发音方式和韵母发音方式,所述声母发音方式包括根据发音部位分类的部位发音方式和根据发音方法分类的方法发音方式,所述韵母发音方式包括根据结构分类的结构发音方式和根据口型分类的口型发音方式。
  22. 如权利要求12-19中任一项所述的装置,其中,所述第二获取单元包括:
    第三获取子单元,配置为获取各个音素的预测起止时间和基准起止时间的起止时间交集和起止时间并集;以及
    第四获取子单元,配置为根据各个音素的起止时间交集与起止时间并集的比值,得到各个音素的时间准确性得分。
  23. 一种存储介质,其中,所述存储介质存储有适于语音强制对齐模型评价的程序,以实现如权利要求1-11中任一项所述的语音强制对齐模型评价方法。
  24. 一种电子设备,包括:
    至少一个存储器;以及
    至少一个处理器,
    其中,所述存储器存储有程序,所述处理器调用所述程序,以执行如权利要求1-11中任一项所述的语音强制对齐模型评价方法。
PCT/CN2021/108899 2020-09-07 2021-07-28 语音强制对齐模型评价方法、装置、电子设备及存储介质 WO2022048354A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU2021336957A AU2021336957B2 (en) 2020-09-07 2021-07-28 Method for evaluating a speech forced alignment model, electronic device, and storage medium
CA3194051A CA3194051C (en) 2020-09-07 2021-07-28 Method for evaluating a speech forced alignment model, electronic device, and storage medium
US18/178,813 US11749257B2 (en) 2020-09-07 2023-03-06 Method for evaluating a speech forced alignment model, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010925650.2 2020-09-07
CN202010925650.2A CN111798868B (zh) 2020-09-07 2020-09-07 语音强制对齐模型评价方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/178,813 Continuation US11749257B2 (en) 2020-09-07 2023-03-06 Method for evaluating a speech forced alignment model, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022048354A1 true WO2022048354A1 (zh) 2022-03-10

Family

ID=72834301

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/108899 WO2022048354A1 (zh) 2020-09-07 2021-07-28 语音强制对齐模型评价方法、装置、电子设备及存储介质

Country Status (5)

Country Link
US (1) US11749257B2 (zh)
CN (1) CN111798868B (zh)
AU (1) AU2021336957B2 (zh)
CA (1) CA3194051C (zh)
WO (1) WO2022048354A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466272B (zh) * 2020-10-23 2023-01-17 浙江同花顺智能科技有限公司 一种语音合成模型的评价方法、装置、设备及存储介质
US11798527B2 (en) 2020-08-19 2023-10-24 Zhejiang Tonghu Ashun Intelligent Technology Co., Ltd. Systems and methods for synthesizing speech
CN111798868B (zh) * 2020-09-07 2020-12-08 北京世纪好未来教育科技有限公司 语音强制对齐模型评价方法、装置、电子设备及存储介质
CN112420015A (zh) * 2020-11-18 2021-02-26 腾讯音乐娱乐科技(深圳)有限公司 一种音频合成方法、装置、设备及计算机可读存储介质
CN112542159B (zh) * 2020-12-01 2024-04-09 腾讯音乐娱乐科技(深圳)有限公司 一种数据处理方法以及设备
CN112908308B (zh) * 2021-02-02 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 一种音频处理方法、装置、设备及介质
CN112992184B (zh) * 2021-04-20 2021-09-10 北京世纪好未来教育科技有限公司 一种发音评测方法、装置、电子设备和存储介质
CN117095672A (zh) * 2023-07-12 2023-11-21 支付宝(杭州)信息技术有限公司 一种数字人唇形生成方法及装置

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7216079B1 (en) * 1999-11-02 2007-05-08 Speechworks International, Inc. Method and apparatus for discriminative training of acoustic models of a speech recognition system
CN101651788A (zh) * 2008-12-26 2010-02-17 中国科学院声学研究所 一种在线语音文本对齐系统及方法
US20100324900A1 (en) * 2009-06-19 2010-12-23 Ronen Faifkov Searching in Audio Speech
CN109326277A (zh) * 2018-12-05 2019-02-12 四川长虹电器股份有限公司 半监督的音素强制对齐模型建立方法及系统
CN109545243A (zh) * 2019-01-23 2019-03-29 北京猎户星空科技有限公司 发音质量评价方法、装置、电子设备及存储介质
CN109903752A (zh) * 2018-05-28 2019-06-18 华为技术有限公司 对齐语音的方法和装置
WO2020027394A1 (ko) * 2018-08-02 2020-02-06 미디어젠 주식회사 음소 단위 발음 정확성 평가 장치 및 평가 방법
CN111312231A (zh) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 音频检测方法、装置、电子设备及可读存储介质
CN111798868A (zh) * 2020-09-07 2020-10-20 北京世纪好未来教育科技有限公司 语音强制对齐模型评价方法、装置、电子设备及存储介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5333275A (en) * 1992-06-23 1994-07-26 Wheatley Barbara J System and method for time aligning speech
US7146319B2 (en) * 2003-03-31 2006-12-05 Novauris Technologies Ltd. Phonetically based speech recognition system and method
US20090326947A1 (en) * 2008-06-27 2009-12-31 James Arnold System and method for spoken topic or criterion recognition in digital media and contextual advertising
US11062615B1 (en) * 2011-03-01 2021-07-13 Intelligibility Training LLC Methods and systems for remote language learning in a pandemic-aware world
US8729374B2 (en) * 2011-07-22 2014-05-20 Howling Technology Method and apparatus for converting a spoken voice to a singing voice sung in the manner of a target singer
WO2014031918A2 (en) * 2012-08-24 2014-02-27 Interactive Intelligence, Inc. Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems
JP2014240940A (ja) * 2013-06-12 2014-12-25 株式会社東芝 書き起こし支援装置、方法、及びプログラム
US9418650B2 (en) * 2013-09-25 2016-08-16 Verizon Patent And Licensing Inc. Training speech recognition using captions
US9947322B2 (en) * 2015-02-26 2018-04-17 Arizona Board Of Regents Acting For And On Behalf Of Northern Arizona University Systems and methods for automated evaluation of human speech
US9558734B2 (en) * 2015-06-29 2017-01-31 Vocalid, Inc. Aging a text-to-speech voice
US9336782B1 (en) * 2015-06-29 2016-05-10 Vocalid, Inc. Distributed collection and processing of voice bank data
US9786270B2 (en) * 2015-07-09 2017-10-10 Google Inc. Generating acoustic models
US10706873B2 (en) * 2015-09-18 2020-07-07 Sri International Real-time speaker state analytics platform
US10884503B2 (en) * 2015-12-07 2021-01-05 Sri International VPA with integrated object recognition and facial expression recognition
WO2017112813A1 (en) * 2015-12-22 2017-06-29 Sri International Multi-lingual virtual personal assistant
US10043519B2 (en) * 2016-09-02 2018-08-07 Tim Schlippe Generation of text from an audio speech signal
US11443646B2 (en) * 2017-12-22 2022-09-13 Fathom Technologies, LLC E-Reader interface system with audio and highlighting synchronization for digital books
CN108510978B (zh) 2018-04-18 2020-08-21 中国人民解放军62315部队 一种应用于语种识别的英语声学模型的建模方法及系统
US11887622B2 (en) * 2018-09-14 2024-01-30 United States Department Of Veteran Affairs Mental health diagnostics using audio data
CN109377981B (zh) * 2018-11-22 2021-07-23 四川长虹电器股份有限公司 音素对齐的方法及装置
CN111105785B (zh) 2019-12-17 2023-06-16 广州多益网络股份有限公司 一种文本韵律边界识别的方法及装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7216079B1 (en) * 1999-11-02 2007-05-08 Speechworks International, Inc. Method and apparatus for discriminative training of acoustic models of a speech recognition system
CN101651788A (zh) * 2008-12-26 2010-02-17 中国科学院声学研究所 一种在线语音文本对齐系统及方法
US20100324900A1 (en) * 2009-06-19 2010-12-23 Ronen Faifkov Searching in Audio Speech
CN109903752A (zh) * 2018-05-28 2019-06-18 华为技术有限公司 对齐语音的方法和装置
WO2020027394A1 (ko) * 2018-08-02 2020-02-06 미디어젠 주식회사 음소 단위 발음 정확성 평가 장치 및 평가 방법
CN109326277A (zh) * 2018-12-05 2019-02-12 四川长虹电器股份有限公司 半监督的音素强制对齐模型建立方法及系统
CN109545243A (zh) * 2019-01-23 2019-03-29 北京猎户星空科技有限公司 发音质量评价方法、装置、电子设备及存储介质
CN111312231A (zh) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 音频检测方法、装置、电子设备及可读存储介质
CN111798868A (zh) * 2020-09-07 2020-10-20 北京世纪好未来教育科技有限公司 语音强制对齐模型评价方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
US20230206902A1 (en) 2023-06-29
CA3194051A1 (en) 2022-03-10
AU2021336957A1 (en) 2023-05-04
CA3194051C (en) 2023-11-07
CN111798868A (zh) 2020-10-20
CN111798868B (zh) 2020-12-08
US11749257B2 (en) 2023-09-05
AU2021336957B2 (en) 2023-09-28

Similar Documents

Publication Publication Date Title
WO2022048354A1 (zh) 语音强制对齐模型评价方法、装置、电子设备及存储介质
US10832002B2 (en) System and method for scoring performance of chatbots
TWI666558B (zh) 語意分析方法、語意分析系統及非暫態電腦可讀取媒體
WO2017166650A1 (zh) 语音识别方法及装置
US9396724B2 (en) Method and apparatus for building a language model
WO2018157840A1 (zh) 语音识别测试方法及测试终端、计算设备及存储介质
CN108564953B (zh) 一种语音识别文本的标点处理方法及装置
US20220375459A1 (en) Decoding network construction method, voice recognition method, device and apparatus, and storage medium
US10553206B2 (en) Voice keyword detection apparatus and voice keyword detection method
CN108052498A (zh) 语音输入的字词级纠正
US10796096B2 (en) Semantic expression generation method and apparatus
CN107039040A (zh) 语音识别系统
US10102771B2 (en) Method and device for learning language and computer readable recording medium
WO2021120602A1 (zh) 节奏点检测方法、装置及电子设备
US11373638B2 (en) Presentation assistance device for calling attention to words that are forbidden to speak
TWI660340B (zh) 聲控方法及系統
US10997966B2 (en) Voice recognition method, device and computer storage medium
US20230059882A1 (en) Speech synthesis method and apparatus, device and computer storage medium
CN112331194A (zh) 一种输入方法、装置和电子设备
CN111048098B (zh) 语音校正系统及语音校正方法
CN115116442B (zh) 语音交互方法和电子设备
EP4095847A1 (en) Method and apparatus for processing voice recognition result, electronic device, and computer medium
JP2014153479A (ja) 診断システム、診断方法及びプログラム
CN117668151A (zh) 一种智能问答方法、装置、电子设备及介质
JP2013130904A (ja) 複合語読み表示方法及びプログラム,並びに読み生成装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21863422

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3194051

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: AU2021336957

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2021336957

Country of ref document: AU

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021336957

Country of ref document: AU

Date of ref document: 20210728

Kind code of ref document: A

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.07.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21863422

Country of ref document: EP

Kind code of ref document: A1