CN109166569B - Detection method and device for phoneme mislabeling - Google Patents

Detection method and device for phoneme mislabeling Download PDF

Info

Publication number
CN109166569B
CN109166569B CN201810827500.0A CN201810827500A CN109166569B CN 109166569 B CN109166569 B CN 109166569B CN 201810827500 A CN201810827500 A CN 201810827500A CN 109166569 B CN109166569 B CN 109166569B
Authority
CN
China
Prior art keywords
phoneme
confidence
labeling result
combined
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810827500.0A
Other languages
Chinese (zh)
Other versions
CN109166569A (en
Inventor
孟君
廖晓玲
郝玉峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Haitian Rui Sheng Polytron Technologies Inc
Original Assignee
Beijing Haitian Rui Sheng Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=64898134&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=CN109166569(B) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Beijing Haitian Rui Sheng Polytron Technologies Inc filed Critical Beijing Haitian Rui Sheng Polytron Technologies Inc
Priority to CN201810827500.0A priority Critical patent/CN109166569B/en
Publication of CN109166569A publication Critical patent/CN109166569A/en
Application granted granted Critical
Publication of CN109166569B publication Critical patent/CN109166569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for detecting phoneme mislabels.

Description

Detection method and device for phoneme mislabeling
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a detection method and a detection device for phoneme mislabels.
Background
At present, the mainstream speech synthesis method comprises a parameter speech synthesis method and a waveform splicing synthesis method, when the two methods are adopted for speech synthesis, fine phoneme labeling data are needed, and the correctness of phoneme labeling directly influences the tone quality, expressive force and the like of the finally synthesized speech, so that the correctness of phoneme labeling is an important index for measuring the quality of speech synthesis databases.
In the prior art, a labeling person firstly checks the labeled phonemes, re-labels the wrong phonemes, and then returns the re-labeled phoneme data to check, for phoneme data packets, the checking person performs spot check on 30% of the phoneme data, and if the labels of the 30% of the phoneme data are all correct, it is determined that all the phoneme labels in the phoneme data packet are correct.
In the prior art, the mode of checking the phoneme labels needs manual work, time and labor are wasted, and phonemes with obvious wrong labels in the phoneme data which are not checked easily are missed, so that the quality of synthesized speech is poor.
Disclosure of Invention
The invention provides a detection method and a device for phoneme mislabels, which adopt a preset phoneme confidence model to obtain the confidence of a phoneme labeling result, thereby being capable of quickly and accurately obtaining the mislabeled phoneme.
The aspect of the present invention provides a method for detecting a mis-labeling of phonemes, comprising:
acquiring a confidence coefficient of a phoneme labeling result of the target combined phoneme according to a target voice corresponding to the target combined phoneme, a phoneme labeling result of the target combined phoneme and a preset phoneme confidence coefficient model, wherein the phoneme confidence coefficient model is used for representing a corresponding relation among the target voice, the phoneme labeling result of the target combined phoneme and the confidence coefficient of the phoneme labeling result of the target combined phoneme;
and when the confidence coefficient of the phoneme labeling result of the target combined phoneme is smaller than a confidence coefficient threshold value, determining the target combined phoneme as a mis-labeled phoneme.
Optionally, the determining the target combined phoneme before determining that the target combined phoneme is the mis-labeled phoneme further includes:
obtaining a confidence standard deviation and a confidence mean value according to the confidence of the phoneme labeling result of each target combined phoneme;
and acquiring the confidence coefficient threshold value according to the confidence coefficient standard deviation and the confidence coefficient mean value.
Optionally, the obtaining the confidence threshold according to the confidence standard deviation and the confidence mean includes:
obtaining a triple confidence standard deviation according to the confidence standard deviation, wherein the triple confidence standard deviation is three times of the confidence standard deviation;
determining the difference between the confidence mean and the triple confidence standard deviation as the confidence threshold.
Optionally, before the obtaining the target combined phoneme, the method further includes:
and training to obtain the phoneme confidence model according to the confidence degrees of the voices corresponding to a plurality of combined phonemes corresponding to all the phoneme data in the voice synthesis database, the phoneme labeling result of each combined phoneme and the phoneme labeling result of each combined phoneme, wherein phoneme data comprise at least phoneme sequences.
Optionally, before the training obtains the phoneme confidence model, the training further includes:
performing phoneme segmentation processing on each phoneme sequence to obtain a plurality of single phonemes corresponding to each phoneme sequence;
and combining each of the monophosphates and the monophosphates adjacent to each of the monophosphates to obtain the combined phoneme.
Optionally, the phoneme labeling result of the target combined phoneme includes a phoneme boundary labeling result of the target combined phoneme.
Optionally, after determining that the target combined phoneme is the mis-labeled phoneme, the method further includes:
and re-labeling the mis-labeled phoneme.
A second aspect of the present invention provides an apparatus for detecting a mislabelling of phonemes, comprising:
a confidence level obtaining module, configured to obtain a confidence level of a phoneme labeling result of a target combined phoneme according to a target voice corresponding to the target combined phoneme, the phoneme labeling result of the target combined phoneme, and a preset phoneme confidence level model, where the phoneme confidence level model is used to represent a correspondence relationship between the target voice, the phoneme labeling result of the target combined phoneme, and the confidence level of the phoneme labeling result of the target combined phoneme;
and the mislabeling phoneme determining module is used for determining the target combined phoneme as a mislabeling phoneme when the confidence coefficient of the phoneme labeling result of the target combined phoneme is less than a confidence coefficient threshold value.
Optionally, the apparatus further comprises: a confidence threshold acquisition module.
The confidence threshold acquisition module is used for acquiring a confidence standard deviation and a confidence mean value according to the confidence of the phoneme labeling result of each target combined phoneme; and acquiring the confidence coefficient threshold value according to the confidence coefficient standard deviation and the confidence coefficient mean value.
Optionally, the confidence threshold obtaining module is specifically configured to obtain a triple confidence standard deviation according to the confidence standard deviation, where the triple confidence standard deviation is triple of the confidence standard deviation;
and determining the difference between the confidence coefficient mean value and the triple confidence coefficient standard deviation as a confidence coefficient threshold value.
Optionally, the apparatus further comprises: and a phoneme confidence model obtaining module.
And the phoneme confidence model acquisition module is used for training and acquiring a phoneme confidence model according to the voices corresponding to the multiple combined phonemes corresponding to all the phoneme data in the voice synthesis database, the phoneme labeling result of each combined phoneme and the confidence of the phoneme labeling result of each combined phoneme, wherein phoneme data comprise at least phoneme sequences.
Optionally, the apparatus further comprises: and a combined phoneme obtaining module.
The combined phoneme acquisition module is used for carrying out phoneme segmentation processing on each phoneme sequence to acquire a plurality of single phonemes corresponding to each phoneme sequence; and combining each monophone and the monophones adjacent to each monophone to obtain a combined phoneme.
Optionally, the phoneme labeling result of the target combined phoneme includes a phoneme boundary labeling result of the target combined phoneme.
Optionally, the apparatus further comprises: and (5) re-labeling the module.
And the re-labeling module is used for re-labeling the mis-labeled phonemes.
The third aspect of the present invention provides an phoneme mislabeling detection device, comprising at least processors and memories;
the memory stores computer-executable instructions;
the at least processors execute the computer-executable instructions stored in the memory to cause the device to perform the method of phoneme mislabeling.
A fourth aspect of the present invention provides computer-readable storage media having stored thereon computer-executable instructions that, when executed by a processor, implement the above-described method for detecting mislabeling of phonemes.
The invention provides a method and a device for detecting phoneme mislabels.
Drawings
FIG. 1 is a schematic diagram of a system architecture suitable for the detection method of phoneme mislabeling provided by the present invention;
FIG. 2 is a flowchart illustrating a method for detecting mislabeling phonemes according to the present invention;
FIG. 3 is a second flowchart illustrating a method for detecting a phoneme mislabel according to the present invention;
FIG. 4 is a schematic structural diagram of the device for detecting mislabeling phonemes provided in the present invention;
FIG. 5 is a second schematic structural diagram of the device for detecting phoneme mislabels according to the present invention;
fig. 6 is a third schematic structural diagram of the device for detecting phoneme mislabeling according to the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below in conjunction with the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention.
The method for detecting the phoneme mislabel provided by the invention is suitable for a system for detecting the phoneme mislabel, and fig. 1 is a system architecture schematic diagram suitable for the method for detecting the phoneme mislabel provided by the invention, and as shown in fig. 1, the system for detecting the phoneme mislabel comprises a device for detecting the phoneme mislabel of the method for detecting the phoneme mislabel, and a speech synthesis database, wherein the speech synthesis database comprises a plurality of sections of speech and a phoneme text corresponding to each section of speech, specifically, the phoneme text can be a phoneme sequence text corresponding to the sequence of the speech, wherein each section of speech can correspond to at least phoneme sequence texts, and correspondingly, the system for detecting the phoneme mislabel of each phoneme in the phoneme sequence also comprises a phoneme result of each phoneme in the phoneme sequence.
The phoneme labeling result may include: the method comprises the following steps of carrying out tone labeling, rhythm labeling, phoneme boundary labeling and other labeling results, wherein the phoneme boundary labeling is carried out on the starting time point and the ending time point of a phoneme, if the phoneme boundary labeling of the phoneme is wrong, the naturalness and the fluency of a synthesized sentence are seriously influenced, and the user experience is low.
In this embodiment, the device for detecting a phoneme mislabel is used to check the phoneme label in the speech synthesis database to obtain a mislabeled phoneme, and further , the mislabeled phoneme is modified so that the accuracy of the phoneme label in the speech synthesis database reaches more than 99%, thereby improving the quality of the speech synthesis database and improving the user experience.
Fig. 2 is a schematic flow diagram of the method for detecting a phoneme mislabel provided by the present invention, where an execution main body of the method flow shown in fig. 2 may be a device for detecting a phoneme mislabel, and the device for detecting a phoneme mislabel may be implemented by any software and/or hardware, as shown in fig. 2, the method for detecting a phoneme mislabel provided by this embodiment may include:
s101, obtaining the confidence of the phoneme labeling result of the target combined phoneme according to the target voice corresponding to the target combined phoneme, the phoneme labeling result of the target combined phoneme and a preset phoneme confidence model, wherein the phoneme confidence model is used for representing the corresponding relation of the target voice, the phoneme labeling result of the target combined phoneme and the confidence of the phoneme labeling result of the target combined phoneme.
In the embodiment, in order to improve the accuracy of phoneme label detection, detection is performed not only from the perspective of monophonic note labeling, but also from the perspective of triphone or polyphonic labeling in combination with the context relationship, in order to detect the phoneme label in the speech synthesis database.
The speech synthesis database in this embodiment stores the phoneme labeling result of the target combined phoneme in advance, specifically, the phoneme labeling may be implemented in an artificial labeling manner, or may be directly labeled by a machine, or acquired in a man-machine combined labeling manner, and the specific manner of labeling the target combined phoneme is not limited in this embodiment.
Acquiring a target voice corresponding to a target combined phoneme in a voice synthesis database to be detected, a phoneme labeling result of the target combined phoneme and a preset phoneme confidence model, and acquiring a confidence of the phoneme labeling result of the target combined phoneme, wherein the phoneme confidence model is used for representing a corresponding relation between the target voice, the phoneme labeling result of the target combined phoneme and the confidence of the phoneme labeling result of the target combined phoneme. The confidence of the phoneme labeling result is a probability value that the phoneme labeling result is a correct labeling result, and specifically, the confidence of the phoneme labeling result in this embodiment is a similarity between a phoneme and a corresponding pronunciation of a voice. The higher the confidence, the higher the probability that the phoneme labeling result of the target combined phoneme is correct, and the lower the confidence, the lower the probability that the phoneme labeling result of the target combined phoneme is correct.
Specifically, the whole speech synthesis database is the same as human speech, so the phoneme confidence model obtained by using the combined phoneme related data in the speech synthesis database is the model corresponding to the speaker, and there is no need to perform speaker adaptive training.
S102, when the confidence coefficient of the phoneme labeling result of the target combined phoneme is smaller than a confidence coefficient threshold value, determining the target combined phoneme as a mis-labeled phoneme.
In this embodiment, a confidence threshold may be preset, and specifically, the confidence threshold may be set in a manner that: obtaining the confidence degrees of phoneme labeling results corresponding to a plurality of speech synthesis databases, wherein the plurality of speech synthesis databases are different from the speech synthesis database corresponding to the target combined phoneme, obtaining the average value of all the confidence degrees according to all the confidence degrees corresponding to the plurality of speech synthesis databases, and determining the average value as a confidence degree threshold; or, the confidence threshold is obtained according to the confidence of all the phoneme labeling results corresponding to the detected speech synthesis database, and such a setting mode can be closer to the actual situation in the detected speech synthesis database and can also more accurately obtain the mis-labeled phonemes.
Specifically, because the number of phonemes in the speech synthesis database is large, the estimation of the confidence level of the phoneme labeling result approximately follows normal distribution according to the central limit law, when the confidence level of the phoneme labeling result of the target combined phoneme is smaller than the confidence level threshold value, the target combined phoneme is determined to be a mis-labeled phoneme, and step may be performed to re-label the mis-phoneme, so as to improve the accuracy of phoneme labeling in the speech synthesis database.
The method for detecting the phoneme mislabel provided by the embodiment comprises the following steps: obtaining the confidence coefficient of the phoneme labeling result of the target combined phoneme according to the target voice corresponding to the target combined phoneme, the phoneme labeling result of the target combined phoneme and a preset phoneme confidence coefficient model, wherein the phoneme confidence coefficient model is used for representing the corresponding relation among the target voice, the phoneme labeling result of the target combined phoneme and the confidence coefficient of the phoneme labeling result of the target combined phoneme; and when the confidence coefficient of the phoneme labeling result of the target combined phoneme is smaller than the confidence coefficient threshold value, determining the target combined phoneme as the mis-labeled phoneme. The method adopts the preset phoneme confidence coefficient model to obtain the confidence coefficient of the phoneme labeling result, so that the mis-labeled phoneme can be quickly and accurately obtained.
Fig. 3 is a flowchart illustrating a second flow chart of the method for detecting a phoneme mislabel provided by the present invention, and as shown in fig. 3, the method for detecting a phoneme mislabel provided by this embodiment may include:
s201, training and acquiring a phoneme confidence model according to a plurality of combined phonemes corresponding to all the phoneme data in the speech synthesis database, a phoneme labeling result of each combined phoneme and a confidence of the phoneme labeling result of each combined phoneme, wherein phoneme data comprise at least phoneme sequences.
Specifically, in this embodiment, in order to improve the confidence of the phoneme confidence model and improve the accuracy of the detection of the phoneme labeling result, it is necessary to perform a phoneme segmentation process on each phoneme sequence to obtain a plurality of monophony elements corresponding to each phoneme sequence.
The speech synthesis database comprises a plurality of phoneme sequences, wherein phoneme sequences are segmented by a segmentation tool in the prior art to obtain a plurality of monophony elements corresponding to each phoneme sequence, wherein each monophony element is the smallest unit of the phoneme sequence, for example, a single phoneme element in the chinese language can be a segmentation unit of pinyin, a monophony element in the chinese language can be an initial consonant, and exemplarily, ā (a) only has monophony elements, a (ai) has monophony elements, and a d ā i (poor) has two monophony elements.
Correspondingly, the phoneme sequences corresponding to different languages have their own characteristics, and according to the phoneme segmentation processing method of different languages, multiple monophonic elements corresponding to the phoneme sequences of different languages can be obtained, for example, the phoneme sequence in english may be a phonetic symbol sequence, and the monophonic element may be a single phonetic symbol.
And combining each monophone and the monophones adjacent to each monophone to obtain a combined phoneme.
In order to link the context, each monophone and the monophones adjacent to each monophone are combined to obtain a combined phoneme corresponding to the phoneme sequence.
The combined phonemes corresponding to the phoneme sequences in this embodiment may be or multiple ones, and specifically, the combined phonemes obtained in this embodiment are triphones, that is, triphones formed by combining three monophosphins.
The method comprises the steps of obtaining combined phonemes by combining a monophone with two monophones adjacent to the monophone when the monophone is at the head of a phoneme sequence, obtaining combined phonemes by combining the monophone with two monophones adjacent to the monophone before the monophone when the monophone is at the tail of the phoneme sequence, and obtaining combined phonemes by combining the monophone with two monophones adjacent to the monophone before and after the monophone when the monophone is at the middle of the phoneme sequence, wherein the number of the combined phonemes is when the number of the monophones of the phoneme sequence is less than or equal to three.
Specifically, a plurality of combined phonemes may be obtained in the order of arrangement of the monophonic elements in the phoneme sequence, which means that after monophonic elements are combined with their adjacent monophonic elements, the monophonic elements are not repeatedly combined with other monophonic elements adjacent to the monophonic elements.
The training sample for obtaining the phoneme confidence model in this embodiment may be a sample in a detected speech synthesis database, and specifically, the training sample includes combined phonemes corresponding to all phoneme data in the speech synthesis database, a phoneme labeling result of each combined phoneme, and a confidence of a phoneme labeling result of each combined phoneme, and the three parameters corresponding to the combined phoneme are used as input, so as to obtain a corresponding relationship between the combined phoneme, the phoneme labeling result of each combined phoneme, and the confidence of the phoneme labeling result of each combined phoneme.
Since the number of samples of the training samples of the phoneme confidence model in the embodiment is large, the phoneme confidence model is continuously trained in a deep learning manner to achieve a phoneme confidence model with high reliability.
Specifically, when performing the phoneme confidence model training in this embodiment, the labeling result of each combined phoneme may also be a phoneme boundary labeling result, that is, the obtained phoneme confidence model in this embodiment can also obtain the confidence of the combined phoneme, the phoneme boundary labeling result of each combined phoneme, and the phoneme boundary labeling result of each combined phoneme.
S202, obtaining the confidence of the phoneme labeling result of the target combined phoneme according to the target voice corresponding to the target combined phoneme, the phoneme labeling result of the target combined phoneme and a preset phoneme confidence model.
The specific manner of obtaining the confidence level of the phoneme labeling result of the target combined phoneme in S202 in this embodiment may specifically refer to the related description of S101 in the above embodiment, which is not limited herein.
S203, according to the confidence coefficient of the phoneme labeling result of each target combination phoneme, obtaining a confidence coefficient standard deviation and a confidence coefficient mean value, and according to the confidence coefficient standard deviation and the confidence coefficient mean value, obtaining a confidence coefficient threshold value.
The confidence threshold is obtained according to the confidence of the phoneme labeling result of each target combined phoneme in the detected speech synthesis database, and because the number of the phonemes in the speech synthesis database is large, and the estimation of the confidence of the phoneme labeling result approximately follows normal distribution by a central limit law, the confidence threshold of the phoneme which is most likely to have problems can be determined by a method of calculating multiple standard deviations smaller than the confidence average value side, and the target combined phoneme of which the confidence is smaller than the confidence threshold is determined as the misphone.
Specifically, a confidence standard deviation is obtained according to the confidence of the phoneme labeling result of each target combined phoneme, and the obtaining mode of the standard deviation is the same as the obtaining mode of the standard deviation of the number series conforming to normal distribution in the prior art; and acquiring a confidence coefficient mean value according to the confidence coefficient of the phoneme labeling result of each target combined phoneme.
Obtaining triple confidence standard deviation according to the confidence standard deviation, wherein the triple confidence standard deviation is triple of the confidence standard deviation; and determining the difference between the confidence coefficient mean value and the triple confidence coefficient standard deviation as a confidence coefficient threshold value.
An exemplary, english-english speech synthesis database, for example, contains 4427 WAVs, where a WAV is a storage format of speech and is understood to contain 4427 segments of speech, specifically 20.1 ten thousand phonemes. In the manner described above in S201 to S204, the confidence mean value is-92.38, the confidence standard deviation is 12.54, and the confidence threshold value is determined based on the difference between the confidence mean value and the triple confidence standard deviation, that is, -92.38-3 · 12.54 ═ 130.00.
S204, when the confidence of the phoneme labeling result of the target combined phoneme is smaller than the confidence threshold, determining the target combined phoneme as a mis-labeled phoneme.
In this embodiment, specific ways of obtaining the mis-labeled phoneme in S204 may specifically refer to the related description of S102 in the above embodiment, which is not limited herein.
And S205, re-labeling the mis-labeled phonemes.
The device for detecting a phoneme mislabel in this embodiment may be configured with a phoneme labeling function in the prior art, and after determining a mislabeled phoneme, the device for detecting a phoneme mislabel re-labels the mislabeled phoneme to obtain a plurality of target combined phonemes after re-labeling and a phoneme re-labeling result of each target combined phoneme after re-labeling.
In the embodiment, triphones in a plurality of phoneme sequences are used as combined phonemes, the confidence level model of the obtained phonemes is higher due to the combination of context, the accuracy of the confidence level of the obtained labeling result of the phonemes to be detected is higher, then, the confidence level threshold value is obtained by using the confidence levels of a plurality of target combined phonemes to be detected, the accuracy of the obtained wrong phonemes is further improved , and the detection device for the phoneme wrong labeling in the embodiment can re-label the obtained wrong phonemes, so that the accuracy of the result is improved.
Fig. 4 is a schematic structural diagram of the device for detecting a mis-labeled phoneme according to the present invention, as shown in fig. 4, the device 300 for detecting a mis-labeled phoneme includes a confidence level obtaining module 301 and a mis-labeled phoneme determining module 302.
A confidence obtaining module 301, configured to obtain a confidence of the phoneme labeling result of the target combined phoneme according to the target speech corresponding to the target combined phoneme, the phoneme labeling result of the target combined phoneme, and a preset phoneme confidence model, where the phoneme confidence model is used to represent a correspondence between the target speech, the phoneme labeling result of the target combined phoneme, and the confidence of the phoneme labeling result of the target combined phoneme;
and a mis-labeled phoneme determining module 302, configured to determine the target combined phoneme as a mis-labeled phoneme when the confidence of the phoneme labeling result of the target combined phoneme is smaller than a confidence threshold.
The principle and technical effect of the detection apparatus for phoneme mislabeling provided in this embodiment are similar to those of the above question-answering method, and are not described herein again.
Optionally, fig. 5 is a schematic structural diagram of a phoneme mislabeling detection apparatus provided by the present invention, as shown in fig. 5, the phoneme mislabeling detection apparatus 300 further includes: a combined phone acquisition module 303, a confidence threshold acquisition module 304, a phone confidence model acquisition module 305, and a re-labeling module 306.
A combined phoneme obtaining module 303, configured to perform phoneme segmentation processing on each phoneme sequence to obtain a plurality of monophony phonemes corresponding to each phoneme sequence;
and combining each monophone and the monophones adjacent to each monophone to obtain a combined phoneme.
Optionally, the target combined phonemes are multiple, and the confidence threshold obtaining module 304 is configured to obtain a confidence standard deviation and a confidence average value according to the confidence of the phoneme labeling result of each target combined phoneme before determining that the target combined phoneme is the mis-labeled phoneme;
and obtaining a confidence coefficient threshold value according to the confidence coefficient standard deviation and the confidence coefficient mean value.
A phoneme confidence model obtaining module 305, configured to train a obtained phoneme confidence model according to the voices corresponding to the multiple combined phonemes corresponding to all the phoneme data in the speech synthesis database, the phoneme annotation result of each combined phoneme, and the confidence of the phoneme annotation result of each combined phoneme before obtaining the target combined phoneme, where phoneme data include at least phoneme sequences.
And a re-labeling module 306, configured to re-label the mis-labeled phoneme after determining that the target combined phoneme is the mis-labeled phoneme.
Optionally, the confidence threshold obtaining module 304 is specifically configured to obtain three times of confidence standard deviations according to the confidence standard deviation, where the three times of confidence standard deviations is three times of the confidence standard deviation;
and determining the difference between the confidence coefficient mean value and the triple confidence coefficient standard deviation as a confidence coefficient threshold value.
Optionally, the phoneme labeling result of the target combined phoneme is a phoneme boundary labeling result of the target combined phoneme.
Fig. 6 is a schematic structural diagram of a device for detecting a phoneme mislabel provided by the present invention, where the device for detecting a phoneme mislabel may be a terminal device, such as a smart phone, a tablet computer, a computer, etc. as shown in fig. 6, the question-answering device 400 includes a memory 401 and at least processors 402.
A memory 401 for storing program instructions.
The processor 402 is configured to implement the method for detecting a phoneme mislabel in this embodiment when the program instruction is executed, and the specific implementation principle may be referred to the above embodiments, which is not described herein again.
The apparatus 400 for detecting phoneme mislabeling may further include an input/output interface 403.
The input/output interface 403 may include a separate output interface and input interface, or may be an integrated interface that integrates input and output. The output interface is used for outputting data, the input interface is used for acquiring input data, the output data is a general name output in the method embodiment, and the input data is a general name input in the method embodiment.
The present invention further provides readable storage media, wherein the readable storage media stores executable instructions, and when at least processors of the device for detecting phoneme mislabels execute the executable instructions, the computer executable instructions when executed by the processors implement the method for detecting phoneme mislabels in the above embodiments.
At least processors of the phoneme mislabeling detection device can read the execution instructions from the readable storage medium, and at least processors execute the execution instructions to enable the phoneme mislabeling detection device to implement the phoneme mislabeling detection method provided by the various embodiments.
For example, the above-described embodiments of the apparatus are merely illustrative, e.g., the division of units is merely a logical division of , and other divisions may be possible in actual implementation, e.g., a plurality of units or components may be combined or integrated with another systems, or features may be omitted or not executed, and further point, the shown or discussed coupling or direct coupling or communication connection between each other may be via interfaces, and the indirect coupling or communication connection of the apparatuses or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, that is, may be located in places, or may also be distributed on multiple network units.
In addition, the functional units in the embodiments of the present invention may be integrated into processing units, or each unit may exist alone physically, or two or more units are integrated into units.
The integrated unit implemented in the form of a software functional unit may be stored in computer-readable storage media, where the software functional unit is stored in storage media, and includes several instructions to enable computer devices (such as a personal computer, a server, or a network device) or processors (english: processor) to execute some steps of the methods according to the embodiments of the present invention.
In the foregoing embodiments of the network device or the terminal device, it should be understood that the Processor may be a Central Processing Unit (CPU), or may be another general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present application may be embodied directly in a hardware processor, or in a combination of the hardware and software modules in the processor.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1, detection method of phoneme mislabeling, which is characterized by comprising:
acquiring a confidence coefficient of a phoneme labeling result of the target combined phoneme according to a target voice corresponding to the target combined phoneme, a phoneme labeling result of the target combined phoneme and a preset phoneme confidence coefficient model, wherein the phoneme confidence coefficient model is used for representing a corresponding relation among the target voice, the phoneme labeling result of the target combined phoneme and the confidence coefficient of the phoneme labeling result of the target combined phoneme;
when the confidence of the phoneme labeling result of the target combined phoneme is smaller than a confidence threshold, determining the target combined phoneme as a mis-labeled phoneme;
wherein the target combined phoneme is a plurality of phonemes, and the confidence threshold is obtained by:
obtaining a confidence standard deviation and a confidence mean value according to the confidence of the phoneme labeling result of each target combined phoneme;
obtaining a triple confidence standard deviation according to the confidence standard deviation, wherein the triple confidence standard deviation is three times of the confidence standard deviation;
determining the difference between the confidence mean and the triple confidence standard deviation as the confidence threshold.
2. The method of claim 1, wherein before said obtaining the target combined phoneme, further comprising:
and training to obtain the phoneme confidence model according to the confidence degrees of the voices corresponding to a plurality of combined phonemes corresponding to all the phoneme data in the voice synthesis database, the phoneme labeling result of each combined phoneme and the phoneme labeling result of each combined phoneme, wherein phoneme data comprise at least phoneme sequences.
3. The method of claim 2, wherein before the training obtains the phone confidence model, further comprising:
performing phoneme segmentation processing on each phoneme sequence to obtain a plurality of single phonemes corresponding to each phoneme sequence;
and combining each of the monophosphates and the monophosphates adjacent to each of the monophosphates to obtain the combined phoneme.
4. The method of any of claims 1-3, wherein the phoneme labeling result of the target combined phoneme comprises a phoneme boundary labeling result of the target combined phoneme.
5. The method of claim 4, wherein after determining the target combined phone to be a mis-labeled phone, further comprising:
and re-labeling the mis-labeled phoneme.
The device for detecting the mislabeling of the phonemes of types is characterized by comprising the following components:
a confidence level obtaining module, configured to obtain a confidence level of a phoneme labeling result of a target combined phoneme according to a target voice corresponding to the target combined phoneme, the phoneme labeling result of the target combined phoneme, and a preset phoneme confidence level model, where the phoneme confidence level model is used to represent a correspondence relationship between the target voice, the phoneme labeling result of the target combined phoneme, and the confidence level of the phoneme labeling result of the target combined phoneme;
the mislabeling phoneme determining module is used for determining the target combined phoneme as a mislabeling phoneme when the confidence coefficient of the phoneme labeling result of the target combined phoneme is smaller than a confidence coefficient threshold value;
the device for detecting the phoneme mislabel further comprises: a confidence threshold acquisition module; the target combined phoneme is multiple;
the confidence threshold obtaining module is configured to obtain a confidence standard deviation and a confidence mean according to the confidence of the phoneme labeling result of each target combination phoneme, obtain a triple confidence standard deviation according to the confidence standard deviation, and determine a difference between the confidence mean and the triple confidence standard deviation as the confidence threshold, where the triple confidence standard deviation is three times of the confidence standard deviation.
The detection device for the mislabeling of the phonemes of the 7 types and the types is characterized by comprising at least processors and memories;
the memory stores computer-executable instructions;
the at least processors executing the computer-executable instructions stored by the memory cause the phoneme mislabeling detection apparatus to perform the method of any of claims 1-5 to .
8, computer-readable storage media having computer-executable instructions stored thereon that, when executed by a processor, perform the method of any of claims 1-5, .
CN201810827500.0A 2018-07-25 2018-07-25 Detection method and device for phoneme mislabeling Active CN109166569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810827500.0A CN109166569B (en) 2018-07-25 2018-07-25 Detection method and device for phoneme mislabeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810827500.0A CN109166569B (en) 2018-07-25 2018-07-25 Detection method and device for phoneme mislabeling

Publications (2)

Publication Number Publication Date
CN109166569A CN109166569A (en) 2019-01-08
CN109166569B true CN109166569B (en) 2020-01-31

Family

ID=64898134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810827500.0A Active CN109166569B (en) 2018-07-25 2018-07-25 Detection method and device for phoneme mislabeling

Country Status (1)

Country Link
CN (1) CN109166569B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066482A (en) * 2019-12-13 2021-07-02 阿里巴巴集团控股有限公司 Voice model updating method, voice data processing method, voice model updating device, voice data processing device and storage medium
CN112530402B (en) * 2020-11-30 2024-01-12 深圳市优必选科技股份有限公司 Speech synthesis method, speech synthesis device and intelligent equipment
CN112562636B (en) * 2020-12-03 2024-07-05 云知声智能科技股份有限公司 Speech synthesis error correction method and device
CN112669814B (en) * 2020-12-17 2024-06-14 北京猎户星空科技有限公司 Data processing method, device, equipment and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7280967B2 (en) * 2003-07-30 2007-10-09 International Business Machines Corporation Method for detecting misaligned phonetic units for a concatenative text-to-speech voice
CN101840699B (en) * 2010-04-30 2012-08-15 中国科学院声学研究所 Voice quality evaluation method based on pronunciation model
CN104795077B (en) * 2015-03-17 2018-02-02 北京航空航天大学 A kind of consistency detecting method for examining voice annotation quality
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN107123417B (en) * 2017-05-16 2020-06-09 上海交通大学 Customized voice awakening optimization method and system based on discriminant training

Also Published As

Publication number Publication date
CN109166569A (en) 2019-01-08

Similar Documents

Publication Publication Date Title
CN109036464B (en) Pronunciation error detection method, apparatus, device and storage medium
CN109887497B (en) Modeling method, device and equipment for speech recognition
US10755701B2 (en) Method and apparatus for converting English speech information into text
CN107564511B (en) Electronic device, phoneme synthesizing method and computer readable storage medium
CN109166569B (en) Detection method and device for phoneme mislabeling
US11043213B2 (en) System and method for detection and correction of incorrectly pronounced words
JP6362603B2 (en) Method, system, and computer program for correcting text
US8818813B2 (en) Methods and system for grammar fitness evaluation as speech recognition error predictor
CN107170453B (en) Cross-language voice transcription method, equipment and readable medium based on artificial intelligence
CN109637521A (en) A kind of lip reading recognition methods and device based on deep learning
CN111402862B (en) Speech recognition method, device, storage medium and equipment
CN110033760A (en) Modeling method, device and the equipment of speech recognition
CN110010136B (en) Training and text analysis method, device, medium and equipment for prosody prediction model
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
CN112397056B (en) Voice evaluation method and computer storage medium
CN110415725B (en) Method and system for evaluating pronunciation quality of second language using first language data
Qian et al. Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT)
CN110503956A (en) Audio recognition method, device, medium and electronic equipment
CN115240655A (en) Chinese voice recognition system and method based on deep learning
CN114299930A (en) End-to-end speech recognition model processing method, speech recognition method and related device
CN109697975B (en) Voice evaluation method and device
CN112309429A (en) Method, device and equipment for explosion loss detection and computer readable storage medium
CN111951827B (en) Continuous reading identification correction method, device, equipment and readable storage medium
CN111128181B (en) Recitation question evaluating method, recitation question evaluating device and recitation question evaluating equipment
CN115050351A (en) Method and device for generating timestamp and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant