CN112242132A - Data labeling method, device and system in speech synthesis - Google Patents
Data labeling method, device and system in speech synthesis Download PDFInfo
- Publication number
- CN112242132A CN112242132A CN201910650880.XA CN201910650880A CN112242132A CN 112242132 A CN112242132 A CN 112242132A CN 201910650880 A CN201910650880 A CN 201910650880A CN 112242132 A CN112242132 A CN 112242132A
- Authority
- CN
- China
- Prior art keywords
- phoneme
- data
- recording
- mute
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 147
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 85
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 85
- 238000012545 processing Methods 0.000 claims abstract description 137
- 238000000034 method Methods 0.000 claims abstract description 67
- 230000033764 rhythmic process Effects 0.000 claims abstract description 24
- 230000008569 process Effects 0.000 claims abstract description 21
- 230000015654 memory Effects 0.000 claims description 27
- 238000001514 detection method Methods 0.000 claims description 25
- 238000012937 correction Methods 0.000 claims description 20
- 238000012795 verification Methods 0.000 claims description 11
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000012986 modification Methods 0.000 claims description 4
- 230000004048 modification Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Document Processing Apparatus (AREA)
Abstract
The application discloses a data labeling method, device and system in speech synthesis. Wherein, the method comprises the following steps: acquiring a recording audio and a recording text; through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data. The method and the device solve the technical problem that the voice synthesis cannot be completed online in real time due to the fact that manual participation in data annotation is needed in the voice synthesis process.
Description
Technical Field
The present application relates to the field of speech processing, and in particular, to a method, an apparatus, and a system for data annotation in speech synthesis.
Background
Text To Speech (TTS) is a Speech synthesis technique that converts Text To Speech. When performing voice synthesis, recording audio, recording text and recording label data need to be acquired, and then voice synthesis is performed according to the recording audio, the recording text and the recording label data. The recording annotation data mainly comprises pronunciation annotation, rhythm annotation, phoneme boundary annotation and the like.
In the conventional TTS data labeling, the recording labeling data such as the pronunciation labeling, the prosody labeling, and the phoneme boundary labeling need to be labeled manually. Although the automatic labeling tool exists in the prior art, the automatic labeling tool is only applied to a certain stage in the manual labeling stage, and the complete automatic labeling of the recording labeling data cannot be realized, and the manufacturing cost of the sound library is increased by the way of manually labeling the data in the speech synthesis.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the application provides a data labeling method, a device and a system in voice synthesis, which at least solve the technical problem that the voice synthesis cannot be finished on line in real time due to the fact that data labeling needs to be manually participated in the voice synthesis process.
According to an aspect of the embodiments of the present application, there is provided a data annotation method in speech synthesis, including: acquiring a recording audio and a recording text; through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data. Wherein, through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data and include: obtaining pronunciation marking data and a prosody pause prediction result by converting the recording text into a structured prosody text; obtaining phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data.
According to another aspect of the embodiments of the present application, there is also provided a data annotation device in speech synthesis, including: the first labeling module is used for acquiring a recording text and converting the recording text into a structured prosodic text; the second labeling module is used for acquiring the recording audio and the recording text, and performing voice recognition on the recording audio and the recording text to obtain a first processing result, wherein the first processing result is used for describing time boundary information of each phoneme in the recording audio; the third labeling module is used for acquiring the recorded audio and performing signal processing on the recorded audio to obtain a second processing result, wherein the second processing result is used for describing voice part information and mute part information detected from the recorded audio; a processing module to perform at least one of the following operations: determining pronunciation marking data according to the structured prosodic text; determining phoneme boundary labeling data according to the first processing result and the second processing result; prosody labeling data is determined from the structured prosody text and the phoneme boundary labeling data.
According to another aspect of the embodiments of the present application, there is also provided a data annotation device in speech synthesis, including: the acquisition module is used for acquiring the recording audio and the recording text; the marking module is used for carrying out recording marking processing on the recording audio and the recording text to obtain recording marking data, wherein the recording marking data comprises at least one of the following data: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data; wherein, the mark module includes: the conversion module is used for converting the recording text into a structured prosody text to obtain pronunciation marking data and a prosody pause prediction result; the first processing module is used for carrying out voice recognition processing on the recording audio and the recording text and carrying out voice detection processing on the recording audio to obtain phoneme boundary labeling data; and the second processing module is used for correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein when the program runs, a device in which the storage medium is located is controlled to execute the data annotation method in speech synthesis.
According to another aspect of the embodiments of the present application, there is also provided a processor, configured to execute a program, where the program executes the method for tagging data in speech synthesis.
According to another aspect of the embodiments of the present application, there is also provided a data annotation system in speech synthesis, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a recording audio and a recording text; through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data; the processor is further used for converting the recording text into a structured prosody text to obtain pronunciation marking data and a prosody pause prediction result; obtaining phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data.
In this application embodiment, adopt and carry out the mode of marking to recording the mark data automatically, after obtaining recording audio frequency and recording text, carry out the recording mark to recording audio frequency and recording text to obtain recording mark data, wherein, through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data and include: obtaining pronunciation marking data and a prosody pause prediction result by converting the recording text into a structured prosody text; obtaining phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data. Wherein, the recording label data comprises at least one of the following: pronunciation annotation data, prosody annotation data, and phoneme boundary annotation data. It is easy to note that in the above process, no manual intervention is needed in the process of recording annotation. In addition, the data labeling method provided by the application can label all the pronunciation labeling data, the rhythm labeling data and the phoneme boundary labeling data instead of only labeling one or more types of data, so that the aim of automatically labeling the data in the speech synthesis is fulfilled, the technical effects of saving labor cost and meeting the online real-time requirement are achieved, and the technical problem that the speech synthesis cannot be completed online in real time due to the fact that manual data labeling is needed in the speech synthesis process is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a computer terminal according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for annotating data in speech synthesis according to an embodiment of the present application;
FIG. 3 is a flow chart of an alternative data annotation according to an embodiment of the present application;
FIG. 4 is a diagram of a data annotation device in speech synthesis according to an embodiment of the present application;
FIG. 5 is a diagram of a data annotation device in speech synthesis according to an embodiment of the present application; and
fig. 6 is a block diagram of a computer terminal according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
There is also provided, in accordance with an embodiment of the present application, an embodiment of a method for data annotation in speech synthesis, where it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a data annotation method in speech synthesis. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).
The memory 104 can be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the transmission module in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the data tagging method in speech synthesis. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).
It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.
Under the operating environment, the application provides a data annotation method in speech synthesis as shown in fig. 2. The method is applied to a training scene of a speech synthesis model. Fig. 2 is a flowchart of a data annotation method in speech synthesis according to a first embodiment of the present application, and as can be seen from fig. 2, the method includes the following steps:
step S202, acquiring a recording audio and a recording text.
Optionally, in step S202, the recording audios and the recording texts have a corresponding relationship, where each recording audio has an audio identifier, each recording text has a text identifier, and there is an association relationship between the audio identifier and the text identifier, and the recording texts associated with the recording audios or the recording audios associated with the recording texts may be queried through the association relationship. Wherein, the recorded text, the recorded audio and the association relationship between the two can be stored in the database.
In an optional embodiment, the recording personnel can input the recording audio through the voice acquisition equipment, the text processing personnel can write out the recording text corresponding to the recording audio according to the recording audio input by the recording personnel, and the data storage personnel can store the recording audio and the corresponding recording text into the database. The speech synthesis system may obtain the recorded audio and recorded text corresponding to the recorded audio from a database.
Step S204, recording and labeling the recording audio and the recording text to obtain recording and labeling data, wherein the recording and labeling data comprises at least one of the following data: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data.
It should be noted that the recording audio and the recording text are labeled, that is, the recording audio is labeled in multiple levels such as semantics, grammar, phonemes and the like, so that the machine can learn rules from the recording audio, and a man-machine voice interaction technology is realized.
In an alternative embodiment, as shown in the flowchart of fig. 3, as can be seen from fig. 3, in the present application, the pronunciation labeling data, the prosody labeling data, and the phoneme boundary labeling data can be obtained by performing labeling processing on the recorded audio and the recorded text. As can be seen from fig. 3, in the process of data labeling, three labeling tools are used, namely, a TTS front-end tool, an ASR (Automatic Speech Recognition) Align tool, and a VAD (Voice Activity Detection) tool, where the TTS front-end tool is used to convert a text into a structured prosody text in Speech synthesis, so as to generate information such as pronunciation and prosody pause corresponding to the text; the ASR Align tool is used for carrying out speech recognition and is mainly used for generating time boundary information of each phoneme in the recorded audio; the VAD tool is configured to perform signal processing on the recorded audio to detect a portion of the recorded audio that includes speech and silence, where silence is a portion without speech.
In addition, as can be seen from fig. 3, the present application also adopts VAD Silence Detect technology and Silence prose Detect technology, where VAD Silence Detect technology is a Silence detection method combining ASR Align and VAD, and Silence prose Detect technology is a method of labeling Prosody pause by Silence detection.
In an optional embodiment, after the recording audio and the recording text are obtained, the voice synthesis system obtains pronunciation annotation data and a prosody pause prediction result by converting the recording text into a structured prosody text, obtains phoneme boundary annotation data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio, and finally obtains prosody annotation data by correcting the prosody pause prediction result by using the phoneme boundary annotation data. The voice synthesis system obtains phoneme boundary information by performing voice recognition processing on the recording audio and the recording text, obtains mute boundary information by performing voice detection processing on the recording audio, and then performs cross comparison verification on the phoneme boundary information and the mute boundary information to obtain phoneme boundary labeling data.
Taking fig. 3 as an example for explanation, as can be seen from fig. 3, the voice synthesis system inputs the recorded text into the TTS front-end tool, inputs the recorded text and the recorded audio into the ASR Align tool, and inputs the recorded audio into the VAD tool, where the TTS front-end tool converts the recorded text into a structured prosody text, where the structured prosody text includes pronunciation labeling data and a prosody pause prediction structure, and the pronunciation labeling data generated by the TTS front-end tool is directly used as a result of pronunciation labeling. And performing voice recognition processing on the recording audio and the recording text by using an ASR Align tool to obtain phoneme boundary information, processing the recording audio by using a VAD tool to obtain mute boundary information, and performing cross comparison and verification on the phoneme boundary information and the mute boundary information by using VAD Silence Detect technology to obtain phoneme boundary labeling data. Optionally, the phone boundary information includes, but is not limited to, a speech duration corresponding to a phone, a speech start time corresponding to a phone, and an end time, and the silence boundary information includes, but is not limited to, a silence duration, a silence start time, and an end time, wherein the phone is the smallest speech unit divided according to natural attributes of the speech, for example, the chinese syllable "a" has only one phone, "ai" has two phones, and "dai" has three phones.
Based on the scheme defined in the above step S202 to step S204, it can be known that, after the recording audio and the recording text are obtained, recording annotation is performed on the recording audio and the recording text by using a mode of automatically labeling the recording annotation data, so as to obtain the recording annotation data, wherein the recording annotation data is obtained by performing recording annotation processing on the recording audio and the recording text, and the obtaining of the recording annotation data includes: obtaining pronunciation marking data and a prosody pause prediction result by converting the recording text into a structured prosody text; obtaining phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data. Wherein, the recording label data comprises at least one of the following: pronunciation annotation data, prosody annotation data, and phoneme boundary annotation data.
It is easy to note that in the above process, no manual intervention is needed in the process of recording annotation. In addition, the data labeling method provided by the application can label all the pronunciation labeling data, the rhythm labeling data and the phoneme boundary labeling data instead of only labeling one or more types of data, so that the aim of automatically labeling the data in the speech synthesis is fulfilled, the technical effects of saving labor cost and meeting the online real-time requirement are achieved, and the technical problem that the speech synthesis cannot be completed online in real time due to the fact that manual data labeling is needed in the speech synthesis process is solved.
In an optional embodiment, after obtaining the phoneme boundary information and the silence boundary information, the speech synthesis system needs to perform silence detection, that is, cross-comparing and checking the phoneme boundary information and the silence boundary information to obtain phoneme boundary labeling data. Specifically, the speech synthesis system scans the silence boundary information to obtain a silence segment information list, scans the phoneme boundary information to obtain silence phonemes, and compares the start time of the silence phonemes with the start time of the same silence segment recorded in the silence segment information list and the end time of the silence phonemes with the end time of the same silence segment recorded in the silence segment information list to obtain a comparison result. And finally, determining phoneme boundary labeling data according to the comparison result. The silent section information list is used for recording the starting time and the ending time of each silent section.
Optionally, the speech synthesis system scans all silence boundary information generated by the VAD tool to obtain a silence segment information list, where the start time of each silence segment is recorded in the silence segment information list, and for example, [ c, d ] indicates that the start time of silence is c and the end time is d. In addition, the data stored in the mute section information list may be sorted in the chronological order of the generation of the mute. The speech synthesis system obtains a mute phoneme corresponding to each phoneme boundary information by circularly scanning each phoneme boundary information generated by the ASR Align tool. And then acquiring the starting time and the ending time of the mute phoneme, and comparing the starting time and the ending time of the mute phoneme with the starting time and the ending time of the same mute section in the mute section information list one by one to obtain a comparison result.
In the above process, before obtaining the mute phoneme corresponding to each phoneme boundary information, the speech synthesis system further detects whether the phoneme corresponding to the speech boundary information is a mute phoneme, and if it is detected that the phoneme is a mute phoneme, the speech synthesis system compares the start time and the end time of the mute phoneme.
Further, after the comparison result is obtained, if it is determined that the time period of the mute phoneme falls within the range of any mute section according to the comparison result, the mute phoneme is determined to be effectively muted. Then the speech synthesis system corrects the start time of the mute phoneme to the start time of any mute section and corrects the end time of the mute phoneme to the end time of any mute section respectively to obtain a correction result, and adjusts the end time of the previous phoneme adjacent to the start time of the mute phoneme and the start time of the next phoneme adjacent to the end time of the mute phoneme according to the correction result. After the start time of the mute phoneme is corrected to the start time of any mute section and the end time of the mute phoneme is corrected to the end time of any mute section respectively to obtain correction results, if the start time of the modified mute phoneme is determined to be less than or equal to the start time of the first phoneme or the end time of the modified mute phoneme is determined to be greater than or equal to the end time of the second phoneme according to the correction results, the labeling is determined to be invalid, and the labeling result of the mute phoneme is discarded.
For example, the start time of a mute phoneme of the ASR Align is [ a, b ], the start time of the same mute segment recorded in the VAD mute segment list is [ c, d ], and if a is greater than or equal to c, and b is less than or equal to d, the mute phoneme is determined to be effectively muted. At this time, the speech synthesis system corrects the boundary time of the mute phoneme in the ASR Align to [ c, d ], while adjusting the time of the preceding and following adjacent phonemes. For another example, if the modified mute phoneme boundary crosses the front and rear phoneme boundaries, i.e. if the front phoneme boundary is [ e, f ], the rear phoneme boundary is [ g, h ], and c is less than or equal to e, or d is greater than or equal to h, it is determined that the labeling method is invalid, and the piece of data is discarded.
In addition, if the time period of the mute phoneme is determined not to fall within the range of any mute section according to the comparison result, the mute phoneme is determined to be invalid mute, the mute phoneme is deleted, and the time period corresponding to the mute phoneme is merged into the time period corresponding to the adjacent phoneme. For example, if the start time of a mute phoneme does not fall within the range of any VAD mute segment, the mute phoneme is invalid, and at this time, the speech synthesis system deletes the mute phoneme in the ASR Align result and combines the start time corresponding to the mute phoneme into the time segment corresponding to the adjacent phoneme.
Further, after the phoneme boundary annotation data are obtained, the speech synthesis system corrects the prosody pause prediction result by using the phoneme boundary annotation data to obtain prosody annotation data. Specifically, the speech synthesis system merges speech phonemes in the phoneme boundary tagging data into corresponding words according to a phoneme rule, sets mute phonemes, located outside the head and the tail of the recording audio, in the phoneme boundary tagging data as pause identifiers to obtain pause information, compares a prosody pause prediction result with the pause information by taking the merged words as units to obtain a comparison result, and obtains prosody tagging data according to the comparison result. If the pause identification exists at the adjacent position after the combined word is determined, setting a first label at the adjacent position, wherein the first label adopts a first pause level; if it is determined that no pause indications exist at adjacent positions after the merged word and have been identified as a second pause level, a second annotation is placed at the adjacent positions, wherein the first annotation assumes the second pause level.
Optionally, the speech synthesis system scans phoneme boundary labeling data corresponding to VAD Silence Detect, and combines speech phonemes in the phoneme boundary labeling data into one word according to a phoneme rule. And if the voice phoneme is a mute phoneme and is not a mute phoneme at the head and tail of the audio, using the mute phoneme as a pause identifier. Then, the speech synthesis system compares the prosody pause prediction result and the pause information generated by the TTS front-end tool one by one according to the characters. If the word is followed by a pause flag, then label #3 (i.e., the first label) after the word; if the word is not followed by a pause identification, but is identified by the TTS front-end tool as #1 (i.e., the second pause level), then the word is followed by #1 (i.e., the second label). And finally, obtaining final rhythm marking data through a comparison result.
It should be noted that different language types have different phonemes, for example, chinese has chinese phonemes, english has english phonemes, and english phonemes include 48 phonemes, including 20 vowel phonemes and 28 consonant phonemes. When synthesizing the speech phonemes in the phoneme boundary labeling data according to the phoneme rules, the speech synthesis system may first detect a language type corresponding to the recorded speech, and then perform synthesis processing using the phoneme rules corresponding to the language type.
According to the above content, the scheme provided by the application utilizes an ASR Align tool and a VAD tool to perform double verification, so that the marking accuracy and precision of the silence segment are ensured, and the accuracy of prosody prediction is ensured by utilizing the TTS front-end tool and the identification of the silence segment. In addition, the scheme provided by the application does not need manual participation, can be operated on line in a full-automatic mode, meets the requirement of TTS data labeling, reduces the manufacturing cost of the bank, and can realize the on-line real-time training task of the personalized TTS.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the above description of the embodiments, those skilled in the art can clearly understand that the data annotation method in speech synthesis according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
Example 2
According to an embodiment of the present application, there is also provided a data annotation device in speech synthesis for implementing the data annotation method in speech synthesis, as shown in fig. 4, the device includes: a first annotation module 401, a second annotation module 403, a third annotation module 405, and a processing module 407.
The first labeling module 401 is configured to obtain a recording text and convert the recording text into a structured prosodic text; the second labeling module 403 is configured to obtain a recording audio and a recording text, and perform speech recognition on the recording audio and the recording text to obtain a first processing result, where the first processing result is used to describe time boundary information of each phoneme in the recording audio; a third labeling module 405, configured to obtain a recording audio, and perform signal processing on the recording audio to obtain a second processing result, where the second processing result is used to describe voice part information and mute part information detected from the recording audio; a processing module 407 configured to perform at least one of the following operations: determining pronunciation marking data according to the structured prosodic text; determining phoneme boundary labeling data according to the first processing result and the second processing result; prosody labeling data is determined from the structured prosody text and the phoneme boundary labeling data.
Alternatively, the first labeled model may be the TTS front-end tool in fig. 3, the second labeled model may be the ASR tool in fig. 3, and the third labeled model may be the VAD tool in fig. 4. The TTS front-end tool is used for converting a text into a structured prosody text in speech synthesis so as to generate information such as pronunciation, prosody pause and the like corresponding to the text; the ASR Align tool is used for carrying out speech recognition and is mainly used for generating time boundary information of each phoneme in the recorded audio; the VAD tool is used for carrying out signal processing on the recorded audio to detect that the recorded audio contains voice part information and mute part information, wherein the mute part information is information of a part without voice.
Furthermore, the processing module determines pronunciation annotation data according to the structured prosody text output by the first annotation module, determines phoneme boundary annotation data according to the first processing result output by the second annotation module and the second processing result output by the third annotation module, and determines prosody annotation data according to the structured prosody text and the phoneme boundary annotation data. Optionally, as shown in fig. 3, the TTS front-end tool converts the recorded text into a structured prosody text, where the structured prosody text includes pronunciation annotation data and a prosody pause prediction structure. And the processing module directly uses pronunciation marking data generated by a TTS front-end tool as a pronunciation marking result. The ASR Align tool carries out voice recognition processing on the recording audio and the recording text to obtain phoneme boundary information, the VAD tool carries out processing on the recording audio to obtain mute boundary information, and then the processing module carries out cross comparison verification on the phoneme boundary information and the mute boundary information by using VAD Silence Detect technology to obtain phoneme boundary labeling data. The phoneme boundary information includes, but is not limited to, a speech duration corresponding to a phoneme, a speech start time and an end time corresponding to a phoneme, the silence boundary information includes, but is not limited to, a silence duration, a silence start time and an end time, the phoneme is the smallest speech unit divided according to natural attributes of the speech, for example, the chinese syllable "a" has only one phoneme, "ai" has two phonemes, and "dai" has three phonemes.
As can be seen from the above, the data annotation device in speech synthesis provided by this embodiment does not need human intervention in the process of performing recording annotation. In addition, the data labeling device provided by the application can label all the pronunciation labeling data, rhythm labeling data and phoneme boundary labeling data instead of only labeling one or a plurality of types of data, so that the aim of automatically labeling the data in the speech synthesis is fulfilled, the labor cost is saved, the technical effect of meeting the online real-time requirement is achieved, and the technical problem that the speech synthesis cannot be completed online in real time due to the fact that manual data labeling is needed in the speech synthesis process is solved.
In an optional embodiment, the second labeling module performs speech recognition processing on the recording audio and the recording text to obtain phoneme boundary information, performs speech detection processing on the recording audio to obtain mute boundary information, and performs cross comparison and verification on the phoneme boundary information and the mute boundary information to obtain phoneme boundary labeling data, i.e., obtains the first processing result.
Specifically, the second labeling module scans the silence boundary information to obtain a silence segment information list, then scans the phoneme boundary information to obtain a silence phoneme, compares the start time of the silence phoneme with the start time of the same silence segment recorded in the silence segment information list and the end time of the silence phoneme with the end time of the same silence segment recorded in the silence segment information list to obtain a comparison result, and finally determines phoneme boundary labeling data according to the comparison result.
If the time period of the mute phoneme is determined to fall within the range of any mute section according to the comparison result, the mute phoneme is determined to be effective mute, the starting time of the mute phoneme is corrected to be the starting time of any mute section and the ending time of the mute phoneme is corrected to be the ending time of any mute section respectively to obtain a correction result, and then the ending time of the previous phoneme adjacent to the starting time of the mute phoneme and the starting time of the next phoneme adjacent to the ending time of the mute phoneme are adjusted according to the correction result. And if the starting time of the modified mute phoneme is determined to be less than or equal to the starting time of the first phoneme or the ending time of the modified mute phoneme is determined to be greater than or equal to the ending time of the second phoneme according to the modification result, determining that the labeling is invalid and discarding the labeling result of the mute phoneme. In addition, if the time period of the mute phoneme is determined not to fall within the range of any mute section according to the comparison result, the mute phoneme is determined to be invalid mute, then the mute phoneme is deleted, and the time period corresponding to the mute phoneme is merged into the time period corresponding to the adjacent phoneme.
It should be noted that, in the above process, the mute segment information list is used to record the start time and the end time of each mute segment, where the start time of each mute segment is recorded in the mute segment information list, for example, [ c, d ] indicates that the start time of mute is c and the end time is d. In addition, the data stored in the mute section information list may be sorted in the chronological order of the generation of the mute.
In an optional embodiment, after the first processing result and the second processing result are obtained, the processing module obtains pronunciation annotation data and a prosody pause prediction result by converting the recording text into a structured prosody text, then performs speech recognition processing on the recording audio and the recording text and performs speech detection processing on the recording audio to obtain phoneme boundary annotation data, and finally corrects the prosody pause prediction result by using the phoneme boundary annotation data to obtain prosody annotation data.
Specifically, the processing module merges the voice phonemes in the phoneme boundary tagging data into corresponding words according to a phoneme rule, sets mute phonemes, located outside the head and the tail of the recording audio, in the phoneme boundary tagging data as pause identifiers to obtain pause information, compares prosody pause prediction results with the pause information by taking the merged words as units to obtain comparison results, and sets a first label at an adjacent position if the pause identifiers exist at the adjacent position after the merged words, wherein the first label adopts a first pause level; if it is determined that no pause indications exist at adjacent positions after the merged word and have been identified as a second pause level, a second annotation is placed at the adjacent positions, wherein the first annotation assumes the second pause level.
Through the scheme, the prosody prediction accuracy is guaranteed, and the manufacturing cost of the bank is reduced.
Example 3
According to an embodiment of the present application, there is also provided a data annotation device in speech synthesis for implementing the data annotation method in speech synthesis, as shown in fig. 5, the device 50 includes: an acquisition module 501 and a labeling module 503.
The acquiring module 501 is configured to acquire a recording audio and a recording text; the labeling module 503 is configured to perform recording labeling processing on the recording audio and the recording text to obtain recording labeling data, where the recording labeling data includes at least one of the following data: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data.
It should be noted that the acquiring module 501 and the labeling module 503 correspond to steps S202 to S204 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.
In an alternative embodiment, the annotation module comprises: the device comprises a conversion module, a first processing module and a second processing module. The conversion module is used for converting the recording text into a structured prosody text to obtain pronunciation marking data and a prosody pause prediction result; the first processing module is used for carrying out voice recognition processing on the recording audio and the recording text and carrying out voice detection processing on the recording audio to obtain phoneme boundary labeling data; and the second processing module is used for correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data.
In an alternative embodiment, the first processing module comprises: the device comprises a third processing module, a fourth processing module and a checking module. The third processing module is used for performing voice recognition processing on the recording audio and the recording text to obtain phoneme boundary information; the fourth processing module is used for carrying out voice detection processing on the recorded audio to obtain mute boundary information; and the verification module is used for performing cross comparison verification on the phoneme boundary information and the mute boundary information to obtain phoneme boundary labeling data.
In an alternative embodiment, the verification module comprises: the device comprises a fifth processing module, a scanning module and a determining module. The fifth processing module is configured to perform scanning processing on the silence boundary information to obtain a silence segment information list, where the silence segment information list is used to record a start time and an end time of each silence segment; the scanning module is used for scanning the phoneme boundary information to obtain a mute phoneme and comparing the starting time of the mute phoneme with the starting time of the same mute section recorded in the mute section information list and the ending time of the mute phoneme with the ending time of the same mute section recorded in the mute section information list respectively to obtain a comparison result; and the determining module is used for determining phoneme boundary labeling data according to the comparison result.
In an alternative embodiment, the determining module includes: the device comprises a first determining module, a correcting module and an adjusting module. The first determining module is used for determining the mute phoneme to be effective mute if the time period of the mute phoneme is determined to fall into the range of any mute period according to the comparison result; the correction module is used for correcting the starting time of the mute phoneme to the starting time of any mute section and correcting the ending time of the mute phoneme to the ending time of any mute section respectively to obtain a correction result; and the adjusting module is used for adjusting the ending time of the previous phoneme adjacent to the starting time of the mute phoneme and adjusting the starting time of the next phoneme adjacent to the ending time of the mute phoneme according to the correction result.
In an alternative embodiment, the data annotation device in speech synthesis further comprises: a second determination module. And the second determining module is used for determining that the labeling is invalid and discarding the labeling result of the mute phoneme if the modified start time of the mute phoneme is less than or equal to the start time of the first phoneme or the modified end time of the mute phoneme is greater than or equal to the end time of the second phoneme after the correction result is obtained by respectively modifying the start time of the mute phoneme to the start time of any mute section and the end time of the mute phoneme to the end time of any mute section.
In an alternative embodiment, the determining module includes: a third determining module and a first combining module. The third determining module is configured to determine that the mute phoneme is invalid mute if it is determined according to the comparison result that the time period of the mute phoneme does not fall within the range of any mute segment; and the first merging module is used for deleting the mute phoneme and merging the time period corresponding to the mute phoneme into the time period corresponding to the adjacent phoneme.
In an alternative embodiment, the second processing module comprises: the device comprises a second merging module, a comparison module and a fourth determination module. The second merging module is used for merging the voice phonemes in the phoneme boundary tagging data into corresponding words according to the phoneme rule, and setting mute phonemes, which are positioned outside the head and the tail of the recording audio in the phoneme boundary tagging data, as pause identifiers to obtain pause information; the comparison module is used for comparing the rhythm pause prediction result with pause information by taking the combined character as a unit to obtain a comparison result; and the fourth determining module is used for obtaining prosody labeling data according to the comparison result.
In an alternative embodiment, the fourth determining module includes: a sixth processing module and a seventh processing module. The sixth processing module is configured to, if it is determined that a pause identifier exists at an adjacent position after the merged word, set a first label at the adjacent position, where the first label adopts a first pause level; and a seventh processing module, configured to set a second label at an adjacent position after the merged word if it is determined that no pause indication exists at the adjacent position and the pause indication is identified as a second pause level, wherein the first label adopts the second pause level.
Optionally, the data labeling apparatus in speech synthesis is applied to a training scenario of a speech synthesis model.
Example 4
According to an embodiment of the present application, there is also provided a data annotation system in speech synthesis for implementing the data annotation method in speech synthesis, the system including: a processor and a memory.
The memory is connected with the processor and used for providing instructions for the processor to process the following processing steps: acquiring a recording audio and a recording text; through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data.
By last, adopt the automatic mode of carrying out the mark to recording the mark data, after obtaining recording audio frequency and recording text, carry out the recording mark to recording audio frequency and recording text to obtain recording mark data, wherein, recording mark data include: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data; the processor is further used for converting the recording text into a structured prosody text to obtain pronunciation marking data and a prosody pause prediction result; obtaining phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data.
It is easy to note that in the above process, no manual intervention is needed in the process of recording annotation. In addition, the data labeling method provided by the application can label all the pronunciation labeling data, the rhythm labeling data and the phoneme boundary labeling data instead of only labeling one or more types of data, so that the aim of automatically labeling the data in the speech synthesis is fulfilled, the technical effects of saving labor cost and meeting the online real-time requirement are achieved, and the technical problem that the speech synthesis cannot be completed online in real time due to the fact that manual data labeling is needed in the speech synthesis process is solved.
It should be noted that, the data annotation system in speech synthesis in this embodiment can also execute the data annotation method in speech synthesis provided in embodiment 1, and related contents have been described in embodiment 1 and are not described herein again.
Example 4
The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.
Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.
In this embodiment, the computer terminal may execute the program code of the following steps in the data annotation method in speech synthesis: acquiring a recording audio and a recording text; through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data.
Optionally, fig. 6 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 6, the computer terminal 10 may include: one or more (only one of which is shown) processors 602, memory 604, and a peripherals interface 606.
The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the data tagging method and apparatus in speech synthesis in the embodiment of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implements the data tagging method in speech synthesis. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, and these remote memories may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a recording audio and a recording text; through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data, wherein, recording mark data includes: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data.
Optionally, the processor may further execute the program code of the following steps: obtaining pronunciation marking data and a prosody pause prediction result by converting the recording text into a structured prosody text; obtaining phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data.
Optionally, the processor may further execute the program code of the following steps: performing voice recognition processing on the recording audio and the recording text to obtain phoneme boundary information; carrying out voice detection processing on the recorded audio to obtain mute boundary information; and performing cross comparison and verification on the phoneme boundary information and the mute boundary information to obtain phoneme boundary labeling data.
Optionally, the processor may further execute the program code of the following steps: scanning the mute boundary information to obtain a mute section information list, wherein the mute section information list is used for recording the starting time and the ending time of each mute section; obtaining a mute phoneme by scanning the phoneme boundary information, and respectively comparing the starting time of the mute phoneme with the starting time of the same mute section recorded in the mute section information list and the ending time of the mute phoneme with the ending time of the same mute section recorded in the mute section information list to obtain a comparison result; and determining phoneme boundary labeling data according to the comparison result.
Optionally, the processor may further execute the program code of the following steps: if the time period of the mute phoneme is determined to fall into the range of any mute section according to the comparison result, determining the mute phoneme to be effective mute; respectively correcting the starting time of the mute phoneme to the starting time of any mute section and correcting the ending time of the mute phoneme to the ending time of any mute section to obtain correction results; and adjusting the ending time of the previous phoneme adjacent to the starting time of the mute phoneme and the starting time of the next phoneme adjacent to the ending time of the mute phoneme according to the correction result.
Optionally, the processor may further execute the program code of the following steps: after the start time of the mute phoneme is corrected to the start time of any mute section and the end time of the mute phoneme is corrected to the end time of any mute section respectively to obtain correction results, if the start time of the modified mute phoneme is determined to be less than or equal to the start time of the first phoneme or the end time of the modified mute phoneme is determined to be greater than or equal to the end time of the second phoneme according to the correction results, the labeling is determined to be invalid, and the labeling result of the mute phoneme is discarded.
Optionally, the processor may further execute the program code of the following steps: if the time period of the mute phoneme is determined not to fall into the range of any mute section according to the comparison result, determining the mute phoneme to be invalid mute; deleting the mute phoneme and merging the time period corresponding to the mute phoneme into the time period corresponding to the adjacent phoneme.
Optionally, the processor may further execute the program code of the following steps: merging the voice phonemes in the phoneme boundary labeling data into corresponding words according to a phoneme rule, and setting mute phonemes, which are positioned outside the head and the tail of the recorded audio, in the phoneme boundary labeling data as pause identifiers to obtain pause information; comparing the rhythm pause prediction result with pause information by taking the word obtained by merging as a unit to obtain a comparison result; and obtaining prosody labeling data according to the comparison result.
Optionally, the processor may further execute the program code of the following steps: if the pause identification exists at the adjacent position after the combined word, setting a first label at the adjacent position, wherein the first label adopts a first pause level; if it is determined that no pause indications exist at adjacent positions after the merged word and have been identified as a second pause level, a second annotation is placed at the adjacent positions, wherein the first annotation assumes the second pause level.
It can be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 6 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 6, or have a different configuration than shown in FIG. 6.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Example 5
Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the data tagging method in speech synthesis provided in the first embodiment.
Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a recording audio and a recording text; through carrying out recording mark processing to recording audio frequency and recording text, obtain recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining pronunciation marking data and a prosody pause prediction result by converting the recording text into a structured prosody text; obtaining phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain prosody annotation data.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: performing voice recognition processing on the recording audio and the recording text to obtain phoneme boundary information; carrying out voice detection processing on the recorded audio to obtain mute boundary information; and performing cross comparison and verification on the phoneme boundary information and the mute boundary information to obtain phoneme boundary labeling data.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: scanning the mute boundary information to obtain a mute section information list, wherein the mute section information list is used for recording the starting time and the ending time of each mute section; obtaining a mute phoneme by scanning the phoneme boundary information, and respectively comparing the starting time of the mute phoneme with the starting time of the same mute section recorded in the mute section information list and the ending time of the mute phoneme with the ending time of the same mute section recorded in the mute section information list to obtain a comparison result; and determining phoneme boundary labeling data according to the comparison result.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: if the time period of the mute phoneme is determined to fall into the range of any mute section according to the comparison result, determining the mute phoneme to be effective mute; respectively correcting the starting time of the mute phoneme to the starting time of any mute section and correcting the ending time of the mute phoneme to the ending time of any mute section to obtain correction results; and adjusting the ending time of the previous phoneme adjacent to the starting time of the mute phoneme and the starting time of the next phoneme adjacent to the ending time of the mute phoneme according to the correction result.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: after the start time of the mute phoneme is corrected to the start time of any mute section and the end time of the mute phoneme is corrected to the end time of any mute section respectively to obtain correction results, if the start time of the modified mute phoneme is determined to be less than or equal to the start time of the first phoneme or the end time of the modified mute phoneme is determined to be greater than or equal to the end time of the second phoneme according to the correction results, the labeling is determined to be invalid, and the labeling result of the mute phoneme is discarded.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: if the time period of the mute phoneme is determined not to fall into the range of any mute section according to the comparison result, determining the mute phoneme to be invalid mute; deleting the mute phoneme and merging the time period corresponding to the mute phoneme into the time period corresponding to the adjacent phoneme.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: merging the voice phonemes in the phoneme boundary labeling data into corresponding words according to a phoneme rule, and setting mute phonemes, which are positioned outside the head and the tail of the recorded audio, in the phoneme boundary labeling data as pause identifiers to obtain pause information; comparing the rhythm pause prediction result with pause information by taking the word obtained by merging as a unit to obtain a comparison result; and obtaining prosody labeling data according to the comparison result.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: if the pause identification exists at the adjacent position after the combined word, setting a first label at the adjacent position, wherein the first label adopts a first pause level; if it is determined that no pause indications exist at adjacent positions after the merged word and have been identified as a second pause level, a second annotation is placed at the adjacent positions, wherein the first annotation assumes the second pause level.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.
Claims (14)
1. A method for labeling data in speech synthesis is characterized by comprising the following steps:
acquiring a recording audio and a recording text;
through right the recording audio frequency with the recording text carries out recording mark processing, obtains recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data;
wherein, through right the recording audio frequency with the recording text carries out recording mark processing, obtains recording mark data includes:
obtaining the pronunciation marking data and a prosody pause prediction result by converting the recording text into a structured prosody text;
obtaining the phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio;
and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain the prosody annotation data.
2. The method of claim 1, wherein obtaining the phoneme boundary labeling data by performing a speech recognition process on the recorded audio and the recorded text and by performing a speech detection process on the recorded audio comprises:
performing voice recognition processing on the recording audio and the recording text to obtain phoneme boundary information;
carrying out voice detection processing on the recorded audio to obtain mute boundary information;
and performing cross comparison and verification on the phoneme boundary information and the mute boundary information to obtain phoneme boundary labeling data.
3. The method of claim 2, wherein cross-comparing the phoneme boundary information with the silence boundary information to obtain the phoneme boundary labeling data comprises:
scanning the mute boundary information to obtain a mute section information list, wherein the mute section information list is used for recording the starting time and the ending time of each mute section;
obtaining a mute phoneme by scanning the phoneme boundary information, and comparing the starting time of the mute phoneme with the starting time of the same mute section recorded in the mute section information list and the ending time of the mute phoneme with the ending time of the same mute section recorded in the mute section information list respectively to obtain a comparison result;
and determining the phoneme boundary labeling data according to the comparison result.
4. The method of claim 3, wherein determining the phoneme boundary labeling data according to the comparison result comprises:
if the time period of the mute phoneme is determined to fall into the range of any mute section according to the comparison result, determining the mute phoneme to be effective mute;
respectively correcting the starting time of the mute phoneme to be the starting time of any mute section and correcting the ending time of the mute phoneme to be the ending time of any mute section to obtain a correction result;
and adjusting the ending time of the previous phoneme adjacent to the starting time of the mute phoneme and the starting time of the next phoneme adjacent to the ending time of the mute phoneme according to the correction result.
5. The method according to claim 4, wherein after the modifying the start time of the mute phoneme to the start time of any mute segment and the end time of the mute phoneme to the end time of any mute segment respectively to obtain the modifying results, the method further comprises:
and if the starting time of the modified mute phoneme is determined to be less than or equal to the starting time of the first phoneme or the ending time of the modified mute phoneme is determined to be greater than or equal to the ending time of the second phoneme according to the modification result, determining that the labeling is invalid and discarding the labeling result of the mute phoneme.
6. The method of claim 3, wherein determining the phoneme boundary labeling data according to the comparison result comprises:
if the time period of the mute phoneme is determined not to fall into the range of any mute section according to the comparison result, determining the mute phoneme to be invalid mute;
deleting the mute phoneme, and merging the time period corresponding to the mute phoneme into the time period corresponding to the adjacent phoneme.
7. The method of claim 1, wherein the modifying the prosody pause prediction result by using the phoneme boundary labeling data to obtain the prosody labeling data comprises:
merging the voice phonemes in the phoneme boundary labeling data into corresponding words according to a phoneme rule, and setting mute phonemes, which are positioned outside the head and the tail of the recording audio in the phoneme boundary labeling data, as pause identifiers to obtain pause information;
comparing the rhythm pause prediction result with the pause information by taking the word obtained by merging as a unit to obtain a comparison result;
and obtaining the prosody labeling data according to the comparison result.
8. The method of claim 7, wherein obtaining the prosody labeling data according to the comparison result comprises:
if the pause identifier is determined to exist at the adjacent position after the merged word, setting a first label at the adjacent position, wherein the first label adopts a first pause level;
if it is determined that the pause indication is not present at an adjacent position after the merged word and has been identified as a second pause level, then a second annotation is placed at the adjacent position, wherein the first annotation assumes the second pause level.
9. The method of claim 1, wherein the method is applied to a training scenario of a speech synthesis model.
10. An apparatus for annotating data in speech synthesis, comprising:
the first labeling module is used for acquiring a recording text and converting the recording text into a structured prosodic text;
the second labeling module is used for acquiring a recording audio and the recording text, and performing voice recognition on the recording audio and the recording text to obtain a first processing result, wherein the first processing result is used for describing time boundary information of each phoneme in the recording audio;
the third labeling module is used for acquiring the recording audio and performing signal processing on the recording audio to obtain a second processing result, wherein the second processing result is used for describing voice part information and mute part information detected from the recording audio;
a processing module to perform at least one of the following operations:
determining pronunciation marking data according to the structured prosodic text;
determining phoneme boundary annotation data according to the first processing result and the second processing result;
and determining prosody labeling data according to the structured prosody text and the phoneme boundary labeling data.
11. An apparatus for annotating data in speech synthesis, comprising:
the acquisition module is used for acquiring the recording audio and the recording text;
the marking module is used for carrying out recording marking processing on the recording audio and the recording text to obtain recording marking data, wherein the recording marking data comprises at least one of the following data: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data;
wherein, the labeling module comprises: the conversion module is used for converting the recording text into a structured prosody text to obtain the pronunciation annotation data and a prosody pause prediction result; the first processing module is used for performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio to obtain the phoneme boundary labeling data; and the second processing module is used for correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain the prosody annotation data.
12. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device in which the storage medium is located is controlled to execute the data annotation method in speech synthesis according to any one of claims 1 to 9.
13. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the method for data annotation in speech synthesis according to any one of claims 1 to 9 when running.
14. A system for annotating data in speech synthesis, comprising:
a processor; and
a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a recording audio and a recording text; through right the recording audio frequency with the recording text carries out recording mark processing, obtains recording mark data, wherein, recording mark data includes following at least one: pronunciation annotation data, rhythm annotation data and phoneme boundary annotation data;
the processor is further used for converting the recording text into a structured prosody text to obtain the pronunciation annotation data and a prosody pause prediction result; obtaining the phoneme boundary labeling data by performing voice recognition processing on the recording audio and the recording text and performing voice detection processing on the recording audio; and correcting the prosody pause prediction result by adopting the phoneme boundary annotation data to obtain the prosody annotation data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910650880.XA CN112242132B (en) | 2019-07-18 | 2019-07-18 | Data labeling method, device and system in voice synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910650880.XA CN112242132B (en) | 2019-07-18 | 2019-07-18 | Data labeling method, device and system in voice synthesis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112242132A true CN112242132A (en) | 2021-01-19 |
CN112242132B CN112242132B (en) | 2024-06-14 |
Family
ID=74167879
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910650880.XA Active CN112242132B (en) | 2019-07-18 | 2019-07-18 | Data labeling method, device and system in voice synthesis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112242132B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114203160A (en) * | 2021-12-28 | 2022-03-18 | 深圳市优必选科技股份有限公司 | Method, device and equipment for generating sample data set |
EP4033484A3 (en) * | 2021-06-09 | 2023-02-01 | Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. | Recognition of semantic information of a speech signal, training a recognition model |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060009977A1 (en) * | 2004-06-04 | 2006-01-12 | Yumiko Kato | Speech synthesis apparatus |
CN101000765A (en) * | 2007-01-09 | 2007-07-18 | 黑龙江大学 | Speech synthetic method based on rhythm character |
CN101000764A (en) * | 2006-12-18 | 2007-07-18 | 黑龙江大学 | Speech synthetic text processing method based on rhythm structure |
US20140149116A1 (en) * | 2011-07-11 | 2014-05-29 | Nec Corporation | Speech synthesis device, speech synthesis method, and speech synthesis program |
CN105374350A (en) * | 2015-09-29 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Speech marking method and device |
CN108962217A (en) * | 2018-07-28 | 2018-12-07 | 华为技术有限公司 | Phoneme synthesizing method and relevant device |
CN109285535A (en) * | 2018-10-11 | 2019-01-29 | 四川长虹电器股份有限公司 | Phoneme synthesizing method based on Front-end Design |
CN109300468A (en) * | 2018-09-12 | 2019-02-01 | 科大讯飞股份有限公司 | A kind of voice annotation method and device |
-
2019
- 2019-07-18 CN CN201910650880.XA patent/CN112242132B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060009977A1 (en) * | 2004-06-04 | 2006-01-12 | Yumiko Kato | Speech synthesis apparatus |
CN101000764A (en) * | 2006-12-18 | 2007-07-18 | 黑龙江大学 | Speech synthetic text processing method based on rhythm structure |
CN101000765A (en) * | 2007-01-09 | 2007-07-18 | 黑龙江大学 | Speech synthetic method based on rhythm character |
US20140149116A1 (en) * | 2011-07-11 | 2014-05-29 | Nec Corporation | Speech synthesis device, speech synthesis method, and speech synthesis program |
CN105374350A (en) * | 2015-09-29 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Speech marking method and device |
CN108962217A (en) * | 2018-07-28 | 2018-12-07 | 华为技术有限公司 | Phoneme synthesizing method and relevant device |
CN109300468A (en) * | 2018-09-12 | 2019-02-01 | 科大讯飞股份有限公司 | A kind of voice annotation method and device |
CN109285535A (en) * | 2018-10-11 | 2019-01-29 | 四川长虹电器股份有限公司 | Phoneme synthesizing method based on Front-end Design |
Non-Patent Citations (1)
Title |
---|
傅睿博: ""基于静音时长和文本特征融合的韵律边界自动标注"", 《清华大学学报》, vol. 58, no. 1, 31 December 2018 (2018-12-31) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP4033484A3 (en) * | 2021-06-09 | 2023-02-01 | Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. | Recognition of semantic information of a speech signal, training a recognition model |
CN114203160A (en) * | 2021-12-28 | 2022-03-18 | 深圳市优必选科技股份有限公司 | Method, device and equipment for generating sample data set |
Also Published As
Publication number | Publication date |
---|---|
CN112242132B (en) | 2024-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105845125B (en) | Phoneme synthesizing method and speech synthetic device | |
CN107622054B (en) | Text data error correction method and device | |
CN106534548B (en) | Voice error correction method and device | |
KR102191425B1 (en) | Apparatus and method for learning foreign language based on interactive character | |
CN109522564B (en) | Voice translation method and device | |
CN105095186A (en) | Semantic analysis method and device | |
CN109543021B (en) | Intelligent robot-oriented story data processing method and system | |
CN112463942B (en) | Text processing method, text processing device, electronic equipment and computer readable storage medium | |
CN109102824B (en) | Voice error correction method and device based on man-machine interaction | |
US20220358297A1 (en) | Method for human-machine dialogue, computing device and computer-readable storage medium | |
CN111128116B (en) | Voice processing method and device, computing equipment and storage medium | |
CN111881297A (en) | Method and device for correcting voice recognition text | |
CN111883137A (en) | Text processing method and device based on voice recognition | |
CN111899859A (en) | Surgical instrument counting method and device | |
CN112908308B (en) | Audio processing method, device, equipment and medium | |
CN109166569B (en) | Detection method and device for phoneme mislabeling | |
CN111916062A (en) | Voice recognition method, device and system | |
CN113535925A (en) | Voice broadcasting method, device, equipment and storage medium | |
CN114783424A (en) | Text corpus screening method, device, equipment and storage medium | |
CN112242132A (en) | Data labeling method, device and system in speech synthesis | |
CN106528715B (en) | Audio content checking method and device | |
CN109065019B (en) | Intelligent robot-oriented story data processing method and system | |
US9087512B2 (en) | Speech synthesis method and apparatus for electronic system | |
CN112447168A (en) | Voice recognition system and method, sound box, display device and interaction platform | |
CN112151019A (en) | Text processing method and device and computing equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |