CN111145719B - Data labeling method and device for Chinese-English mixing and tone labeling - Google Patents

Data labeling method and device for Chinese-English mixing and tone labeling Download PDF

Info

Publication number
CN111145719B
CN111145719B CN201911404092.9A CN201911404092A CN111145719B CN 111145719 B CN111145719 B CN 111145719B CN 201911404092 A CN201911404092 A CN 201911404092A CN 111145719 B CN111145719 B CN 111145719B
Authority
CN
China
Prior art keywords
training
text
audio file
characters
english
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911404092.9A
Other languages
Chinese (zh)
Other versions
CN111145719A (en
Inventor
戴健
周伟东
刘华
刘凯
喻凌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Taiji Huabao Technology Co ltd
Original Assignee
Beijing Taiji Huabao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Taiji Huabao Technology Co ltd filed Critical Beijing Taiji Huabao Technology Co ltd
Priority to CN201911404092.9A priority Critical patent/CN111145719B/en
Publication of CN111145719A publication Critical patent/CN111145719A/en
Application granted granted Critical
Publication of CN111145719B publication Critical patent/CN111145719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The embodiment of the application discloses a data labeling method and device for mixing Chinese and English and labeling tone, which are applied to a deep learning speech synthesis algorithm, wherein the method comprises the following steps: capturing a training text from a data source, wherein the training text covers Chinese and English characters; adding an emotion label to the captured training text, and recording a read audio file of the training text marked by the speaker according to the emotion label as an audio file for training; checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and revising the audio file by the inconsistent part; the training text is mapped into a text vector, the text vector and the read audio file of the speaker are submitted to a deep learning engine of a neural network for training, and the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels are learned through deep learning training.

Description

Data labeling method and device for Chinese-English mixing and tone labeling
Technical Field
The embodiment of the application relates to a data labeling method and device for Chinese-English mixing and tone labeling.
Background
The existing voice synthesis technology greatly improves the voice synthesis quality, can directly generate vivid voice from a text, and can be applied to the fields of voice navigation, automatic broadcasting, automatic queuing and number calling service and the like. However, in the current text-based voice output technology, the tone is often flattened in the voice output process, and although the voice sounds smooth, the emotion color is insufficient, and the experience of people is very poor. Meanwhile, in the traditional voice output technology, the method can not be simultaneously applied to the condition of Chinese and English mixing. The method relates to Chinese and English mixed pronunciation, and two models are often called for processing, so that the processing efficiency is low, and the voice output effect is poor. This is because, in the conventional text labeling technique, characters are directly converted into pinyin, and the pinyin is converted into vectors to be used as input of a neural network. Under such labeled data, voice of frustration cannot be trained basically due to the unicity of data preparation.
Disclosure of Invention
In order to solve the above technical problem, embodiments of the present application are expected to provide a data tagging method and device for mixing chinese and english and tagging a mood.
The technical scheme of the invention is realized as follows:
the embodiment of the application provides a data labeling method for Chinese-English mixing and tone labeling, which comprises the following steps:
capturing a training text from a data source, wherein the training text covers Chinese and English characters;
adding emotion labels to the captured training texts, wherein the emotion labels comprise at least one of short pauses, calms, surprises, questions, lingering sounds, questions and accents;
recording a reading audio file of the training text marked by the speaker according to the emotion label as an audio file for training;
checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and revising the audio file by the inconsistent part;
the training text is mapped into a text vector, the text vector and the read audio file of the speaker are submitted to a deep learning engine of a neural network for training, and the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels are learned through deep learning training.
As one implementation, the mapping the training text to a text vector includes:
performing pronunciation labeling on characters, numbers and English characters in the training text, converting the speech of sentence labeling into a number string according to the corresponding relation between the letters and the calibration numbers by letters in the labeled pronunciation, converting the tone corresponding to the characters into corresponding numbers, and converting the emotion label of the sentence into corresponding number identification; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;
the text converted into a string of numbers is mapped into a vector.
As an implementation, the method further comprises:
marking the characters in the training text with the retromorphism sound, and not marking the light sound of the characters;
and before the marked text is mapped into a vector, converting the retromorphism sound in the tone corresponding to the character into a corresponding number.
As one implementation, after revising the audio file, the method further includes:
and when the audio file cannot meet the requirement after being revised, deleting the audio file or rereading the audio file based on the training text corresponding to the audio file to regenerate the audio file.
A data labeling device for Chinese-English mixing and tone labeling comprises:
the grabbing unit is used for grabbing the training text from the data source; the training text covers Chinese and English characters;
the adding unit is used for adding emotional labels to the captured training texts, wherein the emotional labels comprise at least one of short pauses, level tones, surprises, questions, dragging sounds, question backs and accents;
the recording unit is used for recording the reading audio file of the training text marked by the speaker according to the emotion label as the audio file for training;
the checking unit is used for checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not and triggering the revising unit when the audio file for training is inconsistent with the emotion label;
a revision unit for revising a portion of the audio file for training that is inconsistent with the corresponding training text;
the mapping unit is used for mapping the training text into a text vector;
and the training unit is used for submitting the text vectors and the read audio files of the speaker to a deep learning engine of a neural network for training, and learning the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels through deep learning training.
As an implementation manner, the mapping unit is further configured to perform pronunciation labeling on characters, numbers and english characters in the training text, convert the speech labeled in the sentence into a numeric string according to the correspondence between the letters and the calibration numbers, convert the tone corresponding to the characters into corresponding numbers, and convert the emotion tags of the sentence into corresponding numeric identifiers; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;
the text converted into a string of numbers is mapped into a vector.
As an implementation manner, the mapping unit is further configured to label the characters in the training text with a retroactive sound, and the light sounds of the characters are not labeled;
and before the marked text is mapped into a vector, converting the retromorphism sound in the tone corresponding to the character into a corresponding number.
As an implementation, the revision unit is further configured to:
after the audio file is revised, when the audio file cannot meet the requirement after being revised, the audio file is deleted or the audio file is regenerated based on the fact that the training text corresponding to the audio file is read again.
Compared with the prior art, the technical scheme of the embodiment of the application has the following advantages:
the embodiment of the application can solve the problems that the synthesized voice is too flat and Chinese and English can not be read in a mixed mode in the traditional end-to-end deep learning network, can train a better voice model by carrying out a data labeling algorithm on the text, synthesizes a voice effect of inhibiting the rising and the falling, and can support the Chinese and English mixed reading in a model, so that the output voice is consistent with the intuition of people, the complexity of a neural network is not increased, and the network learning is facilitated. The intuitive data annotation method for the end-to-end speech synthesis model provided by the embodiment of the application can meet the requirement of basic intonation definition without increasing extra complexity.
Drawings
FIG. 1 is a schematic flow chart illustrating a data tagging method for Chinese-English mixing and tone tagging according to an embodiment of the present application;
fig. 2 is a schematic structural diagram illustrating a data tagging apparatus for tagging chinese-english mixture and speech.
Detailed Description
The embodiments described in the present invention can be combined without conflict.
The technical solution of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a data tagging method for mixing chinese and english and tagging a mood according to an embodiment of the present application, and as shown in fig. 1, the data tagging method for mixing chinese and english and tagging a mood according to the embodiment of the present application includes the following steps:
step 101, capturing a training text from a data source, wherein the training text covers Chinese and English characters.
In the embodiment of the application, the training text can be obtained from the data training library. The data source may be various web pages in the network, such as text in Baidu encyclopedia, and the data source may also be textbook or magazine text, and the like. The embodiment of the application captures the training text containing Chinese and English characters from the data source.
And 102, adding emotion labels to the captured training texts, wherein the emotion labels comprise at least one of short pause, level tone, surprise, question, lingering tone, question reversing and accent.
In the embodiment of the application, emotion labels such as short pause, flat tone, surprise, question, sound, question and answer, emphasis and the like need to be added to the training text. In the embodiment of the application, the tone of the sentence is determined to be too coarse through the punctuation marks, and the semantic meanings of the tone and the context have great association relationship in many times, like a sentence, the tone of the sentence is greatly different under different application scenes, for example, when the sentence is read by intense, flat and inverse mock, the tone of the sentence is very obvious, so that the tone of the sentence is determined based on the semantic analysis of the context, the punctuation marks, the characters and the positions of the characters in the sentence, and the emotional tags are more accurately added to the training text.
And 103, recording the reading audio file of the training text marked by the speaker according to the emotion label as the audio file for training.
In the embodiment of the application, after the training text is captured, the speaker reads the audio file according to the training text marked by the emotion label, and records and stores the read audio file as the audio file for training.
And 104, checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and revising the audio file by the inconsistent part.
In the embodiment of the application, the audio file generated by reading needs to be checked, and the part of the audio file which does not reach the standard is revised when the requirement is not met; and if the audio file cannot meet the requirement after being revised, deleting the audio file or rereading the audio file based on the training text corresponding to the audio file to regenerate the audio file.
And 105, mapping the training text into a text vector, submitting the text vector and the read audio file of the speaker to a deep learning engine of a neural network for training, and learning the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels through deep learning training.
How to map the training text into a text vector is described in detail below.
In the embodiment of the application, the pronunciation of each character, word, numeral, English character and the like in the training text is labeled. Wherein, the Chinese characters are directly converted into pinyin of the corresponding Chinese characters. With tone symbols, where the light tone is not marked, 1-4 tones are placed in sequence after the pinyin by the numbers 1-4. If the decimal point is found, the converted pinyin is as follows: "xiao 3 shu4 dian3 er". If the pinyin converted from the dot product is as follows: "dian 3 ji 1".
The English label uses international phonetic symbols. Labeling with the CMU System, as follows
Figure BDA0002348151980000051
Figure BDA0002348151980000061
For the sounds marked with CMU phonetic symbols, enclosed with "{ }", the english letter Q should be converted into: { K Y UW }.
Is the Toyota car of the sentence "do you Jing Q3HM 21" you or your family are at home? "
The following steps are carried out:
“nin2 de jing1{K Y UW}san1{HH AH M}er4 yi1 de feng1 tian2 qi4 che1 shi4 nin2 huo4 nin2 de jia1 ren2 zai4 kai1 ma?”。
in the embodiment of the present application, the emotion tag includes:
1. the sound of children: such as: the converted pinyin of the decimal point is as follows: xiao3 shu4 dian3 er.
2. Short-time pause: ",". Representing a brief pause in reading.
3. Leveling and reading: ". ". Indicating normal speech reading.
4. Surprise reading: "! ", indicates that the word should read a surprise.
5. Query reading: "? ", indicates a verbal reading of the application query of the sentence.
6. Dragging and reading: "-" indicates that the word should be prolonged. For example, "ask you for you is-", and the last "is" the long sound should be played.
7. Reading in reverse: "^" indicates that the sentence should be read with the language of the question in reverse. Generally, the third tone emphasis should be used for reading. Ironically "i am good".
8. Emphasis on reading: "+, read with emphasis. Such as "do you confirm your consent? ", turn to: "ni 3 que4 ren4 nin2 tong2 yi4 zhang1 san1 xian1 sheng1 dai4 ti4 nin2 qian1 zi4 ma". Wherein, three are emphasized word by word when reading.
In the embodiment of the application, the marked text data is digitally converted, and the conversion rule is as follows:
1) the pinyin maps one number one by one according to English letters and numbers;
2) each phonetic symbol label is mapped with a number independently;
3) each mood label is mapped with a number independently;
4) ignore all other symbols;
5) each phoneme is connected by a space, and the space is used for mapping a number.
According to the rules, the mapping rules of the labels and the numbers are sorted out as follows:
Figure BDA0002348151980000071
Figure BDA0002348151980000081
Figure BDA0002348151980000091
Figure BDA0002348151980000101
Figure BDA0002348151980000111
Figure BDA0002348151980000121
and according to the mapping table, the labels converted from the text are mapped into vectors, and then the vectors can be submitted to a neural network for model learning.
And mapping the text converted into the numeric string into a vector, inputting the vector into an end-to-end neural network for model training, and recording a training result.
In the embodiment of the application, tone labeling reference and correction can be performed on the newly input text based on the training result after the neural network training.
As an implementation manner, the data tagging method for mixing chinese and english and labeling mood in the embodiment of the present application further includes:
marking the characters in the training text with the retromorphism sound, and not marking the light sound of the characters;
and before the marked text is mapped into a vector, converting the retromorphism sound in the tone corresponding to the character into a corresponding number.
The embodiment of the application can solve the problems that the synthesized voice is too flat and Chinese and English can not be read in a mixed mode in the traditional end-to-end deep learning network, can train a better voice model by carrying out a data labeling algorithm on the text, synthesizes a voice effect of inhibiting the rising and the falling, and can support the Chinese and English mixed reading in a model, so that the output voice is consistent with the intuition of people, the complexity of a neural network is not increased, and the network learning is facilitated. The intuitive data annotation method for the end-to-end speech synthesis model provided by the embodiment of the application can meet the requirement of basic intonation definition without increasing extra complexity.
Fig. 2 is a schematic structural diagram illustrating a composition of a data tagging apparatus for mixing chinese and english and tagging a mood according to an embodiment of the present application, and as shown in fig. 2, the data tagging apparatus for mixing chinese and english and tagging a mood according to the embodiment of the present application includes:
a grasping unit 20, configured to grasp a training text from a data source; the training text covers Chinese and English characters;
an adding unit 21, configured to add an emotion tag to the captured training text, where the emotion tag includes at least one of a short pause, a flat tone, a surprise, a question, a lingering tone, a question, and a highlight;
the recording unit 22 is used for recording the reading audio file of the training text marked by the speaker according to the emotion label as the audio file for training;
the checking unit 23 is used for checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and triggering the revision unit when the audio file for training is inconsistent with the emotion label;
a revision unit 24 for revising a portion of the audio file for training that is inconsistent with the corresponding training text;
a mapping unit 25, configured to map the training text into a text vector;
and the training unit 26 is used for submitting the text vectors and the read audio files of the speaker to a deep learning engine of a neural network for training, and learning the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels through deep learning training.
In this embodiment of the application, the mapping unit 25 is further configured to perform pronunciation labeling on characters, numbers and english characters in the training text, convert the speech labeled in the sentence into a numeric string according to the correspondence between the letters and the calibration numbers, convert the tone corresponding to the characters into corresponding numbers, and convert the emotion tags of the sentence into corresponding numeric identifiers; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;
the text converted into a string of numbers is mapped into a vector.
In the embodiment of the present application, the mapping unit 25 is further configured to label the characters in the training text with a retroactive sound, and the light sounds of the characters are not labeled;
and before the marked text is mapped into a vector, converting the retromorphism sound in the tone corresponding to the character into a corresponding number.
In the embodiment of the present application, the revising unit 24 is further configured to:
after the audio file is revised, when the audio file cannot meet the requirement after being revised, the audio file is deleted or the audio file is regenerated based on the fact that the training text corresponding to the audio file is read again.
It should be understood by those skilled in the art that the functions of each processing unit in the data tagging apparatus for mixing chinese and english according to the embodiment of the present invention can be understood by referring to the related description of the data tagging method for mixing chinese and english and tagging chinese, and each processing unit in the data tagging apparatus for mixing chinese and english according to the embodiment of the present invention can be implemented by an analog circuit that implements the functions described in the embodiment of the present invention, or by running software that implements the functions described in the embodiment of the present invention on an intelligent device.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
According to the embodiment of the application, the efficient intelligent operation and maintenance management system of the data center is constructed, the operation management level of information resources is effectively improved, the stable operation of the server is guaranteed, the service efficiency of a machine room is improved, the stability of the server is monitored in real time, the fault processing efficiency is improved through real-time alarming, and the stability and effectiveness of the system are guaranteed.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

Claims (8)

1. A data labeling method for Chinese-English mixing and tone labeling is characterized by comprising the following steps:
capturing a training text from a data source, wherein the training text covers Chinese and English characters;
adding emotion labels to the captured training texts;
recording a reading audio file of the training text marked by the speaker according to the emotion label as an audio file for training;
checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and revising the audio file by the inconsistent part;
the training text is mapped into a text vector, the text vector and the read audio file of the speaker are submitted to a deep learning engine of a neural network for training, and the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels are learned through deep learning training.
2. The labeling method of claim 1, wherein the mapping the training text to a text vector comprises:
performing pronunciation labeling on characters, numbers and English characters in the training text, converting the speech of sentence labeling into a number string according to the corresponding relation between the letters and the calibration numbers by letters in the labeled pronunciation, converting the tone corresponding to the characters into corresponding numbers, and converting the emotion label of the sentence into corresponding number identification; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;
the text converted into a string of numbers is mapped into a vector.
3. The annotation method of claim 2, further comprising:
marking the characters in the training text with the retromorphism sound, and not marking the light sound of the characters;
and before the marked text is mapped into a vector, converting the retromorphism sound in the tone corresponding to the character into a corresponding number.
4. The annotation method of claim 1, wherein after revising the audio file, the method further comprises:
and when the audio file cannot meet the requirement after being revised, deleting the audio file or rereading the audio file based on the training text corresponding to the audio file to regenerate the audio file.
5. A data labeling device for Chinese-English mixing and tone labeling is characterized in that the device comprises:
the grabbing unit is used for grabbing the training text from the data source; the training text covers Chinese and English characters;
the adding unit is used for adding emotion labels to the captured training texts;
the recording unit is used for recording the reading audio file of the training text marked by the speaker according to the emotion label as the audio file for training;
the checking unit is used for checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not and triggering the revising unit when the audio file for training is inconsistent with the emotion label;
a revision unit for revising a portion of the audio file for training that is inconsistent with the corresponding training text;
the mapping unit is used for mapping the training text into a text vector;
and the training unit is used for submitting the text vectors and the read audio files of the speaker to a deep learning engine of a neural network for training, and learning the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels through deep learning training.
6. The labeling device of claim 5, wherein the mapping unit is further configured to label the pronunciation of the characters, numbers and english characters in the training text, convert the labeled speech of the sentence into a numeric string, convert the tone corresponding to the characters into corresponding numbers, and convert the emotion tags of the sentence into corresponding numeric identifiers according to the correspondence between the letters and the labeled numbers; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;
the text converted into a string of numbers is mapped into a vector.
7. The labeling device of claim 6, wherein the mapping unit is further configured to label characters in the training text with retromorphism sounds, and light sounds of the characters are not labeled;
and before the marked text is mapped into a vector, converting the retromorphism sound in the tone corresponding to the character into a corresponding number.
8. The annotating device of claim 5, wherein the revision unit is further configured to:
after the audio file is revised, when the audio file cannot meet the requirement after being revised, the audio file is deleted or the audio file is regenerated based on the fact that the training text corresponding to the audio file is read again.
CN201911404092.9A 2019-12-31 2019-12-31 Data labeling method and device for Chinese-English mixing and tone labeling Active CN111145719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911404092.9A CN111145719B (en) 2019-12-31 2019-12-31 Data labeling method and device for Chinese-English mixing and tone labeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911404092.9A CN111145719B (en) 2019-12-31 2019-12-31 Data labeling method and device for Chinese-English mixing and tone labeling

Publications (2)

Publication Number Publication Date
CN111145719A CN111145719A (en) 2020-05-12
CN111145719B true CN111145719B (en) 2022-04-05

Family

ID=70522293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911404092.9A Active CN111145719B (en) 2019-12-31 2019-12-31 Data labeling method and device for Chinese-English mixing and tone labeling

Country Status (1)

Country Link
CN (1) CN111145719B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675854B (en) * 2019-08-22 2022-10-28 厦门快商通科技股份有限公司 Chinese and English mixed speech recognition method and device
CN111785249A (en) * 2020-07-10 2020-10-16 恒信东方文化股份有限公司 Training method, device and obtaining method of input phoneme of speech synthesis
CN112634865B (en) * 2020-12-23 2022-10-28 爱驰汽车有限公司 Speech synthesis method, apparatus, computer device and storage medium
CN113838448B (en) * 2021-06-16 2024-03-15 腾讯科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN113611286B (en) * 2021-10-08 2022-01-18 之江实验室 Cross-language speech emotion recognition method and system based on common feature extraction

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385858A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Emotional voice synthesis method and system
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN109147760A (en) * 2017-06-28 2019-01-04 阿里巴巴集团控股有限公司 Synthesize method, apparatus, system and the equipment of voice
CN109817198A (en) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 Multiple sound training method, phoneme synthesizing method and device for speech synthesis
CN112151005A (en) * 2020-09-28 2020-12-29 四川长虹电器股份有限公司 Chinese and English mixed speech synthesis method and device
CN113012680A (en) * 2021-03-03 2021-06-22 北京太极华保科技股份有限公司 Speech technology synthesis method and device for speech robot
CN113380221A (en) * 2021-06-21 2021-09-10 携程科技(上海)有限公司 Chinese and English mixed speech synthesis method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385858A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Emotional voice synthesis method and system
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN109147760A (en) * 2017-06-28 2019-01-04 阿里巴巴集团控股有限公司 Synthesize method, apparatus, system and the equipment of voice
CN109817198A (en) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 Multiple sound training method, phoneme synthesizing method and device for speech synthesis
CN112151005A (en) * 2020-09-28 2020-12-29 四川长虹电器股份有限公司 Chinese and English mixed speech synthesis method and device
CN113012680A (en) * 2021-03-03 2021-06-22 北京太极华保科技股份有限公司 Speech technology synthesis method and device for speech robot
CN113380221A (en) * 2021-06-21 2021-09-10 携程科技(上海)有限公司 Chinese and English mixed speech synthesis method and device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A HMM-based fuzzy affective model for motional speech synthesis";Yuqiang Qin 等;《2010 2nd International Conference on Signal Processing System》;20100707;全文 *
"情感语音识别与合成的研究";孙颖;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20121015;全文 *
"面向情感语音合成的语言情感建模研究";高莹莹;《中国博士学位论文全文数据库(信息科技辑)》;20161215;全文 *

Also Published As

Publication number Publication date
CN111145719A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN111145719B (en) Data labeling method and device for Chinese-English mixing and tone labeling
CN101079301B (en) Time sequence mapping method for text to audio realized by computer
Eyben et al. Unsupervised clustering of emotion and voice styles for expressive TTS
CN109389968B (en) Waveform splicing method, device, equipment and storage medium based on double syllable mixing and lapping
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
CN109241330A (en) The method, apparatus, equipment and medium of key phrase in audio for identification
CN109754783A (en) Method and apparatus for determining the boundary of audio sentence
CN110600002B (en) Voice synthesis method and device and electronic equipment
JP4634889B2 (en) Voice dialogue scenario creation method, apparatus, voice dialogue scenario creation program, recording medium
CN112818089B (en) Text phonetic notation method, electronic equipment and storage medium
CN112365878A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN115002491A (en) Network live broadcast method, device, equipment and storage medium based on intelligent machine
CN112231015A (en) Browser-based operation guidance method, SDK plug-in and background management system
CN116320607A (en) Intelligent video generation method, device, equipment and medium
CN109492126B (en) Intelligent interaction method and device
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
JP2006236037A (en) Voice interaction content creation method, device, program and recording medium
CN116092472A (en) Speech synthesis method and synthesis system
CN116129868A (en) Method and system for generating structured photo
CN114118068B (en) Method and device for amplifying training text data and electronic equipment
Jones Macsen: A voice assistant for speakers of a lesser resourced language
CN114267325A (en) Method, system, electronic device and storage medium for training speech synthesis model
CN113515586A (en) Data processing method and device
Meron et al. Improving the authoring of foreign language interactive lessons in the tactical language training system.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Dai Jian

Inventor after: Zhou Weidong

Inventor after: Liu Hua

Inventor after: Liu Kai

Inventor after: Yu Ling

Inventor before: Dai Jian

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant