CN111145719B

CN111145719B - Data labeling method and device for Chinese-English mixing and tone labeling

Info

Publication number: CN111145719B
Application number: CN201911404092.9A
Authority: CN
Inventors: 戴健; 周伟东; 刘华; 刘凯; 喻凌
Original assignee: Beijing Taiji Huabao Technology Co ltd
Current assignee: Beijing Taiji Huabao Technology Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-04-05
Anticipated expiration: 2039-12-31
Also published as: CN111145719A

Abstract

The embodiment of the application discloses a data labeling method and device for mixing Chinese and English and labeling tone, which are applied to a deep learning speech synthesis algorithm, wherein the method comprises the following steps: capturing a training text from a data source, wherein the training text covers Chinese and English characters; adding an emotion label to the captured training text, and recording a read audio file of the training text marked by the speaker according to the emotion label as an audio file for training; checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and revising the audio file by the inconsistent part; the training text is mapped into a text vector, the text vector and the read audio file of the speaker are submitted to a deep learning engine of a neural network for training, and the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels are learned through deep learning training.

Description

Data labeling method and device for Chinese-English mixing and tone labeling

Technical Field

The embodiment of the application relates to a data labeling method and device for Chinese-English mixing and tone labeling.

Background

The existing voice synthesis technology greatly improves the voice synthesis quality, can directly generate vivid voice from a text, and can be applied to the fields of voice navigation, automatic broadcasting, automatic queuing and number calling service and the like. However, in the current text-based voice output technology, the tone is often flattened in the voice output process, and although the voice sounds smooth, the emotion color is insufficient, and the experience of people is very poor. Meanwhile, in the traditional voice output technology, the method can not be simultaneously applied to the condition of Chinese and English mixing. The method relates to Chinese and English mixed pronunciation, and two models are often called for processing, so that the processing efficiency is low, and the voice output effect is poor. This is because, in the conventional text labeling technique, characters are directly converted into pinyin, and the pinyin is converted into vectors to be used as input of a neural network. Under such labeled data, voice of frustration cannot be trained basically due to the unicity of data preparation.

Disclosure of Invention

In order to solve the above technical problem, embodiments of the present application are expected to provide a data tagging method and device for mixing chinese and english and tagging a mood.

The technical scheme of the invention is realized as follows:

the embodiment of the application provides a data labeling method for Chinese-English mixing and tone labeling, which comprises the following steps:

capturing a training text from a data source, wherein the training text covers Chinese and English characters;

adding emotion labels to the captured training texts, wherein the emotion labels comprise at least one of short pauses, calms, surprises, questions, lingering sounds, questions and accents;

recording a reading audio file of the training text marked by the speaker according to the emotion label as an audio file for training;

checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and revising the audio file by the inconsistent part;

the training text is mapped into a text vector, the text vector and the read audio file of the speaker are submitted to a deep learning engine of a neural network for training, and the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels are learned through deep learning training.

As one implementation, the mapping the training text to a text vector includes:

performing pronunciation labeling on characters, numbers and English characters in the training text, converting the speech of sentence labeling into a number string according to the corresponding relation between the letters and the calibration numbers by letters in the labeled pronunciation, converting the tone corresponding to the characters into corresponding numbers, and converting the emotion label of the sentence into corresponding number identification; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;

the text converted into a string of numbers is mapped into a vector.

As an implementation, the method further comprises:

marking the characters in the training text with the retromorphism sound, and not marking the light sound of the characters;

and before the marked text is mapped into a vector, converting the retromorphism sound in the tone corresponding to the character into a corresponding number.

As one implementation, after revising the audio file, the method further includes:

and when the audio file cannot meet the requirement after being revised, deleting the audio file or rereading the audio file based on the training text corresponding to the audio file to regenerate the audio file.

A data labeling device for Chinese-English mixing and tone labeling comprises:

the grabbing unit is used for grabbing the training text from the data source; the training text covers Chinese and English characters;

the adding unit is used for adding emotional labels to the captured training texts, wherein the emotional labels comprise at least one of short pauses, level tones, surprises, questions, dragging sounds, question backs and accents;

the recording unit is used for recording the reading audio file of the training text marked by the speaker according to the emotion label as the audio file for training;

the checking unit is used for checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not and triggering the revising unit when the audio file for training is inconsistent with the emotion label;

a revision unit for revising a portion of the audio file for training that is inconsistent with the corresponding training text;

the mapping unit is used for mapping the training text into a text vector;

and the training unit is used for submitting the text vectors and the read audio files of the speaker to a deep learning engine of a neural network for training, and learning the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels through deep learning training.

As an implementation manner, the mapping unit is further configured to perform pronunciation labeling on characters, numbers and english characters in the training text, convert the speech labeled in the sentence into a numeric string according to the correspondence between the letters and the calibration numbers, convert the tone corresponding to the characters into corresponding numbers, and convert the emotion tags of the sentence into corresponding numeric identifiers; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;

the text converted into a string of numbers is mapped into a vector.

As an implementation manner, the mapping unit is further configured to label the characters in the training text with a retroactive sound, and the light sounds of the characters are not labeled;

As an implementation, the revision unit is further configured to:

after the audio file is revised, when the audio file cannot meet the requirement after being revised, the audio file is deleted or the audio file is regenerated based on the fact that the training text corresponding to the audio file is read again.

Compared with the prior art, the technical scheme of the embodiment of the application has the following advantages:

the embodiment of the application can solve the problems that the synthesized voice is too flat and Chinese and English can not be read in a mixed mode in the traditional end-to-end deep learning network, can train a better voice model by carrying out a data labeling algorithm on the text, synthesizes a voice effect of inhibiting the rising and the falling, and can support the Chinese and English mixed reading in a model, so that the output voice is consistent with the intuition of people, the complexity of a neural network is not increased, and the network learning is facilitated. The intuitive data annotation method for the end-to-end speech synthesis model provided by the embodiment of the application can meet the requirement of basic intonation definition without increasing extra complexity.

Drawings

FIG. 1 is a schematic flow chart illustrating a data tagging method for Chinese-English mixing and tone tagging according to an embodiment of the present application;

fig. 2 is a schematic structural diagram illustrating a data tagging apparatus for tagging chinese-english mixture and speech.

Detailed Description

The embodiments described in the present invention can be combined without conflict.

The technical solution of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a data tagging method for mixing chinese and english and tagging a mood according to an embodiment of the present application, and as shown in fig. 1, the data tagging method for mixing chinese and english and tagging a mood according to the embodiment of the present application includes the following steps:

step 101, capturing a training text from a data source, wherein the training text covers Chinese and English characters.

In the embodiment of the application, the training text can be obtained from the data training library. The data source may be various web pages in the network, such as text in Baidu encyclopedia, and the data source may also be textbook or magazine text, and the like. The embodiment of the application captures the training text containing Chinese and English characters from the data source.

And 102, adding emotion labels to the captured training texts, wherein the emotion labels comprise at least one of short pause, level tone, surprise, question, lingering tone, question reversing and accent.

In the embodiment of the application, emotion labels such as short pause, flat tone, surprise, question, sound, question and answer, emphasis and the like need to be added to the training text. In the embodiment of the application, the tone of the sentence is determined to be too coarse through the punctuation marks, and the semantic meanings of the tone and the context have great association relationship in many times, like a sentence, the tone of the sentence is greatly different under different application scenes, for example, when the sentence is read by intense, flat and inverse mock, the tone of the sentence is very obvious, so that the tone of the sentence is determined based on the semantic analysis of the context, the punctuation marks, the characters and the positions of the characters in the sentence, and the emotional tags are more accurately added to the training text.

And 103, recording the reading audio file of the training text marked by the speaker according to the emotion label as the audio file for training.

In the embodiment of the application, after the training text is captured, the speaker reads the audio file according to the training text marked by the emotion label, and records and stores the read audio file as the audio file for training.

And 104, checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and revising the audio file by the inconsistent part.

In the embodiment of the application, the audio file generated by reading needs to be checked, and the part of the audio file which does not reach the standard is revised when the requirement is not met; and if the audio file cannot meet the requirement after being revised, deleting the audio file or rereading the audio file based on the training text corresponding to the audio file to regenerate the audio file.

And 105, mapping the training text into a text vector, submitting the text vector and the read audio file of the speaker to a deep learning engine of a neural network for training, and learning the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels through deep learning training.

How to map the training text into a text vector is described in detail below.

In the embodiment of the application, the pronunciation of each character, word, numeral, English character and the like in the training text is labeled. Wherein, the Chinese characters are directly converted into pinyin of the corresponding Chinese characters. With tone symbols, where the light tone is not marked, 1-4 tones are placed in sequence after the pinyin by the numbers 1-4. If the decimal point is found, the converted pinyin is as follows: "xiao 3 shu4 dian3 er". If the pinyin converted from the dot product is as follows: "dian 3 ji 1".

The English label uses international phonetic symbols. Labeling with the CMU System, as follows

For the sounds marked with CMU phonetic symbols, enclosed with "{ }", the english letter Q should be converted into: { K Y UW }.

Is the Toyota car of the sentence "do you Jing Q3HM 21" you or your family are at home? "

The following steps are carried out:

“nin2 de jing1{K Y UW}san1{HH AH M}er4 yi1 de feng1 tian2 qi4 che1 shi4 nin2 huo4 nin2 de jia1 ren2 zai4 kai1 ma？”。

in the embodiment of the present application, the emotion tag includes:

1. the sound of children: such as: the converted pinyin of the decimal point is as follows: xiao3 shu4 dian3 er.

2. Short-time pause: ",". Representing a brief pause in reading.

3. Leveling and reading: ". ". Indicating normal speech reading.

4. Surprise reading: "! ", indicates that the word should read a surprise.

5. Query reading: "? ", indicates a verbal reading of the application query of the sentence.

6. Dragging and reading: "-" indicates that the word should be prolonged. For example, "ask you for you is-", and the last "is" the long sound should be played.

7. Reading in reverse: "^" indicates that the sentence should be read with the language of the question in reverse. Generally, the third tone emphasis should be used for reading. Ironically "i am good".

8. Emphasis on reading: "+, read with emphasis. Such as "do you confirm your consent? ", turn to: "ni 3 que4 ren4 nin2 tong2 yi4 zhang1 san1 xian1 sheng1 dai4 ti4 nin2 qian1 zi4 ma". Wherein, three are emphasized word by word when reading.

In the embodiment of the application, the marked text data is digitally converted, and the conversion rule is as follows:

1) the pinyin maps one number one by one according to English letters and numbers;

2) each phonetic symbol label is mapped with a number independently;

3) each mood label is mapped with a number independently;

4) ignore all other symbols;

5) each phoneme is connected by a space, and the space is used for mapping a number.

According to the rules, the mapping rules of the labels and the numbers are sorted out as follows:

and according to the mapping table, the labels converted from the text are mapped into vectors, and then the vectors can be submitted to a neural network for model learning.

And mapping the text converted into the numeric string into a vector, inputting the vector into an end-to-end neural network for model training, and recording a training result.

In the embodiment of the application, tone labeling reference and correction can be performed on the newly input text based on the training result after the neural network training.

As an implementation manner, the data tagging method for mixing chinese and english and labeling mood in the embodiment of the present application further includes:

Fig. 2 is a schematic structural diagram illustrating a composition of a data tagging apparatus for mixing chinese and english and tagging a mood according to an embodiment of the present application, and as shown in fig. 2, the data tagging apparatus for mixing chinese and english and tagging a mood according to the embodiment of the present application includes:

a grasping unit 20, configured to grasp a training text from a data source; the training text covers Chinese and English characters;

an adding unit 21, configured to add an emotion tag to the captured training text, where the emotion tag includes at least one of a short pause, a flat tone, a surprise, a question, a lingering tone, a question, and a highlight;

the recording unit 22 is used for recording the reading audio file of the training text marked by the speaker according to the emotion label as the audio file for training;

the checking unit 23 is used for checking whether the audio file for training is consistent with the emotion label of the corresponding training text or not, and triggering the revision unit when the audio file for training is inconsistent with the emotion label;

a revision unit 24 for revising a portion of the audio file for training that is inconsistent with the corresponding training text;

a mapping unit 25, configured to map the training text into a text vector;

and the training unit 26 is used for submitting the text vectors and the read audio files of the speaker to a deep learning engine of a neural network for training, and learning the pronunciation characteristics of the text under various combinations of Chinese, English and emotion labels through deep learning training.

In this embodiment of the application, the mapping unit 25 is further configured to perform pronunciation labeling on characters, numbers and english characters in the training text, convert the speech labeled in the sentence into a numeric string according to the correspondence between the letters and the calibration numbers, convert the tone corresponding to the characters into corresponding numbers, and convert the emotion tags of the sentence into corresponding numeric identifiers; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;

the text converted into a string of numbers is mapped into a vector.

In the embodiment of the present application, the mapping unit 25 is further configured to label the characters in the training text with a retroactive sound, and the light sounds of the characters are not labeled;

In the embodiment of the present application, the revising unit 24 is further configured to:

It should be understood by those skilled in the art that the functions of each processing unit in the data tagging apparatus for mixing chinese and english according to the embodiment of the present invention can be understood by referring to the related description of the data tagging method for mixing chinese and english and tagging chinese, and each processing unit in the data tagging apparatus for mixing chinese and english according to the embodiment of the present invention can be implemented by an analog circuit that implements the functions described in the embodiment of the present invention, or by running software that implements the functions described in the embodiment of the present invention on an intelligent device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

According to the embodiment of the application, the efficient intelligent operation and maintenance management system of the data center is constructed, the operation management level of information resources is effectively improved, the stable operation of the server is guaranteed, the service efficiency of a machine room is improved, the stability of the server is monitored in real time, the fault processing efficiency is improved through real-time alarming, and the stability and effectiveness of the system are guaranteed.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A data labeling method for Chinese-English mixing and tone labeling is characterized by comprising the following steps:

adding emotion labels to the captured training texts;

2. The labeling method of claim 1, wherein the mapping the training text to a text vector comprises:

the text converted into a string of numbers is mapped into a vector.

3. The annotation method of claim 2, further comprising:

4. The annotation method of claim 1, wherein after revising the audio file, the method further comprises:

5. A data labeling device for Chinese-English mixing and tone labeling is characterized in that the device comprises:

the adding unit is used for adding emotion labels to the captured training texts;

the mapping unit is used for mapping the training text into a text vector;

6. The labeling device of claim 5, wherein the mapping unit is further configured to label the pronunciation of the characters, numbers and english characters in the training text, convert the labeled speech of the sentence into a numeric string, convert the tone corresponding to the characters into corresponding numbers, and convert the emotion tags of the sentence into corresponding numeric identifiers according to the correspondence between the letters and the labeled numbers; the phonemes are identified by setting identifiers, and the setting identifiers are converted into numbers;

the text converted into a string of numbers is mapped into a vector.

7. The labeling device of claim 6, wherein the mapping unit is further configured to label characters in the training text with retromorphism sounds, and light sounds of the characters are not labeled;

8. The annotating device of claim 5, wherein the revision unit is further configured to: