Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
To achieve the above technical purpose, as shown in fig. 1, an embodiment of the present invention provides an automatic speech synthesis method suitable for a virtual anchor in live tv broadcasting, including:
step 1, processing Chinese data to obtain Chinese audio and a Chinese factor library.
The Chinese data can be called from an open-source Chinese corpus, the audio data can be directly called, the text data can also be called, and the corresponding audio is recorded according to the text data, and the following attention needs to be paid: audio needs to ensure that a single speaker produces speech and that different corpora are uttered as identically as possible.
In some embodiments, step 1, processing the chinese data to obtain chinese audio, and a chinese factor library, as shown in fig. 2, includes:
step 11, processing the Chinese data to obtain Chinese audio and subtitles;
step 12, converting the subtitles into pinyin and converting the pinyin into corresponding Chinese phonemes;
and step 13, all Chinese phonemes are stored in a inductive manner to serve as a Chinese phoneme library.
And 2, processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library.
Wherein, the english word can be derived from an english word text, for example, a college english sixth level or a research word, the number of words is about 6000 or more, and corresponding audio data is recorded according to the english word text, and it should be noted that: audio needs to ensure that a single speaker produces speech and that different corpora are uttered as identically as possible.
In some embodiments, step 2, processing the english word to obtain an audio of the english word, and obtaining a chinese factor tag corresponding to the english word according to the chinese factor library, as shown in fig. 3, includes:
step 21, corresponding the international phoneme to the phoneme in the Chinese phoneme library with the closest pronunciation to obtain a phoneme alignment dictionary;
step 22, converting English words into English phonetic symbols, and converting the English phonetic symbols into corresponding word international phoneme labels;
step 23, converting the word international phoneme label into a corresponding Chinese phoneme label according to the phoneme alignment dictionary.
And 3, processing the English letters to obtain the pronunciation audio of the English letters and the Chinese factor labels corresponding to the letters.
Wherein, the English letters can be 26 letters.
In some embodiments, step 3, processing the english alphabet to obtain the pronunciation audio of the english alphabet and the chinese factor tag corresponding to the english alphabet, as shown in fig. 4, includes:
step 31, acquiring a Chinese factor label corresponding to the English letter according to the Chinese factor library;
step 32, when the chinese factor label corresponding to the english alphabet is similar to the pronunciation of the common chinese phrase, for example: the 'G' is very similar to the 'one', and the chinese factor tag corresponding to the english alphabet is specifically processed, for example, the chinese factor tag corresponding to the english alphabet is retained as an international factor, or the corresponding chinese phoneme tag is replaced by a custom phoneme;
step 33, obtaining letter pronunciation texts, randomly combining 26 letters into a letter list, randomly taking values from 1 to 40 in length, randomly generating 300 letter lists, and obtaining pronunciation audios of the letter lists, in this embodiment, in order to better embody the boundaries of letter pronunciation, in the letter list, a letter is selected according to one third of probability, and a letter identical to the letter is inserted after the selected letter, for example: if B is selected from A, B and C, the letter list is changed into A, B, B and C.
Compared with English words, the English letters appear higher in the actual text, so that the requirement on the pronunciation quality of the English letters is higher, and the pronunciation quality of the letters can be effectively improved through the processing.
Step 4, taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to carry out model training so as to obtain a voice model;
in general, the speech model mainly includes two parts, a synthesizer and a vocoder, in this embodiment, a cocotron is used as the synthesizer, and mel frequency spectrum features are used as output labels of the cocotron model, the mel channel is 80, the Hop _ size of the cocotron model is 256, and the win _ size is 1024.
The vocoder adopts a hifi-gan model, and the vocoder adopts a pre-training model, namely pre-training is performed on large-scale data in advance.
Specifically, as shown in fig. 5:
extracting mel frequency spectrum characteristics of Chinese audio, English word audio and letter pronunciation audio to be used as an output label of a tacotron model;
taking a Chinese factor library, Chinese factor labels corresponding to English words and Chinese factor labels corresponding to letters as the input of a tacontron model to fit corresponding audio mel frequency spectrum characteristics;
and inputting mel frequency spectrum characteristics generated by the tacotron into a pre-trained hifi-gan model to generate audio, and testing the model effect.
And 5, preprocessing the corpus to be synthesized to obtain the corresponding Chinese phoneme label, inputting the Chinese phoneme label into the voice model, generating an audio file corresponding to the text, and completing voice synthesis.
In some embodiments, the corpus to be synthesized is preprocessed to obtain the corresponding chinese phoneme label, and input the chinese phoneme label to the speech model, so as to generate the audio file corresponding to the text, and complete the speech synthesis, as shown in fig. 6,
step 51, performing text specification by regular, for example, converting special characters such as Arabic numerals, telephone numbers, unit names, operation symbols and the like into Chinese;
step 52, performing cutting processing on the long sentence exceeding the preset standard to obtain a short text sentence, which may specifically be:
predicting Chinese prosody through a pre-trained sequence label model, wherein the model can be a bert-bilstm-crf model, the output prosody levels are 0, 1, 2 and 3, and the higher the level is, the longer the character pause is;
uniformly setting English output labels to be 0, uniformly setting blank spaces to be 3, setting a threshold value, and if characters contained in a sentence are larger than the threshold value and are not finally ended by letters, cutting the sentence, wherein the length threshold value is set to be 22 in the embodiment;
the cutting rule is as follows: counting sentences from back to front, the largest rhythm label level is a cutting point, and the letter cannot be the cutting point;
and step 53, converting the short text sentence into a corresponding phoneme label by combining with the phoneme dictionary, inputting the phoneme label into a synthesizer model, obtaining the output mel frequency spectrum characteristic, and inputting the mel frequency spectrum characteristic into the hifi-gan model, so as to generate an audio file corresponding to the text and complete speech synthesis.
The invention trains a speech synthesis model based on Chinese-English mixed linguistic data with less English linguistic data than Chinese, when the speech synthesis is carried out on the input text linguistic data, in order to process longer text, the truncation processing is carried out on the longer text, the Chinese part is truncated aiming at the rhyme, the English effectively identifies the word boundary, and finally the cut text is input into the previously trained Chinese-English mixed speech synthesis model and then returns to the synthesized audio.
The voice synthesized by the invention has better Chinese-English mixing capability and natural voice, can flexibly control the pronunciation of words by depending on the phoneme dictionary, and is very suitable for the condition of Chinese-English mixed broadcast text in live telecast of E-commerce.
In order to further clarify the present invention, the following detailed description is given with reference to specific examples.
1. The audio corpus is obtained from the chinese dataset and the corresponding text labels are then converted into corresponding phoneme labels, as shown in table 1 below:
table 1:
2. while converting the chinese pinyin to phonemes, all the phonemes are collected as a phoneme library as shown in table 2 below:
table 2:
3. the international phonemes are mapped to the corresponding chinese phonemes from the chinese phoneme library as shown in table 3 below:
table 3:
4. the chinese phoneme tags corresponding to the words are generated by using the english word table to pronounce the international phonemes of the english words and replace the international phonemes with the chinese phonemes corresponding to the international phonemes, as shown in table 4 below:
table 4:
5. during testing, if the input text is larger than 22 characters, the text is cut, the cutting point takes the prosody level as the cutting point, meanwhile, English letters are not taken as the cutting point, spaces are cut preferentially, after cutting is finished, the characters are converted into corresponding phoneme labels according to the previous rule, then the models are input, and audio is generated, as shown in the following table 5:
table 5:
embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the method of the present invention.
The memory may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing the computer program and other programs and data required by the terminal equipment. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.