CN114387947A

CN114387947A - Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast

Info

Publication number: CN114387947A
Application number: CN202210285104.6A
Authority: CN
Inventors: 梁晨阳
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-04-22
Anticipated expiration: 2042-03-23
Also published as: CN114387947B

Abstract

The invention discloses an automatic voice synthesis method suitable for a virtual anchor in live E-commerce, which comprises the following steps: processing the Chinese data to obtain a Chinese audio and a Chinese factor library; processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library; processing English letters to obtain pronunciation audio of the English letters and Chinese factor labels corresponding to the letters; performing model training by taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to obtain a voice model; and preprocessing the linguistic data to be synthesized to obtain corresponding Chinese phoneme labels, inputting the Chinese phoneme labels into the voice model, generating an audio file corresponding to the text, and completing voice synthesis. The invention has better Chinese-English mixing capability and natural voice.

Description

Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast

Technical Field

The invention relates to the technical field of computer information processing and voice signal processing, in particular to an automatic voice synthesis method suitable for a virtual anchor in live telecast.

Background

In recent years, with the rapid development of deep learning, artificial intelligence techniques have begun to spread in various aspects of life. The speech synthesis technology has also been advanced unprecedentedly with the development of deep learning. The system is widely popularized in the fields of intelligent customer service, chat robots and the like. Along with rapid development of live broadcast and delivery of e-commerce, intelligent anchor is more and more popular. Although artificial intelligence technology has started to spread, it is difficult to replace humans in practical applications. Even widespread in the industry: the artificial intelligence depends on the manual work, and the more the manual work is, the more the intelligent is, the more the artificial intelligence is, the full embodiment is obtained in the aspect of voice processing.

At present, most speech recognition models can control the error rate to be below 10%, but in practical application, it is also possible that the few errors cause that the subsequent processing cannot be normally performed. In speech synthesis, although there is no hard index such as the error rate, similar problems are encountered as in speech recognition. Although the speech synthesis can synthesize the text needing the synthesized speech into the corresponding audio, the naturalness of the text can not reach the level like a human being, and the naturalness is particularly important in the field of intelligent broadcasting.

Different from the voice synthesis systems in other fields, the intelligent anchor voice synthesis system in the E-business field has the following characteristics:

first, the naturalness of the language. Because of the human-like system, the synthesized voice must be as coherent and natural as possible to ensure the live user experience.

Second, extensive chinese-english mixing. Different from a Chinese chat question-answering system, a large number of Chinese-English mixed situations, such as commodity models, idioms and the like, generally exist in a live television broadcast system, but English is not too long, so that a better Chinese-English mixed capability is required for the live television broadcast voice synthesis system.

Finally, for live broadcast services, knowledge is always updated in real time, especially for the updating of commodity types, so that a live broadcast question-answering system of e-commerce needs good expansibility.

In practice, it is found that a speech synthesis system capable of satisfying the above requirements does not exist, and the speech synthesis method in the prior art is as follows: firstly, based on a traditional voice splicing method, phonemes are recorded in advance, and then the phonemes are spliced according to a text. Although the method can generate the voice conveniently, the generated voice has strong mechanical feeling and is very unnatural, and the method is difficult to accept particularly in the field of human-simulated live broadcast.

In addition, a common deep learning model, such as tacontron, is based on a statistical model to realize speech synthesis, and needs to be trained by a large amount of corpus. However, the Chinese and the English belong to different pronunciation systems, so that it is difficult to train a better Chinese and English compatible mixed speech synthesis model on less English corpus, and in the live broadcast process, sometimes a word needs to pronounce the word, and sometimes corresponding letters, such as led, need to be pronounced. And the deep learning model based on the statistical model can be synthesized according to the pronunciation of the words uniformly. Meanwhile, some English words may be mentioned frequently in the live broadcast process of E-commerce, but most English words are mentioned rarely, and the word frequency of different live broadcast words may need to be changed flexibly, so that good expansibility is needed.

Disclosure of Invention

In view of this, an embodiment of the present invention provides an automatic speech synthesis method suitable for a virtual anchor in live tv broadcast, so as to solve the above technical problem.

In order to achieve the above object, an embodiment of the present invention provides an automatic speech synthesis method suitable for a virtual anchor in live tv broadcasting, wherein the improvement includes:

processing the Chinese data to obtain a Chinese audio and a Chinese factor library;

processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library;

processing English letters to obtain pronunciation audio of the English letters and Chinese factor labels corresponding to the letters;

performing model training by taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to obtain a voice model;

and preprocessing the linguistic data to be synthesized to obtain corresponding Chinese phoneme labels, inputting the Chinese phoneme labels into the voice model, generating an audio file corresponding to the text, and completing voice synthesis.

Through the technical scheme provided by the invention, the invention at least has the following technical effects:

the invention has better Chinese-English mixing capability, natural voice and can flexibly control the pronunciation of words by depending on the phoneme dictionary, thereby being very suitable for the condition of Chinese-English mixed broadcast text in live telecast of E-commerce.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a flow chart of one embodiment of an automatic speech synthesis method suitable for a virtual anchor in live E-commerce according to the present invention;

FIG. 2 is a flow diagram of one embodiment of the processing of the chinese data of FIG. 1;

FIG. 3 is a flow diagram for one embodiment of the processing of English words in FIG. 1;

FIG. 4 is a flow diagram for one embodiment of the processing of English letters in FIG. 1;

FIG. 5 is a flow diagram of one embodiment of obtaining a speech model of FIG. 1;

FIG. 6 is a flow diagram of one embodiment of the speech synthesis of FIG. 1.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

To achieve the above technical purpose, as shown in fig. 1, an embodiment of the present invention provides an automatic speech synthesis method suitable for a virtual anchor in live tv broadcasting, including:

step 1, processing Chinese data to obtain Chinese audio and a Chinese factor library.

The Chinese data can be called from an open-source Chinese corpus, the audio data can be directly called, the text data can also be called, and the corresponding audio is recorded according to the text data, and the following attention needs to be paid: audio needs to ensure that a single speaker produces speech and that different corpora are uttered as identically as possible.

In some embodiments, step 1, processing the chinese data to obtain chinese audio, and a chinese factor library, as shown in fig. 2, includes:

step 11, processing the Chinese data to obtain Chinese audio and subtitles;

step 12, converting the subtitles into pinyin and converting the pinyin into corresponding Chinese phonemes;

and step 13, all Chinese phonemes are stored in a inductive manner to serve as a Chinese phoneme library.

And 2, processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library.

Wherein, the english word can be derived from an english word text, for example, a college english sixth level or a research word, the number of words is about 6000 or more, and corresponding audio data is recorded according to the english word text, and it should be noted that: audio needs to ensure that a single speaker produces speech and that different corpora are uttered as identically as possible.

In some embodiments, step 2, processing the english word to obtain an audio of the english word, and obtaining a chinese factor tag corresponding to the english word according to the chinese factor library, as shown in fig. 3, includes:

step 21, corresponding the international phoneme to the phoneme in the Chinese phoneme library with the closest pronunciation to obtain a phoneme alignment dictionary;

step 22, converting English words into English phonetic symbols, and converting the English phonetic symbols into corresponding word international phoneme labels;

step 23, converting the word international phoneme label into a corresponding Chinese phoneme label according to the phoneme alignment dictionary.

And 3, processing the English letters to obtain the pronunciation audio of the English letters and the Chinese factor labels corresponding to the letters.

Wherein, the English letters can be 26 letters.

In some embodiments, step 3, processing the english alphabet to obtain the pronunciation audio of the english alphabet and the chinese factor tag corresponding to the english alphabet, as shown in fig. 4, includes:

step 31, acquiring a Chinese factor label corresponding to the English letter according to the Chinese factor library;

step 32, when the chinese factor label corresponding to the english alphabet is similar to the pronunciation of the common chinese phrase, for example: the 'G' is very similar to the 'one', and the chinese factor tag corresponding to the english alphabet is specifically processed, for example, the chinese factor tag corresponding to the english alphabet is retained as an international factor, or the corresponding chinese phoneme tag is replaced by a custom phoneme;

step 33, obtaining letter pronunciation texts, randomly combining 26 letters into a letter list, randomly taking values from 1 to 40 in length, randomly generating 300 letter lists, and obtaining pronunciation audios of the letter lists, in this embodiment, in order to better embody the boundaries of letter pronunciation, in the letter list, a letter is selected according to one third of probability, and a letter identical to the letter is inserted after the selected letter, for example: if B is selected from A, B and C, the letter list is changed into A, B, B and C.

Compared with English words, the English letters appear higher in the actual text, so that the requirement on the pronunciation quality of the English letters is higher, and the pronunciation quality of the letters can be effectively improved through the processing.

Step 4, taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to carry out model training so as to obtain a voice model;

in general, the speech model mainly includes two parts, a synthesizer and a vocoder, in this embodiment, a cocotron is used as the synthesizer, and mel frequency spectrum features are used as output labels of the cocotron model, the mel channel is 80, the Hop _ size of the cocotron model is 256, and the win _ size is 1024.

The vocoder adopts a hifi-gan model, and the vocoder adopts a pre-training model, namely pre-training is performed on large-scale data in advance.

Specifically, as shown in fig. 5:

extracting mel frequency spectrum characteristics of Chinese audio, English word audio and letter pronunciation audio to be used as an output label of a tacotron model;

taking a Chinese factor library, Chinese factor labels corresponding to English words and Chinese factor labels corresponding to letters as the input of a tacontron model to fit corresponding audio mel frequency spectrum characteristics;

and inputting mel frequency spectrum characteristics generated by the tacotron into a pre-trained hifi-gan model to generate audio, and testing the model effect.

And 5, preprocessing the corpus to be synthesized to obtain the corresponding Chinese phoneme label, inputting the Chinese phoneme label into the voice model, generating an audio file corresponding to the text, and completing voice synthesis.

In some embodiments, the corpus to be synthesized is preprocessed to obtain the corresponding chinese phoneme label, and input the chinese phoneme label to the speech model, so as to generate the audio file corresponding to the text, and complete the speech synthesis, as shown in fig. 6,

step 51, performing text specification by regular, for example, converting special characters such as Arabic numerals, telephone numbers, unit names, operation symbols and the like into Chinese;

step 52, performing cutting processing on the long sentence exceeding the preset standard to obtain a short text sentence, which may specifically be:

predicting Chinese prosody through a pre-trained sequence label model, wherein the model can be a bert-bilstm-crf model, the output prosody levels are 0, 1, 2 and 3, and the higher the level is, the longer the character pause is;

uniformly setting English output labels to be 0, uniformly setting blank spaces to be 3, setting a threshold value, and if characters contained in a sentence are larger than the threshold value and are not finally ended by letters, cutting the sentence, wherein the length threshold value is set to be 22 in the embodiment;

the cutting rule is as follows: counting sentences from back to front, the largest rhythm label level is a cutting point, and the letter cannot be the cutting point;

and step 53, converting the short text sentence into a corresponding phoneme label by combining with the phoneme dictionary, inputting the phoneme label into a synthesizer model, obtaining the output mel frequency spectrum characteristic, and inputting the mel frequency spectrum characteristic into the hifi-gan model, so as to generate an audio file corresponding to the text and complete speech synthesis.

The invention trains a speech synthesis model based on Chinese-English mixed linguistic data with less English linguistic data than Chinese, when the speech synthesis is carried out on the input text linguistic data, in order to process longer text, the truncation processing is carried out on the longer text, the Chinese part is truncated aiming at the rhyme, the English effectively identifies the word boundary, and finally the cut text is input into the previously trained Chinese-English mixed speech synthesis model and then returns to the synthesized audio.

The voice synthesized by the invention has better Chinese-English mixing capability and natural voice, can flexibly control the pronunciation of words by depending on the phoneme dictionary, and is very suitable for the condition of Chinese-English mixed broadcast text in live telecast of E-commerce.

In order to further clarify the present invention, the following detailed description is given with reference to specific examples.

1. The audio corpus is obtained from the chinese dataset and the corresponding text labels are then converted into corresponding phoneme labels, as shown in table 1 below:

table 1:

2. while converting the chinese pinyin to phonemes, all the phonemes are collected as a phoneme library as shown in table 2 below:

table 2:

3. the international phonemes are mapped to the corresponding chinese phonemes from the chinese phoneme library as shown in table 3 below:

table 3:

4. the chinese phoneme tags corresponding to the words are generated by using the english word table to pronounce the international phonemes of the english words and replace the international phonemes with the chinese phonemes corresponding to the international phonemes, as shown in table 4 below:

table 4:

5. during testing, if the input text is larger than 22 characters, the text is cut, the cutting point takes the prosody level as the cutting point, meanwhile, English letters are not taken as the cutting point, spaces are cut preferentially, after cutting is finished, the characters are converted into corresponding phoneme labels according to the previous rule, then the models are input, and audio is generated, as shown in the following table 5:

table 5:

embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the method of the present invention.

The memory may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing the computer program and other programs and data required by the terminal equipment. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. An automatic voice synthesis method suitable for a virtual anchor in live E-commerce is characterized by comprising the following steps:

2. The method of claim 1, wherein the processing of the chinese data to obtain chinese audio, and wherein the chinese factor library comprises:

processing the Chinese data to obtain Chinese audio and subtitles;

converting the subtitles into pinyin and converting the pinyin into corresponding Chinese phonemes;

and (5) all the Chinese phonemes are stored in an inductive way to be used as a Chinese phoneme library.

3. The method of claim 1, wherein the processing the english word to obtain the english word audio and obtaining the chinese factor tag corresponding to the english word according to the chinese factor library comprises:

corresponding the international phoneme to the phoneme in the Chinese phoneme library with the closest pronunciation to obtain a phoneme alignment dictionary;

converting English words into English phonetic symbols, and converting the English phonetic symbols into corresponding word international phoneme labels;

the word international phoneme tags are converted to corresponding chinese phoneme tags according to a phoneme alignment dictionary.

4. The method of claim 1, wherein the processing of the english alphabet to obtain the pronunciation audio of the english alphabet and the chinese factor tag corresponding to the english alphabet comprises:

acquiring a Chinese factor label corresponding to an English letter according to a Chinese factor library;

when the Chinese factor label corresponding to the English letter is similar to the pronunciation of a common Chinese phrase, the Chinese factor label corresponding to the English letter is specially processed;

letter pronunciation text is obtained, 26 letters are randomly combined into a letter list, and pronunciation audio of the letter list is obtained.

5. The method of claim 1, wherein the model training with the chinese audio, the chinese factor library, the english word audio, the chinese factor tag corresponding to the english word, the letter pronunciation audio, and the chinese factor tag corresponding to the letter as the mixed prediction to obtain the speech model comprises:

6. The method according to claim 1, wherein the preprocessing the corpus to be synthesized to obtain corresponding chinese phoneme tags, and inputting the chinese phoneme tags into the speech model to generate an audio file corresponding to the text, thereby completing the speech synthesis, comprises:

performing text specification by regular mode;

cutting long sentences exceeding a preset standard to obtain short text sentences;

and converting the short text sentence into a corresponding phoneme label by combining with a phoneme dictionary, inputting the phoneme label into a synthesizer model, obtaining the output mel frequency spectrum characteristic, and inputting the mel frequency spectrum characteristic into a hifi-gan model, so as to generate an audio file corresponding to the text and complete speech synthesis.