CN114387947A - Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast - Google Patents

Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast Download PDF

Info

Publication number
CN114387947A
CN114387947A CN202210285104.6A CN202210285104A CN114387947A CN 114387947 A CN114387947 A CN 114387947A CN 202210285104 A CN202210285104 A CN 202210285104A CN 114387947 A CN114387947 A CN 114387947A
Authority
CN
China
Prior art keywords
chinese
audio
english
factor
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210285104.6A
Other languages
Chinese (zh)
Other versions
CN114387947B (en
Inventor
梁晨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Shenzhi Technology Co ltd
Original Assignee
Beijing Zhongke Shenzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Shenzhi Technology Co ltd filed Critical Beijing Zhongke Shenzhi Technology Co ltd
Priority to CN202210285104.6A priority Critical patent/CN114387947B/en
Publication of CN114387947A publication Critical patent/CN114387947A/en
Application granted granted Critical
Publication of CN114387947B publication Critical patent/CN114387947B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses an automatic voice synthesis method suitable for a virtual anchor in live E-commerce, which comprises the following steps: processing the Chinese data to obtain a Chinese audio and a Chinese factor library; processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library; processing English letters to obtain pronunciation audio of the English letters and Chinese factor labels corresponding to the letters; performing model training by taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to obtain a voice model; and preprocessing the linguistic data to be synthesized to obtain corresponding Chinese phoneme labels, inputting the Chinese phoneme labels into the voice model, generating an audio file corresponding to the text, and completing voice synthesis. The invention has better Chinese-English mixing capability and natural voice.

Description

Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast
Technical Field
The invention relates to the technical field of computer information processing and voice signal processing, in particular to an automatic voice synthesis method suitable for a virtual anchor in live telecast.
Background
In recent years, with the rapid development of deep learning, artificial intelligence techniques have begun to spread in various aspects of life. The speech synthesis technology has also been advanced unprecedentedly with the development of deep learning. The system is widely popularized in the fields of intelligent customer service, chat robots and the like. Along with rapid development of live broadcast and delivery of e-commerce, intelligent anchor is more and more popular. Although artificial intelligence technology has started to spread, it is difficult to replace humans in practical applications. Even widespread in the industry: the artificial intelligence depends on the manual work, and the more the manual work is, the more the intelligent is, the more the artificial intelligence is, the full embodiment is obtained in the aspect of voice processing.
At present, most speech recognition models can control the error rate to be below 10%, but in practical application, it is also possible that the few errors cause that the subsequent processing cannot be normally performed. In speech synthesis, although there is no hard index such as the error rate, similar problems are encountered as in speech recognition. Although the speech synthesis can synthesize the text needing the synthesized speech into the corresponding audio, the naturalness of the text can not reach the level like a human being, and the naturalness is particularly important in the field of intelligent broadcasting.
Different from the voice synthesis systems in other fields, the intelligent anchor voice synthesis system in the E-business field has the following characteristics:
first, the naturalness of the language. Because of the human-like system, the synthesized voice must be as coherent and natural as possible to ensure the live user experience.
Second, extensive chinese-english mixing. Different from a Chinese chat question-answering system, a large number of Chinese-English mixed situations, such as commodity models, idioms and the like, generally exist in a live television broadcast system, but English is not too long, so that a better Chinese-English mixed capability is required for the live television broadcast voice synthesis system.
Finally, for live broadcast services, knowledge is always updated in real time, especially for the updating of commodity types, so that a live broadcast question-answering system of e-commerce needs good expansibility.
In practice, it is found that a speech synthesis system capable of satisfying the above requirements does not exist, and the speech synthesis method in the prior art is as follows: firstly, based on a traditional voice splicing method, phonemes are recorded in advance, and then the phonemes are spliced according to a text. Although the method can generate the voice conveniently, the generated voice has strong mechanical feeling and is very unnatural, and the method is difficult to accept particularly in the field of human-simulated live broadcast.
In addition, a common deep learning model, such as tacontron, is based on a statistical model to realize speech synthesis, and needs to be trained by a large amount of corpus. However, the Chinese and the English belong to different pronunciation systems, so that it is difficult to train a better Chinese and English compatible mixed speech synthesis model on less English corpus, and in the live broadcast process, sometimes a word needs to pronounce the word, and sometimes corresponding letters, such as led, need to be pronounced. And the deep learning model based on the statistical model can be synthesized according to the pronunciation of the words uniformly. Meanwhile, some English words may be mentioned frequently in the live broadcast process of E-commerce, but most English words are mentioned rarely, and the word frequency of different live broadcast words may need to be changed flexibly, so that good expansibility is needed.
Disclosure of Invention
In view of this, an embodiment of the present invention provides an automatic speech synthesis method suitable for a virtual anchor in live tv broadcast, so as to solve the above technical problem.
In order to achieve the above object, an embodiment of the present invention provides an automatic speech synthesis method suitable for a virtual anchor in live tv broadcasting, wherein the improvement includes:
processing the Chinese data to obtain a Chinese audio and a Chinese factor library;
processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library;
processing English letters to obtain pronunciation audio of the English letters and Chinese factor labels corresponding to the letters;
performing model training by taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to obtain a voice model;
and preprocessing the linguistic data to be synthesized to obtain corresponding Chinese phoneme labels, inputting the Chinese phoneme labels into the voice model, generating an audio file corresponding to the text, and completing voice synthesis.
Through the technical scheme provided by the invention, the invention at least has the following technical effects:
the invention has better Chinese-English mixing capability, natural voice and can flexibly control the pronunciation of words by depending on the phoneme dictionary, thereby being very suitable for the condition of Chinese-English mixed broadcast text in live telecast of E-commerce.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a flow chart of one embodiment of an automatic speech synthesis method suitable for a virtual anchor in live E-commerce according to the present invention;
FIG. 2 is a flow diagram of one embodiment of the processing of the chinese data of FIG. 1;
FIG. 3 is a flow diagram for one embodiment of the processing of English words in FIG. 1;
FIG. 4 is a flow diagram for one embodiment of the processing of English letters in FIG. 1;
FIG. 5 is a flow diagram of one embodiment of obtaining a speech model of FIG. 1;
FIG. 6 is a flow diagram of one embodiment of the speech synthesis of FIG. 1.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
To achieve the above technical purpose, as shown in fig. 1, an embodiment of the present invention provides an automatic speech synthesis method suitable for a virtual anchor in live tv broadcasting, including:
step 1, processing Chinese data to obtain Chinese audio and a Chinese factor library.
The Chinese data can be called from an open-source Chinese corpus, the audio data can be directly called, the text data can also be called, and the corresponding audio is recorded according to the text data, and the following attention needs to be paid: audio needs to ensure that a single speaker produces speech and that different corpora are uttered as identically as possible.
In some embodiments, step 1, processing the chinese data to obtain chinese audio, and a chinese factor library, as shown in fig. 2, includes:
step 11, processing the Chinese data to obtain Chinese audio and subtitles;
step 12, converting the subtitles into pinyin and converting the pinyin into corresponding Chinese phonemes;
and step 13, all Chinese phonemes are stored in a inductive manner to serve as a Chinese phoneme library.
And 2, processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library.
Wherein, the english word can be derived from an english word text, for example, a college english sixth level or a research word, the number of words is about 6000 or more, and corresponding audio data is recorded according to the english word text, and it should be noted that: audio needs to ensure that a single speaker produces speech and that different corpora are uttered as identically as possible.
In some embodiments, step 2, processing the english word to obtain an audio of the english word, and obtaining a chinese factor tag corresponding to the english word according to the chinese factor library, as shown in fig. 3, includes:
step 21, corresponding the international phoneme to the phoneme in the Chinese phoneme library with the closest pronunciation to obtain a phoneme alignment dictionary;
step 22, converting English words into English phonetic symbols, and converting the English phonetic symbols into corresponding word international phoneme labels;
step 23, converting the word international phoneme label into a corresponding Chinese phoneme label according to the phoneme alignment dictionary.
And 3, processing the English letters to obtain the pronunciation audio of the English letters and the Chinese factor labels corresponding to the letters.
Wherein, the English letters can be 26 letters.
In some embodiments, step 3, processing the english alphabet to obtain the pronunciation audio of the english alphabet and the chinese factor tag corresponding to the english alphabet, as shown in fig. 4, includes:
step 31, acquiring a Chinese factor label corresponding to the English letter according to the Chinese factor library;
step 32, when the chinese factor label corresponding to the english alphabet is similar to the pronunciation of the common chinese phrase, for example: the 'G' is very similar to the 'one', and the chinese factor tag corresponding to the english alphabet is specifically processed, for example, the chinese factor tag corresponding to the english alphabet is retained as an international factor, or the corresponding chinese phoneme tag is replaced by a custom phoneme;
step 33, obtaining letter pronunciation texts, randomly combining 26 letters into a letter list, randomly taking values from 1 to 40 in length, randomly generating 300 letter lists, and obtaining pronunciation audios of the letter lists, in this embodiment, in order to better embody the boundaries of letter pronunciation, in the letter list, a letter is selected according to one third of probability, and a letter identical to the letter is inserted after the selected letter, for example: if B is selected from A, B and C, the letter list is changed into A, B, B and C.
Compared with English words, the English letters appear higher in the actual text, so that the requirement on the pronunciation quality of the English letters is higher, and the pronunciation quality of the letters can be effectively improved through the processing.
Step 4, taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to carry out model training so as to obtain a voice model;
in general, the speech model mainly includes two parts, a synthesizer and a vocoder, in this embodiment, a cocotron is used as the synthesizer, and mel frequency spectrum features are used as output labels of the cocotron model, the mel channel is 80, the Hop _ size of the cocotron model is 256, and the win _ size is 1024.
The vocoder adopts a hifi-gan model, and the vocoder adopts a pre-training model, namely pre-training is performed on large-scale data in advance.
Specifically, as shown in fig. 5:
extracting mel frequency spectrum characteristics of Chinese audio, English word audio and letter pronunciation audio to be used as an output label of a tacotron model;
taking a Chinese factor library, Chinese factor labels corresponding to English words and Chinese factor labels corresponding to letters as the input of a tacontron model to fit corresponding audio mel frequency spectrum characteristics;
and inputting mel frequency spectrum characteristics generated by the tacotron into a pre-trained hifi-gan model to generate audio, and testing the model effect.
And 5, preprocessing the corpus to be synthesized to obtain the corresponding Chinese phoneme label, inputting the Chinese phoneme label into the voice model, generating an audio file corresponding to the text, and completing voice synthesis.
In some embodiments, the corpus to be synthesized is preprocessed to obtain the corresponding chinese phoneme label, and input the chinese phoneme label to the speech model, so as to generate the audio file corresponding to the text, and complete the speech synthesis, as shown in fig. 6,
step 51, performing text specification by regular, for example, converting special characters such as Arabic numerals, telephone numbers, unit names, operation symbols and the like into Chinese;
step 52, performing cutting processing on the long sentence exceeding the preset standard to obtain a short text sentence, which may specifically be:
predicting Chinese prosody through a pre-trained sequence label model, wherein the model can be a bert-bilstm-crf model, the output prosody levels are 0, 1, 2 and 3, and the higher the level is, the longer the character pause is;
uniformly setting English output labels to be 0, uniformly setting blank spaces to be 3, setting a threshold value, and if characters contained in a sentence are larger than the threshold value and are not finally ended by letters, cutting the sentence, wherein the length threshold value is set to be 22 in the embodiment;
the cutting rule is as follows: counting sentences from back to front, the largest rhythm label level is a cutting point, and the letter cannot be the cutting point;
and step 53, converting the short text sentence into a corresponding phoneme label by combining with the phoneme dictionary, inputting the phoneme label into a synthesizer model, obtaining the output mel frequency spectrum characteristic, and inputting the mel frequency spectrum characteristic into the hifi-gan model, so as to generate an audio file corresponding to the text and complete speech synthesis.
The invention trains a speech synthesis model based on Chinese-English mixed linguistic data with less English linguistic data than Chinese, when the speech synthesis is carried out on the input text linguistic data, in order to process longer text, the truncation processing is carried out on the longer text, the Chinese part is truncated aiming at the rhyme, the English effectively identifies the word boundary, and finally the cut text is input into the previously trained Chinese-English mixed speech synthesis model and then returns to the synthesized audio.
The voice synthesized by the invention has better Chinese-English mixing capability and natural voice, can flexibly control the pronunciation of words by depending on the phoneme dictionary, and is very suitable for the condition of Chinese-English mixed broadcast text in live telecast of E-commerce.
In order to further clarify the present invention, the following detailed description is given with reference to specific examples.
1. The audio corpus is obtained from the chinese dataset and the corresponding text labels are then converted into corresponding phoneme labels, as shown in table 1 below:
table 1:
Figure 316921DEST_PATH_IMAGE001
2. while converting the chinese pinyin to phonemes, all the phonemes are collected as a phoneme library as shown in table 2 below:
table 2:
Figure 358696DEST_PATH_IMAGE003
3. the international phonemes are mapped to the corresponding chinese phonemes from the chinese phoneme library as shown in table 3 below:
table 3:
Figure 76116DEST_PATH_IMAGE004
4. the chinese phoneme tags corresponding to the words are generated by using the english word table to pronounce the international phonemes of the english words and replace the international phonemes with the chinese phonemes corresponding to the international phonemes, as shown in table 4 below:
table 4:
Figure 782910DEST_PATH_IMAGE005
5. during testing, if the input text is larger than 22 characters, the text is cut, the cutting point takes the prosody level as the cutting point, meanwhile, English letters are not taken as the cutting point, spaces are cut preferentially, after cutting is finished, the characters are converted into corresponding phoneme labels according to the previous rule, then the models are input, and audio is generated, as shown in the following table 5:
table 5:
Figure DEST_PATH_IMAGE007
embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the method of the present invention.
The memory may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing the computer program and other programs and data required by the terminal equipment. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (6)

1. An automatic voice synthesis method suitable for a virtual anchor in live E-commerce is characterized by comprising the following steps:
processing the Chinese data to obtain a Chinese audio and a Chinese factor library;
processing the English words to obtain English word audio, and obtaining Chinese factor labels corresponding to the English words according to the Chinese factor library;
processing English letters to obtain pronunciation audio of the English letters and Chinese factor labels corresponding to the letters;
performing model training by taking the Chinese audio, the Chinese factor library, the English word audio, the Chinese factor label corresponding to the English word, the letter pronunciation audio and the Chinese factor label corresponding to the letter as a mixed prediction to obtain a voice model;
and preprocessing the linguistic data to be synthesized to obtain corresponding Chinese phoneme labels, inputting the Chinese phoneme labels into the voice model, generating an audio file corresponding to the text, and completing voice synthesis.
2. The method of claim 1, wherein the processing of the chinese data to obtain chinese audio, and wherein the chinese factor library comprises:
processing the Chinese data to obtain Chinese audio and subtitles;
converting the subtitles into pinyin and converting the pinyin into corresponding Chinese phonemes;
and (5) all the Chinese phonemes are stored in an inductive way to be used as a Chinese phoneme library.
3. The method of claim 1, wherein the processing the english word to obtain the english word audio and obtaining the chinese factor tag corresponding to the english word according to the chinese factor library comprises:
corresponding the international phoneme to the phoneme in the Chinese phoneme library with the closest pronunciation to obtain a phoneme alignment dictionary;
converting English words into English phonetic symbols, and converting the English phonetic symbols into corresponding word international phoneme labels;
the word international phoneme tags are converted to corresponding chinese phoneme tags according to a phoneme alignment dictionary.
4. The method of claim 1, wherein the processing of the english alphabet to obtain the pronunciation audio of the english alphabet and the chinese factor tag corresponding to the english alphabet comprises:
acquiring a Chinese factor label corresponding to an English letter according to a Chinese factor library;
when the Chinese factor label corresponding to the English letter is similar to the pronunciation of a common Chinese phrase, the Chinese factor label corresponding to the English letter is specially processed;
letter pronunciation text is obtained, 26 letters are randomly combined into a letter list, and pronunciation audio of the letter list is obtained.
5. The method of claim 1, wherein the model training with the chinese audio, the chinese factor library, the english word audio, the chinese factor tag corresponding to the english word, the letter pronunciation audio, and the chinese factor tag corresponding to the letter as the mixed prediction to obtain the speech model comprises:
extracting mel frequency spectrum characteristics of Chinese audio, English word audio and letter pronunciation audio to be used as an output label of a tacotron model;
taking a Chinese factor library, Chinese factor labels corresponding to English words and Chinese factor labels corresponding to letters as the input of a tacontron model to fit corresponding audio mel frequency spectrum characteristics;
and inputting mel frequency spectrum characteristics generated by the tacotron into a pre-trained hifi-gan model to generate audio, and testing the model effect.
6. The method according to claim 1, wherein the preprocessing the corpus to be synthesized to obtain corresponding chinese phoneme tags, and inputting the chinese phoneme tags into the speech model to generate an audio file corresponding to the text, thereby completing the speech synthesis, comprises:
performing text specification by regular mode;
cutting long sentences exceeding a preset standard to obtain short text sentences;
and converting the short text sentence into a corresponding phoneme label by combining with a phoneme dictionary, inputting the phoneme label into a synthesizer model, obtaining the output mel frequency spectrum characteristic, and inputting the mel frequency spectrum characteristic into a hifi-gan model, so as to generate an audio file corresponding to the text and complete speech synthesis.
CN202210285104.6A 2022-03-23 2022-03-23 Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast Active CN114387947B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210285104.6A CN114387947B (en) 2022-03-23 2022-03-23 Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210285104.6A CN114387947B (en) 2022-03-23 2022-03-23 Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast

Publications (2)

Publication Number Publication Date
CN114387947A true CN114387947A (en) 2022-04-22
CN114387947B CN114387947B (en) 2022-08-02

Family

ID=81204909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210285104.6A Active CN114387947B (en) 2022-03-23 2022-03-23 Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast

Country Status (1)

Country Link
CN (1) CN114387947B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1212404A (en) * 1997-09-19 1999-03-31 国际商业机器公司 Method for identifying character/numeric string in Chinese speech recognition system
US20080059191A1 (en) * 2006-09-04 2008-03-06 Fortemedia, Inc. Method, system and apparatus for improved voice recognition
CN112151005A (en) * 2020-09-28 2020-12-29 四川长虹电器股份有限公司 Chinese and English mixed speech synthesis method and device
CN112365878A (en) * 2020-10-30 2021-02-12 广州华多网络科技有限公司 Speech synthesis method, device, equipment and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1212404A (en) * 1997-09-19 1999-03-31 国际商业机器公司 Method for identifying character/numeric string in Chinese speech recognition system
US20080059191A1 (en) * 2006-09-04 2008-03-06 Fortemedia, Inc. Method, system and apparatus for improved voice recognition
CN112151005A (en) * 2020-09-28 2020-12-29 四川长虹电器股份有限公司 Chinese and English mixed speech synthesis method and device
CN112365878A (en) * 2020-10-30 2021-02-12 广州华多网络科技有限公司 Speech synthesis method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN114387947B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN109389968B (en) Waveform splicing method, device, equipment and storage medium based on double syllable mixing and lapping
US7418389B2 (en) Defining atom units between phone and syllable for TTS systems
CN105845125B (en) Phoneme synthesizing method and speech synthetic device
CN107516509B (en) Voice database construction method and system for news broadcast voice synthesis
US10043519B2 (en) Generation of text from an audio speech signal
CN112420016B (en) Method and device for aligning synthesized voice and text and computer storage medium
CN108847241A (en) It is method, electronic equipment and the storage medium of text by meeting speech recognition
CN1731510B (en) Text-speech conversion for amalgamated language
CN109545183A (en) Text handling method, device, electronic equipment and storage medium
CN110264992B (en) Speech synthesis processing method, apparatus, device and storage medium
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN112270917B (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN111696521A (en) Method for training speech clone model, readable storage medium and speech clone method
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
CN104050962B (en) Multifunctional reader based on speech synthesis technique
CN116092472A (en) Speech synthesis method and synthesis system
CN116320607A (en) Intelligent video generation method, device, equipment and medium
CN113409761B (en) Speech synthesis method, speech synthesis device, electronic device, and computer-readable storage medium
CN113707124A (en) Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
CN114387947B (en) Automatic voice synthesis method suitable for virtual anchor in E-commerce live broadcast
CN116129868A (en) Method and system for generating structured photo
CN109960806A (en) A kind of natural language processing method
CN114708848A (en) Method and device for acquiring size of audio and video file
CN113990286A (en) Speech synthesis method, apparatus, device and storage medium
CN115249472A (en) Voice synthesis method and device for realizing stress overall planning by combining context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room 911, 9th Floor, Block B, Xingdi Center, Building 2, No.10, Jiuxianqiao North Road, Jiangtai Township, Chaoyang District, Beijing, 100000

Patentee after: Beijing Zhongke Shenzhi Technology Co.,Ltd.

Country or region after: China

Address before: 100000 room 311a, floor 3, building 4, courtyard 4, Yongchang Middle Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing

Patentee before: Beijing Zhongke Shenzhi Technology Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address