CN112530404A - Voice synthesis method, voice synthesis device and intelligent equipment - Google Patents

Voice synthesis method, voice synthesis device and intelligent equipment Download PDF

Info

Publication number
CN112530404A
CN112530404A CN202011380178.5A CN202011380178A CN112530404A CN 112530404 A CN112530404 A CN 112530404A CN 202011380178 A CN202011380178 A CN 202011380178A CN 112530404 A CN112530404 A CN 112530404A
Authority
CN
China
Prior art keywords
word
chinese
english
list
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011380178.5A
Other languages
Chinese (zh)
Inventor
钱程浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ubtech Robotics Corp
Original Assignee
Ubtech Robotics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ubtech Robotics Corp filed Critical Ubtech Robotics Corp
Priority to CN202011380178.5A priority Critical patent/CN112530404A/en
Publication of CN112530404A publication Critical patent/CN112530404A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The application discloses a voice synthesis method, a voice synthesis device, intelligent equipment and a computer readable storage medium. Wherein, the method comprises the following steps: performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a Chinese word list and an English word list; determining pinyin corresponding to each Chinese word in the Chinese word list; searching phonemes corresponding to each English word in the English word list based on a preset word prefix dictionary; if the target English word exists, inputting the target English word into the grapheme-to-phoneme model to obtain a phoneme corresponding to the target English word output by the grapheme-to-phoneme model; and performing speech synthesis of the input text according to the pinyin of each Chinese word and the phoneme of each English word. Through this application scheme, can promote the speech synthesis effect of smart machine when facing chinese-english mixed text.

Description

Voice synthesis method, voice synthesis device and intelligent equipment
Technical Field
The present application belongs to the technical field of artificial intelligence, and in particular, relates to a speech synthesis method, a speech synthesis apparatus, and an intelligent device.
Background
When the voice synthesis is carried out, a voice synthesis system carried by the intelligent equipment analyzes the text to be subjected to the voice synthesis, and the purpose of the analysis is to enable a computer to know characters from the text, further know what voice is to be pronounced and what pronunciation is to be pronounced, and tell the intelligent equipment about the way of pronunciation; in addition, the speech synthesis system can make the intelligent device know which words and phrases or sentences in the text, so that the intelligent device can know what pause should be performed during pronunciation to obtain more fluent speech expression. However, the current speech synthesis system can only perform speech synthesis based on a single language text, and is inferior in speech synthesis based on a Chinese-English mixed text.
Disclosure of Invention
The application provides a voice synthesis method, a voice synthesis device, intelligent equipment and a computer readable storage medium, which can improve the voice synthesis effect of the intelligent equipment when facing Chinese and English mixed texts.
In a first aspect, the present application provides a speech synthesis method, including:
performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a Chinese word list and an English word list, wherein the Chinese word list comprises each Chinese word forming the input text, and the English word list comprises each English word forming the input text;
determining the pinyin corresponding to each Chinese word in the Chinese word list;
searching phonemes corresponding to each English word in the English word list based on a preset word prefix dictionary, wherein the word prefix dictionary is configured with at least one English word and corresponding phonemes;
if the target English word exists, inputting the target English word into a grapheme-to-phoneme model to obtain a phoneme corresponding to the target English word output by the grapheme-to-phoneme model;
and performing speech synthesis on the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list.
In a second aspect, the present application provides a speech synthesis apparatus comprising:
the text word segmentation unit is used for performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a Chinese word list and an English word list, wherein the Chinese word list comprises each Chinese word forming the input text, and the English word list comprises each English word forming the input text;
a pinyin determining unit, configured to determine a pinyin corresponding to each chinese term in the chinese term list;
a first phoneme determining unit, configured to search, based on a preset word prefix dictionary, a phoneme corresponding to each english word in the english word list, where the word prefix dictionary is configured with at least one english word and a corresponding phoneme;
a second phoneme determining unit, configured to, if there is a target english word, input the target english word into a grapheme-to-phoneme model to obtain a phoneme corresponding to the target english word output by the grapheme-to-phoneme model;
and the voice synthesis unit is used for carrying out voice synthesis on the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list.
In a third aspect, the present application provides a smart device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect when executing the computer program.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.
In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by one or more processors, performs the steps of the method of the first aspect as described above.
Compared with the prior art, the application has the beneficial effects that: when an input text mixed with Chinese and English is faced, firstly, the input text is segmented to obtain a Chinese word list and an English word list, wherein the Chinese word list comprises each Chinese word forming the input text, the English word list comprises each English word forming the input text, and then the Chinese word list and the English word list are processed separately, specifically: for the Chinese word list, directly determining the pinyin corresponding to each Chinese word; for the English word list, the phoneme corresponding to each English word can be searched through the word prefix dictionary, the target English word can be input into the grapheme-to-phoneme model to obtain the phoneme corresponding to the target English word output by the grapheme-to-phoneme model, and finally, the speech synthesis can be carried out according to the pinyin of each Chinese word and the phoneme of each English word in the input text. According to the process, the scheme separately processes English words and Chinese words in the input text; in addition, the scheme further ensures the speech synthesis of English words through a grapheme-to-phoneme model, and can greatly improve the speech synthesis effect of intelligent equipment when the intelligent equipment faces Chinese and English mixed texts. It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flow chart of an implementation of a speech synthesis method provided in an embodiment of the present application;
fig. 2 is an exemplary diagram of a directed acyclic graph in a speech synthesis method provided in an embodiment of the present application;
fig. 3 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an intelligent device provided in an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In order to explain the technical solution proposed in the present application, the following description will be given by way of specific examples.
A speech synthesis method provided in an embodiment of the present application is described below. Referring to fig. 1, the speech synthesis method includes:
step 101, based on a preset word segmentation algorithm, performing word segmentation processing on an input text to obtain a Chinese word list and an English word list.
In the embodiment of the application, under the situation that English and Chinese exist in the input text, word segmentation processing can be performed on the input text to obtain a Chinese word list and an English word list, wherein the Chinese word list comprises each Chinese word forming the input text, and the English word list comprises each English word forming the input text. That is, for the mixed chinese and english text, the word is used as the minimum unit for dividing the chinese text, and the word is used as the minimum unit for dividing the english text. Specifically, the input text with English and Chinese can be segmented by jieba segmentation, and the working principle is briefly described as follows:
the jieba word segmentation can firstly carry out preliminary analysis on the input text mixed with Chinese and English words, and divide each English word in the input text to complete the word segmentation of English; then, the input text with English words removed is segmented, namely sentences are stripped from the input text based on punctuation marks to form sentence arrays corresponding to the sentences; then, further processing is carried out by taking the statement as a unit, namely, each statement array is further processed. Specifically, for each statement array, the further processing procedure includes: constructing a directed acyclic graph based on the statement array, then performing maximum probability path calculation, and obtaining a segmentation result corresponding to the statement array based on a segmentation mode corresponding to the maximum probability path; finally, a plurality of Chinese words forming each sentence can be obtained to complete Chinese word segmentation.
For example, the input text is "the first lesson programmed is to learn hello world"; when the jieba word segmentation is used for processing the input text, firstly, English words of the input text, namely 'hello' and 'world', are segmented; then, because the input text only contains a sentence, the sentence division is not needed, and the content of eliminating English words 'the first course of programming is learning', namely a sentence array can be formed; continuing to process the statement array, and constructing a directed acyclic graph of the statement array, as shown in fig. 2; for each path, calculating the word forming probability of each word from the last bit of the statement array; finally, a segmentation result can be obtained based on the segmentation position corresponding to the path with the maximum sum of word-forming probabilities, and then the segmentation result of the sentence array "programming the first course is learning" is: programming, first class, yes and learning. Based on the above process, an English word list of [ hello, world ] and a Chinese word list of [ programming, first class, yes, learning ] can be obtained.
Of course, other word segmentation tools may be used to segment the input text, such as SnowNLP, pkuseg, THULAC, and pyhanlp, and the like, which are not limited herein.
And 102, determining the pinyin corresponding to each Chinese word in the Chinese word list.
In the embodiment of the present application, considering that the chinese is pronounced using pinyin, for the chinese word list, the pinyin corresponding to each chinese word in the chinese word list may be determined based on a preset pinyin conversion tool, such as pypinyin.
In some embodiments, after obtaining the chinese word list, part-of-speech tagging may be performed on each chinese word in the chinese word list based on the input text to obtain a part-of-speech of each chinese word; accordingly, the pinyin conversion tool may perform pinyin conversion based on the part of speech of each chinese word; that is, the pinyin corresponding to each chinese word is determined based on the pinyin conversion tool and the part-of-speech of each chinese word in the list of chinese words. By the method, when polyphones appear in the input text, the accurate pinyin of each Chinese word is determined according to the part of speech of each Chinese word, so that the speech synthesis of the Chinese words in the input text is more accurate.
For example, in the previous example, for a Chinese word list [ programming, first course, yes, learning ], it is available through the Pinyin conversion tool:
the pinyin corresponding to the 'programming' is 'bi ā n ch ng'
"the corresponding pinyin of" is "de"
The pinyin corresponding to the first lesson is d mu y ī k "
"is" the corresponding pinyin is "sh mu"
The spelling corresponding to learning is xu xi i "
And 103, searching phonemes corresponding to each English word in the English word list based on a preset word prefix dictionary.
In the embodiment of the present application, it is considered that english is pronounced by using phonemes, so for an english word list, phonemes corresponding to each english word in the english word list can be searched based on a preset word prefix dictionary cmudit, where the word prefix dictionary is configured with at least one english word and corresponding phonemes. An example of this word prefix dictionary is given below:
word Phoneme
HELLO HH AH L OW
WORLD W ER L D
…… ……
For example, in the previous example, for the English word list [ hello, world ], it is available through the word prefix dictionary:
the phoneme corresponding to "hello" is "HH AH L OW"
The phoneme corresponding to "world" is "WER L D"
And 104, if the target English word exists, inputting the target English word into a grapheme-to-phoneme model to obtain a phoneme corresponding to the target English word output by the grapheme-to-phoneme model.
In the embodiment of the present application, considering that the number of the english words stored in the word prefix dictionary is limited, some uncommon english words may not find the corresponding phonemes in the word prefix dictionary, and these english words are marked as target english words. That is, the target english word means: the corresponding English word of the phoneme can not be found in the English word list through the word prefix dictionary. For each target english word, the target english word may be input to a Grapheme-to-phonememe (G2P) model; the intelligent device can then determine the phonemes output by the grapheme-to-phoneme model as the phonemes corresponding to the target English word. The following briefly introduces the grapheme-to-phoneme model employed in the embodiments of the present application:
the grapheme to phoneme conversion may be considered a machine translation, requiring the conversion of a source grapheme to a target phoneme. It is necessary to build an alignment model first, followed by a translation model, which is implemented based on the ngram model. The ngram-based translation model is typically implemented as a Weighted Finite State Transducer (WFST). The conversion of graphemes to phonemes can be treated as a classification problem and a maximum entropy classifier is employed to solve the problem; alternatively, the conversion of graphemes to phonemes can be viewed as a sequence labeling problem and solved using statistical sequence labeling techniques such as Conditional Random Field (CRF) and perceptron (HMN). Specifically, a grapheme-to-phoneme model based on a Long Short-Term Memory artificial neural network (LSTM) is used in the embodiment of the present application, wherein the length of an input layer of the LSTM is the same as the number of graphemes, and the length of an output layer is the same as the number of phonemes; considering that there are 27 graphemes and 40 phonemes in english, the input layer is a one-hot (one-hot) coding layer with a length of 27, and the output layer is a one-hot (one-hot) coding layer with a length of 40.
And 105, performing speech synthesis on the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list.
In the embodiment of the application, after obtaining the pinyin of each chinese word and the phoneme of each english word, the speech synthesis system can determine how each word in the input text should be pronounced, so as to implement speech synthesis of the input text. Specifically, the intelligent device may first generate a pronunciation list of the input text according to the pinyin corresponding to each chinese word in the chinese word list and the phoneme corresponding to each english word in the english word list, and input the pronunciation list to the speech synthesis system to instruct the speech synthesis system to perform speech synthesis on the input text based on the pronunciation list.
For example, for the input text "the first lesson programmed is to learn hello world", the list of pronunciations generated could be:
word Pronunciation identification
Programming biān chéng
Is/are as follows de
First class dì yī kè
Is that shì
Study of xué xí
hello HH AH L OW
world W ER L D
In some embodiments, it is considered that the grapheme-to-phoneme model, although being able to convert more rare english words into corresponding phonemes, still cannot achieve a hundred percent conversion accuracy; based on this, after the step 105, the speech synthesis method may further include:
if receiving user voice input by a user based on the target English word, converting the user voice into a phoneme;
and updating the phoneme corresponding to the target English word into the phoneme converted by the user voice.
After the speech synthesis is performed in step 105, the result of the speech synthesis may be output, and a target english word existing in the input text may be marked on the screen of the smart device. After hearing the voice output, the user can adjust the target English word with inaccurate pronunciation, for example, input a selection instruction on a screen to select the target English word needing pronunciation adjustment; the smart device may then turn on the microphone to receive the user's voice; if the user voice input by the user based on the target English word is received, the user voice can be converted into the phoneme, and the phoneme corresponding to the target English word is updated to the phoneme obtained by the user voice conversion. Through the process, the uncommon English word with wrong or insufficient standard pronunciation can be corrected, and the accuracy of subsequent speech synthesis is improved. In addition, the target english word and the updated phoneme corresponding to the target english word may be added to the word prefix dictionary to update the word prefix dictionary. Therefore, if the same English word appears in other subsequent input texts again, the phoneme of the English word can be directly obtained through the word prefix dictionary, and the phoneme obtaining efficiency can be improved to a certain extent.
In some embodiments, it is considered that the embodiment of the present application mainly performs speech synthesis on a chinese-english mixed text, and in practical applications, the chinese-english mixed text is not a mainstream text; that is, there are still large batches of text that are monolingual text. Based on this, before the step 101, the speech synthesis method further includes:
detecting whether the input text has Chinese and English at the same time;
if the input text contains both Chinese and English, loading the word prefix dictionary and the grapheme to the phoneme model.
After receiving an input text which needs to be subjected to speech synthesis, namely the input text to be pronounced, the intelligent device can detect the languages in the input text to determine whether the input text contains Chinese and English at the same time. The intelligent device can receive characters input by a user to obtain input texts; alternatively, the smart device may import and parse a file specified by the user to obtain the input text, where no limitation is made on the manner of obtaining the input text. Illustratively, the language present in the input text may be detected using a langgid algorithm, a langdetect algorithm, or the like. The method provided by the embodiment of the application is used for voice synthesis only under the condition that the input text has Chinese and English at the same time, and at the moment, a pinyin conversion tool, a word prefix dictionary and grapheme are loaded to a phoneme model to prepare for subsequently determining the pinyin of Chinese words and determining the phoneme of English words.
As can be seen from the above, through the embodiments of the present application, words belonging to english and words belonging to chinese in an input text are separately processed; in addition, in consideration of the fact that the limited words stored in the word prefix dictionary may cause target English words to appear, the embodiment of the application also provides a remedial measure, the speech synthesis of uncommon English words is further guaranteed through the grapheme-to-phoneme model, and the speech synthesis effect of the intelligent equipment in the face of Chinese-English mixed texts can be greatly improved.
Corresponding to the speech synthesis method proposed above, an embodiment of the present application provides a speech synthesis apparatus, which is integrated in an intelligent device. Referring to fig. 3, a speech synthesis apparatus 300 according to an embodiment of the present invention includes:
a text segmentation unit 301, configured to perform segmentation processing on an input text based on a preset segmentation algorithm to obtain a chinese word list and an english word list, where the chinese word list includes each chinese word constituting the input text, and the english word list includes each english word constituting the input text;
a pinyin determining unit 302, configured to determine a pinyin corresponding to each chinese term in the chinese term list;
a first phoneme determining unit 303, configured to search, based on a preset word prefix dictionary, a phoneme corresponding to each english word in the english word list, where the word prefix dictionary is configured with at least one english word and a corresponding phoneme;
a second phoneme determining unit 304, configured to, if there is a target english word, input the target english word into a grapheme-to-phoneme model to obtain a phoneme corresponding to the target english word output by the grapheme-to-phoneme model;
a speech synthesis unit 305, configured to perform speech synthesis on the input text according to the pinyin corresponding to each chinese word in the chinese word list and the phoneme corresponding to each english word in the english word list.
Optionally, the speech synthesis apparatus 300 further includes:
a speech conversion unit, configured to, after the speech synthesis unit performs speech synthesis on the input text according to the pinyin corresponding to each chinese word in the chinese word list and the phoneme corresponding to each english word in the english word list, convert the user speech into phonemes if a user speech input by a user based on the target english word is received;
and a phoneme updating unit, configured to update the phoneme corresponding to the target english word to the phoneme obtained by the user speech conversion.
Optionally, the speech synthesis apparatus 300 further includes:
a dictionary updating unit configured to update the phoneme corresponding to the target english word to the phoneme converted by the user speech, and then add the target english word and the updated phoneme corresponding to the target english word to the word prefix dictionary to update the word prefix dictionary.
Optionally, the speech synthesis apparatus 300 further includes:
a part-of-speech tagging unit, configured to perform word segmentation processing on an input text based on a preset word segmentation algorithm in the text word segmentation unit 301 to obtain a chinese word list and an english word list, and perform part-of-speech tagging on each chinese word in the chinese word list based on the input text to obtain a part-of-speech of each chinese word;
accordingly, the pinyin determining unit 302 is specifically configured to determine pinyins corresponding to each chinese word based on the part-of-speech of each chinese word in the chinese word list.
Optionally, the speech synthesis apparatus 300 further includes:
a text detection unit, configured to detect whether the input text has both chinese and english words before the text segmentation unit 301 performs a segmentation process on the input text based on a preset segmentation algorithm to obtain a chinese word list and an english word list;
and the loading unit is used for loading the word prefix dictionary and the grapheme to the phoneme model if the input text simultaneously has Chinese and English.
Optionally, the speech synthesis unit 305 includes:
a pronunciation list generating subunit, configured to generate a pronunciation list of the input text according to the pinyin corresponding to each chinese word in the chinese word list and the phoneme corresponding to each english word in the english word list;
and the pronunciation list input subunit is used for inputting the pronunciation list to a preset voice synthesis system so as to instruct the voice synthesis system to carry out voice synthesis on the input text based on the pronunciation list.
As can be seen from the above, according to the embodiment of the present application, words belonging to english and words belonging to chinese in an input text are separately processed; in addition, in consideration of the fact that the limited words stored in the word prefix dictionary may cause target English words to appear, the embodiment of the application also provides a remedial measure, the speech synthesis of uncommon English words is further guaranteed through the grapheme-to-phoneme model, and the speech synthesis effect of the intelligent equipment in the face of Chinese-English mixed texts can be greatly improved.
An embodiment of the present application further provides an intelligent device, please refer to fig. 4, where the intelligent device 4 in the embodiment of the present application includes: a memory 401, one or more processors 402 (only one shown in fig. 4), and computer programs stored on the memory 401 and executable on the processors. Wherein: the memory 401 is used for storing software programs and units, and the processor 402 executes various functional applications and data processing by running the software programs and units stored in the memory 401, so as to acquire resources corresponding to the preset events. Specifically, the processor 402, by running the above-mentioned computer program stored in the memory 401, implements the steps of:
performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a Chinese word list and an English word list, wherein the Chinese word list comprises each Chinese word forming the input text, and the English word list comprises each English word forming the input text;
determining the pinyin corresponding to each Chinese word in the Chinese word list;
searching phonemes corresponding to each English word in the English word list based on a preset word prefix dictionary, wherein the word prefix dictionary is configured with at least one English word and corresponding phonemes;
if the target English word exists, inputting the target English word into a grapheme-to-phoneme model to obtain a phoneme corresponding to the target English word output by the grapheme-to-phoneme model;
and performing speech synthesis on the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list.
Assuming that the above is the first possible implementation manner, in a second possible implementation manner provided on the basis of the first possible implementation manner, after performing the speech synthesis of the input text according to the pinyin corresponding to each chinese word in the chinese word list and the phoneme corresponding to each english word in the english word list, the processor 402 further implements the following steps when operating the computer program stored in the memory 401:
if receiving user voice input by a user based on the target English word, converting the user voice into a phoneme;
and updating the phoneme corresponding to the target English word into the phoneme converted by the user voice.
In a third possible implementation manner provided as the basis of the second possible implementation manner, after the phoneme corresponding to the target english word is updated to the phoneme converted from the user speech, the processor 402 further implements the following steps when running the computer program stored in the memory 401:
and adding the target English word and the phoneme corresponding to the updated target English word into the word prefix dictionary to update the word prefix dictionary.
In a fourth possible implementation manner provided on the basis of the first possible implementation manner, after performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a chinese word list and an english word list, the processor 402 further performs the following steps when running the computer program stored in the memory 401:
performing part-of-speech tagging on each Chinese word in the Chinese word list based on the input text to obtain the part-of-speech of each Chinese word;
correspondingly, the determining the pinyin corresponding to each chinese term in the chinese term list includes:
and determining the pinyin corresponding to each Chinese word based on the part of speech of each Chinese word in the Chinese word list.
In a fifth possible implementation manner provided on the basis of the first possible implementation manner, before performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a chinese word list and an english word list, the processor 402 implements the following steps when running the computer program stored in the memory 401:
detecting whether the input text has Chinese and English at the same time;
if the input text contains both Chinese and English, loading the word prefix dictionary and the grapheme to the phoneme model.
In a sixth possible implementation form based on the first possible implementation form, the second possible implementation form, the third possible implementation form, the fourth possible implementation form, or the fifth possible implementation form, the performing speech synthesis of the input text according to the pinyin corresponding to each chinese word in the chinese word list and the phoneme corresponding to each english word in the english word list includes:
generating a pronunciation list of the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list;
and inputting the pronunciation list into a preset voice synthesis system to instruct the voice synthesis system to perform voice synthesis on the input text based on the pronunciation list.
It should be understood that in the embodiments of the present Application, the Processor 402 may be a Central Processing Unit (CPU), and the Processor may be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 401 may include both read-only memory and random-access memory, and provides instructions and data to processor 402. Some or all of memory 401 may also include non-volatile random access memory. For example, the memory 401 may also store information of device classes.
As can be seen from the above, through the embodiments of the present application, words belonging to english and words belonging to chinese in an input text are separately processed; in addition, in consideration of the fact that the limited words stored in the word prefix dictionary may cause target English words to appear, the embodiment of the application also provides a remedial measure, the speech synthesis of uncommon English words is further guaranteed through the grapheme-to-phoneme model, and the speech synthesis effect of the intelligent equipment in the face of Chinese-English mixed texts can be greatly improved.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of external device software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules or units is only one logical functional division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer-readable storage medium may include: any entity or device capable of carrying the above-described computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer readable Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the computer readable storage medium may contain other contents which can be appropriately increased or decreased according to the requirements of the legislation and the patent practice in the jurisdiction, for example, in some jurisdictions, the computer readable storage medium does not include an electrical carrier signal and a telecommunication signal according to the legislation and the patent practice.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method of speech synthesis, comprising:
performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a Chinese word list and an English word list, wherein the Chinese word list comprises each Chinese word forming the input text, and the English word list comprises each English word forming the input text;
determining the pinyin corresponding to each Chinese word in the Chinese word list;
searching phonemes corresponding to each English word in the English word list respectively based on a preset word prefix dictionary, wherein the word prefix dictionary is configured with at least one English word and corresponding phonemes;
if the target English word exists, inputting the target English word into a grapheme-to-phoneme model to obtain a phoneme corresponding to the target English word output by the grapheme-to-phoneme model;
and performing speech synthesis on the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list.
2. The speech synthesis method of claim 1, wherein after the speech synthesis of the input text is performed based on the pinyin corresponding to each chinese word in the chinese word list and the phonemes corresponding to each english word in the english word list, the speech synthesis method further comprises:
if the user voice input by the user based on the target English word is received, converting the user voice into a phoneme;
and updating the phoneme corresponding to the target English word into the phoneme obtained by the voice conversion of the user.
3. The speech synthesis method of claim 2, wherein after the step of updating the phonemes corresponding to the target english word to the phonemes obtained by the speech conversion of the user, the speech synthesis method further comprises:
and adding the target English word and the phoneme corresponding to the updated target English word into the word prefix dictionary to update the word prefix dictionary.
4. The speech synthesis method of claim 1, wherein after the performing word segmentation processing on the input text based on the preset word segmentation algorithm to obtain a chinese word list and an english word list, the speech synthesis method further comprises:
performing part-of-speech tagging on each Chinese word in the Chinese word list based on the input text to obtain the part-of-speech of each Chinese word;
correspondingly, the determining the pinyin corresponding to each chinese term in the chinese term list includes:
and determining the pinyin corresponding to each Chinese word based on the part of speech of each Chinese word in the Chinese word list.
5. The speech synthesis method of claim 1, wherein before performing a segmentation process on the input text based on a preset segmentation algorithm to obtain a chinese word list and an english word list, the speech synthesis method further comprises:
detecting whether the input text has Chinese and English at the same time;
and if the input text has Chinese and English at the same time, loading the word prefix dictionary and the grapheme to the phoneme model.
6. The speech synthesis method of any one of claims 1 to 5, wherein the performing speech synthesis of the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list comprises:
generating a pronunciation list of the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list;
and inputting the pronunciation list to a preset voice synthesis system so as to instruct the voice synthesis system to perform voice synthesis on the input text based on the pronunciation list.
7. A speech synthesis apparatus, comprising:
the system comprises a text word segmentation unit, a word segmentation unit and a word segmentation unit, wherein the text word segmentation unit is used for performing word segmentation processing on an input text based on a preset word segmentation algorithm to obtain a Chinese word list and an English word list, the Chinese word list comprises each Chinese word forming the input text, and the English word list comprises each English word forming the input text;
the pinyin determining unit is used for determining the pinyin corresponding to each Chinese word in the Chinese word list;
the first phoneme determining unit is used for searching phonemes corresponding to each English word in the English word list based on a preset word prefix dictionary, wherein the word prefix dictionary is configured with at least one English word and corresponding phonemes;
the second phoneme determining unit is used for inputting the target English word into a grapheme-to-phoneme model if the target English word exists, and obtaining a phoneme corresponding to the target English word output by the grapheme-to-phoneme model;
and the voice synthesis unit is used for carrying out voice synthesis on the input text according to the pinyin corresponding to each Chinese word in the Chinese word list and the phoneme corresponding to each English word in the English word list.
8. The speech synthesis apparatus of claim 7, wherein the speech synthesis apparatus further comprises:
a voice conversion unit, configured to, after the voice synthesis unit performs voice synthesis on the input text according to the pinyin corresponding to each chinese word in the chinese word list and the phoneme corresponding to each english word in the english word list, convert the user voice into a phoneme if receiving the user voice input by the user based on the target english word;
and the phoneme updating unit is used for updating the phoneme corresponding to the target English word into the phoneme obtained by the voice conversion of the user.
9. A smart device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
CN202011380178.5A 2020-11-30 2020-11-30 Voice synthesis method, voice synthesis device and intelligent equipment Pending CN112530404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011380178.5A CN112530404A (en) 2020-11-30 2020-11-30 Voice synthesis method, voice synthesis device and intelligent equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011380178.5A CN112530404A (en) 2020-11-30 2020-11-30 Voice synthesis method, voice synthesis device and intelligent equipment

Publications (1)

Publication Number Publication Date
CN112530404A true CN112530404A (en) 2021-03-19

Family

ID=74996033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011380178.5A Pending CN112530404A (en) 2020-11-30 2020-11-30 Voice synthesis method, voice synthesis device and intelligent equipment

Country Status (1)

Country Link
CN (1) CN112530404A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951204A (en) * 2021-03-29 2021-06-11 北京大米科技有限公司 Speech synthesis method and device
CN113345408A (en) * 2021-06-02 2021-09-03 云知声智能科技股份有限公司 Chinese and English voice mixed synthesis method and device, electronic equipment and storage medium
CN113380221A (en) * 2021-06-21 2021-09-10 携程科技(上海)有限公司 Chinese and English mixed speech synthesis method and device, electronic equipment and storage medium

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09152884A (en) * 1995-11-30 1997-06-10 Fujitsu Ten Ltd Speech synthesizing device
CN1196531A (en) * 1997-04-14 1998-10-21 英业达股份有限公司 Articulation compounding method for computer phonetic signal
US6829580B1 (en) * 1998-04-24 2004-12-07 British Telecommunications Public Limited Company Linguistic converter
CN1629933A (en) * 2003-12-17 2005-06-22 摩托罗拉公司 Sound unit for bilingualism connection and speech synthesis
US20070038455A1 (en) * 2005-08-09 2007-02-15 Murzina Marina V Accent detection and correction system
WO2009150591A1 (en) * 2008-06-11 2009-12-17 Koninklijke Philips Electronics N.V. Method and device for the generation of a topic-specific vocabulary and computer program product
CN102543069A (en) * 2010-12-30 2012-07-04 财团法人工业技术研究院 Multi-language text-to-speech synthesis system and method
JP2015118222A (en) * 2013-12-18 2015-06-25 株式会社日立超エル・エス・アイ・システムズ Voice synthesis system and voice synthesis method
US20160358596A1 (en) * 2015-06-08 2016-12-08 Nuance Communications, Inc. Process for improving pronunciation of proper nouns foreign to a target language text-to-speech system
US20170133010A1 (en) * 2013-05-30 2017-05-11 Promptu Systems Corporation Systems and methods for adaptive proper name entity recognition and understanding
CN108172211A (en) * 2017-12-28 2018-06-15 云知声(上海)智能科技有限公司 Adjustable waveform concatenation system and method
CN109545183A (en) * 2018-11-23 2019-03-29 北京羽扇智信息科技有限公司 Text handling method, device, electronic equipment and storage medium
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109887506A (en) * 2019-03-21 2019-06-14 广东美的制冷设备有限公司 Control method, device, the apparatus of air conditioning and the server of the apparatus of air conditioning
CN109903748A (en) * 2019-02-14 2019-06-18 平安科技(深圳)有限公司 A kind of phoneme synthesizing method and device based on customized sound bank
WO2019165748A1 (en) * 2018-02-28 2019-09-06 科大讯飞股份有限公司 Speech translation method and apparatus
CN111402862A (en) * 2020-02-28 2020-07-10 问问智能信息科技有限公司 Voice recognition method, device, storage medium and equipment

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09152884A (en) * 1995-11-30 1997-06-10 Fujitsu Ten Ltd Speech synthesizing device
CN1196531A (en) * 1997-04-14 1998-10-21 英业达股份有限公司 Articulation compounding method for computer phonetic signal
US6829580B1 (en) * 1998-04-24 2004-12-07 British Telecommunications Public Limited Company Linguistic converter
CN1629933A (en) * 2003-12-17 2005-06-22 摩托罗拉公司 Sound unit for bilingualism connection and speech synthesis
US20070038455A1 (en) * 2005-08-09 2007-02-15 Murzina Marina V Accent detection and correction system
WO2009150591A1 (en) * 2008-06-11 2009-12-17 Koninklijke Philips Electronics N.V. Method and device for the generation of a topic-specific vocabulary and computer program product
CN102543069A (en) * 2010-12-30 2012-07-04 财团法人工业技术研究院 Multi-language text-to-speech synthesis system and method
US20170133010A1 (en) * 2013-05-30 2017-05-11 Promptu Systems Corporation Systems and methods for adaptive proper name entity recognition and understanding
JP2015118222A (en) * 2013-12-18 2015-06-25 株式会社日立超エル・エス・アイ・システムズ Voice synthesis system and voice synthesis method
US20160358596A1 (en) * 2015-06-08 2016-12-08 Nuance Communications, Inc. Process for improving pronunciation of proper nouns foreign to a target language text-to-speech system
CN108172211A (en) * 2017-12-28 2018-06-15 云知声(上海)智能科技有限公司 Adjustable waveform concatenation system and method
WO2019165748A1 (en) * 2018-02-28 2019-09-06 科大讯飞股份有限公司 Speech translation method and apparatus
CN109545183A (en) * 2018-11-23 2019-03-29 北京羽扇智信息科技有限公司 Text handling method, device, electronic equipment and storage medium
CN109903748A (en) * 2019-02-14 2019-06-18 平安科技(深圳)有限公司 A kind of phoneme synthesizing method and device based on customized sound bank
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109887506A (en) * 2019-03-21 2019-06-14 广东美的制冷设备有限公司 Control method, device, the apparatus of air conditioning and the server of the apparatus of air conditioning
CN111402862A (en) * 2020-02-28 2020-07-10 问问智能信息科技有限公司 Voice recognition method, device, storage medium and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王永生等: "英语语音合成中基于DFGA的字音转换算法", 《计算机工程与应用》, no. 13, pages 158 - 161 *
纪正飚;王吉林;赵力;: "基于HMM的中英文语音合成技术研究", 科学技术与工程, no. 32 *
胡刚;王嘉梅;李炳泽;林睿;林碧彤;: "汉英-泰互译有声语料的数据库研究", 计算机系统应用, no. 09 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951204A (en) * 2021-03-29 2021-06-11 北京大米科技有限公司 Speech synthesis method and device
CN113345408A (en) * 2021-06-02 2021-09-03 云知声智能科技股份有限公司 Chinese and English voice mixed synthesis method and device, electronic equipment and storage medium
CN113380221A (en) * 2021-06-21 2021-09-10 携程科技(上海)有限公司 Chinese and English mixed speech synthesis method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3879525B1 (en) Training model for speech synthesis
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN112530404A (en) Voice synthesis method, voice synthesis device and intelligent equipment
CN110010136B (en) Training and text analysis method, device, medium and equipment for prosody prediction model
EP3349125A1 (en) Language model generation device, language model generation method and program therefor, voice recognition device, and voice recognition method and program therefor
US20230326446A1 (en) Method, apparatus, storage medium, and electronic device for speech synthesis
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
KR102267561B1 (en) Apparatus and method for comprehending speech
US11907665B2 (en) Method and system for processing user inputs using natural language processing
CN112562640B (en) Multilingual speech recognition method, device, system, and computer-readable storage medium
CN109166569B (en) Detection method and device for phoneme mislabeling
CN112530402B (en) Speech synthesis method, speech synthesis device and intelligent equipment
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN112863484A (en) Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method
Rajendran et al. A robust syllable centric pronunciation model for Tamil text to speech synthesizer
CN112530406A (en) Voice synthesis method, voice synthesis device and intelligent equipment
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN115132182A (en) Data identification method, device and equipment and readable storage medium
CN113096667A (en) Wrongly-written character recognition detection method and system
CN112464649A (en) Pinyin conversion method and device for polyphone, computer equipment and storage medium
CN112836522A (en) Method and device for determining voice recognition result, storage medium and electronic device
CN115713934B (en) Error correction method, device, equipment and medium for converting voice into text
CN110750967A (en) Pronunciation labeling method and device, computer equipment and storage medium
US20240095451A1 (en) Method and apparatus for text analysis, electronic device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination