CN112420017A

CN112420017A - Speech synthesis method and device

Info

Publication number: CN112420017A
Application number: CN202011266074.1A
Authority: CN
Inventors: 满达
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-02-26

Abstract

The present disclosure provides a speech synthesis method, which includes: receiving a voice service request from a user terminal; determining a to-be-processed corpus associated with the voice service request; determining text features aiming at the linguistic data to be processed, wherein the text features comprise phoneme sequences and rhythm features; determining acoustic features for the corpus to be processed based on the text features; performing voice synthesis aiming at the linguistic data to be processed based on the text features and the acoustic features; and returning the voice synthesis result to the user terminal. The present disclosure also provides a speech synthesis apparatus, an electronic device and a computer-readable storage medium.

Description

Speech synthesis method and device

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to a speech synthesis method, a speech synthesis apparatus, an electronic device, and a computer-readable storage medium.

Background

With the increasing maturity of computer technology, the field of artificial intelligence is rapidly developed, and speech synthesis plays an important role in the field of artificial intelligence. With the penetration of artificial intelligence technology, people's demand for speech synthesis services is increasing.

In the process of realizing the disclosed concept, the inventor finds that corresponding voice synthesis tools need to be designed aiming at different platforms because the operating systems and software environments of the different platforms are different; moreover, due to different hardware environments and configuration parameters of different user terminals, the operation performance of the same speech synthesis tool in different user terminals may be different. This increases the development cost of the speech synthesis tool on the one hand and affects the usability of the speech synthesis tool on the other hand.

Disclosure of Invention

In view of this, the present disclosure provides a speech synthesis method and apparatus with low development cost, strong versatility and stable speech synthesis effect.

One aspect of the present disclosure provides a speech synthesis method applied to a speech synthesis engine, including: receiving a voice service request from a user terminal; determining a linguistic data to be processed associated with the voice service request; determining text features aiming at the linguistic data to be processed, wherein the text features comprise phoneme sequences and rhythm features; determining acoustic features for the corpus to be processed based on the text features; performing speech synthesis for the corpus to be processed based on the text features and the acoustic features; and returning the voice synthesis result to the user terminal.

Optionally, the receiving a voice service request from a user terminal includes: and receiving the voice service request through a preset communication link, wherein the preset communication link supports offline data transmission.

Optionally, the determining the text feature for the corpus to be processed includes: obtaining a word segmentation sequence aiming at the linguistic data to be processed; determining a phoneme sequence associated with the word segmentation sequence; and determining the pronunciation duration of each phoneme in the phoneme sequence by using a preset phonological structure model so as to obtain the phonological characteristics.

Optionally, the determining, by using a preset phonological structure model, a pronunciation duration of each phoneme in the phoneme sequence to obtain the phonological characteristics includes: inputting the phoneme sequence into the phonological structure model, and determining at least one phonological structure associated with the phoneme sequence, wherein the phonological structure comprises at least one of a phonological word structure, a phonological phrase structure and a semantic phrase structure; determining pause duration aiming at each temperament structure according to a preset incidence relation between the temperament structures and the pause duration; and determining the pronunciation duration of each phoneme in the phoneme sequence according to the pause duration aiming at each rhythm structure so as to obtain the rhythm characteristics.

Optionally, the determining, based on the text feature, an acoustic feature for the corpus to be processed includes: converting each phoneme into a frame sequence unit according to the pronunciation duration of each phoneme indicated by the temperament characteristics; and determining the acoustic characteristics associated with each frame sequential unit by using a preset acoustic characteristic model so as to obtain the acoustic characteristics aiming at the linguistic data to be processed.

Optionally, the acoustic feature model includes: the method comprises the steps of training a basic acoustic model based on general sample data and/or training the basic acoustic model by utilizing personalized sample data to obtain a personalized acoustic model, wherein the general sample data comprises at least one user voice and/or language material sample, and the personalized sample data comprises a preset user voice and/or language material sample.

Optionally, the acoustic feature includes at least one of a spectral feature, a fundamental frequency feature, an energy feature, and a non-periodic feature.

Another aspect of the present disclosure provides a speech synthesis apparatus, including a receiving module for receiving a speech service request from a user terminal; the first processing module is used for determining the linguistic data to be processed related to the voice service request; the second processing module is used for determining text characteristics aiming at the linguistic data to be processed, and the text characteristics comprise a phoneme sequence and a tone characteristic; a third processing module, configured to determine, based on the text feature, an acoustic feature for the corpus to be processed; a fourth processing module, configured to perform speech synthesis for the corpus to be processed based on the text feature and the acoustic feature; and the sending module is used for returning the voice synthesis result to the user terminal.

Optionally, the receiving module includes: and the receiving submodule is used for receiving the voice service request through a preset communication link, and the preset communication link supports offline data transmission.

Optionally, the second processing module includes: the first processing submodule is used for obtaining a word segmentation sequence aiming at the linguistic data to be processed; the second processing submodule is used for determining a phoneme sequence associated with the word segmentation sequence; and the third processing submodule is used for determining the pronunciation duration of each phoneme in the phoneme sequence by using a preset phonological structure model so as to obtain the phonological characteristics.

Optionally, the third processing sub-module includes: a first processing unit, configured to input the phoneme sequence into the rhythm structure model, and determine at least one rhythm structure associated with the phoneme sequence, where the rhythm structure includes at least one of a rhythm word structure, a rhythm phrase structure, and a semantic phrase structure; the second processing unit is used for determining the pause duration of each temperament structure according to the preset incidence relation between the temperament structures and the pause durations; and a third processing unit, configured to determine a pronunciation duration of each phoneme in the phoneme sequence according to the pause duration for each of the temperament structures, so as to obtain the temperament features.

Optionally, the third processing module includes: a fourth processing submodule, configured to convert each phoneme into a frame sequential unit according to the pronunciation duration of each phoneme indicated by the temperament feature; and a fifth processing submodule, configured to determine, by using a preset acoustic feature model, an acoustic feature associated with each frame sequential unit, so as to obtain an acoustic feature for the corpus to be processed.

Another aspect of the present disclosure provides an electronic device. The electronic device includes at least one processor and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to implement the method of the embodiment of the disclosure.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, implement the method of embodiments of the present disclosure.

Another aspect of the disclosure provides a computer program comprising computer executable instructions that when executed perform the method of an embodiment of the disclosure.

According to the embodiment of the disclosure, the voice service request from the user terminal is received; determining a to-be-processed corpus associated with the voice service request; determining text characteristics aiming at the linguistic data to be processed, wherein the text characteristics comprise a phoneme sequence and a pitch characteristic; determining acoustic features for the corpus to be processed based on the text features; performing voice synthesis aiming at the linguistic data to be processed based on the text characteristics and the acoustic characteristics; the technical scheme of returning the voice synthesis result to the user terminal at least partially overcomes the technical problems of high development cost, poor universality and unstable voice synthesis effect of the voice synthesis tool in the related technology, thereby achieving the technical effects of effectively reducing the development cost of the voice synthesis tool and effectively improving the use performance of the voice synthesis tool.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 schematically illustrates a speech synthesis system architecture according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a method of speech synthesis according to an embodiment of the present disclosure;

FIG. 3 schematically shows an application diagram of a speech synthesis method according to an embodiment of the present disclosure;

FIG. 4 schematically shows an overall schematic diagram of a speech synthesis process according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a method of determining text features according to an embodiment of the disclosure;

FIG. 6 schematically shows a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of an electronic device suitable for implementing a speech synthesis method and apparatus according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It is to be understood that such description is merely illustrative and not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, operations steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Various embodiments of the present disclosure provide a voice synthesis method and a voice synthesis apparatus to which the method can be applied. The method comprises the steps of firstly receiving a voice service request from a user terminal, then determining a to-be-processed corpus associated with the voice service request, and determining text features aiming at the to-be-processed corpus, wherein the text features comprise a phoneme sequence and a pitch feature; and then determining acoustic features aiming at the linguistic data to be processed based on the text features, performing voice synthesis aiming at the linguistic data to be processed based on the text features and the acoustic features, and finally returning a voice synthesis result to the user terminal.

As shown in FIG. 1, the system architecture 100 includes at least one user terminal (a plurality of which are shown, e.g.,

user terminals

101, 102, 103) and a speech synthesis engine 104. In the system architecture 100, a speech synthesis engine 104 receives a speech service request from a user terminal (e.g.,

user terminals

101, 102, 103), the speech synthesis engine 104 determines a corpus to be processed associated with the speech service request, and determines text features for the corpus to be processed, the text features including a phoneme sequence and a pitch feature; then, based on the text features, determining the acoustic features for the corpus to be processed, and based on the text features and the acoustic features, performing speech synthesis for the corpus to be processed, and finally returning the speech synthesis result to the user terminal (such as the

user terminals

101, 102, 103).

The present disclosure will be described in detail below with reference to the drawings and specific embodiments.

FIG. 2 schematically shows a flow diagram of a speech synthesis method applied to a speech synthesis engine according to an embodiment of the present disclosure.

As shown in fig. 2, the method may include operations S210 to S260, for example.

In operation S210, a voice service request from a user terminal is received.

In the disclosed embodiment, the speech synthesis engine may specifically receive the speech service request from the user terminal, and specifically, the speech synthesis engine may receive the speech service request from the user terminal in different operating systems, which may include, for example, Andriod, IOS, Ubuntu, Linux, Windows, Generic Linux, etc. In addition, the speech synthesis engine can also receive speech service requests from the user terminal under different software environments, different hardware environments and different configuration conditions. The programming language and programming interface parameters in different software environments may be different, and the CPU/GPU computing power, memory size, and other parameters in different hardware environments may be different. It is noted that "user" in the present solution may include any object having speech synthesis needs.

The speech synthesis engine receives the speech service request through a preset communication link, and the preset communication link supports offline data transmission. Specifically, the speech synthesis engine authorizes an application program interface to application programs of different operating systems, different software and hardware environments and different configuration conditions, and obtains a speech service request from the user terminal through the application program interface, so that an offline speech synthesis function based on front-end calling and back-end synthesis is realized. Illustratively, the speech synthesis engine opens an Objective-C API (Application Programming Interface) to the user terminal, and obtains a speech service request from the user terminal through the Objective-C API, where the Objective-C API supports offline data transmission, and is a Programming language in which object-oriented features are added on the basis of a C language.

The design is not only beneficial to improving the general degree of the speech synthesis engine, but also beneficial to reducing the development cost of the speech synthesis engine. Because the preset communication link supports offline data transmission, the voice synthesis requirement of the user in a network-free environment or a weak network environment can be effectively met.

Fig. 3 schematically illustrates an application diagram of a speech synthesis method according to an embodiment of the present disclosure, as shown in fig. 3, different users initiate a speech service request through a user terminal, and a service server associated with the user terminal performs data interaction with a speech synthesis engine through an interactive interface, specifically, the speech synthesis engine obtains a corpus to be processed associated with the speech service request from the service server, obtains a synthesized speech by determining text features and acoustic features for the corpus to be processed and performing speech synthesis based on the text features and the acoustic features, and returns the synthesized speech to the user terminal through the service server, thereby implementing provision of speech synthesis service for users of different user terminals. The interactive interface may specifically be a C/C + + based application program interface.

Next, in operation S220, a to-be-processed corpus associated with the voice service request is determined.

In the embodiment of the present disclosure, specifically, the to-be-processed corpus associated with the voice service request is determined, specifically, the to-be-processed corpus that needs to be subjected to voice synthesis in the user terminal may be, for example, a to-be-broadcasted corpus of a vocal application (the vocal application may include, for example, audiobook software, navigation software, video software, and the like), or may be a to-be-broadcasted corpus of an interactive application (the interactive application may include, for example, game software, learning software, and the like), or may be a to-be-broadcasted corpus of an intelligent interactive device (the intelligent interactive device may include, for example, an intelligent stereo, a vehicle-mounted device, a wearable device, a personal digital assistant, a sales terminal, an intelligent robot, and the like), or may be a to-be-announced corpus of a public service device (the public service device may include, for example, a public transportation.

Next, in operation S230, text features for the corpus to be processed are determined, the text features including a phoneme sequence and a pitch feature.

In the embodiment of the present disclosure, specifically, after obtaining the to-be-processed corpus associated with the voice service request, the to-be-processed corpus is subjected to word segmentation processing, so as to obtain a word segmentation sequence for the to-be-processed corpus. The corpus to be processed can be divided into a plurality of word segmentation segments by using the existing word segmentation tool, the word segmentation segments are arranged in sequence to form a word segmentation sequence, and the word segmentation tool can comprise SnowNLP, NLPIR, THULAC and the like. Illustratively, for the corpus "fifteen minutes away from the posting," the segmentation result includes "distance" (preposition), "posting" (noun), "also" (adverb), "having" (verb), "fifteen" (number), and "minute" (quantifier).

Optionally, before performing the word segmentation processing on the corpus to be processed, preprocessing such as text replacement and text normalization may be performed on the corpus to be processed, and by converting the non-standard text in the corpus to be processed into the standard text, the accuracy of subsequent word segmentation processing is improved.

After a word segmentation sequence aiming at the linguistic data to be processed is obtained, G2P (Grapheme to phone) reasoning based on the word segmentation sequence is carried out, and a phone sequence related to the word segmentation sequence is obtained. The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and one pronunciation action forms a phoneme according to the pronunciation action analysis in the syllable, and the phoneme is divided into vowel and consonant. For example, for a "pronoun" word, the word includes two phonemes,/(/ (/)/(/)).

Determining phonemes associated with each grapheme in the participle sequence to obtain a phoneme sequence for the corpus to be processed, for example, "big family" for the corpus to be processed, and "S _ d _ a _4_ SP0_ j _ ia _1_ SP1_ h _ ao _3_ E" for the corpus to be processed, where S is a start marker, E is an end-of-sentence marker, numbers 4, 1, and 3 are tone markers, respectively, SP0 and SP1 are pause markers, and different phonemes are separated by a space "_". After the phoneme sequence is determined, information such as pronunciation duration, pronunciation lightness and gravity and the like of each phoneme in the phoneme sequence is determined by using a preset phoneme structure model, and the phoneme characteristics aiming at the linguistic data to be processed are obtained.

Next, in operation S240, based on the text features, acoustic features for the corpus to be processed are determined.

In the disclosed embodiment, the acoustic features may include at least one of spectral features, fundamental frequency features, energy features, and non-periodic features, for example. Spectral features, which may include, for example, linear prediction cepstral coefficients, Mel-frequency cepstral coefficients, etc., may be used to indicate a correlation between vocal tract shape changes and vocal tract motion. The fundamental frequency contains a large number of characteristics capable of representing speech emotion and plays an important role in speech emotion recognition. The fundamental frequency characteristic includes a pitch period, which is a vocal cord vibration period that indicates a period during which the airflow through the vocal tract vibrates its vocal cords when the person is voiced. The energy characteristic may particularly comprise an intensity characteristic or a volume characteristic, which is used to indicate the sound level. The aperiodic features may include, for example, speech rate features, emotional intensity, prosodic features, and the like, and the prosodic features may include, for example, features characterizing intonation, pitch, accent emphasis, pause, cadence, and the like.

The acoustic features of the linguistic data to be processed are determined based on the phoneme sequence and the prosody features associated with the linguistic data to be processed, and specifically, each phoneme can be expanded into a frame sequence unit according to the pronunciation duration of each phoneme indicated by the prosody features, so that a frame sequence associated with the linguistic data to be processed is obtained. The frame sequence unit comprises the frame number consistent with the pronunciation duration of the phoneme, and each frame in the frame sequence unit has the corresponding acoustic characteristic. The pronunciation duration of the phoneme is determined by the start time and the end time of the phoneme, which can be output from the phonological structure model in operation S230.

When determining the acoustic features associated with the frame sequential units using the preset acoustic feature model, the acoustic features associated with each frame in the frame sequential units are specifically determined. The acoustic feature model is a pre-trained neural network model, the input of the acoustic feature model is the frame feature of each frame in each frame sequence unit, and the output of the acoustic feature model is the acoustic feature of each frame in each frame sequence unit. The frame features of any frame correspond to the phoneme features of the phonemes associated with that frame, which may include information such as tonal features, prosodic features, location in a dictionary, and the like. In addition, the frame characteristics include the position index of the current frame in the frame sequential unit, the position index of the current frame in the corresponding phoneme sequence, and the like.

The acoustic feature model may be a basic acoustic model obtained by training based on general sample data, or may be an individualized acoustic model obtained by training the basic acoustic model using individualized sample data, where the general sample data includes at least one user's voice and/or corpus sample, and the individualized sample data includes a preset user's voice and/or corpus sample. The basic acoustic model is obtained by training a large number of different speakers or speech and/or corpus samples with different speaking styles as sample data. The personalized acoustic model is obtained by training by taking the voice and/or corpus samples of the target person as sample data on the basis of the basic acoustic model. The personalized acoustic model can better meet the personalized requirements of the voice synthesis function, can provide selective differential configuration for users, is favorable for improving the style diversity of the voice synthesis function and the general degree of a voice synthesis engine.

Next, in operation S250, speech synthesis for the corpus to be processed is performed based on the text feature and the acoustic feature.

In the embodiment of the present disclosure, specifically, after obtaining the acoustic features of each frame in the frame sequence, based on the text features and the acoustic features associated with the corpus to be processed, speech synthesis for the corpus to be processed is performed. The speech synthesis method can be implemented by using the prior art, and the details of the method are not described herein.

Next, in operation S260, the voice synthesis result is returned to the user terminal.

In the embodiment of the present disclosure, specifically, the speech synthesis result is returned to the user terminal through a preset communication link, for example, through an application program interface opened by the speech synthesis engine to the user terminal, which effectively realizes that a unified speech synthesis service is provided based on different edge computing devices, different operating systems, different software and hardware environments, and different configuration environments, and the speech synthesis tool not only has the advantage of strong versatility, but also can provide differentiated speech synthesis functions for different users and different edge computing devices. Meanwhile, the design is also beneficial to ensuring the stability of the performance of the speech synthesis engine, and the synthesized speech with stable output quality and good synthesis effect can be effectively realized.

Fig. 4 schematically shows an overall schematic diagram of a speech synthesis process according to an embodiment of the present disclosure, and as shown in fig. 4, a basic acoustic model is obtained by training general sample data, and a personalized acoustic model is obtained by training the basic acoustic model with personalized sample data. After the to-be-processed corpus associated with the voice service request is obtained, the text features of the to-be-processed corpus are determined by using the voice rhythm structure model, and the acoustic features of the to-be-processed corpus are determined by using the personalized acoustic model based on the determined text features. And then, performing voice synthesis based on the text characteristics and the acoustic characteristics of the linguistic data to be processed to obtain an audio file of the synthesized voice, and returning the audio file to the user terminal.

In the embodiment of the disclosure, the voice service request is received from the user terminal; determining a to-be-processed corpus associated with the voice service request; determining text characteristics aiming at the linguistic data to be processed, wherein the text characteristics comprise a phoneme sequence and a tone characteristic and/or a tone characteristic; determining acoustic features for the corpus to be processed based on the text features; performing voice synthesis aiming at the linguistic data to be processed based on the text characteristics and the acoustic characteristics; and returning the voice synthesis result to the user terminal. Because the speech synthesis engine is not integrated in the user terminal or the operating system of the user terminal, the speech synthesis engine receives the speech service request from the user terminal, carries out speech synthesis processing aiming at the speech service request, and then returns the speech synthesis result to the user terminal, the design enables the speech synthesis engine to meet the speech synthesis requirements of different operating systems, different software and hardware environments and different configuration conditions, thereby effectively improving the general degree of the speech synthesis engine, reducing the development cost of the speech synthesis engine and effectively ensuring the stability of the speech synthesis effect.

FIG. 5 schematically shows a flow chart of a method of determining text features according to an embodiment of the disclosure.

As shown in fig. 5, operation S230 may include, for example, operations S510 to S520.

In operation S510, a phoneme sequence associated with the corpus to be processed is determined.

In the embodiment of the present disclosure, specifically, determining a phoneme sequence associated with a corpus to be processed includes: obtaining a segmentation sequence for the corpus to be processed, and determining a phoneme sequence associated with the segmentation sequence, which is specifically described in operation S230 and will not be described herein again.

Next, in operation S520, a pronunciation duration of each phoneme in the phoneme sequence is determined using a preset phonological structure model to obtain a phonological characteristic.

In the embodiment of the present disclosure, specifically, the phoneme sequence is input into a rhythm structure model, at least one rhythm structure associated with the phoneme sequence is determined, where the rhythm structure includes at least one of a rhythm word structure, a rhythm phrase structure, and a semantic phrase structure, then a pause duration for each rhythm structure is determined according to a preset association relationship between the rhythm structure and the pause duration, and finally a pronunciation duration and a pronunciation lightness of each phoneme in the phoneme sequence are determined according to the pause duration for each rhythm structure, so as to obtain a rhythm feature.

And determining the hierarchical structure of the phoneme sequence by utilizing a preset temperament structure model, wherein the hierarchical structure generally comprises a temperament word, a temperament phrase structure and a semantic phrase structure. Illustratively, for the corpus "fifteen minutes away from the release party, we start the bar", the word segmentation segments "away from", "release party", "start", "still", etc. are labeled as the temperament words, assuming that the pause between adjacent temperament words is T1; "fifteen minutes away from the beginning of the release" is labeled as a temperament phrase, assuming a pause of T2 between adjacent temperament phrases; "fifteen minutes from the beginning of the posting session" and "our departure bar" are labeled as semantic terms, respectively, assuming a pause between adjacent semantic terms of T3. The pause time between different grapheme marked as the same temperament word is not stopped, the pause time between different temperament structures of the same hierarchy is different, wherein the pause time between adjacent temperament words is shortest, the pause time between adjacent temperament phrases is centered, and the pause time between adjacent semantic phrases is longest, so that T1 is more than T2 is more than T3.

After the pause duration aiming at each tone rhythm structure is determined, the start time and the end time of each phoneme in the phoneme sequence are determined, and then the pronunciation duration of each phoneme is determined, so that the tone rhythm characteristics aiming at the linguistic data to be processed are obtained. Meanwhile, phonemes needing to be read heavily and read lightly in the phoneme sequence can be determined by utilizing a voice rhythm structure model according to the part of speech corresponding to the word segmentation fragments, so that voice rhythm characteristics aiming at the linguistic data to be processed are obtained, wherein the voice rhythm structure model is obtained by utilizing a large number of linguistic data samples with labeling results for training.

Fig. 6 schematically shows a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the apparatus may include an obtaining module 601, a first processing module 601, a second processing module 603, a third processing module 604, a fourth processing module 605, and a sending module 606.

Specifically, the receiving module 601 is configured to receive a voice service request from a user terminal; a first processing module 602, configured to determine a corpus to be processed associated with a voice service request; a second processing module 603, configured to determine a text feature for the corpus to be processed, where the text feature includes a phoneme sequence and/or a temperament feature; a third processing module 604, configured to determine, based on the text feature, an acoustic feature for the corpus to be processed; a fourth processing module 605, configured to perform speech synthesis for the corpus to be processed based on the text feature and the acoustic feature; a sending module 606, configured to return the speech synthesis result to the user terminal.

As an alternative embodiment, the receiving module comprises: and the receiving submodule is used for receiving the voice service request through a preset communication link, and the preset communication link supports offline data transmission.

As an alternative embodiment, the second processing module comprises: the first processing submodule is used for obtaining a word segmentation sequence aiming at the linguistic data to be processed; the second processing submodule is used for determining a phoneme sequence associated with the word segmentation sequence; and the third processing submodule is used for determining the pronunciation duration of each phoneme in the phoneme sequence by using a preset phonological structure model so as to obtain the phonological characteristics.

As an alternative embodiment, the third processing submodule includes: the system comprises a first processing unit, a second processing unit and a third processing unit, wherein the first processing unit is used for inputting a phoneme sequence into a rhythm structure model and determining at least one rhythm structure associated with the phoneme sequence, and the rhythm structure comprises at least one of a rhythm word structure, a rhythm phrase structure and a semantic phrase structure; the second processing unit is used for determining the pause duration of each temperament structure according to the preset incidence relation between the temperament structures and the pause durations; and the third processing unit is used for determining the pronunciation duration of each phoneme in the phoneme sequence according to the pause duration aiming at each temperament structure so as to obtain the temperament characteristics.

As an alternative embodiment, the third processing module includes: the fourth processing submodule is used for converting each phoneme into a frame sequence unit according to the pronunciation duration of each phoneme indicated by the temperament characteristics; and the fifth processing submodule is used for determining the acoustic characteristics associated with each frame sequential unit by using a preset acoustic characteristic model so as to obtain the acoustic characteristics aiming at the linguistic data to be processed.

As an alternative embodiment, the acoustic feature model includes: the method comprises the following steps of training a basic acoustic model based on general sample data and/or training the basic acoustic model by utilizing personalized sample data to obtain a personalized acoustic model, wherein the general sample data comprises at least one user voice and/or language material sample, and the personalized sample data comprises a preset user voice and/or language material sample.

As an alternative embodiment, the acoustic features include at least one of spectral features, fundamental frequency features, energy features, and non-periodic features.

Optionally, at least part of the functions of any of the modules, sub-modules, or any of the modules in the obtaining module 601, the first processing module 601, the second processing module 603, the third processing module 604, the fourth processing module 605 and the sending module 606 may be implemented in one module. Any one or more of the modules according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules according to the embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging the circuit, or in any one of three implementations, or in any suitable combination of any of the software, hardware, and firmware. Or one or more of the modules according to embodiments of the disclosure, may be implemented at least partly as computer program modules which, when executed, may perform corresponding functions.

For example, any plurality of the obtaining module 601, the first processing module 601, the second processing module 603, the third processing module 604, the fourth processing module 605 and the sending module 606 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. Alternatively, at least one of the obtaining module 601, the first processing module 601, the second processing module 603, the third processing module 604, the fourth processing module 605 and the sending module 606 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware and firmware, or implemented by a suitable combination of any several of them. Alternatively, at least one of the obtaining module 601, the first processing module 601, the second processing module 603, the third processing module 604, the fourth processing module 605 and the sending module 606 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.

FIG. 7 schematically illustrates a block diagram of an electronic device suitable for implementing a speech synthesis method and apparatus according to an embodiment of the present disclosure. The computer system illustrated in FIG. 7 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.

As shown in fig. 7, a computer system 700 according to an embodiment of the present disclosure includes a processor 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM 703, various programs and data necessary for the operation of the system 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. The processor 701 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 702 and/or the RAM 703. It is noted that the programs may also be stored in one or more memories other than the ROM 702 and RAM 703. The processor 701 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Optionally, the system 700 may also include an input/output (I/O) interface 705, the input/output (I/O) interface 705 also being connected to the bus 704. The system 700 may also include one or more of the following components connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 706 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

Alternatively, the method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by the processor 701, performs the above-described functions defined in the system of the embodiment of the present disclosure. Alternatively, the systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

Alternatively, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, a computer-readable storage medium may optionally include one or more memories other than the ROM 702 and/or RAM 703 and/or ROM 702 and RAM 703 described above.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A speech synthesis method applied to a speech synthesis engine comprises the following steps:

receiving a voice service request from a user terminal;

determining a to-be-processed corpus associated with the voice service request;

determining text features aiming at the linguistic data to be processed, wherein the text features comprise phoneme sequences and rhythm features;

determining acoustic features for the corpus to be processed based on the text features;

performing voice synthesis aiming at the linguistic data to be processed based on the text features and the acoustic features;

and returning the voice synthesis result to the user terminal.

2. The method of claim 1, wherein the receiving a voice service request from a user terminal comprises:

and receiving the voice service request through a preset communication link, wherein the preset communication link supports offline data transmission.

3. The method of claim 1, wherein the determining text features for the corpus to be processed comprises:

obtaining a word segmentation sequence aiming at the linguistic data to be processed;

determining a phoneme sequence associated with the word segmentation sequence;

and determining the pronunciation duration of each phoneme in the phoneme sequence by using a preset phonological structure model so as to obtain the phonological characteristics.

4. The method according to claim 3, wherein the determining the pronunciation duration of each phoneme in the phoneme sequence by using a preset phonological structure model to obtain the phonological characteristics comprises:

inputting the phoneme sequence into the melody structure model, and determining at least one melody structure associated with the phoneme sequence, wherein the melody structure comprises at least one of a melody word structure, a melody phrase structure and a semantic phrase structure;

determining pause duration aiming at each temperament structure according to a preset incidence relation between the temperament structures and the pause duration;

and determining the pronunciation duration of each phoneme in the phoneme sequence according to the pause duration aiming at each rhythm structure so as to obtain the rhythm characteristics.

5. The method according to claim 3, wherein the determining the acoustic feature for the corpus to be processed based on the text feature comprises:

converting each phoneme into a frame sequential unit according to the pronunciation duration of each phoneme indicated by the temperament characteristics;

and determining the acoustic features associated with each frame sequence unit by using a preset acoustic feature model to obtain the acoustic features of the linguistic data to be processed.

6. The method of claim 5, wherein the acoustic feature model comprises:

based on a basic acoustic model obtained by training general sample data and/or an individualized acoustic model obtained by training the basic acoustic model by utilizing individualized sample data,

the universal sample data comprises at least one voice and/or corpus sample of a user, and the personalized sample data comprises a voice and/or corpus sample of a preset user.

7. The method of claim 5, wherein the acoustic features include at least one of spectral features, fundamental frequency features, energy features, and non-periodic features.

8. A speech synthesis apparatus comprising:

a receiving module, configured to receive a voice service request from a user terminal;

the first processing module is used for determining the linguistic data to be processed related to the voice service request;

the second processing module is used for determining text features aiming at the linguistic data to be processed, and the text features comprise phoneme sequences and rhythm features;

the third processing module is used for determining the acoustic features aiming at the linguistic data to be processed based on the text features;

the fourth processing module is used for performing voice synthesis aiming at the linguistic data to be processed based on the text characteristics and the acoustic characteristics;

and the sending module is used for returning the voice synthesis result to the user terminal.

9. An electronic device, comprising:

one or more processors; and

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 7.