WO2021127979A1 - 语音合成方法、装置、计算机设备及计算机可读存储介质 - Google Patents

语音合成方法、装置、计算机设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021127979A1
WO2021127979A1 PCT/CN2019/127914 CN2019127914W WO2021127979A1 WO 2021127979 A1 WO2021127979 A1 WO 2021127979A1 CN 2019127914 W CN2019127914 W CN 2019127914W WO 2021127979 A1 WO2021127979 A1 WO 2021127979A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectrum
emotional
synthesized
preset
superimposed
Prior art date
Application number
PCT/CN2019/127914
Other languages
English (en)
French (fr)
Inventor
黄东延
盛乐园
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2019/127914 priority Critical patent/WO2021127979A1/zh
Priority to CN201980003185.2A priority patent/CN111108549B/zh
Publication of WO2021127979A1 publication Critical patent/WO2021127979A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This application relates to the technical field of speech synthesis, and in particular to a speech synthesis method, device, computer equipment, and computer-readable storage medium.
  • Speech synthesis is a technology that generates artificial speech through mechanical and electronic methods. Specifically, it refers to a technology that converts text information generated by a computer or input into a computer externally into an understandable and fluent speech output technology.
  • emotional features are extracted from the reference speech, and then the extracted emotional features are used to control the style of the speech in an unsupervised manner.
  • the speech includes not only emotions, but also accents. It is necessary to deal with prosodic factors such as accents. Perform fine control to make the synthesized speech more realistic.
  • the embodiment of the present application provides a speech synthesis method, the method includes:
  • a speech synthesis device includes:
  • the spectrum acquisition module is used to acquire the spectrum to be synthesized and the preset spectrum
  • a superimposed spectrum module configured to obtain a superimposed spectrum according to the spectrum to be synthesized and the preset spectrum
  • An emotional semantics module configured to perform emotional semantic feature extraction on the superimposed spectrum to obtain the emotional semantic feature corresponding to the superimposed spectrum
  • a fundamental frequency extraction module configured to perform fundamental frequency extraction on the preset frequency spectrum to obtain the fundamental frequency characteristics corresponding to the preset frequency spectrum
  • the emotional prosody module is configured to obtain the emotional prosody spectrum corresponding to the spectrum to be synthesized according to the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum, so as to generate speech according to the emotional prosody spectrum.
  • a computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the above speech synthesis method device, computer equipment and computer-readable storage medium, first obtain the spectrum to be synthesized and the preset spectrum; then obtain the superimposed spectrum according to the spectrum to be synthesized and the preset spectrum; and perform the superposition on the superimposed spectrum at the same time.
  • Emotional semantic feature extraction obtains the emotional semantic feature corresponding to the superimposed spectrum; and performs fundamental frequency extraction on the preset spectrum to obtain the fundamental frequency feature corresponding to the preset spectrum; finally, according to the emotional semantic feature corresponding to the superimposed spectrum
  • the fundamental frequency characteristic corresponding to the preset frequency spectrum is used to obtain the emotional prosody spectrum corresponding to the to-be-synthesized spectrum, so as to generate speech according to the emotional prosody spectrum.
  • the above speech synthesis method first extracts the emotional semantic features, gives the speech emotion, and then extracts the fundamental frequency of the preset frequency spectrum, and the fundamental frequency can reflect the prosody, thus achieving the control of the prosody such as the accent of the voice, and finally making the synthesis The voice is more real.
  • Figure 1 is an application environment diagram of a speech synthesis method in an embodiment
  • Figure 2 is a flowchart of a speech synthesis method in an embodiment
  • FIG. 3 is a flowchart of obtaining a superimposed spectrum according to the spectrum to be synthesized and the preset spectrum in an embodiment
  • Figure 5 is a structural block diagram of a speech synthesis device in an embodiment
  • Fig. 6 is a structural block diagram of a computer device in an embodiment.
  • Fig. 1 is an application environment diagram of a speech synthesis method in an embodiment. 1, the speech synthesis method is applied to a speech synthesis system.
  • the speech synthesis system can be set in a terminal or a server.
  • the terminal can be a desktop terminal or a mobile terminal, and the mobile terminal can be a mobile phone, At least one of a robot, a tablet computer, a notebook computer, etc.
  • the desktop terminal may be a desktop computer or a vehicle-mounted computer; the server includes a high-performance computer and a high-performance computer cluster.
  • the speech synthesis system includes a spectrum acquisition module for acquiring a spectrum to be synthesized and a preset spectrum; a superimposing spectrum module for obtaining a superimposed spectrum according to the spectrum to be synthesized and the preset spectrum; Emotional semantic feature extraction is an emotional semantics module for obtaining emotional semantic features corresponding to the superimposed spectrum; a fundamental frequency extraction module for extracting the fundamental frequency of the preset frequency spectrum to obtain the fundamental frequency feature corresponding to the preset frequency spectrum; According to the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum, the emotional prosody spectrum corresponding to the spectrum to be synthesized is obtained, so as to generate an emotional prosody module of speech according to the emotional prosody spectrum.
  • a speech synthesis method is provided.
  • the method can be applied to terminals, servers, and other speech synthesis devices.
  • the speech synthesis method specifically includes the following steps:
  • Step 202 Obtain a spectrum to be synthesized and a preset spectrum.
  • the spectrum to be synthesized refers to the spectrum corresponding to the text to be synthesized without emotion and rhythm.
  • the spectrum to be synthesized may be a Mel spectrum corresponding to the text to be synthesized, or may be a Mel cepstrum corresponding to the text to be synthesized.
  • the preset frequency spectrum refers to the frequency spectrum corresponding to the target speech with certain emotion and rhythm set in advance.
  • the emotion and rhythm in the preset frequency spectrum are extracted and superimposed on the spectrum to be synthesized without emotion or rhythm.
  • the emotional prosody spectrum with the certain emotion and prosody is obtained, and the speech with the certain emotion and prosody is generated according to the emotional prosody spectrum.
  • obtaining a target voice with certain emotion and prosody obtaining a preset frequency spectrum corresponding to the target voice according to the target voice.
  • the preset frequency spectrum may be preset in the device that executes the speech synthesis method described in the embodiment of the present invention, or the preset frequency spectrum may be obtained from other devices when there is a need for speech synthesis.
  • Step 204 Obtain a superimposed spectrum according to the spectrum to be synthesized and the preset spectrum.
  • the superimposed spectrum includes both the characteristics of the spectrum to be synthesized and the characteristics of the preset spectrum.
  • the superimposed spectrum may include all the characteristics of the spectrum to be synthesized and the preset spectrum at the same time, or it may include both the spectrum to be synthesized and the characteristics of the preset spectrum.
  • Step 206 Perform emotional semantic feature extraction on the superimposed spectrum to obtain the emotional semantic feature corresponding to the superimposed spectrum.
  • emotional semantic features include emotional features and semantic features.
  • Emotional features reflect the emotion to be expressed by the voice or text;
  • semantic features reflect the semantics of the voice or text (for example, the text "what's the date today?", the semantics expressed is to ask for today's date).
  • Emotional semantic feature extraction is performed on the superimposed spectrum, and the emotional feature of the obtained emotional semantic feature is consistent with the emotion to be expressed by the preset frequency spectrum, and the semantic feature is consistent with the semantics to be expressed by the spectrum to be synthesized.
  • the final generated speech contains emotion and is close to the real human speech.
  • emotion is the emotional attribute of the entire speech or text.
  • the emotion to be expressed in the entire speech or text is "happy", “sad” or “angry”; rhythm reflects the emotion of the entire speech or some Chinese characters in the text Attributes, for example, some Chinese characters have accents, "Xiao Ming is in the mall", the accent may be in Xiao Ming, or in the mall, the emotion of the whole speech or some Chinese characters in the text is expressed through prosody, making the synthesized speech more circumflex and frustrating. Certain intonation, stress and rhythm.
  • Step 208 Perform fundamental frequency extraction on the preset frequency spectrum to obtain fundamental frequency characteristics corresponding to the preset frequency spectrum.
  • the fundamental frequency is a set of sine waves with the lowest frequency in the preset spectrum.
  • the fundamental frequency refers to the frequency of the fundamental tone in a polyphony.
  • the fundamental tone has the lowest frequency and the highest intensity.
  • Pitch is the auditory psychological perception of the fundamental frequency.
  • the pitch change depends on the pitch change. Therefore, the pitch change depends on the fundamental frequency.
  • the pitch change is manifested as the circumflex of the target voice, so the fundamental frequency characteristics of the preset frequency spectrum corresponding to the target voice can reflect the prosody of the target voice.
  • the fundamental frequency characteristics in the preset frequency spectrum can be obtained, and the fundamental frequency characteristics can express prosody, so that the finally obtained emotional prosody spectrum has both emotional characteristics and prosody characteristics, so that the final synthesis
  • the voice has emotion and rhythm.
  • Step 210 Obtain the emotional prosody spectrum corresponding to the spectrum to be synthesized according to the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum, so as to generate speech according to the emotional prosody spectrum.
  • the emotional prosody spectrum refers to a spectrum that contains both the semantic features of the spectrum to be synthesized, the emotional features of the preset spectrum, and the fundamental frequency feature.
  • the semantics of the speech generated according to the emotional prosody spectrum and the spectrum to be expressed by the spectrum to be synthesized The semantics are the same, and the emotion and prosody to be expressed by the voice generated according to the emotional prosody spectrum are the same as the emotion and prosody to be expressed by the preset frequency spectrum.
  • the spectrum to be synthesized and the preset spectrum are first obtained; then the superimposed spectrum is obtained according to the spectrum to be synthesized and the preset spectrum; at the same time, emotional semantic feature extraction is performed on the superimposed spectrum to obtain the corresponding superimposed spectrum Emotional semantic features; and extracting the fundamental frequency of the preset frequency spectrum to obtain the fundamental frequency characteristics corresponding to the preset frequency spectrum; finally according to the emotional semantic characteristics corresponding to the superimposed frequency spectrum and the fundamental frequency characteristics corresponding to the preset frequency spectrum Obtain the emotional prosody spectrum corresponding to the spectrum to be synthesized, so as to generate speech according to the emotional prosody spectrum.
  • the above speech synthesis method first extracts the emotional semantic features, gives the speech emotion, and then extracts the fundamental frequency of the preset frequency spectrum, and the fundamental frequency can reflect the prosody, thus achieving the control of the prosody such as the accent of the voice, and finally making the synthesis The voice is more real.
  • obtaining the superimposed spectrum according to the spectrum to be synthesized and the preset spectrum in step 204 includes:
  • Step 204A Use the preset frequency spectrum as the input of the emotion encoder to obtain the emotion feature corresponding to the preset frequency spectrum.
  • the emotion encoder is used to extract the emotion features of the preset frequency spectrum.
  • the emotion encoder includes an emotion extraction part, an emotion selection part and an emotion compression part.
  • the emotion extraction unit is used to extract the emotion-related features in the preset frequency spectrum
  • the emotion selection unit filters and selects the features extracted by the emotion extraction unit
  • the emotion compression unit selects the filtered features by the emotion selection unit Compression is performed to obtain the emotional features corresponding to the preset frequency spectrum.
  • the emotion extraction part of the emotion encoder is composed of six block modules, and each block module is composed of three parts: a two-dimensional convolutional layer, a two-dimensional batch normalization layer, and a modified linear unit.
  • the emotion extraction unit extracts high-frequency or high-dimensional features by ascending dimensions.
  • the emotion selection unit is composed of a gated loop unit, which is used to filter and select the features extracted by the emotion extraction unit, such as filtering out the noise features in the extracted high-dimensional features, so as to ensure that the output features of the emotion selection unit are all about emotions. Characteristics.
  • the emotion compression unit compresses the features filtered and selected by the emotion selection unit through linear affine transformation mapping to obtain a one-dimensional (or two-dimensional, three-dimensional, not specifically limited here) latent vector, which is the preset The emotional characteristics corresponding to the frequency spectrum.
  • Step 204B Obtain the superimposed spectrum according to the emotional feature corresponding to the preset spectrum and the spectrum to be synthesized.
  • obtaining the superimposed spectrum according to the emotional feature corresponding to the preset spectrum and the spectrum to be synthesized in step 204B includes:
  • Step 204B1 Obtain the dimension to be synthesized corresponding to the spectrum to be synthesized.
  • the dimension to be synthesized refers to the size of the dimension corresponding to the spectrum to be synthesized.
  • Step 204B2 Convert the emotional feature corresponding to the preset frequency spectrum into an emotional conversion feature with the same dimension as the dimension to be synthesized.
  • the dimensional conversion of the emotional feature is performed to obtain the emotional conversion feature, where the dimension of the emotional conversion feature is the dimension to be synthesized.
  • Step 204B3 Obtain the superimposed spectrum according to the spectrum to be synthesized and the emotion conversion feature.
  • the spectrum to be synthesized is (A, B, C, D), and the emotion conversion feature is (a, b, c, d).
  • performing emotional semantic feature extraction on the superimposed spectrum in step 206 to obtain the emotional semantic feature corresponding to the superimposed spectrum includes:
  • the superimposed spectrum is used as the input of the emotional semantic encoder to obtain the emotional semantic features corresponding to the superimposed spectrum output by the emotional semantic encoder.
  • the emotional semantic encoder is used to extract the emotional semantic features of the superimposed spectrum.
  • the emotion semantic encoder includes an emotion semantic extraction unit, an emotion semantic selection unit and an emotion semantic compression unit.
  • the emotion semantic extraction unit is used to extract features related to emotion semantics in the superimposed spectrum
  • the emotion semantic selection unit is used to filter and select the features extracted by the emotion semantic extraction unit
  • the emotion semantic compression unit will The semantic selection unit selects and compresses the filtered features to obtain the emotional semantic features corresponding to the superimposed spectrum.
  • the emotional semantic extraction part of the emotional semantic encoder is composed of six Block modules, and each Block module is composed of three parts: a two-dimensional convolutional layer, a two-dimensional batch normalization layer, and a modified linear unit.
  • the emotional semantic extraction unit extracts high-frequency or high-dimensional features by ascending dimensions.
  • the emotional semantic selection unit is composed of a gated loop unit, which is used to filter and select the features extracted by the emotional semantic extraction unit, such as filtering out noise features in the extracted high-dimensional features, so as to ensure that the output features of the emotional semantic selection unit are uniform. It is a feature of emotional semantics.
  • the emotional semantic compression unit is composed of a linear affine transformation mapping unit, and the emotional semantic features filtered and selected by the emotional semantic selection unit are compressed by linear affine transformation mapping to obtain a one-dimensional (or two-dimensional, three-dimensional, not detailed here)
  • the latent vector defined by) is the emotional semantic feature corresponding to the superimposed spectrum.
  • step 210 obtains the emotional prosody spectrum corresponding to the spectrum to be synthesized according to the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum, including:
  • the combined feature includes the semantic feature of the spectrum to be synthesized, the emotional feature and the fundamental frequency feature of the preset spectrum.
  • the emotional semantic feature corresponding to the superimposed spectrum is a one-dimensional vector A
  • the fundamental frequency feature corresponding to the preset spectrum is a one-dimensional vector B
  • the combined feature is a two-dimensional vector (A, B).
  • the emotional prosody decoder is used to obtain the emotional prosody spectrum corresponding to the spectrum to be synthesized.
  • the emotional prosody decoder includes a first dimension conversion unit, a feature extraction unit, a second dimension conversion unit, and a compression unit. After the first dimension conversion unit expands the dimensions of the combined features, the feature extraction unit re-extracts the features of the combined features after the dimension expansion, and the second dimension conversion unit expands the re-extracted features, which are compressed by the compression unit after expansion. , So that the dimension is the same as the dimension of the combined feature, and then the emotional prosody spectrum corresponding to the spectrum to be synthesized can be obtained.
  • the first dimension conversion part of the emotion decoder is composed of a long and short-term memory cyclic neural network (Long Short-Term Memory, LSTM), the feature extraction part is composed of three Block modules, each block module is composed of a one-dimensional convolution layer, a one-dimensional batch normalization layer and a modified linear unit, and the second dimension conversion part is composed of an LSTM
  • the compression unit is composed of a linear affine transformation mapping unit.
  • the dimension of the combined feature is 80 dimensions.
  • the combined feature is input into the emotion decoder.
  • the first dimension conversion unit increases the dimension of the combined feature to 256 dimensions, and the feature extraction unit extracts and converts the 256-dimensional combined feature again.
  • the latter combined feature is still 256 dimensions.
  • the second dimension conversion part upgrades the converted combined feature to 1024 dimensions.
  • the compression unit performs linear affine transformation mapping on the 1024-dimensional features, and compresses to obtain an 80-dimensional data, which is the emotional prosody spectrum corresponding to the spectrum to be synthesized.
  • the emotion encoder, the emotion semantic encoder and the emotion prosody decoder are integrated in the same speech synthesis neural network, and are obtained by training according to the frequency spectrum of the training speech.
  • the emotion encoder Input the frequency spectrum of the training speech into the speech synthesis neural network, the emotion encoder extracts the training emotion characteristics corresponding to the frequency spectrum of the training speech, superimposes the training emotion characteristics and the frequency spectrum of the training speech to obtain the training superposition spectrum, and inputs the training superposition spectrum into the emotional semantic coding
  • the emotional semantic encoder outputs the training emotional semantic features corresponding to the training superimposed frequency spectrum, and combines the training fundamental frequency characteristics corresponding to the training voice frequency spectrum and the training emotional semantic features corresponding to the training superimposed frequency spectrum to obtain the training combined feature, and the training combined feature is input
  • the emotional prosody decoder outputs the training emotional prosody spectrum, and calculates the error value between the training speech spectrum and the training emotional prosody spectrum until the error value is less than the preset error value, and the speech synthesis neural network training is completed.
  • the frequency spectrum to be synthesized and the preset frequency spectrum are input into the trained speech synthesis neural network, and the speech synthesis neural network directly outputs the emotional prosody spectrum corresponding to the frequency spectrum to be synthesized.
  • obtaining the spectrum to be synthesized in step 202 includes:
  • the text to be synthesized refers to the text content corresponding to the spectrum to be synthesized.
  • the text to be synthesized is recognized to obtain multiple text contents, and the speech to be synthesized corresponding to the multiple text contents is generated.
  • the spectrum to be synthesized of the text to be synthesized can be determined (for example, using Fourier transform to treat The synthesized speech is processed to obtain the spectrum to be synthesized).
  • a speech synthesis device As shown in FIG. 5, in one embodiment, a speech synthesis device is provided, and the device includes:
  • the spectrum acquisition module 502 is used to acquire the spectrum to be synthesized and the preset spectrum
  • the superimposed spectrum module 504 is configured to obtain a superimposed spectrum according to the spectrum to be synthesized and the preset spectrum;
  • the emotional semantic module 506 is configured to perform emotional semantic feature extraction on the superimposed spectrum to obtain the emotional semantic feature corresponding to the superimposed spectrum;
  • the fundamental frequency extraction module 508 is configured to perform fundamental frequency extraction on the preset frequency spectrum to obtain the fundamental frequency characteristics corresponding to the preset frequency spectrum;
  • the emotional prosody module 510 is configured to obtain the emotional prosody spectrum corresponding to the spectrum to be synthesized according to the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum, so as to generate speech according to the emotional prosody spectrum.
  • the above-mentioned speech synthesis device first obtains the spectrum to be synthesized and the preset spectrum; then obtains the superimposed spectrum according to the spectrum to be synthesized and the preset spectrum; at the same time, perform emotional semantic feature extraction on the superimposed spectrum to obtain the corresponding superimposed spectrum Emotional semantic features; and extracting the fundamental frequency of the preset frequency spectrum to obtain the fundamental frequency characteristics corresponding to the preset frequency spectrum; finally according to the emotional semantic characteristics corresponding to the superimposed frequency spectrum and the fundamental frequency characteristics corresponding to the preset frequency spectrum Obtain the emotional prosody spectrum corresponding to the spectrum to be synthesized, so as to generate speech according to the emotional prosody spectrum.
  • the above speech synthesis method first extracts the emotional semantic features, gives the speech emotion, and then extracts the fundamental frequency of the preset frequency spectrum, and the fundamental frequency can reflect the prosody, thus achieving the control of the prosody such as the accent of the voice, and finally making the synthesis The voice is more real.
  • the superimposing spectrum module 504 includes: an emotional feature extraction module, configured to use the preset spectrum as the input of the emotion encoder to obtain the emotional features corresponding to the preset spectrum; the superimposing module uses To obtain the superimposed spectrum according to the emotional feature corresponding to the preset spectrum and the spectrum to be synthesized.
  • the superposition module is specifically configured to: obtain the dimension to be synthesized corresponding to the spectrum to be synthesized; and convert the emotional feature corresponding to the preset frequency spectrum into an emotional conversion feature with a dimension consistent with the dimension to be synthesized ; Obtain the superimposed spectrum according to the spectrum to be synthesized and the emotional conversion feature.
  • the emotion semantic module 506 is specifically configured to: use the superimposed spectrum as the input of the emotional semantic encoder to obtain the emotional semantic features corresponding to the superimposed spectrum output by the emotional semantic encoder.
  • the emotional prosody module 510 is specifically configured to: combine the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum to obtain a combined feature; and input the combined feature
  • the emotional prosody decoder obtains the emotional prosody spectrum corresponding to the to-be-synthesized spectrum output by the emotional prosody decoder.
  • the spectrum acquisition module 502 is configured to: acquire the text to be synthesized; and obtain the spectrum to be synthesized of the text to be synthesized according to the text to be synthesized.
  • Fig. 6 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device can be a terminal, a server, or a speech synthesis device.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor can realize the speech synthesis method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the speech synthesis method.
  • FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the above-mentioned computer equipment first obtains the spectrum to be synthesized and the preset spectrum; then obtains the superimposed spectrum according to the spectrum to be synthesized and the preset spectrum; at the same time, performs emotional semantic feature extraction on the superimposed spectrum to obtain the emotion corresponding to the superimposed spectrum Semantic features; and extract the fundamental frequency of the preset frequency spectrum to obtain the fundamental frequency characteristics corresponding to the preset frequency spectrum; finally obtain the fundamental frequency characteristics corresponding to the emotional semantic characteristics of the superimposed frequency spectrum and the preset frequency spectrum
  • the emotional prosody spectrum corresponding to the spectrum to be synthesized is used to generate speech according to the emotional prosody spectrum.
  • the above speech synthesis method first extracts the emotional semantic features, gives the speech emotion, and then extracts the fundamental frequency of the preset frequency spectrum, and the fundamental frequency can reflect the prosody, thus achieving the control of the prosody such as the accent of the voice, and finally making the synthesis The voice is more real.
  • the obtaining the superimposed spectrum according to the spectrum to be synthesized and the preset spectrum includes: using the preset spectrum as an input of an emotion encoder to obtain the emotional feature corresponding to the preset spectrum; Obtain the superimposed spectrum according to the emotional feature corresponding to the preset spectrum and the spectrum to be synthesized.
  • the obtaining the superimposed spectrum according to the emotional feature corresponding to the preset spectrum and the spectrum to be synthesized includes: obtaining the dimension to be synthesized corresponding to the spectrum to be synthesized; The corresponding emotion feature is converted into an emotion conversion feature with a dimension consistent with the dimension to be synthesized; the superimposed spectrum is obtained according to the spectrum to be synthesized and the emotion conversion feature.
  • the performing emotional semantic feature extraction on the superimposed spectrum to obtain the emotional semantic feature corresponding to the superimposed spectrum includes: using the superimposed spectrum as an input of an emotional semantic encoder to obtain the emotional semantic code The emotional semantic feature corresponding to the superimposed frequency spectrum output by the processor.
  • the obtaining the emotional prosody spectrum corresponding to the spectrum to be synthesized according to the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum includes: corresponding to the superimposed spectrum
  • the emotional semantic feature is combined with the fundamental frequency feature corresponding to the preset frequency spectrum to obtain a combined feature; the combined feature is input to the emotional prosody decoder to obtain the emotional prosody corresponding to the spectrum to be synthesized output by the emotional prosody decoder Spectrum.
  • the obtaining the spectrum to be synthesized includes: obtaining the text to be synthesized; and obtaining the spectrum to be synthesized of the text to be synthesized according to the text to be synthesized.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the above-mentioned computer-readable storage medium first obtains the spectrum to be synthesized and the preset spectrum; then obtains the superimposed spectrum according to the spectrum to be synthesized and the preset spectrum; at the same time, performs emotional semantic feature extraction on the superimposed spectrum to obtain the superimposed spectrum Corresponding emotional semantic features; and extract the fundamental frequency of the preset frequency spectrum to obtain the fundamental frequency characteristics corresponding to the preset frequency spectrum; finally according to the emotional semantic characteristics corresponding to the superimposed frequency spectrum and the base frequency corresponding to the preset frequency spectrum
  • the frequency characteristics obtain the emotional prosody spectrum corresponding to the spectrum to be synthesized, so as to generate speech according to the emotional prosody spectrum.
  • the above speech synthesis method first extracts the emotional semantic features, gives the speech emotion, and then extracts the fundamental frequency of the preset frequency spectrum, and the fundamental frequency can reflect the prosody, thus achieving the control of the prosody such as the accent of the voice, and finally making the synthesis The voice is more real.
  • the obtaining the superimposed spectrum according to the spectrum to be synthesized and the preset spectrum includes: using the preset spectrum as an input of an emotion encoder to obtain the emotional feature corresponding to the preset spectrum; Obtain the superimposed spectrum according to the emotional feature corresponding to the preset spectrum and the spectrum to be synthesized.
  • the obtaining the superimposed spectrum according to the emotional feature corresponding to the preset spectrum and the spectrum to be synthesized includes: obtaining the dimension to be synthesized corresponding to the spectrum to be synthesized; The corresponding emotion feature is converted into an emotion conversion feature with a dimension consistent with the dimension to be synthesized; the superimposed spectrum is obtained according to the spectrum to be synthesized and the emotion conversion feature.
  • the performing emotional semantic feature extraction on the superimposed spectrum to obtain the emotional semantic feature corresponding to the superimposed spectrum includes: using the superimposed spectrum as an input of an emotional semantic encoder to obtain the emotional semantic code The emotional semantic feature corresponding to the superimposed frequency spectrum output by the processor.
  • the obtaining the emotional prosody spectrum corresponding to the spectrum to be synthesized according to the emotional semantic feature corresponding to the superimposed spectrum and the fundamental frequency feature corresponding to the preset spectrum includes: corresponding to the superimposed spectrum
  • the emotional semantic feature is combined with the fundamental frequency feature corresponding to the preset frequency spectrum to obtain a combined feature; the combined feature is input to the emotional prosody decoder to obtain the emotional prosody corresponding to the spectrum to be synthesized output by the emotional prosody decoder Spectrum.
  • the obtaining the spectrum to be synthesized includes: obtaining the text to be synthesized; and obtaining the spectrum to be synthesized of the text to be synthesized according to the text to be synthesized.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音合成方法、装置、计算机设备及计算机可读存储介质。该方法包括:获取待合成频谱和预置频谱(202);根据该待合成频谱和预置频谱得到叠加频谱(204);对叠加频谱进行情感语义特征提取得到对应的情感语义特征(206);对预置频谱进行基频提取,得到预置频谱对应的基频特征(208);根据叠加频谱对应的情感语义特征和预置频谱对应的基频特征得到待合成频谱对应的情感韵律频谱,根据该情感韵律频谱生成语音(210)。该语音与待合成频谱具有相同的语义,并且与预置频谱的情感特征和韵律特征一致。该方法能够实现对语音的重音等韵律进行控制,最终使得合成的语音更加真实。

Description

语音合成方法、装置、计算机设备及计算机可读存储介质 技术领域
本申请涉及语言合成技术领域,尤其涉及一种语音合成方法、装置、计算机设备及计算机可读存储介质。
背景技术
语音合成是通过机械的、电子的方法产生人造语音的技术,具体是指将计算机自己产生的、或外部输入计算机的文字信息转变为可以听得懂的、流利的语音输出的技术。
技术问题
现有技术中,从参考的语音中提取情感特征,然后通过无监督的方式利用提取的情感特征来控制语音的风格,但是,语音中不止情感,还包括有重音等,需要对重音等韵律因素进行精细控制,从而使得合成的语音更加真实。
技术解决方案
基于此,有必要针对上述问题,提出了一种能够同时对情感和韵律进行控制的语音合成、装置、计算机设备及存储介质。
本申请实施例提供了一种语音合成方法,所述方法包括:
获取待合成频谱和预置频谱;
根据所述待合成频谱和所述预置频谱得到叠加频谱;
对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
一种语音合成装置,所述装置包括:
频谱获取模块,用于获取待合成频谱和预置频谱;
叠加频谱模块,用于根据所述待合成频谱和所述预置频谱得到叠加频谱;
情感语义模块,用于对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
基频提取模块,用于对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
情感韵律模块,用于根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待合成频谱和预置频谱;
根据所述待合成频谱和所述预置频谱得到叠加频谱;
对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待合成频谱和预置频谱;
根据所述待合成频谱和所述预置频谱得到叠加频谱;
对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
有益效果
实施本申请实施例,将具有如下有益效果:
上述语音合成方法、装置、计算机设备及计算机可读存储介质,首先获取待合成频谱和预置频谱;然后根据所述待合成频谱和所述预置频谱得到叠加频谱;同时对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;并且对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;最后根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。上述语音合成方法,首先提取到了情感语义特征,赋予了语音情感,然后提取到了预置频谱的基频,而基频能够体现韵律,由此实现了对语音的重音等韵律进行控制,最终使得合成的语音更加真实。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为一个实施例中语音合成方法的应用环境图;
图2为一个实施例中语音合成方法的流程图;
图3为一个实施例中根据所述待合成频谱和所述预置频谱得到叠加频谱的流程图;
图4为一个实施例中根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱的流程图;
图5为一个实施例中语音合成装置的结构框图;
图6为一个实施例中计算机设备的结构框图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1为一个实施例中语音合成方法的应用环境图。参照图1,该语音合成方法应用于语音合成系统,该语音合成系统可设置于终端中,也可以设置于服务器中,其中,终端具体可以是台式终端或移动终端,移动终端具体可以是手机、机器人、平板电脑、笔记本电脑等中的至少一种,台式终端可以是台式电脑、车载电脑;服务器包括高性能计算机和高性能计算机集群。该语音合成系统包括用于获取待合成频谱和预置频谱的频谱获取模块;用于根据所述待合成频谱和所述预置频谱得到叠加频谱的叠加频谱模块;用于对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征的情感语义模块;用于对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征的基频提取模块;用于根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音的情感韵律模块。
如图2所示,在一个实施例中,提供了一种语音合成方法。该方法既可以应用于终端,也可以应用于服务器,还可以应用于其他语音合成装置中。该语音合成方法具体包括如下步骤:
步骤202:获取待合成频谱和预置频谱。
其中,待合成频谱是指不具有情感、韵律的待合成文本对应的频谱。示例性的,待合成频谱可以是待合成文本对应的梅尔频谱,还可以是待合成文本对应的梅尔倒谱。
其中,预置频谱,是指预先设置的具有一定的情感和韵律的目标语音对应的频谱,将预置频谱中的情感和韵律提取出来,并叠加到不具有情感、韵律的待合成频谱上,得到具有该一定的情感和韵律的情感韵律频谱,从而根据该情感韵律频谱生成具有该一定的情感和韵律的语音。示例性的,获取具有一定的情感和韵律的目标语音;根据所述目标语音得到所述目标语音对应的预置频谱。预置频谱可以预先设置于执行本发明实施例所述的语音合成方法的设备中,也可以在有语音合成需求的时候,从其他设备中获取到该预置频谱。
步骤204:根据所述待合成频谱和所述预置频谱得到叠加频谱。
其中,叠加频谱,同时包含有待合成频谱的特征和预置频谱的特征,具体的,叠加频谱可以同时包括所述待合成频谱和所述预置频谱的全部特征,也可以同时包括待合成频谱和所述预置频谱的部分特征,但需要说明的是,叠加频谱必须包括所述待合成频谱中的语义特征和预置频谱中的情感特征。
步骤206:对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征。
其中,情感语义特征包括情感特征和语义特征。情感特征反映语音或者文本所要表达的情感;语义特征反映语音或者文本的语义(例如,文本“今天几号?”,表达的语义就是想询问今天的日期)。
对所述叠加频谱进行情感语义特征提取,得到的情感语义特征中的情感特征与预置频谱所要表达的情感一致,语义特征与待合成频谱所要表达的语义一致。
通过对叠加频谱进行情感语义特征提取,使得最终生成的语音包含有情感,接近人真实的语音。
其中,情感,为整个语音或者文本的情感属性,例如,整个语音或者文本所要表达的情感为“高兴”、“伤心”或者为“生气”;韵律,反映整个语音或者文本中的部分汉字的情感属性,例如,部分汉字具有重音,“小明在商场”,重音可能在小明,也可能在商场,通过韵律对整个语音或者文本中的部分汉字的情感进行表达,使得合成的语音更加的抑扬顿挫,具备一定的语调、重音和节奏。
步骤208:对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征。
其中,基频,为预置频谱中的一组频率最低的正弦波。在声音中,基频是指一个复音中基音的频率。在构成一个复音的若干个音中,基音的频率最低,强度最大。音调是对基频的听觉心理感知量。声调高低变化取决于音调的高低变化,因此,声调的高低变化取决于基频的大小变化。声调的高低变化表现为目标语音的抑扬顿挫,因此目标语音对应的预置频谱的基频特征可以反映该目标语音的韵律。
通过对所述预置频谱进行基频提取,可以得到预置频谱中的基频特征,而基频特征能够表达韵律,使得最终得到的情感韵律频谱同时具备情感特征和韵律特征,从而使得最终合成的语音具备情感和韵律。
步骤210:根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
其中,情感韵律频谱是指同时包含待合成频谱的语义特征、预置频谱的情感特征和基频特征的频谱,根据所述情感韵律频谱生成的语音所要的语义与所述待合成频谱所要表达的语义相同,根据所述情感韵律频谱生成的语音所要表达的情感、韵律和所述预置频谱所要表达的情感、韵律相同。
上述语音合成方法,首先获取待合成频谱和预置频谱;然后根据所述待合成频谱和所述预置频谱得到叠加频谱;同时对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;并且对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;最后根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。上述语音合成方法,首先提取到了情感语义特征,赋予了语音情感,然后提取到了预置频谱的基频,而基频能够体现韵律,由此实现了对语音的重音等韵律进行控制,最终使得合成的语音更加真实。
在一个实施例中,如图3所示,步骤204所述根据所述待合成频谱和所述预置频谱得到叠加频谱,包括:
步骤204A,将所述预置频谱作为情感编码器的输入,得到所述预置频谱对应的情感特征。
其中,情感编码器,用于提取所述预置频谱的情感特征。情感编码器包括情感提取部,情感选取部和情感压缩部。其中,情感提取部用于提取所述预置频谱中关于情感的特征,情感选取部对所述情感提取部提取得到的特征进行过滤和选取,情感压缩部将所述情感选取部选取过滤的特征进行压缩以获取所述预置频谱对应的情感特征。示例性的,情感编码器的情感提取部由六个块(Block)模块构成,每个Block模块均由三部分组成:一个二维卷积层,一个二维批标准化层和一个修正线性单元。情感提取部通过升维提取高频或者说是高维的特征。情感选取部由门控循环单元构成,用于将所述情感提取部提取的特征进行过滤和选取,如过滤掉提取高维的特征中的噪音特征,以保障情感选取部输出特征均为关于情感的特征。情感压缩部将所述情感选取部过滤和选取的特征经过线性仿射变换映射压缩得到一个一维(或者二维、三维,在此不做具体的限定)的潜在向量,即为所述预置频谱对应的情感特征。
步骤204B,根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱。
根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱,具体是将所述预置频谱对应的情感特征和所述待合成频谱直接叠加得到所述叠加频谱,还可以是提取所述待合成频谱对应的语义特征,将所述预置频谱对应的情感特征和所述待合成频谱对应的语义特征叠加得到所述叠加频谱。
如图4所示,在一个实施例中,步骤204B所述根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱,包括:
步骤204B1:获取所述待合成频谱对应的待合成维度。
其中,待合成维度是指待合成频谱对应的维度大小。
步骤204B2:将所述预置频谱对应的情感特征转换成维度和所述待合成维度一致的情感转换特征。
对情感特征进行维度转换得到情感转换特征,其中,情感转换特征的维度为待合成维度。
步骤204B3:根据所述待合成频谱和所述情感转换特征得到所述叠加频谱。
示例性的,待合成频谱为(A,B,C,D),情感转换特征为(a,b,c,d),将待合成频谱和情感转换特征相加,得到叠加频谱为(A+a,B+b,C+c,D+d)。
在一个实施例中,步骤206对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征,包括:
将所述叠加频谱作为情感语义编码器的输入,得到所述情感语义编码器输出的所述叠加频谱对应的情感语义特征。
其中,情感语义编码器用于提取所述叠加频谱的情感语义特征。情感语义编码器包括情感语义提取部,情感语义选取部和情感语义压缩部。其中,情感语义提取部用于提取所述叠加频谱中关于情感语义的特征,情感语义选取部用于对所述情感语义提取部提取得到的特征进行过滤和选取,情感语义压缩部将所述情感语义选取部选取过滤的特征进行压缩以获取所述叠加频谱对应的情感语义特征。示例性的,情感语义编码器的情感语义提取部由六个Block模块构成,每个Block模块均由三部分组成:一个二维卷积层,一个二维批标准化层和一个修正线性单元。情感语义提取部通过升维提取高频或者说是高维的特征。情感语义选取部由门控循环单元构成,用于将所述情感语义提取部提取的特征进行过滤和选取,如过滤掉提取高维的特征中的噪音特征,以保障情感语义选取部输出特征均为关于情感语义的特征。情感语义压缩部由线性仿射变换映射单元构成,将所述情感语义选取部过滤和选取的情感语义特征经过线性仿射变换映射压缩得到一个一维(或者二维、三维,在此不做具体的限定)的潜在向量,即为所述叠加频谱对应的情感语义特征。
在一个实施例中,步骤210根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,包括:
将所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征进行组合,得到组合特征;将所述组合特征输入情感韵律解码器,得到所述情感韵律解码器输出的所述待合成频谱对应的情感韵律频谱。
其中,组合特征包括所述待合成频谱的语义特征、所述预置频谱的情感特征和基频特征。示例性的,所述叠加频谱对应的情感语义特征为一维向量A,所述预置频谱对应的基频特征为一维向量B,则所述组合特征为二维向量(A,B)。
其中,情感韵律解码器用于获取待合成频谱对应的情感韵律频谱。情感韵律解码器包括第一维度转换部、特征提取部、第二维度转换部和压缩部。第一维度转换部将所述组合特征的维度扩展后,特征提取部对维度扩展后的组合特征进行特征的再次提取,第二维度转换部对再次提取的特征进行扩展,扩展后经压缩部压缩,使其维度与组合特征的维度一样,即可获取所述待合成频谱对应的情感韵律频谱。示例性的,情感解码器的第一维度转换部由一个长短时记忆循环神经网络(Long Short-Term Memory,LSTM)构成,特征提取部由三个Block模块构成,每个Block模块均由一维卷积层、一维批标准化层和修正线性单元构成,第二维度转换部由一个LSTM构成,压缩部由线性仿射变换映射单元构成。组合特征的维度为80维,将组合特征输入情感解码器中,第一维度转换部将组合特征的维度升为256维,特征提取部将256维的组合特征进行特征的再次提取和转换,转换后的组合特征仍为256维,为保障有足够多的特征,第二维度转换部对转换后的组合特征进行升维,将其维度升为1024维。压缩部将1024维的特征进行线性仿射变换映射,压缩得到一个80维的数据,即为待合成频谱对应的情感韵律频谱。
在一个实施例中,所述情感编码器、所述情感语义编码器和所述情感韵律解码器集成在同一个语音合成神经网络中,根据训练语音的频谱训练得到。将训练语音的频谱输入到语音合成神经网络中,情感编码器提取训练语音的频谱对应的训练情感特征,训练情感特征和训练语音的频谱叠加得到训练叠加频谱,将训练叠加频谱输入到情感语义编码器中,情感语义编码器输出训练叠加频谱对应的训练情感语义特征,将训练语音的频谱对应的训练基频特征和训练叠加频谱对应的训练情感语义特征合并得到的训练组合特征,训练组合特征输入情感韵律解码器中输出训练情感韵律频谱,计算训练语音的频谱和训练情感韵律频谱之间的误差值,直至误差值小于预设误差值,该语音合成神经网络训练完成。
相应的,将待合成频谱和预置频谱输入训练完成的语音合成神经网络中,语音合成神经网络直接输出所述待合成频谱对应的情感韵律频谱。
在一个实施例中,步骤202获取待合成频谱,包括:
获取待合成文本;根据所述待合成文本得到所述待合成文本的待合成频谱。
其中,待合成文本是指待合成频谱对应的文本内容。
对待合成文本进行识别,得到多个文字内容,生成与所述多个文字内容对应的待合成语音,根据待合成语音可以确定所述待合成文本的待合成频谱(例如,使用傅里叶变换对待合成语音进行处理得到待合成频谱)。
如图5所示,在一个实施例中,提供了一种语音合成装置,该装置包括:
频谱获取模块502,用于获取待合成频谱和预置频谱;
叠加频谱模块504,用于根据所述待合成频谱和所述预置频谱得到叠加频谱;
情感语义模块506,用于对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
基频提取模块508,用于对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
情感韵律模块510,用于根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
上述语音合成装置,首先获取待合成频谱和预置频谱;然后根据所述待合成频谱和所述预置频谱得到叠加频谱;同时对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;并且对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;最后根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。上述语音合成方法,首先提取到了情感语义特征,赋予了语音情感,然后提取到了预置频谱的基频,而基频能够体现韵律,由此实现了对语音的重音等韵律进行控制,最终使得合成的语音更加真实。
在一个实施例中,所述叠加频谱模块504,包括:提取情感特征模块,用于将所述预置频谱作为情感编码器的输入,得到所述预置频谱对应的情感特征;叠加模块,用于根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱。
在一个实施例中,所述叠加模块具体用于:获取所述待合成频谱对应的待合成维度;将所述预置频谱对应的情感特征转换成维度和所述待合成维度一致的情感转换特征;根据所述待合成频谱和所述情感转换特征得到所述叠加频谱。
在一个实施例中,所述情感语义模块506具体用于:将所述叠加频谱作为情感语义编码器的输入,得到所述情感语义编码器输出的所述叠加频谱对应的情感语义特征。
在一个实施例中,所述情感韵律模块510具体用于:将所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征进行组合,得到组合特征;将所述组合特征输入情感韵律解码器,得到所述情感韵律解码器输出的所述待合成频谱对应的情感韵律频谱。
在一个实施例中,所述频谱获取模块502用于:获取待合成文本;根据所述待合成文本得到所述待合成文本的待合成频谱。
图6示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端,也可以是服务器,还可以是语音合成装置。如图6所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音合成方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行语音合成方法。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待合成频谱和预置频谱;
根据所述待合成频谱和所述预置频谱得到叠加频谱;
对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
上述计算机设备,首先获取待合成频谱和预置频谱;然后根据所述待合成频谱和所述预置频谱得到叠加频谱;同时对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;并且对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;最后根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。上述语音合成方法,首先提取到了情感语义特征,赋予了语音情感,然后提取到了预置频谱的基频,而基频能够体现韵律,由此实现了对语音的重音等韵律进行控制,最终使得合成的语音更加真实。
在一个实施例中,所述根据所述待合成频谱和所述预置频谱得到叠加频谱,包括:将所述预置频谱作为情感编码器的输入,得到所述预置频谱对应的情感特征;根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱。
在一个实施例中,所述根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱,包括:获取所述待合成频谱对应的待合成维度;将所述预置频谱对应的情感特征转换成维度和所述待合成维度一致的情感转换特征;根据所述待合成频谱和所述情感转换特征得到所述叠加频谱。
在一个实施例中,所述对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征,包括:将所述叠加频谱作为情感语义编码器的输入,得到所述情感语义编码器输出的所述叠加频谱对应的情感语义特征。
在一个实施例中,所述根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,包括:将所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征进行组合,得到组合特征;将所述组合特征输入情感韵律解码器,得到所述情感韵律解码器输出的所述待合成频谱对应的情感韵律频谱。
在一个实施例中,所述获取待合成频谱,包括:获取待合成文本;根据所述待合成文本得到所述待合成文本的待合成频谱。
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待合成频谱和预置频谱;
根据所述待合成频谱和所述预置频谱得到叠加频谱;
对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
上述计算机可读存储介质,首先获取待合成频谱和预置频谱;然后根据所述待合成频谱和所述预置频谱得到叠加频谱;同时对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;并且对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;最后根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。上述语音合成方法,首先提取到了情感语义特征,赋予了语音情感,然后提取到了预置频谱的基频,而基频能够体现韵律,由此实现了对语音的重音等韵律进行控制,最终使得合成的语音更加真实。
在一个实施例中,所述根据所述待合成频谱和所述预置频谱得到叠加频谱,包括:将所述预置频谱作为情感编码器的输入,得到所述预置频谱对应的情感特征;根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱。
在一个实施例中,所述根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱,包括:获取所述待合成频谱对应的待合成维度;将所述预置频谱对应的情感特征转换成维度和所述待合成维度一致的情感转换特征;根据所述待合成频谱和所述情感转换特征得到所述叠加频谱。
在一个实施例中,所述对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征,包括:将所述叠加频谱作为情感语义编码器的输入,得到所述情感语义编码器输出的所述叠加频谱对应的情感语义特征。
在一个实施例中,所述根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,包括:将所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征进行组合,得到组合特征;将所述组合特征输入情感韵律解码器,得到所述情感韵律解码器输出的所述待合成频谱对应的情感韵律频谱。
在一个实施例中,所述获取待合成频谱,包括:获取待合成文本;根据所述待合成文本得到所述待合成文本的待合成频谱。
需要说明的是,上述语音合成方法、语音合成装置、计算机设备及计算机可读存储介质属于一个总的发明构思,语音合成方法、语音合成装置、计算机设备及计算机可读存储介质实施例中的内容可相互适用。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种语音合成方法,其特征在于,所述方法包括:
    获取待合成频谱和预置频谱;
    根据所述待合成频谱和所述预置频谱得到叠加频谱;
    对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
    对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
    根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述待合成频谱和所述预置频谱得到叠加频谱,包括:
    将所述预置频谱作为情感编码器的输入,得到所述预置频谱对应的情感特征;
    根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱,包括:
    获取所述待合成频谱对应的待合成维度;
    将所述预置频谱对应的情感特征转换成维度和所述待合成维度一致的情感转换特征;
    根据所述待合成频谱和所述情感转换特征得到所述叠加频谱。
  4. 根据权利要求1所述的方法,其特征在于,所述对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征,包括:
    将所述叠加频谱作为情感语义编码器的输入,得到所述情感语义编码器输出的所述叠加频谱对应的情感语义特征。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,包括:
    将所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征进行组合,得到组合特征;
    将所述组合特征输入情感韵律解码器,得到所述情感韵律解码器输出的所述待合成频谱对应的情感韵律频谱。
  6. 根据权利要求1所述的方法,其特征在于,所述获取待合成频谱,包括:
    获取待合成文本;
    根据所述待合成文本得到所述待合成文本的待合成频谱。
  7. 一种语音合成装置,其特征在于,所述装置包括:
    频谱获取模块,用于获取待合成频谱和预置频谱;
    叠加频谱模块,用于根据所述待合成频谱和所述预置频谱得到叠加频谱;
    情感语义模块,用于对所述叠加频谱进行情感语义特征提取得到所述叠加频谱对应的情感语义特征;
    基频提取模块,用于对所述预置频谱进行基频提取,得到所述预置频谱对应的基频特征;
    情感韵律模块,用于根据所述叠加频谱对应的情感语义特征和所述预置频谱对应的基频特征得到所述待合成频谱对应的情感韵律频谱,以根据所述情感韵律频谱生成语音。
  8. 根据权利要求7所述的装置,其特征在于,所述叠加频谱模块,包括:
    提取情感特征模块,用于将所述预置频谱作为情感编码器的输入,得到所述预置频谱对应的情感特征;
    叠加模块,用于根据所述预置频谱对应的情感特征和所述待合成频谱得到所述叠加频谱。
  9. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至6中任一项所述语音合成方法的步骤。
  10. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至6中任一项所述语音合成方法的步骤。
PCT/CN2019/127914 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备及计算机可读存储介质 WO2021127979A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/127914 WO2021127979A1 (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备及计算机可读存储介质
CN201980003185.2A CN111108549B (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备及计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/127914 WO2021127979A1 (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021127979A1 true WO2021127979A1 (zh) 2021-07-01

Family

ID=70427475

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127914 WO2021127979A1 (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111108549B (zh)
WO (1) WO2021127979A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885367B (zh) * 2021-01-19 2022-04-08 珠海市杰理科技股份有限公司 基频获取方法、装置、计算机设备和存储介质
CN117877460A (zh) * 2024-01-12 2024-04-12 汉王科技股份有限公司 语音合成方法、装置、语音合成模型训练方法、装置

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184731A (zh) * 2011-05-12 2011-09-14 北京航空航天大学 一种韵律类和音质类参数相结合的情感语音转换方法
CN103065619A (zh) * 2012-12-26 2013-04-24 安徽科大讯飞信息科技股份有限公司 一种语音合成方法和语音合成系统
CN105529023A (zh) * 2016-01-25 2016-04-27 百度在线网络技术(北京)有限公司 语音合成方法和装置
JP2017203963A (ja) * 2016-05-13 2017-11-16 日本放送協会 音声加工装置、及びプログラム
CN108615524A (zh) * 2018-05-14 2018-10-02 平安科技(深圳)有限公司 一种语音合成方法、系统及终端设备
JP6433063B2 (ja) * 2014-11-27 2018-12-05 日本放送協会 音声加工装置、及びプログラム
CN109599128A (zh) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 语音情感识别方法、装置、电子设备和可读介质
CN110277086A (zh) * 2019-06-25 2019-09-24 中国科学院自动化研究所 基于电网调度知识图谱的语音合成方法、系统及电子设备
CN110299131A (zh) * 2019-08-01 2019-10-01 苏州奇梦者网络科技有限公司 一种可控制韵律情感的语音合成方法、装置、存储介质
CN110556092A (zh) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 语音的合成方法及装置、存储介质、电子装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064104B (zh) * 2006-04-24 2011-02-02 中国科学院自动化研究所 基于语音转换的情感语音生成方法
CN110223705B (zh) * 2019-06-12 2023-09-15 腾讯科技(深圳)有限公司 语音转换方法、装置、设备及可读存储介质

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184731A (zh) * 2011-05-12 2011-09-14 北京航空航天大学 一种韵律类和音质类参数相结合的情感语音转换方法
CN103065619A (zh) * 2012-12-26 2013-04-24 安徽科大讯飞信息科技股份有限公司 一种语音合成方法和语音合成系统
JP6433063B2 (ja) * 2014-11-27 2018-12-05 日本放送協会 音声加工装置、及びプログラム
CN105529023A (zh) * 2016-01-25 2016-04-27 百度在线网络技术(北京)有限公司 语音合成方法和装置
JP2017203963A (ja) * 2016-05-13 2017-11-16 日本放送協会 音声加工装置、及びプログラム
CN108615524A (zh) * 2018-05-14 2018-10-02 平安科技(深圳)有限公司 一种语音合成方法、系统及终端设备
CN110556092A (zh) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 语音的合成方法及装置、存储介质、电子装置
CN109599128A (zh) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 语音情感识别方法、装置、电子设备和可读介质
CN110277086A (zh) * 2019-06-25 2019-09-24 中国科学院自动化研究所 基于电网调度知识图谱的语音合成方法、系统及电子设备
CN110299131A (zh) * 2019-08-01 2019-10-01 苏州奇梦者网络科技有限公司 一种可控制韵律情感的语音合成方法、装置、存储介质

Also Published As

Publication number Publication date
CN111108549A (zh) 2020-05-05
CN111108549B (zh) 2024-02-02

Similar Documents

Publication Publication Date Title
JP7106680B2 (ja) ニューラルネットワークを使用したターゲット話者の声でのテキストからの音声合成
US10186251B1 (en) Voice conversion using deep neural network with intermediate voice training
US11763796B2 (en) Computer-implemented method for speech synthesis, computer device, and non-transitory computer readable storage medium
JP3660937B2 (ja) 音声合成方法および音声合成装置
CN111402858B (zh) 一种歌声合成方法、装置、计算机设备及存储介质
CN111133507B (zh) 一种语音合成方法、装置、智能终端及可读介质
JP4391701B2 (ja) 音声信号の区分化及び認識のシステム及び方法
WO2021127979A1 (zh) 语音合成方法、装置、计算机设备及计算机可读存储介质
CN111261177A (zh) 语音转换方法、电子装置及计算机可读存储介质
CN112735454A (zh) 音频处理方法、装置、电子设备和可读存储介质
CN110264993A (zh) 语音合成方法、装置、设备及计算机可读存储介质
CN112712789A (zh) 跨语言音频转换方法、装置、计算机设备和存储介质
US20110046957A1 (en) System and method for speech synthesis using frequency splicing
WO2019218773A1 (zh) 语音的合成方法及装置、存储介质、电子装置
JP6681264B2 (ja) 音声加工装置、及びプログラム
CN113555003B (zh) 语音合成方法、装置、电子设备及存储介质
RU2754920C1 (ru) Способ синтеза речи с передачей достоверного интонирования клонируемого образца
CN113160849B (zh) 歌声合成方法、装置及电子设备和计算机可读存储介质
Oh et al. DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation
CN116825081B (zh) 基于小样本学习的语音合成方法、装置及存储介质
CN117636842B (zh) 基于韵律情感迁移的语音合成系统及方法
CN111108558B (zh) 语音转换方法、装置、计算机设备及计算机可读存储介质
CN104464717B (zh) 声音合成装置
KR102526338B1 (ko) 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치 및 방법
Pandya ORIGAMI–Oration to Physiognomy

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958001

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19958001

Country of ref document: EP

Kind code of ref document: A1