WO2020248393A1 - Speech synthesis method and system, terminal device, and readable storage medium - Google Patents

Speech synthesis method and system, terminal device, and readable storage medium Download PDF

Info

Publication number
WO2020248393A1
WO2020248393A1 PCT/CN2019/103582 CN2019103582W WO2020248393A1 WO 2020248393 A1 WO2020248393 A1 WO 2020248393A1 CN 2019103582 W CN2019103582 W CN 2019103582W WO 2020248393 A1 WO2020248393 A1 WO 2020248393A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
prosody
text
spectrogram
mel
Prior art date
Application number
PCT/CN2019/103582
Other languages
French (fr)
Chinese (zh)
Inventor
彭话易
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020248393A1 publication Critical patent/WO2020248393A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to the field of speech semantics, and specifically to a speech synthesis method, system, terminal device and readable storage medium.
  • Speech synthesis can play a great role in quality inspection, machine question and answer, and disability assistance, which is convenient for people's lives.
  • the existing machine can synthesize speech often has a fixed pattern, the generated speech is more rigid in terms of prosody, and has obvious differences with real people. Therefore, in certain scenarios where the artificiality of synthesized speech is relatively high (such as : Smart outbound calls), end users often cannot accept such a rigid rhythm. Therefore, there is an urgent need for a speech synthesis method based on deep learning.
  • this application proposes a speech synthesis method, system, terminal device and readable storage medium, which can transfer the prosody in the recording of a real person to the synthesized speech, and realize the fidelity of the synthesized speech.
  • the promotion proposes a speech synthesis method, system, terminal device and readable storage medium, which can transfer the prosody in the recording of a real person to the synthesized speech, and realize the fidelity of the synthesized speech. The promotion.
  • the first aspect of the present application provides a speech synthesis method, including:
  • the target voice is generated according to the Mel language spectrogram.
  • the second aspect of the present application also provides a speech synthesis system, including:
  • the text embedding module is used to obtain text data and generate a text vector according to the text data
  • the prosody extraction module is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;
  • a mel language spectrum generating module configured to combine the text vector and the prosody vector to generate a mel language spectrum map
  • the voice generation module is used to generate the target voice according to the Mel language spectrogram.
  • the third aspect of the present application also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • a terminal device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, The steps of the above speech synthesis method.
  • the fourth aspect of the present application also provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium includes a computer program. When the computer program is executed by a processor, Steps of synthetic method.
  • This application obtains text data and real-person recordings, and generates text vectors from the text data, and then models the prosody of the real-person recordings to generate the prosody vector; then combines the text vector and the prosody vector to generate mel Spectrogram; and then generate the target speech according to the Mel spectrogram, so as to realize the transfer of the prosody in the real-person recording to the synthesized speech.
  • this application also models the prosody in real-person recordings, and uses a method based on global conditional probability generation to make the synthesized speech more similar to the input real-person recordings, and further make the synthesized speech have high fidelity and high fidelity. The effect of naturalness.
  • Fig. 1 shows a flowchart of a speech synthesis method of the present application.
  • Fig. 2 shows a flowchart of a method for generating a text vector according to an embodiment of the present application.
  • Fig. 3 shows a flow chart of a method for generating a prosody vector according to an embodiment of the present application.
  • Fig. 4 shows a block diagram of a speech synthesis system of the present application.
  • Fig. 5 shows a block diagram of a text embedding module according to an embodiment of the present application.
  • Fig. 6 shows a block diagram of a prosody extraction module according to an embodiment of the present application.
  • Fig. 7 shows a schematic diagram of the operation of a speech synthesis system of the present application.
  • Figure 8 shows a schematic diagram of the operation of a text embedding module of the present application.
  • Figure 9 shows a schematic diagram of the operation of a prosody extraction module of the present application.
  • Fig. 10 shows a schematic diagram of a terminal device of the present application.
  • FIG. 1 is a flowchart of a speech synthesis method according to this application.
  • the first aspect of this application provides a speech synthesis method, including:
  • S102 Acquire text data, and generate a text vector according to the text data
  • S104 Acquire a real person recording, and model the rhythm of the real person recording to generate a prosody vector
  • S108 Generate a target voice according to the Mel language spectrogram.
  • step S106 combining the text vector and the prosody vector to generate a Mel language spectrogram, specifically includes:
  • the text vector is used as a local condition
  • the prosody vector is used as a global condition
  • the sequence-to-sequence model mapping is used to generate the Mel language spectrogram (also called the Mel spectrogram).
  • the text vector and the prosody vector will be input into a sequence to the sequence model (seq2seq).
  • the sequence-to-sequence model is a neural network model generated based on conditional probability.
  • the input text vector will serve as the local condition
  • the input prosody vector will serve as the global condition.
  • the real person recording after acquiring the real person recording, it also includes performing pre-enhancement processing on the real person recording; the pre-enhancement is performed in units of frames, with the purpose of enhancing high frequencies and increasing the high frequency resolution of the speech. Because the high-frequency end is attenuated by 6dB/oct (octave) above 800Hz, the higher the frequency, the smaller the corresponding component. For this reason, the high-frequency part of the live recording must be improved before the analysis of the live recording, which can also improve the high-frequency signal. Noise ratio.
  • Fig. 2 shows a flowchart of a method for generating a text vector according to an embodiment of the present application.
  • acquiring text data and generating a text vector from the text data specifically includes:
  • S202 Acquire Chinese character data, and perform word segmentation processing on the Chinese character data
  • S208 Convert the one-dimensional vector data into two-dimensional vector data according to the time sequence.
  • the tones include the first tone, second tone, third tone, fourth tone and soft tone of Mandarin.
  • Arabic numerals 1, 2, 3, 4, and 5 are used to represent the first tone and the first tone of Mandarin.
  • the tone codes of the second tone, the third tone, the fourth tone, and the soft tone but are not limited to these. In other embodiments, other numbers may also be used to represent the four tones and the soft tone of Mandarin.
  • Fig. 3 shows a flow chart of a method for generating a prosody vector according to an embodiment of the present application.
  • acquiring a real person recording, and modeling the prosody of the real person recording to generate a prosody vector which specifically includes:
  • S302 Perform short-time Fourier transform on the acquired real person recording to obtain a corresponding spectrogram
  • S308 Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;
  • S310 Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
  • the Short Time Fourier Transform is a mathematical transformation related to the Fourier Transform, which is used to determine the frequency and phase of a sine wave in a local area of a time-varying signal.
  • the short-time Fourier transform is to cut the original Fourier transform into multiple segments in the time domain to perform Fourier transform separately, each segment is recorded as time t, and the frequency domain characteristics are obtained corresponding to the Fourier transform.
  • the frequency domain characteristics at time t can be roughly estimated (that is, the correspondence between the time domain and the frequency domain is also guided).
  • the tool used for signal truncation is called a window function (the width is equivalent to the length of time). The smaller the window, the more obvious the time domain characteristics, but at this time, the FFT reduces the accuracy due to too few points, resulting in insignificant frequency domain characteristics.
  • wavelet transform or Wigner distribution may also be used to obtain the spectrogram, but it is not limited thereto.
  • the real person recording is a one-dimensional signal; the spectrogram is a two-dimensional signal.
  • random noise needs to be added to the acquired real person recording.
  • some audio is usually synthesized manually, but some artificially synthesized (using software) audio may cause some digital errors, such as underflow, overflow, etc. This application can effectively avoid the above-mentioned digital error problem by adding random noise to the audio.
  • Fig. 4 shows a block diagram of a speech synthesis system of the present application.
  • the speech synthesis system 4 includes:
  • the text embedding module 41 is used to obtain text data and generate a text vector according to the text data;
  • the prosody extraction module 42 is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;
  • the Mel language spectrum generating module 43 is configured to combine the text vector and the prosody vector to generate a Mel language spectrum map
  • the speech generation module 44 is used to generate a target speech according to the Mel language spectrogram.
  • the Mel spectrum generation module 43 is a sequence-to-sequence model (seq2seq), and the sequence-to-sequence model is a neural network model generated based on conditional probability.
  • the text vector and the prosody vector will be input into a sequence to the sequence model, the input text vector will be the local condition, and the input prosody vector will be the global condition.
  • the Mel language spectrogram can be obtained.
  • the sequence-to-sequence model used in the Mel language spectrum generation module 43 and the prosody extraction module use the same undisclosed speech database for joint training.
  • the voice database contains a male/female speaker (i.e. the source speaker) in a quiet environment, with a total length of about 30 hours of voice files recorded with a special recording device, and the text file corresponding to each voice.
  • the real person recording after acquiring the real person recording, it also includes performing pre-enhancement processing on the real person recording; the pre-enhancement is performed in units of frames, with the purpose of enhancing high frequencies and increasing the high frequency resolution of speech. Because the high-frequency end is attenuated by 6dB/oct (octave) above 800Hz, the higher the frequency, the smaller the corresponding component. For this reason, the high-frequency part of the live recording must be improved before the analysis of the live recording, which can also improve the high-frequency signal. Noise ratio.
  • Fig. 5 shows a block diagram of a text embedding module according to an embodiment of the present application.
  • the text embedding module 41 includes:
  • the word segmentation unit 411 is used to obtain Chinese character data and perform word segmentation processing on the Chinese character data;
  • the language model unit 412 is used for translating the Chinese character data after word segmentation processing into tonal Chinese pinyin;
  • the one-hot encoding unit 413 is used to convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;
  • the text vector generating unit 414 is configured to convert one-dimensional vector data into two-dimensional vector data according to a time sequence
  • the tones include the first tones, the second tones, the third tones, the fourth tones, and the soft tones.
  • the Arabic numerals 1, 2, 3, 4, and 5 are used to represent the first, second, Tone codes for the third, fourth, and soft tone.
  • the one-hot encoding unit performs One-Hot-coding (One-Hot-coding) method is: use N-bit status register to encode N states, each state has its own independent register bit, And at any time, only one of them is valid. For example, to encode six states:
  • the natural sequence code is 000,001,010,011,100,101;
  • the one-hot code is 000001,000010,000100,001000,010000,100000.
  • Fig. 6 shows a block diagram of a prosody extraction module according to an embodiment of the present application.
  • the prosody extraction module 42 includes:
  • the short-time Fourier transform unit 421 is configured to perform short-time Fourier transform on the acquired real person recordings to obtain a corresponding spectrogram
  • the convolutional neural network unit 423 is configured to perform time sequence compression and feature representation optimization on the Mel language spectrogram
  • the GRU unit 424 is configured to perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;
  • the prosody vector generating unit 425 is used to obtain the output at each moment and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
  • the Short Time Fourier Transform is a mathematical transformation related to the Fourier Transform, which is used to determine the frequency and phase of a sine wave in a local area of a time-varying signal.
  • the short-time Fourier transform is to cut the original Fourier transform into multiple segments in the time domain to perform Fourier transform separately, each segment is recorded as time t, and the frequency domain characteristics are obtained corresponding to the Fourier transform.
  • the frequency domain characteristics at time t can be roughly estimated (that is, the correspondence between the time domain and the frequency domain is also guided).
  • the tool used for signal truncation is called a window function (the width is equivalent to the length of time). The smaller the window, the more obvious the time domain characteristics, but at this time, the FFT reduces the accuracy due to too few points, resulting in insignificant frequency domain characteristics.
  • the short-time Fourier transform unit 421 may also be replaced by a wavelet transform unit or a Wigner distribution unit, but it is not limited thereto.
  • the real person recording is a one-dimensional signal; the spectrogram is a two-dimensional signal.
  • random noise needs to be added to the acquired real-person recording.
  • some audio is usually synthesized manually, but some artificially synthesized (using software) audio may cause some digital errors, such as underflow, overflow, etc. This application can effectively avoid the above-mentioned digital error problem by adding random noise to the audio.
  • CNN Convolutional Neural Networks
  • the convolutional neural network includes an input layer, a hidden layer, and an output layer.
  • the input layer of the convolutional neural network can process multi-dimensional data.
  • the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array.
  • the one-dimensional array is usually time or spectrum sampling; the two-dimensional array may contain multiple channels;
  • the input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array.
  • the convolutional neural network unit adopts a two-dimensional convolutional neural network to perform a six-layer two-dimensional convolution operation on a real-person recorded Mel language spectrogram.
  • the hidden layer of a convolutional neural network includes a convolutional layer, a pooling layer, and a fully connected layer.
  • the function of the convolutional layer is to extract features from the input data. It contains multiple convolution kernels, which make up each of the convolution kernels. Each element corresponds to a weight coefficient and a deviation, which is similar to a neuron of a feedforward neural network; after feature extraction in the convolutional layer, the output feature map will be passed to the pooling layer for feature selection and information filtering.
  • the pooling layer contains a preset pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions.
  • the fully connected layer is built in the last part of the hidden layer of the convolutional neural network and only transmits signals to other fully connected layers.
  • Recurrent Neural Network is a type of recurrent neural network that takes sequence data as input, recursively in the evolution direction of the sequence, and all nodes (recurrent units) are connected in a chain to form a closed loop.
  • Recurrent neural network has memory, parameter sharing and Turing completeness, so it can learn the nonlinear characteristics of the sequence with high efficiency.
  • Fig. 7 shows a schematic diagram of the operation of a speech synthesis system of the present application.
  • a trained speech generation module is used to synthesize the Mel language spectrogram into a high-fidelity speech file.
  • the speech generation module is a speech vocoder.
  • the input Chinese characters (Hello) will be segmented first, and then a trained language model will translate the Chinese characters into tonal Chinese pinyin; through one-hot encoding, Convert the translated Pinyin letters and tone codes (numbers 1 to 5) into one-dimensional vector data, and then convert them into a two-dimensional vector data according to the time series.
  • the voice generation module After the voice generation module obtains the Mel spectrogram, it uses the Mel spectrogram as a conditional input to generate the voice of the target speaker.
  • the voice generation module is a WaveNet vocoder, which is determined by A non-public speech database is trained, and the database is the same as the speech database used for training the Mel language spectrum generation module.
  • the prosody extraction module uses a cyclic neural network to realize the conversion between real-person recordings and prosody vectors.
  • the specific steps are as follows:
  • the Mel language spectrogram will be input into a six-layer pre-trained convolutional neural network, the purpose is to compress the time sequence and better represent the features in the Mel language spectrum.
  • the processed Mel language spectrogram will be input into the GRU unit-based recurrent neural network for processing according to time sequence, and the recurrent neural network will output according to the time sequence. After getting the output of each moment, a fully connected network converts the output of the recurrent neural network into a two-dimensional vector, which is the prosody vector.
  • Fig. 10 shows a schematic diagram of a terminal device of the present application.
  • the third aspect of the present application further provides a terminal device 7.
  • the terminal device 7 includes: a processor 71, a memory 72, and a computer stored in the memory 72 and running on the processor 71 Program 73, such as a program.
  • Program 73 such as a program.
  • the computer program 73 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 72 and executed by the processor 71 To complete this application.
  • the one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 73 in the terminal device 7.
  • the computer program 73 can be divided into a text embedding module, a prosody extraction module, a Mel language spectrum generation module, and a speech generation module.
  • the specific functions of each module are as follows:
  • the text embedding module is used to obtain text data and generate a text vector from the text data
  • the prosody extraction module is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;
  • a mel language spectrum generating module configured to combine the text vector and the prosody vector to generate a mel language spectrum map
  • the voice generation module is used to generate the target voice according to the Mel language spectrogram.
  • the terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud management server.
  • the terminal device 7 may include, but is not limited to, a processor 71 and a memory 72.
  • FIG. 7 is only an example of the terminal device 7 and does not constitute a limitation on the terminal device 7. It may include more or less components than shown in the figure, or a combination of certain components, or different components.
  • the terminal device may also include input and output devices, network access devices, buses, etc.
  • the so-called processor 71 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 72 may be an internal storage unit of the terminal device 7, such as a hard disk or memory of the terminal device 7.
  • the memory 72 may also be an external storage device of the terminal device 7, such as a plug-in hard disk equipped on the terminal device 7, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD). Card, Flash Card, etc. Further, the memory 72 may also include both an internal storage unit of the terminal device 7 and an external storage device.
  • the memory 72 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 72 can also be used to temporarily store data that has been output or will be output.
  • the fourth aspect of the present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium includes a computer program, and when the computer program is executed by a processor, the above-mentioned voice is realized Steps of synthetic method.
  • This application obtains text data and real-person recordings, and generates text vectors from the text data, and then models the prosody of the real-person recordings to generate the prosody vector; then combines the text vector and the prosody vector to generate mel Spectrogram; and then generate the target speech according to the Mel spectrogram, so as to realize the transfer of the prosody in the real-person recording to the synthesized speech.
  • this application also models the prosody in real-person recordings, and uses a method based on global conditional probability generation to make the synthesized speech more similar to the input real-person recordings, and further make the synthesized speech have high fidelity and high fidelity. The effect of naturalness.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the functional units in the embodiments of the present application can all be integrated into one processing unit, or each unit can be individually used as a unit, or two or more units can be integrated into one unit;
  • the unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
  • the foregoing program can be stored in a computer readable storage medium.
  • the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc.
  • the medium storing the program code.
  • the above-mentioned integrated unit of this application is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

A speech synthesis method, comprising: acquiring text data, and generating a text vector on the basis of the text data (S102); acquiring a speech recording of a real person, and modeling the prosody of the speech recording and generating a prosody vector (S104); combining the text vector with the prosody vector and generating a Mel spectrogram (S106); and generating target speech audio on the basis of the Mel spectrogram (S108). In the method, the prosody of the speech recording of a real person is used to implement modeling, and the technique created on the basis of global conditional probability is employed to make the prosody of the synthesized speech more similar to that of the input speech recording of a real person, such that the synthesized speech has high fidelity and high naturalness.

Description

语音合成方法、系统、终端设备和可读存储介质Speech synthesis method, system, terminal equipment and readable storage medium
本申请要求于2019年6月14日提交中国专利局,申请号为201910515578.3、发明名称为“语音合成方法、系统、终端设备和可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on June 14, 2019, the application number is 201910515578.3, and the invention title is "speech synthesis method, system, terminal equipment and readable storage medium", the entire content of which is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种语音语义领域,具体涉及一种语音合成方法、系统、终端设备和可读存储介质。This application relates to the field of artificial intelligence technology, in particular to the field of speech semantics, and specifically to a speech synthesis method, system, terminal device and readable storage medium.
背景技术Background technique
随着科技的发展,机器已经可以通过语音合成技术进行说话。所谓的语音合成技术,也被称为文语转换技术(Text to Speech,TTS),其目标是让机器通过识别和理解,把文本信息变成人造语音输出,是现代人工智能发展的重要分支。语音合成能够在质量检测、机器问答、残障辅助等领域发挥极大作用,方便人们的生活。With the development of technology, machines can already speak through speech synthesis technology. The so-called speech synthesis technology is also known as Text to Speech (TTS). Its goal is to allow machines to recognize and understand to turn text information into artificial speech output. It is an important branch of modern artificial intelligence development. Speech synthesis can play a great role in quality inspection, machine question and answer, and disability assistance, which is convenient for people's lives.
然而,现有的机器所能合成语音往往具有固定的模式,生成的语音在韵律方面较为生硬,与真人具有明显的差异,因此在某些对合成语音的拟人度要求比较高的场景下(如:智能外呼),终端用户往往不能接受如此生硬的韵律。所以,目前急需一种基于深度学习的语音合成方法。However, the existing machine can synthesize speech often has a fixed pattern, the generated speech is more rigid in terms of prosody, and has obvious differences with real people. Therefore, in certain scenarios where the artificiality of synthesized speech is relatively high (such as : Smart outbound calls), end users often cannot accept such a rigid rhythm. Therefore, there is an urgent need for a speech synthesis method based on deep learning.
发明内容Summary of the invention
为了解决上述至少一个技术问题,本申请提出了一种语音合成方法、系统、终端设备和可读存储介质,其能够将真人录音中的韵律转移到合成的语音中,实现对合成语音保真程度的提升。In order to solve at least one of the above technical problems, this application proposes a speech synthesis method, system, terminal device and readable storage medium, which can transfer the prosody in the recording of a real person to the synthesized speech, and realize the fidelity of the synthesized speech. The promotion.
为了实现上述目的,本申请第一方面提供了一种语音合成方法,包括:In order to achieve the foregoing objective, the first aspect of the present application provides a speech synthesis method, including:
获取文本数据,并根据所述文本数据生成文本向量;Acquiring text data, and generating a text vector according to the text data;
获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;
结合所述文本向量和所述韵律向量生成梅尔语谱图;Combining the text vector and the prosody vector to generate a Mel language spectrogram;
根据所述梅尔语谱图生成目标语音。The target voice is generated according to the Mel language spectrogram.
本申请第二方面还提供了一种语音合成系统,包括:The second aspect of the present application also provides a speech synthesis system, including:
文本嵌入模块,用于获取文本数据,并根据所述文本数据生成文本向量;The text embedding module is used to obtain text data and generate a text vector according to the text data;
韵律提取模块,用于获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;The prosody extraction module is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;
梅尔语谱生成模块,用于结合所述文本向量和所述韵律向量生成梅尔语谱图;A mel language spectrum generating module, configured to combine the text vector and the prosody vector to generate a mel language spectrum map;
语音生成模块,用于根据所述梅尔语谱图生成目标语音。The voice generation module is used to generate the target voice according to the Mel language spectrogram.
本申请第三方面还提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述语音合成方法的步骤。The third aspect of the present application also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, The steps of the above speech synthesis method.
本申请第四方面还提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质中包括计算机程序,所述计算机程序被处理器执行时,实现如上述语音合成方法的步骤。The fourth aspect of the present application also provides a computer non-volatile readable storage medium. The computer non-volatile readable storage medium includes a computer program. When the computer program is executed by a processor, Steps of synthetic method.
本申请通过获取文本数据和真人录音,并将所述文本数据生成文本向量,再对真人录音所具有的韵律进行建模以生成韵律向量;然后结合所述文本向量和所述韵律向量生成梅尔语谱图;再根据所述梅尔语谱图生成所述目标语音,从而实现将真人录音中的韵律转移到合成的语音中。同时,本申请还通过真人录音中的韵律进行建模,并通过基于全局条件概率生成的方法,使合成的语音与输入的真人录音具有更为相似的韵律,进一步使合成语音具有高保真和高自然度的效果。This application obtains text data and real-person recordings, and generates text vectors from the text data, and then models the prosody of the real-person recordings to generate the prosody vector; then combines the text vector and the prosody vector to generate mel Spectrogram; and then generate the target speech according to the Mel spectrogram, so as to realize the transfer of the prosody in the real-person recording to the synthesized speech. At the same time, this application also models the prosody in real-person recordings, and uses a method based on global conditional probability generation to make the synthesized speech more similar to the input real-person recordings, and further make the synthesized speech have high fidelity and high fidelity. The effect of naturalness.
本申请的附加方面和优点将在下面的描述部分中给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。The additional aspects and advantages of this application will be given in the following description, and some will become obvious from the following description, or be understood through the practice of this application.
附图说明Description of the drawings
图1示出了本申请一种语音合成方法的流程图。Fig. 1 shows a flowchart of a speech synthesis method of the present application.
图2示出了本申请一个实施例的文本向量生成方法的流程图。Fig. 2 shows a flowchart of a method for generating a text vector according to an embodiment of the present application.
图3示出了本申请一个实施例的韵律向量生成方法的流程图。Fig. 3 shows a flow chart of a method for generating a prosody vector according to an embodiment of the present application.
图4示出了本申请一种语音合成系统的框图。Fig. 4 shows a block diagram of a speech synthesis system of the present application.
图5示出了本申请一个实施例的文本嵌入模块的框图。Fig. 5 shows a block diagram of a text embedding module according to an embodiment of the present application.
图6示出了本申请一个实施例的韵律提取模块的框图。Fig. 6 shows a block diagram of a prosody extraction module according to an embodiment of the present application.
图7示出了本申请一种语音合成系统的运行示意图。Fig. 7 shows a schematic diagram of the operation of a speech synthesis system of the present application.
图8示出了本申请一种文本嵌入模块的运行示意图。Figure 8 shows a schematic diagram of the operation of a text embedding module of the present application.
图9示出了本申请一种韵律提取模块的运行示意图。Figure 9 shows a schematic diagram of the operation of a prosody extraction module of the present application.
图10示出了本申请一种终端设备的示意图。Fig. 10 shows a schematic diagram of a terminal device of the present application.
具体实施方式Detailed ways
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施方式对本申请进行进一步的详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be further described in detail below in conjunction with the accompanying drawings and specific implementations. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.
在下面的描述中阐述了很多具体细节以便于充分理解本申请,但是,本申请还可以采用其他不同于在此描述的其他方式来实施,因此,本申请的保护范围并不受下面公开的具体实施例的限制。In the following description, many specific details are set forth in order to fully understand this application. However, this application can also be implemented in other ways different from those described here. Therefore, the scope of protection of this application is not covered by the specific details disclosed below. Limitations of the embodiment.
语音合成的主流技术方案有三种:参数合成,波形拼接以及端到端,相较而言,端到端的技术方案能够使生成的语音具有极为卓越的质量。本申请提出的语音合成方法、系统和终端设备,也是基于端到端的技术方案。There are three mainstream technical solutions for speech synthesis: parameter synthesis, waveform splicing, and end-to-end. In comparison, the end-to-end technical solutions can make the generated speech have extremely excellent quality. The speech synthesis method, system and terminal equipment proposed in this application are also based on an end-to-end technical solution.
图1为本申请一种语音合成方法的流程图。Figure 1 is a flowchart of a speech synthesis method according to this application.
如图1所示,本申请的第一方面提供了一种语音合成方法,包括:As shown in Figure 1, the first aspect of this application provides a speech synthesis method, including:
S102,获取文本数据,并根据所述文本数据生成文本向量;S102: Acquire text data, and generate a text vector according to the text data;
S104,获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;S104: Acquire a real person recording, and model the rhythm of the real person recording to generate a prosody vector;
S106,结合所述文本向量和所述韵律向量生成梅尔语谱图;S106: Combine the text vector and the prosody vector to generate a Mel language spectrogram;
S108,根据所述梅尔语谱图生成目标语音。S108: Generate a target voice according to the Mel language spectrogram.
需要说明的是,上述步骤S106,结合所述文本向量和所述韵律向量生成梅尔语谱图,具体包括:It should be noted that the above step S106, combining the text vector and the prosody vector to generate a Mel language spectrogram, specifically includes:
将所述文本向量将作为局部条件,将所述韵律向量作为全局条件,通过所述序列到序列模型映射后,生成所述梅尔语谱图(又称梅尔频谱图)。The text vector is used as a local condition, and the prosody vector is used as a global condition, and the sequence-to-sequence model mapping is used to generate the Mel language spectrogram (also called the Mel spectrogram).
进一步的,所述文本向量和所述韵律向量将会被输入一个序列到序列模型(seq2seq)。所述序列到序列模型是一种基于条件概率生成的神经网络模型,输入的文本向量将作为及局部条件,而输入的韵律向量将作为全局条件。最终通过该预训练的序列到序列模型映射后,即可得到梅尔语谱图。Further, the text vector and the prosody vector will be input into a sequence to the sequence model (seq2seq). The sequence-to-sequence model is a neural network model generated based on conditional probability. The input text vector will serve as the local condition, and the input prosody vector will serve as the global condition. Finally, after mapping the pre-trained sequence to the sequence model, the Mel language spectrogram can be obtained.
需要说明的是,在获取真人录音之后,还包括对所述真人录音进行预增强处理;预增强是以帧为单位进行,目的在于加强高频,增加语音的高频分辨率。因为高频端大约在800Hz以上按6dB/oct(倍频程)衰减,频率越高相应的成分越小,为此要在对真人录音进行分析之前对其高频部分加以提升,也可以改善高频信噪比。It should be noted that, after acquiring the real person recording, it also includes performing pre-enhancement processing on the real person recording; the pre-enhancement is performed in units of frames, with the purpose of enhancing high frequencies and increasing the high frequency resolution of the speech. Because the high-frequency end is attenuated by 6dB/oct (octave) above 800Hz, the higher the frequency, the smaller the corresponding component. For this reason, the high-frequency part of the live recording must be improved before the analysis of the live recording, which can also improve the high-frequency signal. Noise ratio.
图2示出了本申请一个实施例的文本向量生成方法的流程图。Fig. 2 shows a flowchart of a method for generating a text vector according to an embodiment of the present application.
如图2所示,获取文本数据,并将所述文本数据生成文本向量,具体包括:As shown in Figure 2, acquiring text data and generating a text vector from the text data specifically includes:
S202,获取汉字数据,并对所述汉字数据进行分词处理;S202: Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;
S204,将分词处理后的汉字数据转译为带有声调的汉语拼音;S204: Translate the Chinese character data after word segmentation processing into Chinese pinyin with tones;
S206,将转译得到的带有声调的汉语拼音转换为一维向量数据;S206: Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;
S208,按照时间序列将一维向量数据转换为二维向量数据。S208: Convert the one-dimensional vector data into two-dimensional vector data according to the time sequence.
需要说明的是,所述声调包括普通话的第一声、第二声、第三声、第四声和轻声,用阿拉伯数字1、2、3、4、5分别表示普通话的第一声、第二声、第三声、第四声和轻声的声调代码,但不限于此,在其它实施例中,也可以用其他数字表示普通话的四声调和轻声。It should be noted that the tones include the first tone, second tone, third tone, fourth tone and soft tone of Mandarin. Arabic numerals 1, 2, 3, 4, and 5 are used to represent the first tone and the first tone of Mandarin. The tone codes of the second tone, the third tone, the fourth tone, and the soft tone, but are not limited to these. In other embodiments, other numbers may also be used to represent the four tones and the soft tone of Mandarin.
图3示出了本申请一个实施例的韵律向量生成方法的流程图。Fig. 3 shows a flow chart of a method for generating a prosody vector according to an embodiment of the present application.
如图3所示,获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量,具体包括:As shown in Figure 3, acquiring a real person recording, and modeling the prosody of the real person recording to generate a prosody vector, which specifically includes:
S302,对获取的真人录音进行短时傅里叶变换以得到对应的声谱图;S302: Perform short-time Fourier transform on the acquired real person recording to obtain a corresponding spectrogram;
S304,对所述声谱图进行梅尔滤波以得到梅尔语谱图;S304, performing mel filtering on the spectrogram to obtain a mel language spectrogram;
S306,对所述梅尔语谱图进行时序上的压缩及特征表示的优化;S306: Perform time sequence compression and feature representation optimization on the Mel language spectrogram;
S308,对所述梅尔语谱图进行循环神经网络处理,并根据时序输出;S308: Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;
S310,获取每一时刻的输出,并将所述循环神经网络的全部输出转换为二维的韵律向量。S310: Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
需要说明的是,短时傅里叶变换(Short Time Fourier transform,STFT)是和傅里叶变换相关的一种数学变换,用以确定时变信号其局部区域正弦波的频率与相位。具体的,短时傅里叶变换就是将原来的傅里叶变换在时域截短为多段分别进行傅里叶变换,每一段记为时刻t,对应傅里叶变换求出频域特性,就可以粗略估计出时刻t时的频域特性(即同时指导了时域和频域的对应关系)。用于信号截短的工具叫做窗函数(宽度相当于时间长度),窗越小,时域特性越明显,但是此时由于点数过少导致FFT降低了精确度,导致频域特性不明显。It should be noted that the Short Time Fourier Transform (STFT) is a mathematical transformation related to the Fourier Transform, which is used to determine the frequency and phase of a sine wave in a local area of a time-varying signal. Specifically, the short-time Fourier transform is to cut the original Fourier transform into multiple segments in the time domain to perform Fourier transform separately, each segment is recorded as time t, and the frequency domain characteristics are obtained corresponding to the Fourier transform. The frequency domain characteristics at time t can be roughly estimated (that is, the correspondence between the time domain and the frequency domain is also guided). The tool used for signal truncation is called a window function (the width is equivalent to the length of time). The smaller the window, the more obvious the time domain characteristics, but at this time, the FFT reduces the accuracy due to too few points, resulting in insignificant frequency domain characteristics.
可以理解,在其他实施例中,也可以采用小波变换或Wigner分布以得到声谱图,但不限于此。It can be understood that in other embodiments, wavelet transform or Wigner distribution may also be used to obtain the spectrogram, but it is not limited thereto.
具体的,所述真人录音为一维信号;所述声谱图为二维信号。Specifically, the real person recording is a one-dimensional signal; the spectrogram is a two-dimensional signal.
根据本申请的实施例,在对获取的真人录音进行短时傅里叶变换以得 到对应的声谱图之前,还需要对获取的真人录音添加随机噪声。在进行数据增强,通常会手动合成一些音频,然而某些人工合成(使用软件)的音频有可能会造成一些数字错误,如underflow、overflow等。本申请通过在音频中添加随机噪声可以有效避免出现上述数字错误的问题。According to the embodiment of the present application, before the short-time Fourier transform is performed on the acquired real person recording to obtain the corresponding spectrogram, random noise needs to be added to the acquired real person recording. In data enhancement, some audio is usually synthesized manually, but some artificially synthesized (using software) audio may cause some digital errors, such as underflow, overflow, etc. This application can effectively avoid the above-mentioned digital error problem by adding random noise to the audio.
图4示出了本申请一种语音合成系统的框图。Fig. 4 shows a block diagram of a speech synthesis system of the present application.
如图4,本申请第二方面还提供一种语音合成系统4,所述语音合成系统4包括:As shown in Figure 4, the second aspect of the present application also provides a speech synthesis system 4, the speech synthesis system 4 includes:
文本嵌入模块41,用于获取文本数据,并根据所述文本数据生成文本向量;The text embedding module 41 is used to obtain text data and generate a text vector according to the text data;
韵律提取模块42,用于获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;The prosody extraction module 42 is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;
梅尔语谱生成模块43,用于结合所述文本向量和所述韵律向量生成梅尔语谱图;The Mel language spectrum generating module 43 is configured to combine the text vector and the prosody vector to generate a Mel language spectrum map;
语音生成模块44,用于根据所述梅尔语谱图生成目标语音。The speech generation module 44 is used to generate a target speech according to the Mel language spectrogram.
本申请的实施例中,梅尔语谱生成模块43为一种序列到序列模型(seq2seq),所述序列到序列模型是一种基于条件概率生成的神经网络模型。具体的,所述文本向量和所述韵律向量将会被输入一个序列到序列模型,输入的文本向量将作为及局部条件,而输入的韵律向量将作为全局条件。最终通过该预训练的序列到序列模型映射后,即可得到梅尔语谱图。In the embodiment of the present application, the Mel spectrum generation module 43 is a sequence-to-sequence model (seq2seq), and the sequence-to-sequence model is a neural network model generated based on conditional probability. Specifically, the text vector and the prosody vector will be input into a sequence to the sequence model, the input text vector will be the local condition, and the input prosody vector will be the global condition. Finally, after mapping the pre-trained sequence to the sequence model, the Mel language spectrogram can be obtained.
需要说明的是,梅尔语谱生成模块43中所使用的序列到序列模型以及所述韵律提取模块使用了同一份不公开的语音数据库进行了联合训练。该语音数据库包含了一位男性/女性说话人(即源说话人)在安静环境下,用专用录音设备录制的总时长约30个小时的语音文件,以及每条语音所对应的文本文件。It should be noted that the sequence-to-sequence model used in the Mel language spectrum generation module 43 and the prosody extraction module use the same undisclosed speech database for joint training. The voice database contains a male/female speaker (i.e. the source speaker) in a quiet environment, with a total length of about 30 hours of voice files recorded with a special recording device, and the text file corresponding to each voice.
需要说明的是,在获取真人录音之后,还包括对所述真人录音进行预增强处理;预增强是以帧为单位进行,目的在于加强高频,增加语音的高 频分辨率。因为高频端大约在800Hz以上按6dB/oct(倍频程)衰减,频率越高相应的成分越小,为此要在对真人录音进行分析之前对其高频部分加以提升,也可以改善高频信噪比。It should be noted that after acquiring the real person recording, it also includes performing pre-enhancement processing on the real person recording; the pre-enhancement is performed in units of frames, with the purpose of enhancing high frequencies and increasing the high frequency resolution of speech. Because the high-frequency end is attenuated by 6dB/oct (octave) above 800Hz, the higher the frequency, the smaller the corresponding component. For this reason, the high-frequency part of the live recording must be improved before the analysis of the live recording, which can also improve the high-frequency signal. Noise ratio.
图5示出了本申请一个实施例的文本嵌入模块的框图。Fig. 5 shows a block diagram of a text embedding module according to an embodiment of the present application.
如图5所示,所述文本嵌入模块41包括:As shown in FIG. 5, the text embedding module 41 includes:
分词单元411,用于获取汉字数据,并对所述汉字数据进行分词处理;The word segmentation unit 411 is used to obtain Chinese character data and perform word segmentation processing on the Chinese character data;
语言模型单元412,用于将分词处理后的汉字数据转译为带有声调的汉语拼音;The language model unit 412 is used for translating the Chinese character data after word segmentation processing into tonal Chinese pinyin;
独热编码单元413,用于将转译得到的带有声调的汉语拼音转换为一维向量数据;The one-hot encoding unit 413 is used to convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;
文本向量生成单元414,用于按照时间序列将一维向量数据转换为二维向量数据;The text vector generating unit 414 is configured to convert one-dimensional vector data into two-dimensional vector data according to a time sequence;
其中,所述声调包括普通话的第一声、第二声、第三声、第四声和轻声,用阿拉伯数字1、2、3、4、5分别表示普通话的第一声、第二声、第三声、第四声和轻声的声调代码。Wherein, the tones include the first tones, the second tones, the third tones, the fourth tones, and the soft tones. The Arabic numerals 1, 2, 3, 4, and 5 are used to represent the first, second, Tone codes for the third, fourth, and soft tone.
本申请的实施例中,独热编码单元执行独热编码(One-Hot-coding)的方法为:使用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。例如对六个状态进行编码:In the embodiment of the present application, the one-hot encoding unit performs One-Hot-coding (One-Hot-coding) method is: use N-bit status register to encode N states, each state has its own independent register bit, And at any time, only one of them is valid. For example, to encode six states:
自然顺序码为000,001,010,011,100,101;The natural sequence code is 000,001,010,011,100,101;
独热编码则是000001,000010,000100,001000,010000,100000。The one-hot code is 000001,000010,000100,001000,010000,100000.
图6示出了本申请一个实施例的韵律提取模块的框图。Fig. 6 shows a block diagram of a prosody extraction module according to an embodiment of the present application.
如图6所示,所述韵律提取模块42包括:As shown in FIG. 6, the prosody extraction module 42 includes:
短时傅里叶变换单元421,用于对获取的真人录音进行短时傅里叶变换以得到对应的声谱图;The short-time Fourier transform unit 421 is configured to perform short-time Fourier transform on the acquired real person recordings to obtain a corresponding spectrogram;
梅尔滤波单元422,用于对所述声谱图进行梅尔滤波以得到梅尔语谱 图;A mel filtering unit 422, configured to perform mel filtering on the spectrogram to obtain a mel language spectrogram;
卷积神经网络单元423,用于对所述梅尔语谱图进行时序上的压缩及特征表示的优化;The convolutional neural network unit 423 is configured to perform time sequence compression and feature representation optimization on the Mel language spectrogram;
GRU单元424,用于对所述梅尔语谱图进行循环神经网络处理,并根据时序输出;The GRU unit 424 is configured to perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;
韵律向量生成单元425,用于获取每一时刻的输出,并将所述循环神经网络的全部输出转换为二维的韵律向量。The prosody vector generating unit 425 is used to obtain the output at each moment and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
需要说明的是,短时傅里叶变换(Short Time Fourier transform,STFT)是和傅里叶变换相关的一种数学变换,用以确定时变信号其局部区域正弦波的频率与相位。具体的,短时傅里叶变换就是将原来的傅里叶变换在时域截短为多段分别进行傅里叶变换,每一段记为时刻t,对应傅里叶变换求出频域特性,就可以粗略估计出时刻t时的频域特性(即同时指导了时域和频域的对应关系)。用于信号截短的工具叫做窗函数(宽度相当于时间长度),窗越小,时域特性越明显,但是此时由于点数过少导致FFT降低了精确度,导致频域特性不明显。It should be noted that the Short Time Fourier Transform (STFT) is a mathematical transformation related to the Fourier Transform, which is used to determine the frequency and phase of a sine wave in a local area of a time-varying signal. Specifically, the short-time Fourier transform is to cut the original Fourier transform into multiple segments in the time domain to perform Fourier transform separately, each segment is recorded as time t, and the frequency domain characteristics are obtained corresponding to the Fourier transform. The frequency domain characteristics at time t can be roughly estimated (that is, the correspondence between the time domain and the frequency domain is also guided). The tool used for signal truncation is called a window function (the width is equivalent to the length of time). The smaller the window, the more obvious the time domain characteristics, but at this time, the FFT reduces the accuracy due to too few points, resulting in insignificant frequency domain characteristics.
可以理解,在其他实施例中,短时傅里叶变换单元421也可以由小波变换单元或Wigner分布单元替代,但不限于此。It can be understood that in other embodiments, the short-time Fourier transform unit 421 may also be replaced by a wavelet transform unit or a Wigner distribution unit, but it is not limited thereto.
具体的,所述真人录音为一维信号;所述声谱图为二维信号。Specifically, the real person recording is a one-dimensional signal; the spectrogram is a two-dimensional signal.
根据本申请的实施例,在对获取的真人录音进行短时傅里叶变换以得到对应的声谱图之前,还需要对获取的真人录音添加随机噪声。在进行数据增强,通常会手动合成一些音频,然而某些人工合成(使用软件)的音频有可能会造成一些数字错误,如underflow、overflow等。本申请通过在音频中添加随机噪声可以有效避免出现上述数字错误的问题。According to the embodiment of the present application, before the short-time Fourier transform is performed on the acquired real-person recording to obtain the corresponding spectrogram, random noise needs to be added to the acquired real-person recording. In data enhancement, some audio is usually synthesized manually, but some artificially synthesized (using software) audio may cause some digital errors, such as underflow, overflow, etc. This application can effectively avoid the above-mentioned digital error problem by adding random noise to the audio.
需要说明的是,卷积神经网络(Convolutional Neural Networks,CNN)为一类包含卷积计算且具有深度结构的前馈神经网络,其卷积神经网络包括输入层、隐含层和输出层。It should be noted that Convolutional Neural Networks (CNN) is a type of feedforward neural network that includes convolution calculations and has a deep structure. The convolutional neural network includes an input layer, a hidden layer, and an output layer.
卷积神经网络的输入层可以处理多维数据,其中,一维卷积神经网络的输入层接收一维或二维数组,一维数组通常为时间或频谱采样;二维数组可能包含多个通道;二维卷积神经网络的输入层接收二维或三维数组;三维卷积神经网络的输入层接收四维数组。在本申请的实施例中,所述卷积神经网络单元采用二维卷积神经网络,以对真人录音梅尔语谱图进行六层二维卷积操作。The input layer of the convolutional neural network can process multi-dimensional data. Among them, the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array. The one-dimensional array is usually time or spectrum sampling; the two-dimensional array may contain multiple channels; The input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. In the embodiment of the present application, the convolutional neural network unit adopts a two-dimensional convolutional neural network to perform a six-layer two-dimensional convolution operation on a real-person recorded Mel language spectrogram.
卷积神经网络的隐含层包含卷积层、池化层和全连接层,卷积层的功能是对输入数据进行特征提取,其内部包含多个卷积核,组成卷积核的每个元素都对应一个权重系数和一个偏差量,类似于一个前馈神经网络的神经元;在卷积层进行特征提取后,输出的特征图会被传递至池化层进行特征选择和信息过滤。池化层包含预设定的池化函数,其功能是将特征图中单个点的结果替换为其相邻区域的特征图统计量。全连接层搭建在卷积神经网络隐含层的最后部分,并只向其它全连接层传递信号。The hidden layer of a convolutional neural network includes a convolutional layer, a pooling layer, and a fully connected layer. The function of the convolutional layer is to extract features from the input data. It contains multiple convolution kernels, which make up each of the convolution kernels. Each element corresponds to a weight coefficient and a deviation, which is similar to a neuron of a feedforward neural network; after feature extraction in the convolutional layer, the output feature map will be passed to the pooling layer for feature selection and information filtering. The pooling layer contains a preset pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The fully connected layer is built in the last part of the hidden layer of the convolutional neural network and only transmits signals to other fully connected layers.
循环神经网络(Recurrent Neural Network,RNN)是一类以序列数据为输入,在序列的演进方向进行递归且所有节点(循环单元)按链式连接形成闭合回路的递归神经网络。循环神经网络具有记忆性、参数共享并且图灵完备,因此能以很高的效率对序列的非线性特征进行学习。Recurrent Neural Network (RNN) is a type of recurrent neural network that takes sequence data as input, recursively in the evolution direction of the sequence, and all nodes (recurrent units) are connected in a chain to form a closed loop. Recurrent neural network has memory, parameter sharing and Turing completeness, so it can learn the nonlinear characteristics of the sequence with high efficiency.
图7示出了本申请一种语音合成系统的运行示意图。Fig. 7 shows a schematic diagram of the operation of a speech synthesis system of the present application.
如图7所示,所述语音合成系统的运行流程为:As shown in Figure 7, the operating process of the speech synthesis system is:
向语音合成系统输入需要合成的文本内容(例:您好),文本嵌入模块将所述文本内容嵌入成为文本向量。Input the text content to be synthesized (for example: hello) into the speech synthesis system, and the text embedding module embeds the text content into a text vector.
向语音合成系统输入真人录音,韵律提取模块将对真人录音所具有的韵律进行建模,形成韵律向量。Input the real person recording to the speech synthesis system, and the prosody extraction module will model the prosody of the real person recording to form a prosody vector.
将生成的文本向量以及韵律向量输入经过训练的梅尔语谱生成模块生成梅尔语谱图。Input the generated text vector and the prosody vector into the trained Mel language spectrum generation module to generate a Mel language spectrum map.
使用经过训练的语音生成模块将梅尔语谱图合成为高保真的语音文件, 优选的,所述语音生成模块为语音声码器。A trained speech generation module is used to synthesize the Mel language spectrogram into a high-fidelity speech file. Preferably, the speech generation module is a speech vocoder.
如图8所示,在文本嵌入模块中,输入的汉字(您好)首先会进行分词,然后一个经过训练的语言模型会将汉字转译为带有声调的汉语拼音;通过独热编码的方式,将转译得到的拼音字母和声调代码(数字1~5)转换为一维向量数据,再按照时间序列将其转换为一个二维向量数据。As shown in Figure 8, in the text embedding module, the input Chinese characters (Hello) will be segmented first, and then a trained language model will translate the Chinese characters into tonal Chinese pinyin; through one-hot encoding, Convert the translated Pinyin letters and tone codes (numbers 1 to 5) into one-dimensional vector data, and then convert them into a two-dimensional vector data according to the time series.
所述语音生成模块获得了梅尔语谱图后,将梅尔语谱图作为条件输入,生成目标说话人的语音,在本实施例中,所述语音生成模块为WaveNet声码器,其由一个非公开的语音数据库训练而成,该数据库与训练梅尔语谱生成模块所用的语音数据库为同一数据库。After the voice generation module obtains the Mel spectrogram, it uses the Mel spectrogram as a conditional input to generate the voice of the target speaker. In this embodiment, the voice generation module is a WaveNet vocoder, which is determined by A non-public speech database is trained, and the database is the same as the speech database used for training the Mel language spectrum generation module.
如图9所示,所述韵律提取模块通过一个循环神经网络实现对真人录音与韵律向量的转换,具体步骤如下:As shown in Figure 9, the prosody extraction module uses a cyclic neural network to realize the conversion between real-person recordings and prosody vectors. The specific steps are as follows:
首先对输入的真人录音进行短时傅里叶变换,再使用梅尔滤波器获得其梅尔语谱图。梅尔语谱图将会被输入一个六层预训练好的的卷积神经网络,目的是进行时序上的压缩以及更好的对梅尔语谱中的特征进行表示。处理过的梅尔语谱图会按照时序输入进入基于GRU单元的循环神经网络中进行处理,该循环神经网络将根据时序进行输出。在得到每一个时刻的输出后,一个全连接网络会将循环神经网络的输出转化一个二维的向量,此向量就为韵律向量。First, perform a short-time Fourier transform on the input real person recording, and then use a mel filter to obtain its mel language spectrogram. The Mel language spectrogram will be input into a six-layer pre-trained convolutional neural network, the purpose is to compress the time sequence and better represent the features in the Mel language spectrum. The processed Mel language spectrogram will be input into the GRU unit-based recurrent neural network for processing according to time sequence, and the recurrent neural network will output according to the time sequence. After getting the output of each moment, a fully connected network converts the output of the recurrent neural network into a two-dimensional vector, which is the prosody vector.
图10示出了本申请一种终端设备的示意图。Fig. 10 shows a schematic diagram of a terminal device of the present application.
如图10所示,本申请第三方面还提供一种终端设备7,终端设备7包括:处理器71、存储器72以及存储在所述存储器72中并可在所述处理器71上运行的计算机程序73,例如程序。所述处理器71执行所述计算机程序73时实现上述各个语音合成方法实施例中的步骤。As shown in FIG. 10, the third aspect of the present application further provides a terminal device 7. The terminal device 7 includes: a processor 71, a memory 72, and a computer stored in the memory 72 and running on the processor 71 Program 73, such as a program. When the processor 71 executes the computer program 73, the steps in the foregoing embodiments of the speech synthesis method are implemented.
本申请的实施例中,所述计算机程序73可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器72中,并由所述处理器71执行,以完成本申请。所述一个或多个模块/单元可以是能够完 成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序73在所述终端设备7中的执行过程。例如,所述计算机程序73可以被分割成文本嵌入模块、韵律提取模块、梅尔语谱生成模块以及语音生成模块,各模块具体功能如下:In the embodiment of the present application, the computer program 73 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 72 and executed by the processor 71 To complete this application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 73 in the terminal device 7. For example, the computer program 73 can be divided into a text embedding module, a prosody extraction module, a Mel language spectrum generation module, and a speech generation module. The specific functions of each module are as follows:
文本嵌入模块,用于获取文本数据,并将所述文本数据生成文本向量;The text embedding module is used to obtain text data and generate a text vector from the text data;
韵律提取模块,用于获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;The prosody extraction module is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;
梅尔语谱生成模块,用于结合所述文本向量和所述韵律向量生成梅尔语谱图;A mel language spectrum generating module, configured to combine the text vector and the prosody vector to generate a mel language spectrum map;
语音生成模块,用于根据所述梅尔语谱图生成所述目标语音。The voice generation module is used to generate the target voice according to the Mel language spectrogram.
所述终端设备7可以是桌上型计算机、笔记本、掌上电脑及云端管理服务器等计算设备。所述终端设备7可包括,但不仅限于,处理器71、存储器72。本领域技术人员可以理解,图7仅仅是终端设备7的示例,并不构成对终端设备7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。The terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud management server. The terminal device 7 may include, but is not limited to, a processor 71 and a memory 72. Those skilled in the art can understand that FIG. 7 is only an example of the terminal device 7 and does not constitute a limitation on the terminal device 7. It may include more or less components than shown in the figure, or a combination of certain components, or different components. For example, the terminal device may also include input and output devices, network access devices, buses, etc.
所称处理器71可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 71 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
所述存储器72可以是所述终端设备7的内部存储单元,例如终端设备7的硬盘或内存。所述存储器72也可以是所述终端设备7的外部存储设备,例如所述终端设备7上配备的插接式硬盘,智能存储卡 (Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器72还可以既包括所述终端设备7的内部存储单元也包括外部存储设备。所述存储器72用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器72还可以用于暂时地存储已经输出或者将要输出的数据。The memory 72 may be an internal storage unit of the terminal device 7, such as a hard disk or memory of the terminal device 7. The memory 72 may also be an external storage device of the terminal device 7, such as a plug-in hard disk equipped on the terminal device 7, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD). Card, Flash Card, etc. Further, the memory 72 may also include both an internal storage unit of the terminal device 7 and an external storage device. The memory 72 is used to store the computer program and other programs and data required by the terminal device. The memory 72 can also be used to temporarily store data that has been output or will be output.
本申请第四方面还提供一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质中包括计算机程序,所述计算机程序被处理器执行时,实现如上述的语音合成方法的步骤。The fourth aspect of the present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium includes a computer program, and when the computer program is executed by a processor, the above-mentioned voice is realized Steps of synthetic method.
本申请通过获取文本数据和真人录音,并将所述文本数据生成文本向量,再对真人录音所具有的韵律进行建模以生成韵律向量;然后结合所述文本向量和所述韵律向量生成梅尔语谱图;再根据所述梅尔语谱图生成所述目标语音,从而实现将真人录音中的韵律转移到合成的语音中。同时,本申请还通过真人录音中的韵律进行建模,并通过基于全局条件概率生成的方法,使合成的语音与输入的真人录音具有更为相似的韵律,进一步使合成语音具有高保真和高自然度的效果。This application obtains text data and real-person recordings, and generates text vectors from the text data, and then models the prosody of the real-person recordings to generate the prosody vector; then combines the text vector and the prosody vector to generate mel Spectrogram; and then generate the target speech according to the Mel spectrogram, so as to realize the transfer of the prosody in the real-person recording to the synthesized speech. At the same time, this application also models the prosody in real-person recordings, and uses a method based on global conditional probability generation to make the synthesized speech more similar to the input real-person recordings, and further make the synthesized speech have high fidelity and high fidelity. The effect of naturalness.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in the embodiments of the present application can all be integrated into one processing unit, or each unit can be individually used as a unit, or two or more units can be integrated into one unit; The unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc. The medium storing the program code.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated unit of this application is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks or optical disks and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种语音合成方法,其特征在于,包括:A method for speech synthesis, characterized in that it comprises:
    获取文本数据,并根据所述文本数据生成文本向量;Acquiring text data, and generating a text vector according to the text data;
    获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;
    结合所述文本向量和所述韵律向量生成梅尔语谱图;Combining the text vector and the prosody vector to generate a Mel language spectrogram;
    根据所述梅尔语谱图生成目标语音。The target voice is generated according to the Mel language spectrogram.
  2. 根据权利要求1所述的语音合成方法,其特征在于,获取文本数据,并将所述文本数据生成文本向量,包括:The speech synthesis method according to claim 1, wherein obtaining text data and generating a text vector from the text data comprises:
    获取汉字数据,并对所述汉字数据进行分词处理;Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;
    将分词处理后的汉字数据转译为带有声调的汉语拼音;Translate the Chinese character data after word segmentation into Chinese Pinyin with tones;
    将转译得到的带有声调的汉语拼音转换为一维向量数据;Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;
    按照时间序列将一维向量数据转换为二维向量数据。Convert one-dimensional vector data into two-dimensional vector data according to time series.
  3. 根据权利要求1所述的语音合成方法,其特征在于,获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量,包括:The speech synthesis method according to claim 1, characterized in that acquiring a real person recording and modeling the prosody of the real person recording to generate a prosody vector comprises:
    对获取的真人录音进行短时傅里叶变换以得到对应的声谱图;Perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;
    对所述声谱图进行梅尔滤波以得到梅尔语谱图;Performing mel filtering on the spectrogram to obtain a mel language spectrogram;
    对所述梅尔语谱图进行时序上的压缩及特征表示的优化;Compress the sequence of the Mel language spectrogram and optimize the feature representation;
    对所述梅尔语谱图进行循环神经网络处理,并根据时序输出;Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;
    获取每一时刻的输出,并将所述循环神经网络的全部输出转换为二维的韵律向量。Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
  4. 根据权利要求1所述的语音合成方法,其特征在于,结合所述文本向量和所述韵律向量生成梅尔语谱图,包括:The speech synthesis method according to claim 1, wherein generating a Mel language spectrogram in combination with the text vector and the prosodic vector comprises:
    将所述文本向量将作为局部条件,将所述韵律向量作为全局条件,通过所述序列到序列模型映射后,生成所述梅尔语谱图。The text vector will be used as a local condition, and the prosody vector will be used as a global condition, and after the sequence-to-sequence model is mapped, the Mel language spectrogram is generated.
  5. 根据权利要求2所述的语音合成方法,其特征在于,所述声调包括普通话的第一声、第二声、第三声、第四声和轻声,用阿拉伯数字1、2、3、4、5分别表示普通话的第一声、第二声、第三声、第四声和轻声的声调代码。The speech synthesis method according to claim 2, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1, 2, 3, 4, 5 respectively represent the tone codes of the first, second, third, fourth and soft tone of Mandarin.
  6. 一种语音合成系统,其特征在于,包括:A speech synthesis system is characterized in that it comprises:
    文本嵌入模块,用于获取文本数据,并根据所述文本数据生成文本向量;The text embedding module is used to obtain text data and generate a text vector according to the text data;
    韵律提取模块,用于获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;The prosody extraction module is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;
    梅尔语谱生成模块,用于结合所述文本向量和所述韵律向量生成梅尔语谱图;A mel language spectrum generating module, configured to combine the text vector and the prosody vector to generate a mel language spectrum map;
    语音生成模块,用于根据所述梅尔语谱图生成目标语音。The voice generation module is used to generate the target voice according to the Mel language spectrogram.
  7. 根据权利要求6所述的语音合成系统,其特征在于,所述文本嵌入模块包括:The speech synthesis system according to claim 6, wherein the text embedding module comprises:
    分词单元,用于获取汉字数据,并对所述汉字数据进行分词处理;The word segmentation unit is used to obtain Chinese character data and perform word segmentation processing on the Chinese character data;
    语言模型单元,用于将分词处理后的汉字数据转译为带有声调的汉语拼音;The language model unit is used to translate the Chinese character data after word segmentation into tonal Chinese pinyin;
    独热编码单元,用于将转译得到的带有声调的汉语拼音转换为一维向量数据;One-hot encoding unit for converting the tonal Chinese pinyin obtained by translation into one-dimensional vector data;
    文本向量生成单元,用于按照时间序列将一维向量数据转换为二维向 量数据。The text vector generating unit is used to convert one-dimensional vector data into two-dimensional vector data according to time series.
  8. 根据权利要求7所述的语音合成系统,其特征在于,所述声调包括普通话的第一声、第二声、第三声、第四声和轻声,用阿拉伯数字1、2、3、4、5分别表示普通话的第一声、第二声、第三声、第四声和轻声的声调代码。The speech synthesis system according to claim 7, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1, 2, 3, 4, 5 respectively represent the tone codes of the first, second, third, fourth and soft tone of Mandarin.
  9. 根据权利要求6所述的语音合成系统,其特征在于,所述韵律提取模块包括:The speech synthesis system according to claim 6, wherein the prosody extraction module comprises:
    短时傅里叶变换单元,用于对获取的真人录音进行短时傅里叶变换以得到对应的声谱图;The short-time Fourier transform unit is used to perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;
    梅尔滤波单元,用于对所述声谱图进行梅尔滤波以得到梅尔语谱图;A mel filtering unit, configured to perform mel filtering on the spectrogram to obtain a mel language spectrogram;
    卷积神经网络单元,用于对所述梅尔语谱图进行时序上的压缩及特征表示的优化;The convolutional neural network unit is used to perform time-series compression and feature representation optimization on the Mel language spectrogram;
    GRU单元,用于对所述梅尔语谱图进行循环神经网络处理,并根据时序输出;The GRU unit is used to perform cyclic neural network processing on the Mel language spectrogram and output according to time sequence;
    韵律向量生成单元,用于获取每一时刻的输出,并将所述循环神经网络的全部输出转换为二维的韵律向量。The prosody vector generating unit is used to obtain the output at each moment and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
  10. 根据权利要求6所述的语音合成系统,其特征在于,所述梅尔语谱生成模块是将所述文本向量将作为局部条件,将所述韵律向量作为全局条件,通过所述序列到序列模型映射后,生成所述梅尔语谱图。The speech synthesis system according to claim 6, wherein the Mel language spectrum generation module uses the text vector as a local condition and the prosody vector as a global condition, and passes the sequence to sequence model After the mapping, the Mel language spectrogram is generated.
  11. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如下步骤:A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that the processor implements the following steps when the processor executes the computer program:
    获取文本数据,并根据所述文本数据生成文本向量;Acquiring text data, and generating a text vector according to the text data;
    获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;
    结合所述文本向量和所述韵律向量生成梅尔语谱图;Combining the text vector and the prosody vector to generate a Mel language spectrogram;
    根据所述梅尔语谱图生成目标语音。The target voice is generated according to the Mel language spectrogram.
  12. 根据权利要求11所述的终端设备,其特征在于,获取文本数据,并将所述文本数据生成文本向量,包括:The terminal device according to claim 11, wherein obtaining text data and generating a text vector from the text data comprises:
    获取汉字数据,并对所述汉字数据进行分词处理;Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;
    将分词处理后的汉字数据转译为带有声调的汉语拼音;Translate the Chinese character data after word segmentation into Chinese Pinyin with tones;
    将转译得到的带有声调的汉语拼音转换为一维向量数据;Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;
    按照时间序列将一维向量数据转换为二维向量数据。Convert one-dimensional vector data into two-dimensional vector data according to time series.
  13. 根据权利要求11所述的终端设备,其特征在于,获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量,包括:The terminal device according to claim 11, wherein obtaining a real person recording and modeling the prosody of the real person recording to generate a prosody vector comprises:
    对获取的真人录音进行短时傅里叶变换以得到对应的声谱图;Perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;
    对所述声谱图进行梅尔滤波以得到梅尔语谱图;Performing mel filtering on the spectrogram to obtain a mel language spectrogram;
    对所述梅尔语谱图进行时序上的压缩及特征表示的优化;Compress the sequence of the Mel language spectrogram and optimize the feature representation;
    对所述梅尔语谱图进行循环神经网络处理,并根据时序输出;Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;
    获取每一时刻的输出,并将所述循环神经网络的全部输出转换为二维的韵律向量。Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
  14. 根据权利要求11所述的终端设备,其特征在于,结合所述文本向量和所述韵律向量生成梅尔语谱图,包括:The terminal device according to claim 11, wherein generating a Mel language spectrogram in combination with the text vector and the prosodic vector comprises:
    将所述文本向量将作为局部条件,将所述韵律向量作为全局条件,通过所述序列到序列模型映射后,生成所述梅尔语谱图。The text vector will be used as a local condition, and the prosody vector will be used as a global condition, and after the sequence-to-sequence model is mapped, the Mel language spectrogram is generated.
  15. 根据权利要求12所述的终端设备,其特征在于,所述声调包括普通话的第一声、第二声、第三声、第四声和轻声,用阿拉伯数字1、2、3、4、5分别表示普通话的第一声、第二声、第三声、第四声和轻声的声调代码。The terminal device according to claim 12, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1, 2, 3, 4, and 5. They represent the tone codes of the first tone, second tone, third tone, fourth tone, and soft tone of Mandarin.
  16. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质中包括计算机程序,所述计算机程序被处理器执行时,实现如下步骤:A computer non-volatile readable storage medium, wherein the computer non-volatile readable storage medium includes a computer program, and when the computer program is executed by a processor, the following steps are implemented:
    获取文本数据,并根据所述文本数据生成文本向量;Acquiring text data, and generating a text vector according to the text data;
    获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量;Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;
    结合所述文本向量和所述韵律向量生成梅尔语谱图;Combining the text vector and the prosody vector to generate a Mel language spectrogram;
    根据所述梅尔语谱图生成目标语音。The target voice is generated according to the Mel language spectrogram.
  17. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,获取文本数据,并将所述文本数据生成文本向量,包括:The computer non-volatile readable storage medium according to claim 16, wherein obtaining text data and generating a text vector from the text data comprises:
    获取汉字数据,并对所述汉字数据进行分词处理;Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;
    将分词处理后的汉字数据转译为带有声调的汉语拼音;Translate the Chinese character data after word segmentation into Chinese Pinyin with tones;
    将转译得到的带有声调的汉语拼音转换为一维向量数据;Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;
    按照时间序列将一维向量数据转换为二维向量数据。Convert one-dimensional vector data into two-dimensional vector data according to time series.
  18. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,获取真人录音,并对真人录音所具有的韵律进行建模以生成韵律向量,包括:The computer non-volatile readable storage medium according to claim 16, wherein acquiring a real person recording and modeling the prosody of the real person recording to generate a prosody vector comprises:
    对获取的真人录音进行短时傅里叶变换以得到对应的声谱图;Perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;
    对所述声谱图进行梅尔滤波以得到梅尔语谱图;Performing mel filtering on the spectrogram to obtain a mel language spectrogram;
    对所述梅尔语谱图进行时序上的压缩及特征表示的优化;Compress the sequence of the Mel language spectrogram and optimize the feature representation;
    对所述梅尔语谱图进行循环神经网络处理,并根据时序输出;Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;
    获取每一时刻的输出,并将所述循环神经网络的全部输出转换为二维的韵律向量。Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
  19. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,结合所述文本向量和所述韵律向量生成梅尔语谱图,包括:The computer non-volatile readable storage medium according to claim 16, wherein generating a Mel language spectrogram in combination with the text vector and the prosodic vector comprises:
    将所述文本向量将作为局部条件,将所述韵律向量作为全局条件,通过所述序列到序列模型映射后,生成所述梅尔语谱图。The text vector will be used as a local condition, and the prosody vector will be used as a global condition, and after the sequence-to-sequence model is mapped, the Mel language spectrogram is generated.
  20. 根据权利要求17所述的计算机非易失性可读存储介质,其特征在于,所述声调包括普通话的第一声、第二声、第三声、第四声和轻声,用阿拉伯数字1、2、3、4、5分别表示普通话的第一声、第二声、第三声、第四声和轻声的声调代码。The computer non-volatile readable storage medium according to claim 17, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1. 2, 3, 4, and 5 respectively represent the tone codes of the first, second, third, fourth, and soft tone of Putonghua.
PCT/CN2019/103582 2019-06-14 2019-08-30 Speech synthesis method and system, terminal device, and readable storage medium WO2020248393A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910515578.3A CN110335587B (en) 2019-06-14 2019-06-14 Speech synthesis method, system, terminal device and readable storage medium
CN201910515578.3 2019-06-14

Publications (1)

Publication Number Publication Date
WO2020248393A1 true WO2020248393A1 (en) 2020-12-17

Family

ID=68142115

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103582 WO2020248393A1 (en) 2019-06-14 2019-08-30 Speech synthesis method and system, terminal device, and readable storage medium

Country Status (2)

Country Link
CN (1) CN110335587B (en)
WO (1) WO2020248393A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114519997A (en) * 2022-02-17 2022-05-20 湖南快乐阳光互动娱乐传媒有限公司 Processing method and device for video synthesis based on personalized voice
CN115101046A (en) * 2022-06-21 2022-09-23 鼎富智能科技有限公司 Method and device for synthesizing voice of specific speaker

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048065B (en) * 2019-12-18 2024-05-28 腾讯科技(深圳)有限公司 Text error correction data generation method and related device
CN111627420B (en) * 2020-04-21 2023-12-08 升智信息科技(南京)有限公司 Method and device for synthesizing emotion voice of specific speaker under extremely low resource
CN111710326B (en) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111916093B (en) * 2020-07-31 2024-09-06 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method and device
RU2754920C1 (en) * 2020-08-17 2021-09-08 Автономная некоммерческая организация поддержки и развития науки, управления и социального развития людей в области разработки и внедрения искусственного интеллекта "ЦифровойТы" Method for speech synthesis with transmission of accurate intonation of the cloned sample
CN112086086B (en) * 2020-10-22 2024-06-25 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112349268A (en) * 2020-11-09 2021-02-09 湖南芒果听见科技有限公司 Emergency broadcast audio processing system and operation method thereof
CN112687257B (en) * 2021-03-11 2021-06-01 北京新唐思创教育科技有限公司 Sentence similarity judging method and device, electronic equipment and readable storage medium
CN113555003B (en) * 2021-07-23 2024-05-28 平安科技(深圳)有限公司 Speech synthesis method, device, electronic equipment and storage medium
CN114913877B (en) * 2022-05-12 2024-07-19 平安科技(深圳)有限公司 Initial consonant and vowel pronunciation duration prediction method, structure, terminal and storage medium
CN116705058B (en) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452369A (en) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
WO2018183650A2 (en) * 2017-03-29 2018-10-04 Google Llc End-to-end text-to-speech conversion
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
CN109697974A (en) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 Use the system and method for the neural text-to-speech that convolution sequence learns

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105355193B (en) * 2015-10-30 2020-09-25 百度在线网络技术(北京)有限公司 Speech synthesis method and device
CN108305612B (en) * 2017-11-21 2020-07-31 腾讯科技(深圳)有限公司 Text processing method, text processing device, model training method, model training device, storage medium and computer equipment
CN108492818B (en) * 2018-03-22 2020-10-30 百度在线网络技术(北京)有限公司 Text-to-speech conversion method and device and computer equipment
CN109308892B (en) * 2018-10-25 2020-09-01 百度在线网络技术(北京)有限公司 Voice synthesis broadcasting method, device, equipment and computer readable medium
CN109754778B (en) * 2019-01-17 2023-05-30 平安科技(深圳)有限公司 Text speech synthesis method and device and computer equipment
CN109785823B (en) * 2019-01-22 2021-04-02 中财颐和科技发展(北京)有限公司 Speech synthesis method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018183650A2 (en) * 2017-03-29 2018-10-04 Google Llc End-to-end text-to-speech conversion
CN107452369A (en) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN109697974A (en) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 Use the system and method for the neural text-to-speech that convolution sequence learns
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RJ SKERRY-RYAN ET AL.: "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron", ARXIV PREPRINT ARXIV:1803.09047V1, 24 March 2018 (2018-03-24), pages 1 - 11, XP080862501, DOI: 20200305110910X *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114519997A (en) * 2022-02-17 2022-05-20 湖南快乐阳光互动娱乐传媒有限公司 Processing method and device for video synthesis based on personalized voice
CN115101046A (en) * 2022-06-21 2022-09-23 鼎富智能科技有限公司 Method and device for synthesizing voice of specific speaker

Also Published As

Publication number Publication date
CN110335587B (en) 2023-11-10
CN110335587A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
WO2020248393A1 (en) Speech synthesis method and system, terminal device, and readable storage medium
WO2020215551A1 (en) Chinese speech synthesizing method, apparatus and device, storage medium
CN113470684B (en) Audio noise reduction method, device, equipment and storage medium
WO2022142850A1 (en) Audio processing method and apparatus, vocoder, electronic device, computer readable storage medium, and computer program product
KR102137523B1 (en) Method of text to speech and system of the same
CN109147831A (en) A kind of voice connection playback method, terminal device and computer readable storage medium
WO2022121179A1 (en) Speech synthesis method and apparatus, device, and storage medium
CN116741144B (en) Voice tone conversion method and system
JP7124373B2 (en) LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM
CN115206284B (en) Model training method, device, server and medium
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
US20230013370A1 (en) Generating audio waveforms using encoder and decoder neural networks
CN113870827A (en) Training method, device, equipment and medium of speech synthesis model
CN113555003B (en) Speech synthesis method, device, electronic equipment and storage medium
CN113327576B (en) Speech synthesis method, device, equipment and storage medium
CN112687262A (en) Voice conversion method and device, electronic equipment and computer readable storage medium
CN111048065B (en) Text error correction data generation method and related device
CN116913244A (en) Speech synthesis method, equipment and medium
CN113380231B (en) Voice conversion method and device and electronic equipment
WO2023102932A1 (en) Audio conversion method, electronic device, program product, and storage medium
CN113066472B (en) Synthetic voice processing method and related device
CN114464163A (en) Method, device, equipment, storage medium and product for training speech synthesis model
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
CN111696517A (en) Speech synthesis method, speech synthesis device, computer equipment and computer readable storage medium
CN113160849B (en) Singing voice synthesizing method, singing voice synthesizing device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19932297

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19932297

Country of ref document: EP

Kind code of ref document: A1