CN111312208A - Neural network vocoder system with irrelevant speakers - Google Patents
Neural network vocoder system with irrelevant speakers Download PDFInfo
- Publication number
- CN111312208A CN111312208A CN202010158293.1A CN202010158293A CN111312208A CN 111312208 A CN111312208 A CN 111312208A CN 202010158293 A CN202010158293 A CN 202010158293A CN 111312208 A CN111312208 A CN 111312208A
- Authority
- CN
- China
- Prior art keywords
- tone
- feature
- neural network
- acoustic
- waveform
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 34
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000001228 spectrum Methods 0.000 claims abstract description 15
- 238000004519 manufacturing process Methods 0.000 claims abstract description 8
- 238000005070 sampling Methods 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000001965 increasing effect Effects 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims description 2
- 125000004122 cyclic group Chemical group 0.000 claims description 2
- 239000000284 extract Substances 0.000 claims description 2
- 238000012549 training Methods 0.000 abstract description 16
- 230000015572 biosynthetic process Effects 0.000 description 12
- 238000003786 synthesis reaction Methods 0.000 description 12
- 238000011161 development Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000000034 method Methods 0.000 description 3
- 238000010367 cloning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
The invention discloses a neural network vocoder system with irrelevant speakers, which comprises the following steps: s1, the tone color feature extraction module receives the acoustic features M and performs tone color feature extraction on the acoustic features to obtain tone color feature information S, and the acoustic features can select a Mel frequency spectrum, a Mel cepstrum and a linear magnitude spectrum; and S2, the waveform production module receives the acoustic feature M and the tone feature S output by the tone extraction module, and carries out waveform generation processing to obtain a voice waveform W. The invention solves the problems that each single tone vocoder system can only serve one specific tone, the service deployment and operation cost is high, a new vocoder system needs to be trained when a new tone is encountered, the training time is long, and a large amount of recording data of a certain tone is needed for training.
Description
Technical Field
The invention relates to the technical field of networks, in particular to a neural network vocoder system with irrelevant speakers.
Background
With the rapid development of the neural network technology, the voice synthesis effect is also rapidly improved. Realistic speech synthesis technology has been applied to news broadcasting, audio books, voice assistants, intelligent customer service, virtual characters, voice cloning, and the like. Along with the continuous development of artificial intelligence technology and the continuous increase of application scenes, people have higher and higher requirements on speech synthesis technology. Not only is the sound quality of speech synthesis required to be realistic, but it is also desirable to be able to synthesize a wide variety of timbres. This presents many challenges to the development of speech synthesis technology and application deployment.
The current mainstream speech synthesis technology system mainly comprises three subsystems: speech synthesis front-end systems (converting text to phonemes); speech synthesis backend systems (convert phonemes into acoustic features); vocoder systems (convert acoustic features into audio). Among them, the vocoder system plays an important role in synthesizing the sound quality. In recent years, with the success of vocoder systems constructed by neural networks such as WaveNet, SampleRNN, WaveRNN, etc., existing single-tone vocoder systems have been able to synthesize comparable real recording sounds. However, these monophonic vocoder systems can only synthesize sounds of a single timbre and cannot support high quality synthesis of multiple timbres with a single system. Therefore, if the timbre diversity requires a high application scenario (e.g., talking books, voice cloning, etc.), a very large number of vocoder systems are required to meet the requirement of multiple timbres. Along with the increase of the number of systems, the number of hardware for service deployment is increased, and the operation cost is greatly improved. Moreover, the vocoder system for each tone requires training with recording data of the tone for several hours, and the training is converged before synthesizing the voice. The time required for training will vary depending on the training hardware, but typically requires 2-7 days of training time. This undoubtedly brings a huge obstacle to the synthesis of multi-timbre high-quality speech, especially in scenes where sufficient training recording data cannot be acquired.
In summary, the current vocoder system has the following disadvantages in the multi-tone application scenario:
1. each monophonic vocoder system can only serve one particular timbre. Service deployment and operation costs are high.
2. Encountering a new timbre requires a lengthy training time (typically 2-7 days) from the new training vocoder system.
3. A large amount of recording data (generally, recording data of 3 hours or more) of a certain tone color is required for training.
Disclosure of Invention
The invention aims to solve the problems that each single tone vocoder system can only serve one specific tone, the service deployment and operation cost is high, a new vocoder system needs to be trained from a new tone, the training time is long, and a large amount of recording data of a certain tone are needed for training.
In order to achieve the purpose, the invention adopts the following technical scheme: a speaker-independent neural network vocoder system, comprising the steps of:
s1, the tone color feature extraction module receives the acoustic features M and performs tone color feature extraction on the acoustic features to obtain tone color feature information S, and the acoustic features can select a Mel frequency spectrum, a Mel cepstrum and a linear magnitude spectrum;
and S2, the waveform production module receives the acoustic feature M and the tone feature S output by the tone extraction module, and performs waveform generation processing to obtain a voice waveform W.
2. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S1, the acoustic feature may be selected from a mel-frequency spectrum, a mel-frequency cepstrum, and a linear amplitude spectrum.
3. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S1, the traditional tone color feature extraction module extracts a traditional tone color feature sp from the input acoustic feature M, where the traditional tone color feature may be selected from a basic audio frequency, a voiced-unvoiced flag, a magnitude spectrum envelope, a linear prediction coefficient, or a line spectrum pair;
the feature mapping network module maps the traditional tone features sp output by the traditional tone feature extraction module into abstract tone features S, and the tone feature mapping network can be formed by a residual error network or a bidirectional cyclic neural network.
4. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S2, performing upsampling processing on the acoustic feature M and the tone feature S, and increasing the sampling rate to the sampling rate of the audio waveform, for example, if the sampling rate of the audio waveform is 16000Hz, the sampling rate of the acoustic feature is 80Hz, and the duration of each frame is 12.5ms, upsampling the acoustic feature and the tone feature sampling rate 80Hz by 200 times, so as to obtain 16000Hz sampled acoustic feature M1 and tone feature S1;
the up-sampled acoustic feature M1 and the tone feature S1 are input to the neural network layer 1, the feature M2 is output, and then, the operation is repeated N times, the output feature Mi of the previous neural network layer and the tone feature S1 are input to the next neural network layer i, and then, the feature Mi +1 is output. The neural network layer can be realized by a CNN network or a unidirectional RNN network;
the DNN network layer converts the output feature MN +1 of the neural network layer N into a speech waveform W.
Compared with the prior art, the invention has the following beneficial effects: the invention adopts an independent tone characteristic extraction module to extract the tone characteristics of a target speaker, and continuously inputs the tone-changed characteristics into each processing network of the waveform production module, thereby enhancing the robustness of the waveform production module and enabling the waveform production module to synthesize sounds with different tones in high quality from acoustic characteristics.
The neural network vocoder system with irrelevant speakers can synthesize voices in a training data set and outside the training data set, the synthesis effect is close to real person recording, and the voice data of the target speakers do not need to be collected in large quantity for new timbre because the system is irrelevant to the target speakers, and the new training model is not needed, and only a neural network vocoder with irrelevant speakers needs to be trained in advance to be applied to the timbre. This greatly reduces the time cost and hardware cost of a multi-tone speech synthesis scene.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic view of the present invention in its entirety;
FIG. 2 is a schematic diagram of processing details of the timbre feature extraction module of the present invention;
FIG. 3 is a schematic diagram of processing details of the waveform generation module of the present invention.
In the figure: a tone color feature extraction module 101, a waveform production module 102, a traditional tone color feature extraction module 01, a feature mapping network module 02 and an up-sampling processing module 03.
Detailed Description
The following description of the embodiments of the present invention is provided for illustrative purposes, and other advantages and effects of the present invention will become apparent to those skilled in the art from the present disclosure.
Please refer to fig. 1 to 3. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions under which the present invention can be implemented, so that the present invention has no technical significance, and any structural modification, ratio relationship change, or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not to be construed as a scope of the present invention.
The invention provides a technical scheme that: a neural network vocoder system with speaker incoherence comprises a tone color feature extraction module 101 and a waveform production module 102, and the processing procedure comprises the following two steps:
the acoustic feature extraction module 101 receives the acoustic feature M, performs a sound feature extraction process on the acoustic feature, and outputs sound feature information S, where the acoustic feature of this example is selected as a mel frequency spectrum, but is not limited to the mel frequency spectrum;
the waveform generation module 102 receives the acoustic features M and the timbre features S output by the timbre extraction module 101, performs waveform generation processing, and outputs a speech waveform W.
In step 1, the processing details of the tone feature extraction module 101 are shown in fig. 2:
the traditional tone color feature extraction module 01 receives the acoustic feature M and outputs a traditional tone color feature sp, in this example, the pitch frequency F0 and the amplitude spectrum envelope are adopted as the traditional tone color feature, but not limited to these two features;
the feature mapping network module 02 receives the traditional tone features sp output by the traditional tone feature extraction module 01, and maps the traditional tone features sp into abstract tone features S, and the feature mapping network of this example is implemented by using a 5-layer residual error network, but is not limited to this implementation.
In step 2, the processing details of the waveform generating module 102 are as shown in fig. 3:
the up-sampling processing module 03 receives the acoustic features M and the abstract tone features S output by the feature mapping network module 02, and increases the sampling rate of the two features by 200 times to the sampling rate of the audio waveform, where the sampling rate of the example speech audio is 16000Hz, and the sampling rate of the acoustic features and the tone features is 80Hz, and the duration of each frame is 12.5ms, but is not limited to this parameter;
the neural network layer 104 receives the acoustic feature M1 and the tone feature S1 after the sampling rate is raised, processes the output feature M2, and repeats the operation N times: inputting the output characteristic Mi +1 of the last neural network layer i and the tone characteristic S1 output by the up-sampling processing module 03 into the next neural network layer i +1, and then processing the output characteristic Mi +2, wherein the neural network layer of the embodiment is realized by adopting a CNN network, the number of times of repeated operation N is 10, but is not limited to the parameter;
the DNN network layer 07 receives the output feature MN +1 of the neural network layer N06, and processes the output speech waveform W.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (4)
1. A speaker-independent neural network vocoder system, comprising the steps of:
s1, the tone characteristic extraction module receives the acoustic characteristic M and performs tone characteristic extraction on the acoustic characteristic M to obtain tone characteristic information S;
and S2, the waveform production module receives the acoustic feature M and the tone feature S output by the tone extraction module, and performs waveform generation processing to obtain a voice waveform W.
2. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S1, the acoustic feature may be selected from a mel-frequency spectrum, a mel-frequency cepstrum, and a linear amplitude spectrum.
3. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S1, the tone color feature extraction module includes a traditional tone color feature extraction module, where the traditional tone color feature extraction module extracts a traditional tone color feature sp from the input acoustic feature M, and the traditional tone color feature may be selected from a basic audio frequency, a voiced and unvoiced sound flag, an amplitude spectrum envelope, a linear prediction coefficient, or a line spectrum pair;
the feature mapping network module maps the traditional tone features sp output by the traditional tone feature extraction module into abstract tone features S, and the tone feature mapping network can be formed by a residual error network or a bidirectional cyclic neural network.
4. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S2, performing upsampling processing on the acoustic feature M and the tone feature S, and increasing the sampling rate to the sampling rate of the audio waveform, for example, if the sampling rate of the audio waveform is 16000Hz, the sampling rate of the acoustic feature is 80Hz, and the duration of each frame is 12.5ms, upsampling the acoustic feature and the tone feature sampling rate 80Hz by 200 times, so as to obtain 16000Hz sampled acoustic feature M1 and tone feature S1;
inputting the up-sampled acoustic feature M1 and the tone feature S1 to the neural network layer 1, outputting the feature M2, and then repeating the operation N times, inputting the output feature Mi of the previous neural network layer and the tone feature S1 to the next neural network layer i, and then outputting the feature Mi + 1;
the neural network layer can be realized by a CNN network or a unidirectional RNN network;
the DNN network layer converts the output feature MN +1 of the neural network layer N into a speech waveform W.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010158293.1A CN111312208A (en) | 2020-03-09 | 2020-03-09 | Neural network vocoder system with irrelevant speakers |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010158293.1A CN111312208A (en) | 2020-03-09 | 2020-03-09 | Neural network vocoder system with irrelevant speakers |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111312208A true CN111312208A (en) | 2020-06-19 |
Family
ID=71147968
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010158293.1A Pending CN111312208A (en) | 2020-03-09 | 2020-03-09 | Neural network vocoder system with irrelevant speakers |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111312208A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883106A (en) * | 2020-07-27 | 2020-11-03 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method and device |
CN112133278A (en) * | 2020-11-20 | 2020-12-25 | 成都启英泰伦科技有限公司 | Network training and personalized speech synthesis method for personalized speech synthesis model |
CN112365877A (en) * | 2020-11-27 | 2021-02-12 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN113724683A (en) * | 2021-07-23 | 2021-11-30 | 阿里巴巴达摩院(杭州)科技有限公司 | Audio generation method, computer device, and computer-readable storage medium |
WO2023083252A1 (en) * | 2021-11-11 | 2023-05-19 | 北京字跳网络技术有限公司 | Timbre selection method and apparatus, electronic device, readable storage medium, and program product |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105788608A (en) * | 2016-03-03 | 2016-07-20 | 渤海大学 | Chinese initial consonant and compound vowel visualization method based on neural network |
CN107610707A (en) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | A kind of method for recognizing sound-groove and device |
US20180130474A1 (en) * | 2015-06-19 | 2018-05-10 | Google Llc | Speech recognition with acoustic models |
CN108615525A (en) * | 2016-12-09 | 2018-10-02 | 中国移动通信有限公司研究院 | A kind of audio recognition method and device |
CN110033755A (en) * | 2019-04-23 | 2019-07-19 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
JP6578544B1 (en) * | 2019-06-14 | 2019-09-25 | 株式会社テクノスピーチ | Audio processing apparatus and audio processing method |
-
2020
- 2020-03-09 CN CN202010158293.1A patent/CN111312208A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180130474A1 (en) * | 2015-06-19 | 2018-05-10 | Google Llc | Speech recognition with acoustic models |
CN105788608A (en) * | 2016-03-03 | 2016-07-20 | 渤海大学 | Chinese initial consonant and compound vowel visualization method based on neural network |
CN108615525A (en) * | 2016-12-09 | 2018-10-02 | 中国移动通信有限公司研究院 | A kind of audio recognition method and device |
CN107610707A (en) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | A kind of method for recognizing sound-groove and device |
CN110033755A (en) * | 2019-04-23 | 2019-07-19 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
JP6578544B1 (en) * | 2019-06-14 | 2019-09-25 | 株式会社テクノスピーチ | Audio processing apparatus and audio processing method |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883106A (en) * | 2020-07-27 | 2020-11-03 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method and device |
CN111883106B (en) * | 2020-07-27 | 2024-04-19 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method and device |
CN112133278A (en) * | 2020-11-20 | 2020-12-25 | 成都启英泰伦科技有限公司 | Network training and personalized speech synthesis method for personalized speech synthesis model |
CN112133278B (en) * | 2020-11-20 | 2021-02-05 | 成都启英泰伦科技有限公司 | Network training and personalized speech synthesis method for personalized speech synthesis model |
CN112365877A (en) * | 2020-11-27 | 2021-02-12 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN113724683A (en) * | 2021-07-23 | 2021-11-30 | 阿里巴巴达摩院(杭州)科技有限公司 | Audio generation method, computer device, and computer-readable storage medium |
CN113724683B (en) * | 2021-07-23 | 2024-03-22 | 阿里巴巴达摩院(杭州)科技有限公司 | Audio generation method, computer device and computer readable storage medium |
WO2023083252A1 (en) * | 2021-11-11 | 2023-05-19 | 北京字跳网络技术有限公司 | Timbre selection method and apparatus, electronic device, readable storage medium, and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111312208A (en) | Neural network vocoder system with irrelevant speakers | |
CN110534089A (en) | A kind of Chinese speech synthesis method based on phoneme and rhythm structure | |
US8706488B2 (en) | Methods and apparatus for formant-based voice synthesis | |
CN101004911B (en) | Method and device for generating frequency bending function and carrying out frequency bending | |
WO2021225829A1 (en) | Speech recognition using unspoken text and speech synthesis | |
US9135923B1 (en) | Pitch synchronous speech coding based on timbre vectors | |
CN109767778B (en) | Bi-L STM and WaveNet fused voice conversion method | |
JP2956548B2 (en) | Voice band expansion device | |
CN111462769B (en) | End-to-end accent conversion method | |
CN111210803B (en) | System and method for training clone timbre and rhythm based on Bottle sock characteristics | |
Liu et al. | High quality voice conversion through phoneme-based linear mapping functions with straight for mandarin | |
CN110675886A (en) | Audio signal processing method, audio signal processing device, electronic equipment and storage medium | |
CN109616131B (en) | Digital real-time voice sound changing method | |
Cooper et al. | Can speaker augmentation improve multi-speaker end-to-end TTS? | |
CN111465982A (en) | Signal processing device and method, training device and method, and program | |
Liu et al. | Non-parallel voice conversion with autoregressive conversion model and duration adjustment | |
CN111724809A (en) | Vocoder implementation method and device based on variational self-encoder | |
CN114283822A (en) | Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient | |
Zhang et al. | AccentSpeech: Learning accent from crowd-sourced data for target speaker TTS with accents | |
CN112908293A (en) | Method and device for correcting pronunciations of polyphones based on semantic attention mechanism | |
CN112002302A (en) | Speech synthesis method and device | |
CN113314109B (en) | Voice generation method based on cycle generation network | |
CN115862590A (en) | Text-driven speech synthesis method based on characteristic pyramid | |
Westall et al. | Speech technology for telecommunications | |
Aso et al. | Speakbysinging: Converting singing voices to speaking voices while retaining voice timbre |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200619 |