WO2022156479A1 - 定制音色歌声的合成方法、装置、电子设备和存储介质 - Google Patents

定制音色歌声的合成方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2022156479A1
WO2022156479A1 PCT/CN2021/140858 CN2021140858W WO2022156479A1 WO 2022156479 A1 WO2022156479 A1 WO 2022156479A1 CN 2021140858 W CN2021140858 W CN 2021140858W WO 2022156479 A1 WO2022156479 A1 WO 2022156479A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
speaker
synthesized
singing
sample
Prior art date
Application number
PCT/CN2021/140858
Other languages
English (en)
French (fr)
Inventor
张政臣
吴俊仪
蔡玉玉
袁鑫
宋伟
何晓冬
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Priority to JP2023516595A priority Critical patent/JP2023541182A/ja
Priority to US18/252,186 priority patent/US20230410786A1/en
Publication of WO2022156479A1 publication Critical patent/WO2022156479A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the technical field of sound synthesis, and in particular, to a synthesis method, device, electronic device and storage medium for customizing timbre singing.
  • the existing singing voice synthesis technology such as the "VOCALOID" synthesizer that virtual idols rely on, mainly relies on the establishment of a corpus through real voice data, and then word segmentation of the lyrics provided by the user, and then the corresponding corpus is retrieved from the corpus. , and finally adjust the beat and pitch of the synthesized speech according to the music score provided by the user to synthesize the singing voice.
  • the purpose of the present disclosure is to provide a synthesis method, device, electronic device and storage medium for custom timbre singing, which at least to a certain extent overcomes the problem of low synthesis efficiency of custom timbre singing in the related art.
  • a method for synthesizing customized timbre and singing including: training a first neural network through a speaker recording sample to obtain a speaker recognition model, and the first neural network outputs a training result as a speaker vector sample; train the second neural network through the singing cappella sample and the speaker vector sample to obtain a cappella synthesis model; input the speaker recording to be synthesized into the speaker recognition model, and obtain the output of the middle hidden layer of the speaker recognition model. Speaker information; input the a cappella music information and speaker information to be synthesized into the a cappella synthesis model to obtain a synthesized custom timbre singing voice.
  • training the first neural network by using the speaker recording samples to obtain the speaker recognition model includes: dividing the speaker recording samples into test recording samples and registration recording samples, and inputting them into the first Neural network; the registered recording sample outputs the registered recording feature through the first neural network, and performs the average pooling process on the registered recording feature to obtain the registered recording vector; the test recording sample outputs the test recording vector through the first neural network; Calculate the cosine similarity of the test recording vector to obtain the cosine similarity result; optimize the parameters of the first neural network through the cosine similarity result and the regression function, until the loss value of the regression function is the smallest; The network is identified as a speaker recognition model.
  • the a cappella synthesis model includes a duration model, an acoustic model, and a vocoder model
  • the second neural network is trained by the acapella samples and the speaker vector samples to obtain the a cappella synthesis model, including: parsing The score samples, lyrics samples and phoneme duration samples in the singing acapella samples; the duration model is trained through the speaker vector samples, score samples, lyrics samples and phoneme duration samples, and the output result of the duration model is the duration prediction sample.
  • the a cappella synthesis model includes a duration model, an acoustic model, and a vocoder model
  • the second neural network is trained by the acapella samples and the speaker vector samples to obtain the a cappella synthesis model, including: parsing Score samples, lyric samples, and phoneme duration samples in a cappella sample; extract mel map samples from songs in a cappella samples; compare acoustics with speaker vector samples, phoneme duration samples, score samples, lyrics samples, and mel map samples
  • the model is trained, and the output of the acoustic model is the predicted sample of the mel spectrogram.
  • the a cappella synthesis model includes a duration model, an acoustic model, and a vocoder model
  • the second neural network is trained by the acapella samples and the speaker vector samples to obtain the a cappella synthesis model including: according to Mel spectrum samples are extracted from songs in the singing acapella samples; the vocoder model is trained through the Mel spectrum samples, and the output of the vocoder model is an audio prediction sample.
  • inputting the to-be-synthesized acapella music information and speaker information into a cappella synthesis model to obtain a synthesized custom-timbered singing voice includes: parsing the to-be-synthesized musical score and to-be-synthesized lyrics in the cappella music information;
  • the speaker information, the music score to be synthesized, and the lyrics to be synthesized are input into the duration model, and the output result of the duration model is the duration prediction result to be synthesized;
  • the duration prediction result, speaker information, the music score to be synthesized, and the lyrics to be synthesized are input to the acoustic model , and the output result of the acoustic model is the prediction result of the mel spectrogram to be synthesized;
  • the prediction result of the mel spectrogram is input into the vocoder model, and the output result of the vocoder model is the synthesized custom timbre singing voice.
  • parsing the musical score and lyrics to be synthesized in the acapella music information includes: performing text analysis and feature extraction on the musical score and lyrics in the acapella music information to obtain the musical score to be synthesized and the lyrics to be synthesized.
  • the duration prediction result, the speaker information, the musical score to be synthesized, and the lyrics to be synthesized are input into the acoustic model, and the output result of the acoustic model is the prediction result of the Mel spectrogram to be synthesized, including: , Perform frame-level expansion of the musical score to be synthesized and the lyrics to be synthesized; input the result of the frame-level expansion and the speaker information into the acoustic model, and the output result of the acoustic model is the prediction result of the Mel spectrogram to be synthesized.
  • a synthesizing device for custom timbre singing including: a first training module for training a first neural network through a speaker recording sample to obtain a speaker recognition model, the first neural network The network output training result is the speaker vector sample; the second training module is used to train the second neural network through the singing acapella sample and the speaker vector sample to obtain a cappella synthesis model; the recognition module is used to combine the speech to be synthesized.
  • the human recording is input to the speaker recognition model, and the speaker information output by the middle hidden layer of the speaker recognition model is obtained; the synthesis module is used to input the a cappella music information and speaker information to be synthesized into the a cappella synthesis model to obtain the synthesized Custom Voice Singing
  • an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute any one of the foregoing by executing the executable instructions The synthesis method of custom voice singing.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements any one of the above-mentioned methods for synthesizing singing voices with customized timbres.
  • the first neural network is trained by the speaker recording samples to obtain the speaker recognition model, and the first neural network outputs the training result as the speaker vector sample, and is passed through
  • the second neural network is trained by singing a cappella samples and speaker vector samples to obtain a cappella synthesis model, which improves the efficiency of model synthesis, and does not need to collect a large amount of recording data to generate a corpus.
  • the speaker information output by the middle hidden layer of the speaker identification model is obtained, and the acapella music information and speaker information to be synthesized are input into the a cappella synthesis model.
  • user-customized timbres can be trained with only a small amount of corpus, and the effect of singing synthesis can be achieved by adjusting the rhythm and pitch of the synthesized speech, which reduces the time and training corpus required in the process of custom timbre and singing synthesis. , which improves the synthesis efficiency of custom timbre singing.
  • FIG. 1 shows a schematic diagram of a method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 2 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 3 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 4 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 5 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 6 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 7 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 8 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 10 shows a flow chart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 11 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 12 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 13 shows a flowchart of another method for synthesizing custom timbre singing in an embodiment of the present disclosure
  • FIG. 14 shows a schematic diagram of a synthesis device for customizing timbre singing in an embodiment of the present disclosure
  • FIG. 15 shows a schematic diagram of an electronic device in an embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • the first neural network is trained by the speaker recording sample to obtain the speaker recognition model, the first neural network outputs the training result as the speaker vector sample, and the acapella singing sample and the speaker vector sample are paired with each other.
  • the second neural network is trained to obtain a cappella synthesis model, which improves the efficiency of model synthesis, and does not need to collect a large amount of recording data to generate a corpus.
  • the speaker information output by the middle hidden layer of the speaker identification model is obtained, and the acapella music information and speaker information to be synthesized are input into the a cappella synthesis model.
  • user-customized timbres can be trained with only a small amount of corpus, and the effect of singing synthesis can be achieved by adjusting the rhythm and pitch of the synthesized speech, which reduces the time and training corpus required in the process of custom timbre and singing synthesis. , which improves the synthesis efficiency of custom timbre singing.
  • the terminal can be a mobile phone, game console, tablet computer, e-book reader, smart glasses, MP4 (Moving Picture Experts Group Audio Layer IV, moving image expert compression standard audio layer 4) player, smart home equipment, AR (Augmented Reality, Augmented reality) equipment, VR (Virtual Reality, virtual reality) equipment and other mobile terminals, or, the terminal can also be a personal computer (Personal Computer, PC), such as a laptop portable computer and a desktop computer and so on.
  • PC Personal Computer
  • the terminal may be installed with a synthesis application for providing customized timbre and singing.
  • the terminal and the server cluster are connected through a communication network.
  • the communication network is a wired network or a wireless network.
  • a server cluster is a server, or consists of several servers, or a virtualization platform, or a cloud computing service center.
  • the server cluster is used to provide background services for applications that provide synthesis for custom timbre vocals.
  • the server cluster undertakes the main computing work, and the terminal undertakes the secondary computing work; or, the server cluster undertakes the secondary computing work, and the terminal undertakes the main computing work; or, a distributed computing architecture is used between the terminal and the server cluster for collaborative computing. .
  • the clients of the application programs installed in different terminals are the same, or the clients of the application programs installed on the two terminals are clients of the same type of application programs of different control system platforms.
  • the specific form of the client of the application program may also be different, for example, the client of the application program may be a mobile phone client, a PC client, or a global wide area network client.
  • the number of the above-mentioned terminals may be more or less.
  • the above-mentioned terminal may be only one, or the above-mentioned terminal may be dozens or hundreds, or more.
  • the embodiments of the present disclosure do not limit the number of terminals and device types.
  • the system may further include a management device, and the management device and the server cluster are connected through a communication network.
  • the communication network is a wired network or a wireless network.
  • the above-mentioned wireless network or wired network uses standard communication technologies and/or protocols.
  • the network is usually the Internet, but can be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks).
  • data exchanged over a network is represented using technologies and/or formats including Hyper Text Mark-up Language (HTML), Extensible Markup Language (XML), and the like.
  • HTML Hyper Text Mark-up Language
  • XML Extensible Markup Language
  • you can also use services such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec), etc.
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • VPN Virtual Private Network
  • IPsec Internet Protocol Security
  • Conventional encryption techniques to encrypt all or some of the links.
  • custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques
  • FIG. 1 shows a flow chart of a method for synthesizing custom timbre singing according to an embodiment of the present disclosure.
  • the methods provided by the embodiments of the present disclosure may be executed by any electronic device with computing processing capability.
  • the electronic device executes a method for synthesizing custom timbre singing, including the following steps:
  • Step S102 the first neural network is trained by the speaker recording samples to obtain a speaker recognition model, and the first neural network outputs the training result as the speaker vector samples.
  • Step S104 the second neural network is trained by the singing a cappella samples and the speaker vector samples to obtain a cappella synthesis model.
  • Step S106 input the speaker recording to be synthesized into the speaker recognition model, and obtain the speaker information output by the middle hidden layer of the speaker recognition model.
  • Step S108 input the a cappella music information and speaker information to be synthesized into the a cappella synthesis model to obtain a synthesized custom timbre singing voice.
  • the first neural network is trained by the speaker recording samples to obtain the speaker recognition model, the first neural network outputs the training result as the speaker vector sample, and the acapella samples and the speaker are sung by the singing voice.
  • the vector samples are used to train the second neural network to obtain a cappella synthesis model, which improves the efficiency of model synthesis and does not need to collect a large amount of recording data to generate a corpus.
  • the speaker information output by the middle hidden layer of the speaker identification model is obtained, and the acapella music information and speaker information to be synthesized are input into the a cappella synthesis model.
  • user-customized timbres can be trained with only a small amount of corpus, and the effect of singing synthesis can be achieved by adjusting the rhythm and pitch of the synthesized speech, which reduces the time and training corpus required in the process of custom timbre and singing synthesis. , which improves the synthesis efficiency of custom timbre singing.
  • the first neural network is trained through the speaker recording samples to obtain the speaker recognition model including:
  • Step S2022 the speaker recording samples are divided into test recording samples and registration recording samples, and input to the first neural network.
  • Step S2024 the registered recording sample outputs the registered recording feature through the first neural network, and performs the average pooling process on the registered recording feature to obtain the registered recording vector.
  • the forward propagation of the average pooling process is to average the values in a block for pooling
  • the process of back propagation is to divide the gradient of an element into n equal parts It is assigned to the previous layer, so as to ensure that the sum of the gradients (residuals) before and after pooling remains unchanged, and the average pooling process can reduce the error of the estimated value variance caused by the limited size of the neighborhood, and retain more feature information.
  • Step S2026 the test recording sample outputs a test recording vector through the first neural network.
  • Step S2028 perform cosine similarity calculation on the registration recording vector and the test recording vector to obtain a cosine similarity result.
  • Step S2030 optimize the parameters of the first neural network through the cosine similarity result and the regression function until the loss value of the regression function is the smallest.
  • Step S2032 determining the first neural network with optimized parameters as the speaker recognition model.
  • the parameters of the first neural network are optimized through the cosine similarity result and the regression function until the loss value of the regression function is the smallest, so as to obtain a speaker recognition model capable of recognizing the speaker, only It takes a few seconds of speaker recording to complete the recognition.
  • the a cappella synthesis model includes a duration model, an acoustic model and a vocoder model, and the second neural network is trained by the acapella samples and the speaker vector samples to obtain a cappella Synthetic models include:
  • Step S3042 parse the score samples, lyrics samples and phoneme duration samples in the acapella samples.
  • step S3044 the duration model is trained by the speaker vector samples, musical score samples, lyrics samples and phoneme duration samples, and the output result of the duration model is duration prediction samples.
  • the duration model is trained through speaker vector samples, musical score samples, lyrics samples, and phoneme duration samples, and the output result of the duration model is duration prediction samples, so as to realize the synthesis of a cappella song.
  • the duration prediction result is used as an input of the acoustic model.
  • the a cappella synthesis model includes a duration model, an acoustic model and a vocoder model, and the second neural network is trained through the acapella samples and the speaker vector samples to obtain a cappella Synthetic models include:
  • Step S4042 parse the score samples, lyrics samples and phoneme duration samples in the singing acapella samples.
  • Step S4044 extracting mel atlas samples according to the songs in the acapella samples.
  • Step S4046 the acoustic model is trained by using the speaker vector sample, the phoneme duration sample, the musical score sample, the lyrics sample and the mel atlas sample, and the output result of the acoustic model is the predicted sample of the mel spectrum.
  • the acoustic model is trained by using speaker vector samples, phoneme duration samples, musical score samples, lyrics samples, and mel atlas samples, and the output result of the acoustic model is a prediction sample of the mel spectrum, so as to obtain suitable The sound characteristics of large and small, through the mel spectrogram to simulate the human ear's perception of various frequencies of sound, that is, through the mel spectrogram to strengthen the low-frequency part, weaken the high-frequency part, and then make a cappella singing more close to the natural person's voice. singing.
  • the a cappella synthesis model includes a duration model, an acoustic model and a vocoder model, and the second neural network is trained by the acapella samples and the speaker vector samples to obtain a cappella Synthetic models include:
  • Step S5042 extracting mel atlas samples according to the songs in the acapella samples.
  • step S5044 the vocoder model is trained by using the mel spectrogram samples, and the output result of the vocoder model is an audio prediction sample.
  • the vocoder model is trained by using the mel spectrogram samples, and the output result of the vocoder model is an audio prediction sample, thereby obtaining an audio prediction sample that matches the speaker's timbre.
  • inputting the a cappella music information and speaker information to be synthesized into the a cappella synthesis model to obtain a synthesized custom timbre singing voice including:
  • Step S6082 parse the musical score to be synthesized and the lyrics to be synthesized in the acapella music information.
  • Step S6084 the speaker information, the musical score to be synthesized, and the lyrics to be synthesized are input into the duration model, and the output result of the duration model is the duration prediction result to be synthesized.
  • Step S6086 input the duration prediction result, speaker information, musical score to be synthesized, and lyrics to be synthesized into the acoustic model, and the output result of the acoustic model is the prediction result of the Mel spectrogram to be synthesized.
  • step S6088 the prediction result of the mel spectrogram is input to the vocoder model, and the output result of the vocoder model is the synthesized custom-timbered singing voice.
  • the speaker is determined through the speaker recognition model, and then the duration model, the acoustic model and the vocoder model are used in turn to determine the speaker, the score to be synthesized, and the Synthesize lyrics to get a customized voice that matches the voice of the speaker.
  • analyzing the musical score to be synthesized and the lyrics to be synthesized in the acapella music information include:
  • Step S7082 perform text analysis and feature extraction on the musical score and lyrics in the a cappella music information to obtain the musical score to be synthesized and the lyrics to be synthesized.
  • the musical scores and lyrics to be synthesized are obtained by performing text analysis and feature extraction on the musical scores and lyrics in a cappella music information. More in line with the speaker's a cappella timbre.
  • the duration prediction result, speaker information, musical score to be synthesized and lyrics to be synthesized are input into the acoustic model, and the output result of the acoustic model is the mel spectrogram to be synthesized
  • Predicted results include:
  • Step S8082 perform frame-level expansion on the duration prediction result, the musical score to be synthesized, and the lyrics to be synthesized.
  • Step S8084 the frame-level extension result and the speaker information are input to the acoustic model, and the output result of the acoustic model is the prediction result of the mel spectrogram to be synthesized.
  • FIG. 9 a schematic diagram of a synthesis scheme for custom-timbered singing according to this embodiment of the present disclosure will be described below.
  • the training phase 900 of the synthesis scheme of the customized timbre singing shown in FIG. 9 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.
  • the training stage 900 of the synthesis scheme for custom timbre singing includes: inputting recognition data into a speaker recognition model for training; outputting the speaker recognition model as speaker information; inputting a cappella data and speaker information into a cappella Synthetic model for training.
  • FIG. 10 a schematic diagram of a synthesis scheme for custom-timbered singing according to this embodiment of the present disclosure is described below with reference to FIG. 10 .
  • the synthesis stage 1000 of the synthesis scheme of the custom-timber singing voice shown in FIG. 10 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
  • the synthesis stage 1000 of the synthesis scheme for custom timbre singing includes: inputting text recording data into a speaker recognition model to obtain speaker information; inputting speaker information, musical scores and lyrics into the a cappella synthesis model to obtain speaker information Get a cappella voice.
  • the following describes a schematic diagram of a synthesis scheme of a custom timbre singing voice according to this embodiment of the present disclosure with reference to FIG. 11 .
  • the speaker recognition model 1100 of the custom-timber singing voice synthesis scheme shown in FIG. 11 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • the execution stage of the speaker recognition model 1100 of the synthesis scheme of custom timbre singing includes:
  • FIG. 12 a schematic diagram of a synthesis scheme of a custom-timbered singing voice according to this embodiment of the present disclosure will be described below.
  • the a cappella synthesis model 1200 of the synthesis scheme of the customized timbre singing shown in FIG. 12 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.
  • the a cappella synthesis model 1200 of the synthesis scheme of custom timbre and singing includes a phoneme duration model 1202, an acoustic model 1204 and a vocoder model 1206, and the training process of each module can be independently performed as follows:
  • the speaker vector, musical score and lyrics, and phoneme duration are input to the phoneme duration model 1202 for training.
  • the synthesis scheme for custom timbre singing includes a speaker recognition model and a cappella synthesis model
  • the cappella synthesis model includes a training process and an inference process
  • the cappella synthesis model includes a phoneme duration model, an acoustic model, and a neural network vocoder model.
  • the phoneme duration model may be a DNN (Deep Neural Networks, deep neural network) model composed of three fully connected layers, the input is a score and lyrics, and the output is the phoneme duration.
  • DNN Deep Neural Networks, deep neural network
  • the speaker vector is also added to the phoneme duration model during training, so as to obtain different phoneme duration models according to different speakers.
  • the input to the acoustic model is the musical score and phoneme duration
  • the output is a mel spectrogram
  • the speaker vector is also input into the acoustic model.
  • the input to the vocoder model is a mel-spectrogram and the output is audio.
  • the acoustic model is a deep neural network model consisting of a three-layer LSTM, or a complex model with an attention mechanism.
  • the vocoder model can use the LPCNet (Improving Neural Speech Synthesis Through Linear Prediction, improving the nervous system through speech synthesis linear prediction) vocoder.
  • LPCNet Improving Neural Speech Synthesis Through Linear Prediction, improving the nervous system through speech synthesis linear prediction
  • the musical score and lyrics are known, as well as the singer's speaker vector, and then using the phoneme duration model, acoustic model, and vocoder model obtained during the training process, a synthesized song can be output.
  • FIG. 13 a schematic diagram of a synthesis scheme for custom-timber singing according to this embodiment of the present disclosure will be described below.
  • the synthesizing scheme of the custom-timbered singing voice shown in FIG. 13 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the execution steps of the synthesis scheme for custom timbre singing include:
  • Step S1302 input the speaker vector to be synthesized.
  • Step S1304 acquiring musical scores and lyrics.
  • Step S1310 phoneme duration prediction.
  • Step S1312 frame-level extension.
  • Step S1316 synthesizing the song.
  • the speaker vector is extracted using a deep neural network based on the speaker's acoustic features.
  • the timbre of the synthesized a cappella voice can be controlled.
  • the present disclosure trains a cappella synthesis model trained on a large number of speaker-recorded reading and cappella vocal data sets.
  • a new speaker needs to synthesize a cappella singing, it only needs to record a small amount of the speaker's reading corpus, extract the speaker's speaker vector, and then input it into the cappella synthesis model, combine the score and lyrics, and synthesize it through the cappella
  • the reasoning process of the model generates the speaker's acapella voice, that is, custom synthesized singing voice.
  • a cappella data set containing multi-timbral and multi-singing voices can be constructed to train a basic model that can synthesize a cappella voice with a given score and lyrics.
  • the data set also needs to contain a part of the singer's recording data of the specified text.
  • a text-related speaker recognition model can be trained, and the result of the middle hidden layer of the model is taken, which is defined as the speaker vector.
  • the specified text recording of the singer is sent into the speaker recognition model to obtain the speaker vector, and then the speaker vector is used as part of the acapella model to train a cappella through a large number of multi-timbral and multi-singing acapella data sets.
  • Synthesis model the singer's identity information is included in the a cappella synthesis model.
  • the synthesizing apparatus 1400 for custom timbre singing according to this embodiment of the present disclosure will be described below with reference to FIG. 14 .
  • the synthesizing apparatus 1400 for custom timbre singing shown in FIG. 14 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the synthesizing apparatus 1400 for custom timbre singing is represented in the form of a hardware module.
  • the components of the synthesis apparatus 1400 for custom timbre singing may include, but are not limited to: a first training module 1402 , a second training module 1404 , a recognition module 1406 and a synthesis module 1408 .
  • the first training module 1402 is used for training the first neural network through the speaker recording samples to obtain the speaker identification model, and the first neural network outputs the training result as the speaker vector samples.
  • the second training module 1404 is configured to train the second neural network by using the singing a cappella samples and the speaker vector samples to obtain a cappella synthesis model.
  • the identification module 1406 is configured to input the speaker recording to be synthesized into the speaker identification model, and obtain the speaker information output by the middle hidden layer of the speaker identification model.
  • the synthesis module 1408 is used for inputting the a cappella music information and speaker information to be synthesized into the a cappella synthesis model to obtain a synthesized custom-timbered singing voice.
  • FIG. 15 An electronic device 1500 according to this embodiment of the present disclosure is described below with reference to FIG. 15 .
  • the electronic device 1500 shown in FIG. 15 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.
  • electronic device 1500 takes the form of a general-purpose computing device.
  • Components of the electronic device 1500 may include, but are not limited to, the above-mentioned at least one processing unit 1510, the above-mentioned at least one storage unit 1520, and a bus 1530 connecting different system components (including the storage unit 1520 and the processing unit 1510).
  • the storage unit stores program codes, which can be executed by the processing unit 1510, so that the processing unit 1510 performs the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Methods" section of this specification.
  • the processing unit 1510 may perform the steps defined in the method for synthesizing a custom-timbered singing voice of the present disclosure.
  • the storage unit 1520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 15201 and/or a cache storage unit 15202 , and may further include a read only storage unit (ROM) 15203 .
  • RAM random access storage unit
  • ROM read only storage unit
  • the storage unit 1520 may also include a program/utility 15204 having a set (at least one) of program modules 15205 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, An implementation of a network environment may be included in each or some combination of these examples.
  • the bus 1530 may be representative of one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.
  • the electronic device 1500 may also communicate with one or more external devices 1540 (eg, keyboards, pointing devices, Bluetooth devices, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with The electronic device 1500 can communicate with any device (eg, router, modem, etc.) that communicates with one or more other computing devices. Such communication may take place through input/output (I/O) interface 1550 . Also, the electronic device 1500 can communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 1560 , which communicates with the electronic device 1500 through a bus 1530 . communication with other modules.
  • I/O input/output
  • the electronic device 1500 can communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 1560 , which communicates with the electronic device 1500 through
  • the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present disclosure may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to an embodiment of the present disclosure.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a terminal device, the program code is used to cause the terminal device to execute the above-mentioned procedures in this specification. Steps according to various exemplary embodiments of the present disclosure are described in the "Example Methods" section.
  • a program product for implementing the above method according to an embodiment of the present disclosure may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may run on a terminal device such as a personal computer.
  • CD-ROM compact disc read only memory
  • the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a propagated data signal in baseband or as part of a carrier wave with readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable signal medium can also be any readable medium, other than a readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
  • LAN local area network
  • WAN wide area network
  • an external computing device eg, using an Internet service provider business via an Internet connection
  • modules or units of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, according to embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.
  • the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present disclosure may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to an embodiment of the present disclosure.
  • a computing device which may be a personal computer, a server, a mobile terminal, or a network device, etc.
  • the first neural network is trained by the speaker recording sample to obtain the speaker recognition model, the first neural network outputs the training result as the speaker vector sample, and the acapella singing sample and the speaker vector sample are paired with each other.
  • the second neural network is trained to obtain a cappella synthesis model, which improves the efficiency of model synthesis, and does not need to collect a large amount of recording data to generate a corpus. Further, by inputting the speaker recording to be synthesized into the speaker identification model, the speaker information output by the middle hidden layer of the speaker identification model is obtained, and the acapella music information and speaker information to be synthesized are input into the a cappella synthesis model.
  • timbres In order to obtain the synthesized customized timbre singing, user-customized timbres can be trained with only a small amount of corpus, and the effect of singing synthesis can be achieved by adjusting the rhythm and pitch of the synthesized speech, which reduces the time and training corpus required in the process of custom timbre and singing synthesis. , which improves the synthesis efficiency of custom timbre singing.

Abstract

一种定制音色歌声的合成方法、装置、电子设备和存储介质,该合成方法包括:通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本(S102);通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型(S104);将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息(S106);将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声(S108)。通过该方法,提高了定制音色歌声合成的效率和效果,缩短了定制音色歌声合成的模型训练时间和响应时间。

Description

定制音色歌声的合成方法、装置、电子设备和存储介质
本公开要求于2021年1月20日提交的申请号为202110076168.0、名称为“定制音色歌声的合成方法、装置、电子设备和存储介质”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。
技术领域
本公开涉及声音合成技术领域,尤其涉及一种定制音色歌声的合成方法、装置、电子设备和存储介质。
背景技术
随着人工智能行业的高速发展,智能语音合成技术已经渗透多个领域,被应用于:智能家居、语音导航、智能客服等业务,且人工智能合成的语音拟人度高,可达到替代人工的标准。为满足用户音色多样性的需求,现有音色定制功能也日渐成熟,可以通过用户少量语料音频训练出该用户专属音色。同时,随着虚拟偶像的知名度日渐提升,歌声合成也成为了语音合成技术的主要发展方向之一。目前现有的歌声合成技术,例如虚拟偶像所依托的“VOCALOID”合成器,主要依靠于通过真人语音数据建立语料库,再对用户提供的歌词进行字词切割,进而从语料库中调取对应的语料,最后按照用户提供的乐谱调节合成语音的节拍和音高来合成歌声。
相关技术中,由于歌声合成技术建立语料库需要用户长时间的语音数据,且调用语料库生成语音的过程需要耗费大量时间,导致歌声合成的效率低。另外,由于语料库具备体量大的特性,用户对于音色定制的需求只能通过替换整个语料库才可完成,过程繁琐且耗时长。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
本公开的目的在于提供一种定制音色歌声的合成方法、装置、电子设备和存储介质,至少在一定程度上克服由于相关技术中定制音色歌声的合成效率低的问题。
本公开的其他特性和优点将通过后续的详细描述变得显然,或部分地通过本公开的实践而习得。
根据本公开的一个方面,提供一种定制音色歌声的合成方法,包括:通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本;通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型;将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间 隐藏层输出的说话人信息;将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声。
在本公开的一个实施例中,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型包括:将说话人录音样本划分为测试录音样本和注册录音样本,并输入至第一神经网络;注册录音样本经第一神经网络输出注册录音特征,将注册录音特征进行平均池化处理,以得到注册录音向量;测试录音样本经第一神经网络输出测试录音向量;对注册录音向量和测试录音向量进行余弦相似度计算,以获得余弦相似度结果;通过余弦相似度结果和回归函数对第一神经网络进行参数优化,至回归函数的损失值最小为止;将参数优化后的第一神经网络确定为说话人识别模型。
在本公开的一个实施例中,清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:解析歌声清唱样本中的乐谱样本、歌词样本和音素时长样本;通过说话人向量样本、乐谱样本、歌词样本和音素时长样本对持续时间模型进行训练,持续时间模型的输出结果为时长预测样本。
在本公开的一个实施例中,清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:解析歌声清唱样本中的乐谱样本、歌词样本和音素时长样本;根据歌声清唱样本中的歌曲提取梅尔图谱样本;通过说话人向量样本、音素时长样本、乐谱样本、歌词样本和梅尔图谱样本对声学模型进行训练,声学模型输出结果为梅尔谱图预测样本。
在本公开的一个实施例中,清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:根据歌声清唱样本中的歌曲提取梅尔图谱样本;通过梅尔谱图样本对声码器模型进行训练,声码器模型的输出结果为音频预测样本。
在本公开的一个实施例中,将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声包括:解析清唱音乐信息中的待合成乐谱和待合成歌词;将说话人信息、待合成乐谱和待合成歌词输入至持续时间模型,持续时间模型输出结果为待合成的时长预测结果;将时长预测结果、说话人信息、待合成乐谱和待合成歌词输入至声学模型,声学模型输出结果为待合成的梅尔谱图预测结果;将梅尔谱图预测结果输入至声码器模型,声码器模型输出结果为合成的定制音色歌声。
在本公开的一个实施例中,解析清唱音乐信息中的待合成乐谱和待合成歌词包括:对清唱音乐信息中的乐谱和歌词进行文本分析和特征提取,以获取待合成乐谱和待合成歌词。
在本公开的一个实施例中,将时长预测结果、说话人信息、待合成乐谱和待合成歌词输入至声学模型,声学模型输出结果为待合成的梅尔谱图预测结果包括:对时长预测结果、待合成乐谱和待合成歌词进行帧级扩展;将帧级扩展的结果和说话人信息输入至声学模 型,声学模型输出结果为待合成的梅尔谱图预测结果。
根据本公开的另一个方面,提供一种定制音色歌声的合成装置,包括:第一训练模块,用于通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本;第二训练模块,用于通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型;识别模块,用于将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息;合成模块,用于将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声
根据本公开的再一个方面,提供一种电子设备,包括:处理器;以及存储器,用于存储处理器的可执行指令;其中,处理器配置为经由执行可执行指令来执行上述任意一项的定制音色歌声的合成方法。
根据本公开的又一个方面,提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述任意一项的定制音色歌声的合成方法。
本公开的实施例所提供的定制音色歌声的合成方案,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本,并通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型,提高了模型合成的效率,不需要采集大量录音数据来生成语料库。
进一步地,通过将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息,以及将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声,只需少量语料即可训练出用户定制音色,并通过调整合成语音的节奏和音高达到歌声合成的效果,减少了定制音色歌声合成过程中需要的时间和训练语料,提升了定制音色歌声的合成效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出本公开实施例中一种定制音色歌声的合成方法的示意图;
图2示出本公开实施例中另一种定制音色歌声的合成方法的流程图;
图3示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图4示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图5示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图6示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图7示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图8示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图9示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图10示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图11示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图12示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图13示出本公开实施例中又一种定制音色歌声的合成方法的流程图;
图14示出本公开实施例中一种定制音色歌声的合成装置的示意图;
图15示出本公开实施例中一种电子设备的示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
本公开提供的方案,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本,并通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型,提高了模型合成的效率,不需要采集大量录音数据来生成语料库。
进一步地,通过将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息,以及将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声,只需少量语料即可训练出用户定制音色,并通过调整合成语音的节奏和音高达到歌声合成的效果,减少了定制音色歌声合成过程中需要的时间和训练语料,提升了定制音色歌声的合成效率。
上述定制音色歌声的合成方案可以通过多个终端和服务器集群的交互实现。
终端可以是手机、游戏主机、平板电脑、电子书阅读器、智能眼镜、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、智能家居设备、AR(Augmented Reality,增强现实)设备、VR(Virtual Reality,虚拟现实)设备等移动终端,或者,终端也可以是个人计算机(Personal Computer,PC),比如膝上型便携 计算机和台式计算机等等。
其中,终端中可以安装有用于提供定制音色歌声的合成的应用程序。
终端与服务器集群之间通过通信网络相连。可选的,通信网络是有线网络或无线网络。
服务器集群是一台服务器,或者由若干台服务器组成,或者是一个虚拟化平台,或者是一个云计算服务中心。服务器集群用于为提供定制音色歌声的合成的应用程序提供后台服务。可选地,服务器集群承担主要计算工作,终端承担次要计算工作;或者,服务器集群承担次要计算工作,终端承担主要计算工作;或者,终端和服务器集群之间采用分布式计算架构进行协同计算。
可选地,不同的终端中安装的应用程序的客户端是相同的,或两个终端上安装的应用程序的客户端是不同控制系统平台的同一类型应用程序的客户端。基于终端平台的不同,该应用程序的客户端的具体形态也可以不同,比如,该应用程序客户端可以是手机客户端、PC客户端或者全球广域网客户端等。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量。本公开实施例对终端的数量和设备类型不加以限定。
可选的,该系统还可以包括管理设备,该管理设备与服务器集群之间通过通信网络相连。可选的,通信网络是有线网络或无线网络。
可选的,上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也可以是任何网络,包括但不限于局域网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合)。在一些实施例中,使用包括超文本标记语言(Hyper Text Mark-up Language,HTML)、可扩展标记语言(Extensible MarkupLanguage,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer,SSL)、传输层安全(Transport Layer Security,TLS)、虚拟专用网络(Virtual Private Network,VPN)、网际协议安全(Internet ProtocolSecurity,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
下面,将结合附图及实施例对本示例实施方式中的定制音色歌声的合成方法的各个步骤进行更详细的说明。
图1示出本公开实施例中一种定制音色歌声的合成方法流程图。本公开实施例提供的方法可以由任意具备计算处理能力的电子设备执行。
如图1所示,电子设备执行定制音色歌声的合成方法,包括以下步骤:
步骤S102,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本。
步骤S104,通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到 清唱合成模型。
步骤S106,将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息。
步骤S108,将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声。
在本公开的一个实施例中,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本,并通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型,提高了模型合成的效率,不需要采集大量录音数据来生成语料库。
进一步地,通过将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息,以及将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声,只需少量语料即可训练出用户定制音色,并通过调整合成语音的节奏和音高达到歌声合成的效果,减少了定制音色歌声合成过程中需要的时间和训练语料,提升了定制音色歌声的合成效率。
基于图1所示的步骤,如图2所示,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型包括:
步骤S2022,将说话人录音样本划分为测试录音样本和注册录音样本,并输入至第一神经网络。
步骤S2024,注册录音样本经第一神经网络输出注册录音特征,将注册录音特征进行平均池化处理,以得到注册录音向量。
在本公开的一个实施例中,平均池化处理的前向传播就是把一个块中的值求取平均来做池化,那么反向传播的过程也就是把一个元素的梯度等分为n份分配给前一层,这样就保证池化前后的梯度(残差)之和保持不变,平均池化处理能够减小邻域大小受限造成的估计值方差的误差,更多的保留特征信息。
步骤S2026,测试录音样本经第一神经网络输出测试录音向量。
步骤S2028,对注册录音向量和测试录音向量进行余弦相似度计算,以获得余弦相似度结果。
步骤S2030,通过余弦相似度结果和回归函数对第一神经网络进行参数优化,至回归函数的损失值最小为止。
步骤S2032,将参数优化后的第一神经网络确定为说话人识别模型。
在本公开的一个实施例中,通过余弦相似度结果和回归函数对第一神经网络进行参数优化,至回归函数的损失值最小为止,以获得能够对说话人进行识别的说话人识别模型,仅需要几秒的说话人录音即可完成识别。
基于图1所示的步骤,如图3所示,清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合 成模型包括:
步骤S3042,解析歌声清唱样本中的乐谱样本、歌词样本和音素时长样本。
步骤S3044,通过说话人向量样本、乐谱样本、歌词样本和音素时长样本对持续时间模型进行训练,持续时间模型的输出结果为时长预测样本。
在本公开的一个实施例中,通过说话人向量样本、乐谱样本、歌词样本和音素时长样本对持续时间模型进行训练,持续时间模型的输出结果为时长预测样本,以实现对合成后的清唱歌曲的时长预测结果,时长预测结果作为声学模型的一个输入量。
基于图1所示的步骤,如图4所示,清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:
步骤S4042,解析歌声清唱样本中的乐谱样本、歌词样本和音素时长样本。
步骤S4044,根据歌声清唱样本中的歌曲提取梅尔图谱样本。
步骤S4046,通过说话人向量样本、音素时长样本、乐谱样本、歌词样本和梅尔图谱样本对声学模型进行训练,声学模型输出结果为梅尔谱图预测样本。
在本公开的一个实施例中,通过说话人向量样本、音素时长样本、乐谱样本、歌词样本和梅尔图谱样本对声学模型进行训练,声学模型输出结果为梅尔谱图预测样本,以得到合适大小的声音特征,通过梅尔谱图来模拟人耳对各种频率的声音的感知力,也即通过梅尔谱图强化低频部分,弱化高频部分,进而使清唱合成歌声更接近于自然人的歌声。
基于图1所示的步骤,如图5所示,清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:
步骤S5042,根据歌声清唱样本中的歌曲提取梅尔图谱样本。
步骤S5044,通过梅尔谱图样本对声码器模型进行训练,声码器模型的输出结果为音频预测样本。
在本公开的一个实施例中,通过梅尔谱图样本对声码器模型进行训练,声码器模型的输出结果为音频预测样本,从而得到符合说话人音色的音频预测样本。
基于图1和图3所示的步骤,如图6所示,将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声包括:
步骤S6082,解析清唱音乐信息中的待合成乐谱和待合成歌词。
步骤S6084,将说话人信息、待合成乐谱和待合成歌词输入至持续时间模型,持续时间模型输出结果为待合成的时长预测结果。
步骤S6086,将时长预测结果、说话人信息、待合成乐谱和待合成歌词输入至声学模型,声学模型输出结果为待合成的梅尔谱图预测结果。
步骤S6088,将梅尔谱图预测结果输入至声码器模型,声码器模型输出结果为合成的定制音色歌声。
在本公开的一个实施例中,在定制音色歌声的合成过程中,通过说话人识别模型确定说话人,继而依次通过持续时间模型、声学模型和声码器模型根据说话人、待合成乐谱和待合成歌词,得到符合说话人音色的定制音色歌声。
基于图1和图3所示的步骤,如图7所示,解析清唱音乐信息中的待合成乐谱和待合成歌词包括:
步骤S7082,对清唱音乐信息中的乐谱和歌词进行文本分析和特征提取,以获取待合成乐谱和待合成歌词。
在本公开的一个实施例中,通过对清唱音乐信息中的乐谱和歌词进行文本分析和特征提取,以获取待合成乐谱和待合成歌词,待合成歌词更符合说话人的吐字特点,待合成乐谱更符合说话人的清唱音色。
基于图1和图3所示的步骤,如图8所示,将时长预测结果、说话人信息、待合成乐谱和待合成歌词输入至声学模型,声学模型输出结果为待合成的梅尔谱图预测结果包括:
步骤S8082,对时长预测结果、待合成乐谱和待合成歌词进行帧级扩展。
步骤S8084,将帧级扩展的结果和说话人信息输入至声学模型,声学模型输出结果为待合成的梅尔谱图预测结果。
下面参照图9来描述根据本公开的这种实施方式的定制音色歌声的合成方案的示意图。图9所示的定制音色歌声的合成方案的训练阶段900仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,定制音色歌声的合成方案的训练阶段900包括:将识别数据输入至说话人识别模型进行训练;说话人识别模型输出为说话人信息;将清唱数据和说话人信息输入至清唱合成模型进行训练。
下面参照图10来描述根据本公开的这种实施方式的定制音色歌声的合成方案的示意图。图10所示的定制音色歌声的合成方案的合成阶段1000仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图10所示,定制音色歌声的合成方案的合成阶段1000包括:将文本录制数据输入至说话人识别模型,以获得说话人信息;将说话人信息、乐谱和歌词输入至清唱合成模型,以获得清唱歌声。
下面参照图11来描述根据本公开的这种实施方式的定制音色歌声的合成方案的示意图。图11所示的定制音色歌声的合成方案的说话人识别模型1100仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图11所示,定制音色歌声的合成方案的说话人识别模型1100执行阶段包括:
(1)将测试录音、注册录音1、……注册录音N输入至LSTM,N为大于1的整数,其中,LSTM(Long Short-Term Memory)是长短期记忆网络,是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。测试录音经LSTM处理后输出说话人向量1,注册录音经LSTM处理后的向量进行平均池化处理,得到说话人向 量2。
(2)对说话人向量1和说话人向量2进行余弦相似度计算,并进行打分函数的计算。
(3)通过逻辑回归处理确定打分函数的结果为接收或拒绝。
下面参照图12来描述根据本公开的这种实施方式的定制音色歌声的合成方案的示意图。图12所示的定制音色歌声的合成方案的清唱合成模型1200仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图12所示,定制音色歌声的合成方案的清唱合成模型1200包括音素时长模型1202、声学模型1204和声码器模型1206,各模块的训练过程可以独立执行如下:
(1)说话人向量、乐谱和歌词、音素时长输入至音素时长模型1202进行训练。
(2)说话人向量、乐谱和歌词、音素时长和梅尔图谱输入至声学模型1204进行训练。
(3)梅尔图谱和歌曲输入至声码器模型1206进行训练。
具体地,定制音色歌声的合成方案包括说话人识别模型和清唱合成模型,清唱合成模型包含训练过程和推理过程,清唱合成模型包括音素时长模型、声学模型和神经网络声码器模型。
可例如,音素时长模型可以是三层全连接层组成的DNN(Deep Neural Networks,深度神经网络)模型,输入是乐谱和歌词,输出是音素时长。在预测时,我们只知道乐谱。
可例如,说话人向量在训练的时候也被加入到音素时长模型当中,用于根据不同的说话人,来得到不同的音素时长模型。
可例如,声学模型的输入是乐谱和音素时长,输出是梅尔谱图,说话人向量也被输入到声学模型中。
可例如,声码器模型的输入是梅尔谱图,输出是音频。声学模型的一种可能的实现方法是由三层LSTM组成的深度神经网络模型,也可以是复杂的带有注意力机制的模型。
可例如,声码器模型可以采用LPCNet(Improving Neural Speech Synthesis Through Linear Prediction,通过语音合成线性预测来改善神经系统)声码器。
可例如,在推理过程中,已知乐谱和歌词,以及歌唱者的说话人向量,然后使用训练过程中得到的音素时长模型、声学模型和声码器模型,可以输出合成的歌曲。
可例如,在训练和推理的过程中,都需要从乐谱和歌词出发,进行文本分析、提取音素、分词等特征,然后进行音素时长的预测或者音素时长模型的训练。
下面参照图13来描述根据本公开的这种实施方式的定制音色歌声的合成方案的示意图。图13所示的定制音色歌声的合成方案仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图13所示,定制音色歌声的合成方案的执行步骤包括:
步骤S1302,输入待合成的说话人向量。
步骤S1304,获取乐谱和歌词。
步骤S1306,文本分析。
步骤S1308,特征提取。
步骤S1310,音素时长预测。
步骤S1312,帧级扩展。
步骤S1314,梅尔图谱预测。
步骤S1316,合成歌曲。
其中,说话人向量是根据说话人的声学特征,使用深度神经网络提取的。通过说话人信息,可以控制合成的清唱声音的音色。本公开训练一个清唱合成模型,该清唱合成模型是在一个大量的说话人录制的朗读和清唱歌声数据集上训练得到的。当一个新的说话人需要合成清唱的歌声时,只需要录制该说话人少量的朗读语料,提取该说话人的说话人向量,然后输入到该清唱合成模型,结合乐谱和歌词,通过该清唱合成模型的推理过程,生成该说话人的清唱的声音,即定制合成歌声。
可例如,构建一个包含多音色、多歌声的清唱数据集,用以训练一个给定乐谱和歌词可以合成清唱声音基础模型,另外,数据集还需要包含一部分指定文本的歌手录音数据。
可例如,训练一个文本相关的说话人识别模型,取模型中间隐藏层结果,定义为说话人向量。
可例如,将歌手的指定文本录音,送入说话人识别模型中,得到说话人向量,再将这个说话人向量作为清唱歌声模型的一部分,通过大量多音色、多歌声清唱数据集,训练一个清唱合成模型,清唱合成模型中包含歌手身份信息。
下面参照图14来描述根据本公开的这种实施方式的定制音色歌声的合成装置1400。图14所示的定制音色歌声的合成装置1400仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图14所示,定制音色歌声的合成装置1400以硬件模块的形式表现。定制音色歌声的合成装置1400的组件可以包括但不限于:第一训练模块1402、第二训练模块1404、识别模块1406和合成模块1408。
第一训练模块1402,用于通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本。
第二训练模块1404,用于通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型。
识别模块1406,用于将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息。
合成模块1408,用于将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声。
下面参照图15来描述根据本公开的这种实施方式的电子设备1500。图15显示的电子设备1500仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图15所示,电子设备1500以通用计算设备的形式表现。电子设备1500的组件可 以包括但不限于:上述至少一个处理单元1510、上述至少一个存储单元1520、连接不同系统组件(包括存储单元1520和处理单元1510)的总线1530。
其中,存储单元存储有程序代码,程序代码可以被处理单元1510执行,使得处理单元1510执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。例如,处理单元1510可以执行本公开的定制音色歌声的合成方法中限定的步骤。
存储单元1520可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)15201和/或高速缓存存储单元15202,还可以进一步包括只读存储单元(ROM)15203。
存储单元1520还可以包括具有一组(至少一个)程序模块15205的程序/实用工具15204,这样的程序模块15205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线1530可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备1500也可以与一个或多个外部设备1540(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备交互的设备通信,和/或与使得该电子设备1500能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口1550进行。并且,电子设备1500还可以通过网络适配器1560与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信,网络适配器1560通过总线1530与电子设备1500的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RA标识系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。
在本公开的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。
根据本公开的实施方式的用于实现上述方法的程序产品,其可以采用便携式紧凑盘只 读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
此外,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或 惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由所附的权利要求指出。
工业实用性
本公开提供的方案,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,第一神经网络输出训练结果为说话人向量样本,并通过歌声清唱样本和说话人向量样本对第二神经网络进行训练,以得到清唱合成模型,提高了模型合成的效率,不需要采集大量录音数据来生成语料库。进一步地,通过将待合成的说话人录音输入至说话人识别模型,获取说话人识别模型的中间隐藏层输出的说话人信息,以及将待合成的清唱音乐信息和说话人信息输入至清唱合成模型,以获得合成的定制音色歌声,只需少量语料即可训练出用户定制音色,并通过调整合成语音的节奏和音高达到歌声合成的效果,减少了定制音色歌声合成过程中需要的时间和训练语料,提升了定制音色歌声的合成效率。

Claims (11)

  1. 一种定制音色歌声的合成方法,其特征在于,包括:
    通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,所述第一神经网络输出训练结果为说话人向量样本;
    通过歌声清唱样本和所述说话人向量样本对第二神经网络进行训练,以得到清唱合成模型;
    将待合成的说话人录音输入至所述说话人识别模型,获取所述说话人识别模型的中间隐藏层输出的说话人信息;
    将待合成的清唱音乐信息和所述说话人信息输入至所述清唱合成模型,以获得合成的定制音色歌声。
  2. 根据权利要求1所述的定制音色歌声的合成方法,其特征在于,通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型包括:
    将所述说话人录音样本划分为测试录音样本和注册录音样本,并输入至所述第一神经网络;
    所述注册录音样本经所述第一神经网络输出注册录音特征,将所述注册录音特征进行平均池化处理,以得到注册录音向量;
    所述测试录音样本经所述第一神经网络输出测试录音向量;
    对所述注册录音向量和所述测试录音向量进行余弦相似度计算,以获得余弦相似度结果;
    通过所述余弦相似度结果和回归函数对所述第一神经网络进行参数优化,至所述回归函数的损失值最小为止;
    将参数优化后的所述第一神经网络确定为所述说话人识别模型。
  3. 根据权利要求1所述的定制音色歌声的合成方法,其特征在于,所述清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和所述说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:
    解析所述歌声清唱样本中的乐谱样本、歌词样本和音素时长样本;
    通过所述说话人向量样本、所述乐谱样本、所述歌词样本和所述音素时长样本对所述持续时间模型进行训练,所述持续时间模型的输出结果为时长预测样本。
  4. 根据权利要求1所述的定制音色歌声的合成方法,其特征在于,所述清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和所述说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:
    解析所述歌声清唱样本中的乐谱样本、歌词样本和音素时长样本;
    根据所述歌声清唱样本中的歌曲提取梅尔图谱样本;
    通过所述说话人向量样本、所述音素时长样本、所述乐谱样本、所述歌词样本和所述梅尔图谱样本对所述声学模型进行训练,所述声学模型输出结果为梅尔谱图预测样本。
  5. 根据权利要求1所述的定制音色歌声的合成方法,其特征在于,所述清唱合成模型包括持续时间模型、声学模型和声码器模型,通过歌声清唱样本和所述说话人向量样本对第二神经网络进行训练,以得到清唱合成模型包括:
    根据所述歌声清唱样本中的歌曲提取梅尔图谱样本;
    通过所述梅尔谱图样本对所述声码器模型进行训练,所述声码器模型的输出结果为音频预测样本。
  6. 根据权利要求1-5中任一项所述的定制音色歌声的合成方法,其特征在于,所述清唱合成模型包括持续时间模型、声学模型和声码器模型,将待合成的清唱音乐信息和所述说话人信息输入至所述清唱合成模型,以获得合成的定制音色歌声包括:
    解析所述清唱音乐信息中的待合成乐谱和待合成歌词;
    将所述说话人信息、所述待合成乐谱和所述待合成歌词输入至所述持续时间模型,所述持续时间模型输出结果为待合成的时长预测结果;
    将所述时长预测结果、所述说话人信息、所述待合成乐谱和所述待合成歌词输入至所述声学模型,所述声学模型输出结果为待合成的梅尔谱图预测结果;
    将所述梅尔谱图预测结果输入至所述声码器模型,所述声码器模型输出结果为所述合成的定制音色歌声。
  7. 根据权利要求6所述的定制音色歌声的合成方法,其特征在于,解析所述清唱音乐信息中的待合成乐谱和待合成歌词包括:
    对所述清唱音乐信息中的乐谱和歌词进行文本分析和特征提取,以获取所述待合成乐谱和所述待合成歌词。
  8. 根据权利要求6所述的定制音色歌声的合成方法,其特征在于,将所述时长预测结果、所述说话人信息、所述待合成乐谱和所述待合成歌词输入至所述声学模型,所述声学模型输出结果为待合成的梅尔谱图预测结果包括:
    对所述时长预测结果、所述待合成乐谱和所述待合成歌词进行帧级扩展;
    将所述帧级扩展的结果和所述说话人信息输入至所述声学模型,所述声学模型输出结果为待合成的梅尔谱图预测结果。
  9. 一种定制音色歌声的合成装置,其特征在于,包括:
    第一训练模块,用于通过说话人录音样本对第一神经网络进行训练,以得到说话人识别模型,所述第一神经网络输出训练结果为说话人向量样本;
    第二训练模块,用于通过歌声清唱样本和所述说话人向量样本对第二神经网络进行训练,以得到清唱合成模型;
    识别模块,用于将待合成的说话人录音输入至所述说话人识别模型,获取所述说话人识别模型的中间隐藏层输出的说话人信息;
    合成模块,用于将待合成的清唱音乐信息和所述说话人信息输入至所述清唱合成模型,以获得合成的定制音色歌声。
  10. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-8中任一项所述的定制音色歌声的合成方法。
  11. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,
    所述计算机程序被处理器执行时实现权利要求1-8中任一项所述的定制音色歌声的合成方法。
PCT/CN2021/140858 2021-01-20 2021-12-23 定制音色歌声的合成方法、装置、电子设备和存储介质 WO2022156479A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023516595A JP2023541182A (ja) 2021-01-20 2021-12-23 カスタム音色歌声の合成方法、装置、電子機器及び記憶媒体
US18/252,186 US20230410786A1 (en) 2021-01-20 2021-12-23 Custom tone and vocal synthesis method and apparatus, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110076168.0 2021-01-20
CN202110076168.0A CN113781993A (zh) 2021-01-20 2021-01-20 定制音色歌声的合成方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022156479A1 true WO2022156479A1 (zh) 2022-07-28

Family

ID=78835523

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140858 WO2022156479A1 (zh) 2021-01-20 2021-12-23 定制音色歌声的合成方法、装置、电子设备和存储介质

Country Status (4)

Country Link
US (1) US20230410786A1 (zh)
JP (1) JP2023541182A (zh)
CN (1) CN113781993A (zh)
WO (1) WO2022156479A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781993A (zh) * 2021-01-20 2021-12-10 北京沃东天骏信息技术有限公司 定制音色歌声的合成方法、装置、电子设备和存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2023276234A1 (zh) * 2021-06-29 2023-01-05

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766603A (zh) * 2014-01-06 2015-07-08 安徽科大讯飞信息科技股份有限公司 构建个性化歌唱风格频谱合成模型的方法及装置
CN108461079A (zh) * 2018-02-02 2018-08-28 福州大学 一种面向音色转换的歌声合成方法
US20200135172A1 (en) * 2018-10-26 2020-04-30 Google Llc Sample-efficient adaptive text-to-speech
CN111583900A (zh) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 歌曲合成方法、装置、可读介质及电子设备
CN111862937A (zh) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 歌声合成方法、装置及计算机可读存储介质
CN113781993A (zh) * 2021-01-20 2021-12-10 北京沃东天骏信息技术有限公司 定制音色歌声的合成方法、装置、电子设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150025892A1 (en) * 2012-03-06 2015-01-22 Agency For Science, Technology And Research Method and system for template-based personalized singing synthesis
JP6252420B2 (ja) * 2014-09-30 2017-12-27 ブラザー工業株式会社 音声合成装置、及び音声合成システム
CN111354332A (zh) * 2018-12-05 2020-06-30 北京嘀嘀无限科技发展有限公司 一种歌声合成方法及装置
CN111681637B (zh) * 2020-04-28 2024-03-22 平安科技(深圳)有限公司 歌曲合成方法、装置、设备及存储介质
CN111798821B (zh) * 2020-06-29 2022-06-14 北京字节跳动网络技术有限公司 声音转换方法、装置、可读存储介质及电子设备
CN111899720B (zh) * 2020-07-30 2024-03-15 北京字节跳动网络技术有限公司 用于生成音频的方法、装置、设备和介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766603A (zh) * 2014-01-06 2015-07-08 安徽科大讯飞信息科技股份有限公司 构建个性化歌唱风格频谱合成模型的方法及装置
CN108461079A (zh) * 2018-02-02 2018-08-28 福州大学 一种面向音色转换的歌声合成方法
US20200135172A1 (en) * 2018-10-26 2020-04-30 Google Llc Sample-efficient adaptive text-to-speech
CN111583900A (zh) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 歌曲合成方法、装置、可读介质及电子设备
CN111862937A (zh) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 歌声合成方法、装置及计算机可读存储介质
CN113781993A (zh) * 2021-01-20 2021-12-10 北京沃东天骏信息技术有限公司 定制音色歌声的合成方法、装置、电子设备和存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HEYANG XUE; SHAN YANG; YI LEI; LEI XIE; XIULIN LI: "Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 November 2020 (2020-11-17), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081815923 *
SERCAN ARIK, GREGORY DIAMOS, ANDREW GIBIANSKY, JOHN MILLER, KAINAN PENG, WEI PING, JONATHAN RAIMAN, YANQI ZHOU: "Deep Voice 2: Multi-Speaker Neural Text-to-Speech", 24 May 2017 (2017-05-24), XP055491751, Retrieved from the Internet <URL:http://papers.nips.cc/paper/6889-deep-voice-2-multi-speaker-neural-text-to-speech.pdf> *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781993A (zh) * 2021-01-20 2021-12-10 北京沃东天骏信息技术有限公司 定制音色歌声的合成方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
US20230410786A1 (en) 2023-12-21
JP2023541182A (ja) 2023-09-28
CN113781993A (zh) 2021-12-10

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN103189915B (zh) 使用具有时间演化信息的基底函数来分解音乐信号
JP6876752B2 (ja) 応答方法及び装置
WO2022156479A1 (zh) 定制音色歌声的合成方法、装置、电子设备和存储介质
WO2022188734A1 (zh) 一种语音合成方法、装置以及可读存储介质
CN107657017A (zh) 用于提供语音服务的方法和装置
CN108831437A (zh) 一种歌声生成方法、装置、终端和存储介质
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
WO2022184055A1 (zh) 文章的语音播放方法、装置、设备、存储介质及程序产品
CN109308901A (zh) 歌唱者识别方法和装置
CN108766409A (zh) 一种戏曲合成方法、装置和计算机可读存储介质
CN112992109B (zh) 辅助歌唱系统、辅助歌唱方法及其非瞬时计算机可读取记录媒体
WO2022089097A1 (zh) 音频处理方法、装置及电子设备和计算机可读存储介质
CN109102800A (zh) 一种确定歌词显示数据的方法和装置
Kızrak et al. Classification of classic Turkish music makams
CN112885326A (zh) 个性化语音合成模型创建、语音合成和测试方法及装置
CN113744759B (zh) 音色模板定制方法及其装置、设备、介质、产品
CN115171660A (zh) 一种声纹信息处理方法、装置、电子设备及存储介质
JP2020204683A (ja) 電子出版物視聴覚システム、視聴覚用電子出版物作成プログラム、及び利用者端末用プログラム
Midtlyng et al. Voice adaptation by color-encoded frame matching as a multi-objective optimization problem for future games
CN113806586B (zh) 数据处理方法、计算机设备以及可读存储介质
CN116645957B (zh) 乐曲生成方法、装置、终端、存储介质及程序产品
EP4343761A1 (en) Enhanced audio file generator
Shen et al. Solfeggio Teaching Method Based on MIDI Technology in the Background of Digital Music Teaching
JP7376896B2 (ja) 学習装置、学習方法、学習プログラム、生成装置、生成方法及び生成プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21920853

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023516595

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21920853

Country of ref document: EP

Kind code of ref document: A1