CN113963717A - Cross-language song synthesis method and device, equipment, medium and product thereof - Google Patents

Cross-language song synthesis method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN113963717A
CN113963717A CN202111257558.4A CN202111257558A CN113963717A CN 113963717 A CN113963717 A CN 113963717A CN 202111257558 A CN202111257558 A CN 202111257558A CN 113963717 A CN113963717 A CN 113963717A
Authority
CN
China
Prior art keywords
song
information
target
characteristic information
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111257558.4A
Other languages
Chinese (zh)
Inventor
劳振锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN202111257558.4A priority Critical patent/CN113963717A/en
Publication of CN113963717A publication Critical patent/CN113963717A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor
    • G10H2210/125Medley, i.e. linking parts of different musical pieces in one single piece, e.g. sound collage, DJ mix
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/471General musical sound synthesis principles, i.e. sound category-independent synthesis methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a cross-language song synthesis method and a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: acquiring a target music score and synthesis configuration information of a target song, wherein the synthesis configuration information comprises a song singing language, a target pitch object and a target tone object; calling a corresponding phoneme dictionary according to the song singing language to encode the target music score to obtain phoneme characteristic information and sound order characteristic information of the target song; coding and decoding are carried out according to song synthesis characteristic information by adopting an acoustic model to obtain Mel frequency spectrum information, wherein the song synthesis characteristic information comprises phoneme characteristic information, tone sequence characteristic information, pitch characteristic information of the target pitch object and tone characteristic information of the target tone object; a vocoder is used to convert the Mel frequency spectrum information into the target song. The method and the system can realize cross-language song synthesis service, and can be used for synthesizing target songs of multiple singing languages according to needs by using the same acoustic model.

Description

Cross-language song synthesis method and device, equipment, medium and product thereof
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a cross-lingual song synthesis method, and a corresponding apparatus, computer device, computer-readable storage medium, and computer program product.
Background
At present, the song synthesis technology is generally implemented by using a neural network model, and particularly implemented by using a pre-trained acoustic model and a vocoder, and the basic principle of the technology is to combine a plurality of acoustic features according to a related music score to obtain corresponding singing audio data.
On one hand, because the acoustic model needs to be pre-trained and the related song copyright cost is high, the realization cost of high song synthesis is a higher requirement for the extraction of the corresponding technical realization, and a good technical realization scheme is key to the saving of the corresponding training cost.
On the other hand, in a practical requirement, it is often desirable to be able to synthesize songs in different languages for the same song, for example, a user wishes to sing versions of the same song in different languages based on their own timbre.
In addition, in the prior art, acoustic models of various languages are respectively administrative, cross-language services cannot be realized, inconvenience in interface scheduling is brought to online music synthesis, consistency of tone quality of synthesized songs cannot be guaranteed, completely different tone quality effects of the synthesized songs obtained by the acoustic models of different languages may occur, and user trouble may be caused.
In summary, there is still room for improvement in the conventional song synthesis technology in terms of processing the synthesis of songs in different languages.
Disclosure of Invention
It is a primary object of the present application to solve at least one of the above problems and to provide a cross-lingual song synthesizing method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product thereof, so as to implement assisted music creation.
In order to meet various purposes of the application, the following technical scheme is adopted in the application:
a cross-lingual song synthesizing method adapted to one of the purposes of the present application, comprising the steps of:
acquiring a target music score and synthesis configuration information of a target song, wherein the synthesis configuration information comprises a song singing language, a target pitch object and a target tone object;
calling a corresponding phoneme dictionary according to the song singing language to encode the target music score to obtain phoneme characteristic information and sound order characteristic information of the target song, wherein the phoneme dictionary comprises a mapping relation between phonemes of the corresponding language and encoding numerical values;
coding and decoding are carried out according to song synthesis characteristic information by adopting a pre-trained acoustic model to obtain Mel frequency spectrum information, wherein the song synthesis characteristic information comprises the phoneme characteristic information, the tone sequence characteristic information, pitch characteristic information generated corresponding to the target pitch object and preset tone characteristic information generated corresponding to the target tone object;
and converting the Mel frequency spectrum information into audio data corresponding to the target song by adopting a vocoder.
In a further embodiment, the method for encoding the target music score by calling the corresponding phoneme dictionary according to the song singing language to obtain the phoneme characteristic information and the phonetic sequence characteristic information of the target song includes the following steps:
according to the song singing language, determining a phoneme dictionary corresponding to the singing language from a phoneme dictionary library, wherein the phoneme dictionary library comprises a plurality of phoneme dictionaries corresponding to different singing languages;
according to the lyric text in the target music score corresponding to each phoneme in the lyric pronunciation marking information of the singing language, searching coding numerical values corresponding to the phonemes from the phoneme dictionary, and constructing phoneme characteristic information corresponding to the lyrics;
and coding the sound sequence characteristic information corresponding to the phoneme characteristic information according to the position information of each phoneme.
In the extended embodiment, the method adopts a pre-trained acoustic model, carries out coding and decoding according to the song synthesis characteristic information, and comprises the following steps before obtaining the Mel frequency spectrum information:
generating note characteristic information of the target song according to the melody labeling information in the target music score;
inputting the note characteristic information, phoneme characteristic information and tone sequence characteristic information of the target song into a pre-trained pitch generation model matched with the control parameters of the target object so as to generate pitch characteristic information of the target pitch object;
calling tone characteristic information of the target tone object from a preset tone characteristic library according to the target tone object;
and splicing the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information of the target pitch object and the tone characteristic information of the target tone object into song synthesis characteristic information.
In a further embodiment, the method for obtaining the Mel frequency spectrum information by encoding and decoding the song synthesis characteristic information by using a pre-trained acoustic model comprises the following steps:
coding the song synthesis characteristic information set by adopting a coding network in an acoustic model to obtain a coded coding characteristic vector;
performing down-sampling processing on the coded coding feature vector to obtain a down-sampled coding feature vector;
performing feature recombination processing on the down-sampled coding feature vector by adopting an attention mechanism to obtain a coding feature vector recombined according to context information;
and decoding the recombined coding characteristic vector by adopting a decoding network in the acoustic model to obtain Mel frequency spectrum information.
In a further embodiment, after the decoding network in the acoustic model is used to decode the recombined coding feature vector to obtain mel-frequency spectrum information, the method further includes the following steps:
residual error pre-estimation processing is carried out on the Mel frequency spectrum information of the audio data obtained from the decoding network by adopting a residual error pre-estimation network, so as to obtain residual error information;
and correcting the Mel frequency spectrum information of the audio data based on the residual error information to obtain the corrected Mel frequency spectrum information.
In an embodiment, the method for converting the mel-frequency spectrum information into the audio data corresponding to the target song by using the vocoder comprises the following steps:
obtaining first audio data of a vocal singing part of a corresponding target song output by the acoustic model;
acquiring second audio data of background music corresponding to the target song;
extracting music basic information commonly followed by the background music and a target music score of the target song, wherein the music basic information comprises playing speed per hour, beat number and tone number;
synthesizing the first audio data and the second audio data into audio data corresponding to a target song according to the music basic information;
and outputting the audio data corresponding to the target song.
In an extended embodiment, the acoustic model is pre-trained, and the training process comprises the following steps:
acquiring a training sample set, wherein the training sample set comprises a plurality of groups of training samples, the training samples comprise song samples of different singing languages sung by the same singer and song samples of different singing languages sung by different singers respectively, and each group of song samples comprises corresponding audio data of a song and pronunciation marking information of the song lyrics;
for each group of training samples, performing iterative training of the following process by taking the acoustic model as a target training model:
coding according to the pronunciation label information to obtain phoneme feature information and phonetic sequence feature information corresponding to the song sample, wherein in the phoneme feature information, phonemes in the lyric pronunciation label information of the same singing language are represented according to coding values of a phoneme dictionary corresponding to the singing language in a phoneme dictionary library;
extracting pitch characteristic information of the song samples by adopting a preset algorithm;
extracting tone characteristic information corresponding to a singing singer of a song sample in a pre-trained tone extraction model, and constructing a tone characteristic library for storing mapping relation data between the tone characteristic information and the singing singer;
extracting original Mel frequency spectrum information of the song samples by adopting a preset algorithm;
inputting the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information and the tone characteristic information of the training sample into a target training model to predict Mel frequency spectrum information, supervising the training process by utilizing the original Mel frequency spectrum information, and circularly performing iterative training of the next training sample when the target training model is not converged.
A cross-lingual song synthesizing apparatus adapted to one of the objects of the present application includes: the system comprises a data acquisition module, a music score coding module, an acoustic synthesis module and a frequency spectrum conversion module, wherein the data acquisition module is used for acquiring a target music score and synthesis configuration information of a target song, and the synthesis configuration information comprises a song singing language, a target pitch object and a target tone color object; the music score coding module is used for calling a corresponding phoneme dictionary according to the song singing language to code the target music score to obtain phoneme characteristic information and sound order characteristic information of the target song, and the phoneme dictionary contains a mapping relation between phonemes of the corresponding language and coding numerical values; the acoustic synthesis module is used for coding and decoding according to song synthesis characteristic information by adopting a pre-trained acoustic model to obtain Mel frequency spectrum information, wherein the song synthesis characteristic information comprises the phoneme characteristic information, the tone sequence characteristic information, pitch characteristic information generated corresponding to the target pitch object and preset tone characteristic information generated corresponding to the target tone object; and the frequency spectrum conversion module is used for converting the Mel frequency spectrum information into audio data corresponding to the target song by adopting a vocoder.
In a deepened embodiment, the score encoding module includes: the dictionary calling submodule is used for determining a phoneme dictionary corresponding to the singing language from a phoneme dictionary library according to the singing language of the song, and the phoneme dictionary library comprises a plurality of phoneme dictionaries corresponding to different singing languages; a phoneme mapping sub-module, configured to search, according to each phoneme in the lyric pronunciation tagging information of the vocal language corresponding to the lyric text in the target music score, a coding numerical value corresponding to each phoneme from the phoneme dictionary, and construct phoneme feature information corresponding to the lyric; and the sound sequence mapping submodule is used for coding sound sequence characteristic information corresponding to the phoneme characteristic information according to the position information of each phoneme.
In an extended embodiment, the cross-lingual song synthesizing apparatus of the present application further includes: the note coding module is used for generating note characteristic information of the target song according to the melody labeling information in the target music score; a pitch generation module, configured to input the note feature information, the phoneme feature information, and the musical sequence feature information of the target song into a pre-trained pitch generation model matching the control parameters of the target object, so as to generate pitch feature information of the target pitch object; the tone calling module is used for calling tone characteristic information of the target tone object from a preset tone characteristic library according to the target tone object; and the characteristic splicing module is used for splicing the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information of the target pitch object and the tone characteristic information of the target tone object into song synthesis characteristic information.
In a further embodiment, the acoustic synthesis module comprises: the characteristic coding submodule is used for coding the song synthesis characteristic information set by adopting a coding network in the acoustic model to obtain a coded coding characteristic vector; the characteristic sampling submodule is used for performing down-sampling processing on the coded characteristic vector to obtain a down-sampled coded characteristic vector; the characteristic recombination submodule is used for performing characteristic recombination processing on the down-sampled coding characteristic vector by adopting an attention mechanism to obtain a coding characteristic vector recombined according to the context information; and the characteristic decoding submodule is used for decoding the recombined coding characteristic vector by adopting a decoding network in the acoustic model to obtain Mel frequency spectrum information.
In a further embodiment, the acoustic synthesis module further includes: the residual error prediction submodule is used for performing residual error prediction processing on the Mel frequency spectrum information of the audio data obtained from the decoding network by adopting a residual error prediction network to obtain residual error information; and the frequency spectrum correction submodule is used for correcting the Mel frequency spectrum information of the audio data based on the residual error information to obtain the corrected Mel frequency spectrum information.
In an embodied embodiment, the spectrum conversion module includes: the voice acquisition sub-module is used for acquiring first audio data, output by the acoustic model, of a voice singing part corresponding to the target song; the accompaniment acquisition sub-module is used for acquiring second audio data of background music corresponding to the target song; the music score extraction sub-module is used for extracting music basic information commonly followed by the background music and a target music score of the target song, and the music basic information comprises playing speed per hour, beat number and tone number; the full-song synthesizing submodule is used for synthesizing the first audio data and the second audio data into audio data corresponding to a target song according to the basic information of the music; and the song output submodule is used for outputting the audio data corresponding to the target song.
In an extended embodiment, the cross-lingual song synthesizing apparatus of the present application further includes a training module of the acoustic model, where the training module includes: the system comprises a sample acquisition submodule and a training sample set, wherein the training sample set comprises a plurality of groups of training samples, the training samples comprise song samples in different singing languages sung by the same singer and song samples in different singing languages sung by different singers respectively, and each group of song samples comprises corresponding audio data of a song and pronunciation lyric marking information thereof; and the iterative training submodule is used for performing iterative training of the following process by taking the acoustic model as a target training model for each group of training samples: coding according to the pronunciation label information to obtain phoneme feature information and phonetic sequence feature information corresponding to the song sample, wherein in the phoneme feature information, phonemes in the lyric pronunciation label information of the same singing language are represented according to coding values of a phoneme dictionary corresponding to the singing language in a phoneme dictionary library; extracting pitch characteristic information of the song samples by adopting a preset algorithm; extracting tone characteristic information corresponding to a singing singer of a song sample in a pre-trained tone extraction model, and constructing a tone characteristic library for storing mapping relation data between the tone characteristic information and the singing singer; extracting original Mel frequency spectrum information of the song samples by adopting a preset algorithm; inputting the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information and the tone characteristic information of the training sample into a target training model to predict Mel frequency spectrum information, supervising the training process by utilizing the original Mel frequency spectrum information, and circularly performing iterative training of the next training sample when the target training model is not converged.
A computer device adapted for one of the purposes of the present application includes a central processing unit and a memory, the central processing unit is used for invoking and running a computer program stored in the memory to execute the steps of the cross-language song synthesis method described in the present application.
A computer-readable storage medium, which is provided for adapting to another object of the present application, stores a computer program implemented according to the cross-lingual song synthesis method in the form of computer-readable instructions, and when the computer program is called by a computer, executes the steps included in the method.
A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.
Compared with the prior art, the application has the following advantages:
the method comprises preparing multiple phonemic dictionaries of different languages, when a target song is required to be synthesized, calling the phonemic dictionaries corresponding to the languages to synthesize the required corresponding phonemic characteristic information and phonetic sequence characteristic information for the coded song according to the pre-specified languages, inputting the characteristic information, the pitch characteristic information of the pre-specified target pitch object and the tone characteristic information of the target tone object into a pre-trained acoustic model to synthesize the corresponding Mel frequency spectrum information of the target song, and finally converting the corresponding target song by a vocoder according to the Mel frequency spectrum information, wherein the phonemic characteristic information is determined according to the phonemic dictionaries of the corresponding languages in the process, and the acoustic model is pre-trained by the same principle, so that the requirement of synthesizing the songs of different languages can be served based on a unified acoustic model, correspondingly synthesizing a target song corresponding to any pre-trained singing language.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of a cross-lingual song synthesis method of the present application;
FIG. 2 is a schematic diagram of a network architecture for implementing the cross-lingual song synthesis method of the present application;
FIG. 3 is a flowchart illustrating a process of encoding according to a phoneme dictionary in an embodiment of the present application;
FIG. 4 is a flowchart illustrating a process of obtaining song synthesis feature information required by an acoustic model according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of an acoustic model encoding and decoding process in an embodiment of the present application;
FIG. 6 is a flowchart illustrating a process of synthesizing background music and a target song voice part according to an embodiment of the present application;
FIG. 7 is a flowchart illustrating an iterative training process for an acoustic model according to an embodiment of the present disclosure;
FIG. 8 is a schematic block diagram of a cross-lingual song synthesizing apparatus of the present application;
fig. 9 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.
The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.
It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.
One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.
Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.
Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.
The cross-language song synthesis method can be programmed into a computer program product and is realized by being deployed in a client and/or a server to run, so that the client can access an open interface after the computer program product runs in a webpage program or application program mode, and man-machine interaction is realized through a graphical user interface and the progress of the computer program product.
Referring to fig. 1 and 2, in an exemplary embodiment, the method is implemented by the network architecture shown in fig. 2, and includes the following steps:
step S1100, obtaining a target music score and synthesis configuration information of a target song, wherein the synthesis configuration information comprises a song singing language, a target pitch object and a target tone object:
in order to compose a target song of the present application, it is necessary to collect materials required to generate the target song, including a target score and synthesis configuration information.
The synthesis configuration information includes the singing language of the song, the target pitch object and the target tone color object. The song singing language is used for indicating the singing language of the target song so as to indicate the singing language of the acoustic model to synthesize the target song; the target pitch object can be a singer label and is used for enabling the target song to perform virtual singing by the aid of the singing power of the corresponding singer; the target timbre object may also be a singer label for enabling the target song to obtain a timbre representation of the corresponding singer. It should be noted that the target pitch object and the target tone object may be directed to the same singer or to different singers.
The singing skill refers to skills in aspects of tone variation, rhythm control, breath conversion and the like when a singer sings a song melody and adapts to each note and cross-note in the melody, namely the singing performance of the singer for processing words and vocalizations of lyrics. The singing power of a singer is usually embodied in the aspects of voice, qi, words and the like. In acoustic principle, the singing power is expressed in a corresponding frequency spectrum, which is the pitch variation characteristic of the singer. Therefore, different singers form personalized pitch variation characteristics due to long-term singing habits, and the pitch variation characteristics can be obtained by means of a neural network model related to pitch extraction or other speech synthesis means.
Besides determining a target pitch object and a target tone color object required by a target song, a corresponding target music score is further obtained, wherein the target music score is generally composed by a user at a client side of the user through a song auxiliary creation system and comprises a music score and a lyric text, the music score is composed of a plurality of different note sequences with different tone lengths, and the lyric text is composed of lyric characters aligned with the notes.
Step S1200, calling a corresponding phoneme dictionary according to the song singing language to encode the target music score, so as to obtain phoneme feature information and sound order feature information of the target song, where the phoneme dictionary includes a mapping relationship between phonemes of the corresponding language and encoding numerical values:
in the present application, a phoneme dictionary library is prepared, which includes a plurality of phoneme dictionaries, each storing a mapping relationship between phonemes corresponding to one kind of sung language and their encoding numerical values, for example, the initial consonants "zh, ch, sh" are respectively mapped to "0, 1, 2", and the mapping encoding numerical values may be different in the phoneme dictionaries of different sung languages even if the phonemes are substantially the same. Therefore, the phoneme dictionary library is provided with phoneme dictionaries corresponding to different vocal languages, and when song synthesis is required to be carried out on a certain vocal language, the phoneme dictionary corresponding to the vocal language can be called to be used for coding phonemes. The phoneme dictionary is prepared at the stage of training the acoustic model of the application, so that the acoustic model can also perform relevant coding of phonemes in the training process, and the phoneme characteristic information coded by the phoneme dictionary is learned to perform the song synthesis of the target sung language.
Therefore, when a target song needs to be synthesized, a phoneme dictionary corresponding to the language of the singing can be called according to the language of the singing preset in the synthesis configuration information, then the phoneme in the target music score of the target song is correspondingly converted into phoneme characteristic information by utilizing the mapping relation between phonemes and coding numerical values, and meanwhile, corresponding tone characteristic information is coded according to the position information of each phoneme in the lyric text, so that the method can be flexibly implemented by a person skilled in the art according to the disclosure of the application.
In one embodiment, the lyric text in the target music score of the target song is correspondingly converted into the phoneme feature information thereof, the lyric pronunciation marking information of the lyric text corresponding to the given singing language is required to be obtained first, then each initial and final consonant marked in the lyric pronunciation marking information is regarded as a phoneme, and a corresponding phoneme dictionary is searched to realize the coding, so that the corresponding phoneme feature information and the phoneme feature information are obtained.
Step 1300, using a pre-trained acoustic model to perform encoding and decoding according to song synthesis feature information to obtain mel frequency spectrum information, wherein the song synthesis feature information comprises the phoneme feature information, the tone order feature information, pitch feature information generated corresponding to the target pitch object and preset tone feature information generated corresponding to the target tone object:
the method and the device utilize a pre-trained acoustic model for target song synthesis, and the acoustic model is trained to be suitable for encoding and decoding according to song synthesis characteristic information to obtain the corresponding Mel frequency spectrum information of the target song.
The song synthesizing feature information includes the phoneme feature information, the tone order feature information, the pitch feature information of the target pitch object, and the tone feature information of the target tone object, which are required for synthesizing the target song. Regarding the preparation process of the song synthesis characteristic information, it will be disclosed in the embodiments subsequent to the present application, and those skilled in the art can also refer to the implementation manner of the prior art to obtain the pitch characteristic information and the tone characteristic information, and further construct the song synthesis characteristic information required by the present embodiment for the target song. For example, the pitch characteristic information and the tone characteristic information may be prepared and archived in advance, and when song synthesis characteristic information needs to be prepared, the target pitch object and the target tone object are extracted correspondingly.
The acoustic models, including but not limited to, for example, Tactron2, fastspech, Durian, etc., are usually developed and implemented based on LSTM and bilst network models suitable for processing sequence information, and it is understood that the acoustic models required by the present application can be constructed as long as the acoustic models are suitable for encoding and decoding according to the song synthesis characteristic information, and converting the song synthesis characteristic information formed after processing of the present application into the relevant existing and future implementations of mel-frequency spectrum information required by the target song.
And the acoustic model carries out coding and decoding according to the song synthesis characteristic information, converts the song synthesis characteristic information into Mel frequency spectrum information and enables the pitch characteristic information of the target pitch object and the tone characteristic information of the target tone object to be represented.
Step S1400, a vocoder is adopted to convert the Mel frequency spectrum information into audio data corresponding to the target song:
on the basis of obtaining the mel frequency spectrum information, a Vocoder such as Wavenet, Vocoder, World/Straight, and Griffin-Lim is applied to convert the mel frequency spectrum information into audio data, and the audio data contains the content of the vocal singing part of the target song.
According to the exemplary embodiment, it can be seen that, by preparing a plurality of phoneme dictionaries of different languages, when a target song needs to be synthesized, according to a pre-specified language of singing, calling the phoneme dictionary corresponding to the language of singing to synthesize corresponding phoneme characteristic information and tone order characteristic information required for the coded song, then inputting the characteristic information, the pitch characteristic information of a pre-specified target pitch object and the tone characteristic information of a target tone object into a pre-trained acoustic model to synthesize mel spectrum information corresponding to the target song, and finally converting the corresponding target song according to the mel spectrum information by using a vocoder, because the phoneme characteristic information is determined according to the phoneme dictionary of the corresponding language in the process, and the acoustic model is pre-trained by using the same principle, the requirement of synthesizing songs of different languages can be served based on a unified acoustic model, correspondingly synthesizing a target song corresponding to any pre-trained singing language.
According to the exemplary embodiment, it can be seen that, on the basis of the pre-trained acoustic model, the acoustic dictionary library is provided with the plurality of phoneme dictionaries corresponding to the languages of singing for calling during song synthesis, so that the acoustic model can provide unified services for song synthesis of different languages of singing, a unified service interface is facilitated, and the synthesized target song has consistency in sound quality, and is particularly suitable for a song synthesis service with a unified architecture.
Referring to fig. 3, in a further embodiment, the step S1200, according to the song sung language, calls a corresponding phoneme dictionary to encode the target musical score, so as to obtain phoneme feature information and phonetic sequence feature information of the target song, and includes the following steps:
step 1210, according to the song singing language, determining a phoneme dictionary corresponding to the singing language from a phoneme dictionary library, wherein the phoneme dictionary library comprises a plurality of phoneme dictionaries corresponding to different singing languages:
in this embodiment, the phoneme dictionary required for target song synthesis is determined by using the phoneme dictionary library in which the mapping relationship between phonemes and encoding numerical values is made at the pre-training stage, as described above, each phoneme dictionary only stores the mapping relationship data between phonemes of one kind of singing language and corresponding encoding numerical values thereof, and therefore, the phoneme dictionary library formed by these phoneme dictionaries stores the data of the mapping relationship between phonemes and encoding numerical values corresponding to a plurality of kinds of singing languages, and when target song synthesis is required, the corresponding phoneme dictionary can be called directly according to the song singing language given in the song synthesis configuration information of the present application.
Step S1220, according to the lyric text in the target musical score corresponding to each phoneme in the lyric pronunciation label information of the sung language, finding the coding numerical value corresponding to each phoneme from the phoneme dictionary, and constructing the phoneme feature information corresponding to the lyric:
the target music score already provides a lyric text corresponding to a given singing language, and a person skilled in the art knows that corresponding lyric pronunciation label information can be obtained according to the lyric text, for example, corresponding pinyin information can be obtained according to a Chinese lyric text, and the pinyin information is the lyric pronunciation label information; and for the English lyric text, phonetic symbol information of the English lyric text can be acquired as corresponding lyric pronunciation marking information.
According to the lyric pronunciation marking information, each initial consonant and each final in the lyric pronunciation marking information can be decomposed to be used as phonemes, then, a phoneme dictionary corresponding to the singing language is utilized to inquire and obtain a coding numerical value corresponding to each phoneme, and a corresponding phoneme coding vector is constructed to be used as corresponding phoneme characteristic information. It can be understood that the phoneme feature information is a phoneme sequence formed by encoding the lyric pronunciation label information of the song score and the lyric text along the time domain framing of the target song.
Step S1230, encoding the phonetic sequence feature information corresponding to the phoneme feature information according to the position information of each phoneme:
in addition, in order to indicate the position information of each phoneme, the corresponding phoneme coding vector is also coded as the corresponding phoneme characteristic information. Thus, the encoding process corresponding to the lyric text is completed. It is understood that the phonetic feature information is a phoneme position sequence formed by encoding position information of phonemes in the phoneme sequence.
In the embodiment, a given singing language is taken as a basis, a corresponding phoneme dictionary is called to encode phoneme characteristic information and phonetic sequence characteristic information of the corresponding language, a pronunciation basis is laid for the synthesis of a target song, and because vectorization basic knowledge is established through the mapping relation between the singing language and the phoneme dictionary, deep semantic representation of the corresponding language is conveniently carried out on the target song by an acoustic model, so that the pre-trained acoustic model can play a role of uniformly serving different singing languages, and the target song of the corresponding language is conveniently synthesized by the acoustic model.
Referring to fig. 4, in the extended embodiment, step S1300, before performing encoding and decoding according to song synthesis feature information by using a pre-trained acoustic model to obtain mel-frequency spectrum information, includes the following steps:
step S1301, generating note characteristic information of the target song according to the melody labeling information in the target music score:
in this embodiment, the pitch feature information in the song synthesis feature information may be generated in real time by using a pitch generation model, and for this purpose, the corresponding note feature information is determined according to the melody marking information provided by the score in the target score, that is, the feature vector of the tune, and is used for indicating the tune information of each note corresponding to the melody in the target song.
Step S1302, inputting the note feature information, the phoneme feature information and the musical sequence feature information of the target song into a pre-trained pitch generation model matching the control parameter of the target object to generate pitch feature information of the target pitch object:
the present application employs a pitch generation model for generating pitch feature information of a target pitch object, which can be constructed based on means of speech signal processing in the art or based on deep semantic learning. The pitch generation model is trained in advance, so that the pitch generation model is suitable for acquiring a corresponding control parameter set by using an identity tag of a target pitch object, and under the action of the control parameter set, pitch characteristic information fusing pitch change characteristics of the target pitch object can be generated.
Under the effect of the pitch generation model, pitch characteristic information corresponding to the target music score can be generated, the pitch change characteristic represented by the pitch characteristic information is the pitch change characteristic of the target pitch object, and the singing power represented by the pitch change characteristic can cover at least one or more of rhythm, intonation, breath, conversion fluency degree of real and false sounds, obvious degree of sound conversion fault and transparency degree of sound presented when pronunciation data generated by corresponding dependency of the singing power is played.
Therefore, under the action of the pitch generation model, the target music score is converted into fundamental frequency information, and the fundamental frequency information is corrected by the pitch change characteristics of the target pitch object, so that the fundamental frequency information is fused with the corresponding pitch change characteristics of the target pitch object, and the pitch characteristic information of the target song is obtained.
Therefore, the pitch generation model may be a pitch generation model pre-trained in the prior art, and is adapted to obtain corresponding control parameters according to a target pitch object, and then generate pitch feature information of the target pitch object according to the note feature information, the phoneme feature information and the musical sequence feature information.
In one embodiment, the pitch feature information of the target song may be generated using a pitch generation model according to the following process:
firstly, calling a corresponding control parameter set according to the identity tag of a target pitch object to configure the pitch generation model, and generating the control parameter set of the identity tag associated with the target pitch object by the pitch generation model according to the audio data of the target pitch object and the corresponding target music score for training a training sample:
in order to adapt to the situation that the pitch generation model is controlled by using the control parameter set, the corresponding control parameter set needs to be determined according to the target pitch object, and the control parameter set is generated by the pitch generation model in advance and is stored in association with the identity tag of the target pitch object. For a pitch generation model implemented by adopting speech parameter synthesis, the control parameter set refers to relevant speech control parameters required by the pitch generation model to implement pitch variation characteristics of a fusion target pitch object; for a deep semantic learning based pitch generation model, such as a single-person or multi-person pitch generation model, the set of control parameters refers to the weighting parameters corresponding to its adaptation to a specific target pitch object.
When the pitch generation model is required to be used for generating corresponding pitch characteristic information by combining the pitch change characteristic of the target pitch object and the target music score, the pitch generation model invokes the corresponding control parameter set according to the identity tag of the target pitch object for configuration, and on the basis, the generation of the pitch characteristic information fused with the pitch change characteristic of the target pitch object based on the target music score can be realized.
It can be understood that the pitch generation model should be trained in advance, and in the training process, the multiple audio data of the target pitch object and the corresponding sample music score are used as training samples, so that the pitch generation model realizes correct classification of the audio data, and corresponds to the identity label of the target pitch object, thereby obtaining the capability of performing pitch change feature extraction and synthesis on the target pitch object. Therefore, those skilled in the art can flexibly deal with the situation according to the selected specific network model, which is not repeated herein.
Then, generating the pitch characteristic information fused with the pitch change characteristic of the target pitch object by the pitch generation model according to the music score and the lyric text in the target music score:
the pitch generation model generates pitch characteristic information according to the following process: acquiring a target music score, wherein the target music score comprises tune information contained in the music score and character pronunciation information determined according to a lyric text; acquiring a comprehensive characteristic vector set of the target music score based on the tune information and the character pronunciation information; the comprehensive feature vector set is used for representing the features of the tune information and the features of the character pronunciation information, so that the comprehensive feature vector set realizes the comprehensive of the note feature information, the phoneme feature information and the sound sequence feature information; and finally, decoding the comprehensive characteristic vector set to generate pitch characteristic information corresponding to the target music score.
Therefore, by means of the pre-trained pitch generation model and the learned capability of the pre-trained pitch generation model, the corresponding control parameter set can be called more conveniently according to the identity label of the target pitch object, the pitch change characteristic of the target pitch object is integrated into the generated pitch characteristic information, the pitch characteristic information can be extracted quickly, and the processing and production efficiency of music auxiliary creation can be improved.
Step S1303, according to the target tone object, calling tone feature information of the target tone object from a preset tone feature library:
as described above, the tone color feature information is generated in advance, and is stored in association with the identity tag of the target tone color object, and can be directly called through the identity tag of the target tone color object. The tone characteristic information is essentially a voiceprint characteristic vector and is generated by extracting a pre-trained tone extraction model in advance.
The tone extraction model extracts corresponding tone characteristic information from the audio sampling data of the target tone object according to the following process: acquiring audio sampling data, and extracting Mel frequency spectrum information corresponding to a human voice pronunciation part from the audio sampling data; extracting a vector matrix set representing the tone of a target tone object of the audio sampling data from the Mel frequency spectrum information, wherein the vector matrix set comprises a plurality of vector matrixes in a time domain; obtaining a mean vector matrix among a plurality of vector matrices in the vector matrix set as voiceprint characteristic information of a target tone object; and generating a tone template corresponding to the target tone object, wherein the tone template comprises an identity label of the target tone object and the voiceprint characteristic information pointed by the identity label.
The step of extracting the vector matrix set of the timbres of the target timbre object representing the audio sample data from the mel-frequency spectrum information comprises the following steps: extracting a plurality of vector matrixes representing timbres of target timbre objects of the audio sample data from the Mel frequency spectrum information along a time domain; fully connecting the vector matrixes to obtain a fully connected comprehensive vector matrix, wherein the comprehensive vector matrix comprises a plurality of vector matrixes in a time domain; and selecting a plurality of last and continuous vector matrixes in the time domain from the comprehensive vector matrixes, and constructing the vector matrixes into the vector matrix set, wherein each vector matrix set comprises a plurality of vector matrixes in the time domain, and each vector matrix comprises a plurality of vectors for representing timbre.
Extracting a plurality of vector matrixes representing the timbres of the sound source objects of the audio sampling data from the Mel frequency spectrum information along a time domain, and the method comprises the following steps: calling a residual convolution network to perform representation learning on the Mel frequency spectrum information so as to obtain audio texture feature information in the Mel frequency spectrum information; and calling a recurrent neural network to arrange the audio texture feature information so as to obtain a plurality of vector matrixes which synthesize the correlation information of the audio texture feature information on the time domain.
The training process of the tone extraction model comprises the following steps: extracting a vector matrix representing the tone of a target tone object of a training sample from Mel frequency spectrum information of the training sample, and fully connecting the vector matrices to obtain a fully connected comprehensive vector matrix, wherein the comprehensive vector matrix comprises a plurality of vector matrices in a time domain, and each training sample contains song singing sounding data of a single target tone object; calling a preset classification model to classify the comprehensive vector matrix, monitoring a classification result by using a monitoring label corresponding to the training sample, and correcting a weight parameter of the tone extraction model according to the reverse propagation of the monitoring result; and carrying out the training of the tone extraction model in a loop iteration mode until the cross entropy loss function of the classification model reaches a convergence state.
Step S1304, concatenating the phoneme feature information, the musical sequence feature information, the pitch feature information of the target pitch object, and the tone feature information of the target tone object into song synthesis feature information:
in order to facilitate the processing of the acoustic model, corresponding song synthesis feature information needs to be prepared, and the song synthesis feature information can be formed by splicing the phoneme feature information, the tone order feature information and the pitch feature information of the target pitch object and then overlapping the tone feature information of the target tone object along a time domain. The full amount of song synthesis feature information needed for target song synthesis is thus prepared for the acoustic model.
The embodiment allows the pitch characteristic information and the tone characteristic information to be originated from different singers, and prepares the song synthesis characteristic information required for song synthesis at one time, so that the following advantages are more abundant:
firstly, the method acquires related information required for composing a target song at one time, wherein the related information comprises a target pitch object used for determining the singing power applied to the target song, a target tone object used for determining the tone applied to the target song and a target music score of the target song, then invokes a pitch generation model to generate pitch characteristic information fusing the pitch change characteristic of the target pitch object, acquires tone characteristic information corresponding to the target tone object, and generates human voice part audio data of the song sung according to a music score and a lyric text in the target music score by using two kinds of information with different sources and with the assistance of an acoustic model and a vocoder. The audio data is integrated with the unique pitch characteristic information of the target pitch object, the singing power of the target pitch object is reflected, the unique tone characteristic information of the target tone object is integrated, the decoupling of the pitch characteristic information and the tone characteristic information is realized, the pitch characteristic information and the tone characteristic information can be independently constructed, the audio data and the tone characteristic information are flexibly combined, higher flexibility is opened for a song auxiliary creation system, a user is allowed to combine the own tone with the singing power of other singers to generate a target song for the existing melody music score and lyric text, the creation effect is rapidly felt, and the song auxiliary creation efficiency is improved.
Secondly, in the application, the music score and the lyric text of the target music score are used for generating pitch characteristic information in a pitch generation model on one hand, and are quoted by an acoustic model on the other hand to be realized so as to keep the generated Mel frequency spectrum information containing accurate melody information, and the music score and the lyric text of the target music score are embodied in a mode which is most intuitive and convenient for a user to edit, so that the requirement on the specialty of the user side is reduced, the user can concentrate on the composition creation of the music score and the lyric text, the processing between the pitch characteristic information and the tone characteristic information is not required to be processed by self, the creation process of the target song is more intelligent, and the production efficiency of the target song is improved.
In addition, according to the implementation of the technical scheme, on the basis of decoupling of the pitch characteristic information and the tone characteristic information, collaborative creation of songs is more facilitated, for example, a user purchases pitch characteristic information corresponding to singing power from a singer, and target song creation is performed by the pitch characteristic information and the tone characteristic information of the user, so that the quality of the song works of the user is improved by means of the singing power of the singer, collaboration among online entertainment users is promoted, active sharing of the user works is further promoted, user flow is activated, the internet music ecology is redefined, and people who are music players are expected to become reality.
Referring to fig. 5, in a further embodiment, the step S1300 of performing encoding and decoding according to the song synthesis feature information by using the pre-trained acoustic model to obtain mel-frequency spectrum information includes the following steps:
step 1310, encoding the feature information set of the song composition by using an encoding network in the acoustic model to obtain an encoded feature vector:
and the coding network of the acoustic model is suitable for splicing and coding the characteristic information of the song synthesis characteristic information set, so that a corresponding coding characteristic vector can be obtained.
Step S1320, performing downsampling processing on the coded coding feature vector to obtain a downsampled coding feature vector:
further, the coded feature vector is down-sampled by a down-sampling network to obtain the coded feature vector with normalized feature scale.
Step S1330, performing feature reconstruction processing on the down-sampled coding feature vector by using an attention mechanism to obtain a coding feature vector reconstructed according to the context information:
the attention mechanism can recombine the feature vectors according to the context information in the feature sequence, and the sequence reflects the context semantics, so that the semantically combed coding feature vectors can be obtained after the coding feature vectors are subjected to feature recombination on the basis of down sampling.
Step S1340, decoding the recombined coding feature vector by adopting a decoding network in the acoustic model to obtain Mel frequency spectrum information:
and the decoding network of the acoustic model converts the coded feature vector after the attention mechanism combing under the action of the attention mechanism, so that corresponding Mel frequency spectrum information can be obtained.
Step S1350, residual estimation processing is carried out on the Mel frequency spectrum information of the audio data obtained from the decoding network by adopting a residual estimation network, and residual information is obtained:
in order to make the mel frequency spectrum information more pure, the mel frequency spectrum information can be further corrected by a residual error prediction network, and the residual error prediction network can perform residual error prediction on the mel frequency spectrum information obtained by the decoding network to obtain corresponding residual error information so as to be used for correcting the mel frequency spectrum information.
Step S1360, correcting the mel-frequency spectrum information of the audio data based on the residual information to obtain corrected mel-frequency spectrum information:
the acoustic model of this embodiment may be pre-trained, or an acoustic model that is already mature may be directly migrated, where the acoustic model may be, for example, Tactron, fastspech, and in the training stage, provides a corresponding sample score and the pitch and tone feature information to train to a convergence state, so that the acoustic model has corresponding mel-frequency spectrum information converted according to the song synthesis feature information set.
In the embodiment, according to a speech synthesis principle, the acoustic model is applied to encode and decode the song synthesis feature information set, semantic combing is realized, Mel frequency spectrum information corresponding to a target song is obtained, conversion from features to frequency spectrum is realized, the whole process is automatically implemented, and the method is very efficient.
Referring to fig. 6, in an embodiment, the step S1400 of converting the mel-frequency spectrum information into audio data corresponding to the target song by using a vocoder includes the following steps:
step S1410, obtaining first audio data of the vocal singing part of the corresponding target song output by the acoustic model:
in the foregoing embodiments, when mel spectrum information is generated by the acoustic model and converted into a corresponding target song by the vocoder, the target song is only the content of the vocal singing part of the song, and in order to obtain a complete target song, an accompaniment effect needs to be obtained by means of background music, and at this time, the output of the vocoder is first obtained as first audio data.
Step S1420, acquiring second audio data of the background music corresponding to the target song:
as described above, in order to make the song authoring assisting efficiency more efficient, the background music adapted to the target song may be further acquired to be synthesized therewith, and specifically, the second audio data corresponding to the background music may be acquired. The corresponding relation between the background music and the target music score can be preset.
Step S1430, extracting basic information of music which is followed by the background music and the target music score of the target song, wherein the basic information of the music comprises the following steps:
the background music is generally organized according to a certain rhythm, so that music basic information such as playing pace, beat number and key number corresponding to the background music is determined during preparation, the music basic information and chord information corresponding to the background music can be packaged into an accompaniment template, and at the beginning of song creation started by a user, the accompaniment template is selected by the user, so that the music basic information of a music score in a target music score created by the user is determined, and therefore, the user can acquire the music basic information according to the accompaniment template, and the target music score and the background music can be ensured to be consistently followed by the music basic information.
Step S1440, synthesizing the first audio data and the second audio data into audio data corresponding to the target song according to the music piece basic information:
and under the condition of following the rule of the music piece basic information, the second audio data corresponding to the background music and the first audio data corresponding to the vocal performance part of the target song can be aligned and combined by utilizing a voice synthesis means commonly used by a person skilled in the art, so that complete audio data corresponding to the target song can be obtained.
Step S1450, outputting audio data corresponding to the target song:
after the audio data corresponding to the target song is obtained, the audio data can be correspondingly pushed to the corresponding creation user, so that the target user can play the target song at the client side of the target user, and the whole song creation assisting process is perfected.
The supplementary music creation flow is further perfected to this embodiment, and whole automatic completion has simplified the loaded down with trivial details operation that the user carried out the music creation greatly, has promoted supplementary music creation efficiency.
In an extended embodiment, the acoustic model is pre-trained, and the training process comprises the following steps:
step S2100, a training sample set is obtained, the training sample set includes a plurality of groups of training samples, the plurality of groups of training samples include song samples of different singing languages performed by the same singer and song samples of different singing languages performed by different singers, each group of song samples includes audio data corresponding to a song and pronunciation lyric labeling information thereof:
the principle of training the acoustic model to synthesize a song is known to those skilled in the art, but in the present application, in order to make the acoustic model learn the ability to produce songs of different languages uniformly across languages, the training samples used in the training phase are required.
Specifically, to train the acoustic model, a corresponding training sample set needs to be obtained in advance. The training sample set is composed of a plurality of groups of training samples, and it can be understood that the number of the training samples is proper and sufficient according to the neural network training principle, and the specific number is subject to the acoustic model convergence convenience and the training cost saving.
In the training sample set, a plurality of groups of training samples comprise two sample composition conditions, in the first condition, the training sample set comprises song samples of different singing languages sung by the same singer, for example, for the same singer, not only japanese singing songs but also english singing songs are collected; in the second case, the training sample set further includes singing versions of different singing languages, which are respectively performed by different singers, such as a japanese singing song acquired by the singer a and an english song acquired by the singer B. In the preferred embodiment, two situations exist. It has been found that the presence of the first case allows the acoustic model to be trained to converge more quickly than if only the second case is included. The reason is that the first case provides the acoustic model with associated timbre information for the same singer singing a song of a different language, helping the training of the acoustic model. Of course, as an equivalent alternative, it is also allowed in the present application to construct the training sample set using only the second case, provided that a sufficient number of training samples are available. When the two conditions coexist, the training cost of the acoustic model can be saved, and particularly, the song copyright is known to be high.
In a preferred embodiment, the training sample set may also include singing versions of different singing languages corresponding to the same song, and the singing versions of different singing languages may be performed by the same singer or performed by different singers, so as to provide more relevant information for training of the acoustic model at a semantic level.
And each group of training samples in the training sample set comprises audio data corresponding to the song and lyric pronunciation marking information thereof.
Step S2100, for each group of training samples, performing iterative training of the following process shown in fig. 7 with the acoustic model as a target training model:
step S2110, encoding is carried out according to the pronunciation label information to obtain phoneme feature information and sound sequence feature information corresponding to the song sample, wherein in the phoneme feature information, phonemes in the lyric pronunciation label information of the same singing language are represented according to encoding values of a phoneme dictionary corresponding to the singing language in a phoneme dictionary library:
similar to the above embodiments, in the training stage of the acoustic model, a phoneme dictionary corresponding to each singing language is already constructed and stored in the phoneme dictionary library, so that, for each training sample, corresponding phoneme feature information and phonetic order feature information need to be provided according to the acoustic model, and these information can be generated according to the corresponding means disclosed in the foregoing description of the present application, and according to the pronunciation labeling information of the lyrics in the training sample.
Since the phoneme dictionary stores the mapping relation data between the phonemes and the coding numerical values, it is easy to understand that the phoneme dictionary corresponding to the vocal language of the song of the training sample is used for coding, and the phoneme feature information corresponding to the vocal language can be obtained.
Step S2120, extracting pitch characteristic information of the song samples by adopting a preset algorithm:
in one embodiment, the skilled person can use various known pitch extraction algorithms to extract the corresponding pitch characteristic information from the song sample.
In another embodiment, similar to the foregoing, in the training stage, according to the pre-trained pitch generation model, according to the processes disclosed in the foregoing embodiments of the present application, the corresponding note feature information, phoneme feature information, and musical sequence feature information are obtained by using the tune information and the lyric pronunciation label information provided by the training sample, so as to generate corresponding pitch feature information for the song in the training sample. The pitch characteristic information may in turn be used to synthesize a target song corresponding to the song sample.
Step 2130, extracting tone characteristic information corresponding to the singing singer of the song sample in the pre-trained tone extraction model, and constructing a tone characteristic library for storing mapping relation data between the tone characteristic information and the singing singer:
similarly, according to the pre-trained tone characteristic information described in the foregoing embodiments of the present application, the corresponding tone characteristic information may also be extracted from the audio data of the song sample of the training sample, and this tone characteristic information may be used in turn to synthesize the target song corresponding to the song sample, and may also be mapped and stored with the identity tag of the singer, and stored in the tone characteristic library for calling when a new song is subsequently synthesized.
Step S2140, extracting original Mel frequency spectrum information of the song samples by adopting a preset algorithm:
a person skilled in the art may extract the original mel-frequency spectrum information in the audio data of the song sample by using various known algorithms for supervising the training of the acoustic model, for example, by performing speech processing related algorithm operations such as pre-emphasis, framing, windowing, fast fourier transform, mel filtering and the like on the audio data, so as to obtain the original mel-frequency spectrum information.
Step S2150, inputting phoneme feature information, tone order feature information, pitch feature information and tone feature information of the training sample into a target training model to predict Mel frequency spectrum information, supervising the training process by using the original Mel frequency spectrum information, and circularly performing iterative training of the next training sample when the target training model is not converged:
in the previous steps, various information required by various corresponding acoustic model inputs is obtained aiming at a training sample, wherein the various information comprises phoneme characteristic information, sequence characteristic information and pitch characteristic information, and the tone characteristic information, the information can be constructed into song synthesis characteristic information, the acoustic model is input, the acoustic model predicts the corresponding prediction Mel frequency spectrum information, then, calculating the loss of the sub-iterative training by using the difference value of the original Mel frequency spectrum information and the predicted Mel frequency spectrum information, performing gradient update on the acoustic model according to the loss value, correcting the network weight, in the process, whether the loss value approaches to 0 is judged, namely whether each iterative training enables the acoustic model to be converged is checked, in the case that the acoustic model does not converge, returning to step S2110, invoking the next set of training samples to continue the iterative training until the training model is trained to converge.
This embodiment gives the overall process to acoustic model preparation training sample and carry out the training, can see that because the same song of same singer different languages has been contained in the training sample, can promote acoustic model convergence sooner, can reduce the reliance to the training sample, can reduce the training cost, promotes model convergence sooner.
Furthermore, in the process of coding the phoneme feature information, corresponding phoneme dictionaries are constructed in advance according to different singing languages, so that the corresponding relation between the singing languages and the phoneme dictionaries is naturally established in the coding process, and then when the acoustic model is put into production, the acoustic model can uniformly serve the song synthesis needs of different singing languages according to the corresponding phoneme dictionaries in a targeted mode according to the given singing language to carry out corresponding coding on the lyric pronunciation labeling information of the target song, and the acoustic model is trained according to different singing languages through the coding feature information compiled by the polyphonic element dictionaries, so that the acoustic model can be compatible with the synthesis service needs of songs of multiple singing languages, and uniformly serves the synthesis of songs of different singing languages.
Referring to fig. 8, a cross-lingual song synthesizing apparatus adapted to the cross-lingual song synthesizing method of the present application for functional deployment includes: the system comprises a data acquisition module 1100, a music score coding module 1200, an acoustic synthesis module 1300, and a spectrum conversion module 1400, wherein the data acquisition module 1100 is configured to acquire a target music score of a target song and synthesis configuration information, and the synthesis configuration information includes a song singing language, a target pitch object, and a target tone color object; the music score coding module 1200 is configured to call a corresponding phoneme dictionary according to the language in which the song is sung to code the target music score, so as to obtain phoneme feature information and sound order feature information of the target song, where the phoneme dictionary includes a mapping relationship between phonemes of the corresponding language and coding numerical values; the acoustic synthesis module 1300 is configured to perform coding and decoding according to song synthesis feature information by using a pre-trained acoustic model to obtain mel-frequency spectrum information, where the song synthesis feature information includes the phoneme feature information, the tone order feature information, pitch feature information generated corresponding to the target pitch object, and preset tone feature information generated corresponding to the target tone object; the spectrum conversion module 1400 is configured to convert the mel spectrum information into audio data corresponding to a target song by using a vocoder.
In a deepened embodiment, the score encoding module 1200 includes: the dictionary calling submodule is used for determining a phoneme dictionary corresponding to the singing language from a phoneme dictionary library according to the singing language of the song, and the phoneme dictionary library comprises a plurality of phoneme dictionaries corresponding to different singing languages; a phoneme mapping sub-module, configured to search, according to each phoneme in the lyric pronunciation tagging information of the vocal language corresponding to the lyric text in the target music score, a coding numerical value corresponding to each phoneme from the phoneme dictionary, and construct phoneme feature information corresponding to the lyric; and the sound sequence mapping submodule is used for coding sound sequence characteristic information corresponding to the phoneme characteristic information according to the position information of each phoneme.
In an extended embodiment, the cross-lingual song synthesizing apparatus of the present application further includes: the note coding module is used for generating note characteristic information of the target song according to the melody labeling information in the target music score; a pitch generation module, configured to input the note feature information, the phoneme feature information, and the musical sequence feature information of the target song into a pre-trained pitch generation model matching the control parameters of the target object, so as to generate pitch feature information of the target pitch object; the tone calling module is used for calling tone characteristic information of the target tone object from a preset tone characteristic library according to the target tone object; and the characteristic splicing module is used for splicing the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information of the target pitch object and the tone characteristic information of the target tone object into song synthesis characteristic information.
In a further embodiment, the acoustic synthesis module 1300 includes: the characteristic coding submodule is used for coding the song synthesis characteristic information set by adopting a coding network in the acoustic model to obtain a coded coding characteristic vector; the characteristic sampling submodule is used for performing down-sampling processing on the coded characteristic vector to obtain a down-sampled coded characteristic vector; the characteristic recombination submodule is used for performing characteristic recombination processing on the down-sampled coding characteristic vector by adopting an attention mechanism to obtain a coding characteristic vector recombined according to the context information; and the characteristic decoding submodule is used for decoding the recombined coding characteristic vector by adopting a decoding network in the acoustic model to obtain Mel frequency spectrum information.
In a further embodiment, the acoustic synthesis module 1300 further includes: the residual error prediction submodule is used for performing residual error prediction processing on the Mel frequency spectrum information of the audio data obtained from the decoding network by adopting a residual error prediction network to obtain residual error information; and the frequency spectrum correction submodule is used for correcting the Mel frequency spectrum information of the audio data based on the residual error information to obtain the corrected Mel frequency spectrum information.
In an embodied embodiment, the spectrum conversion module 1400 includes: the voice acquisition sub-module is used for acquiring first audio data, output by the acoustic model, of a voice singing part corresponding to the target song; the accompaniment acquisition sub-module is used for acquiring second audio data of background music corresponding to the target song; the music score extraction sub-module is used for extracting music basic information commonly followed by the background music and a target music score of the target song, and the music basic information comprises playing speed per hour, beat number and tone number; the full-song synthesizing submodule is used for synthesizing the first audio data and the second audio data into audio data corresponding to a target song according to the basic information of the music; and the song output submodule is used for outputting the audio data corresponding to the target song.
In an extended embodiment, the cross-lingual song synthesizing apparatus of the present application further includes a training module of the acoustic model, where the training module includes: the system comprises a sample acquisition submodule and a training sample set, wherein the training sample set comprises a plurality of groups of training samples, the training samples comprise song samples in different singing languages sung by the same singer and song samples in different singing languages sung by different singers respectively, and each group of song samples comprises corresponding audio data of a song and pronunciation lyric marking information thereof; and the iterative training submodule is used for performing iterative training of the following process by taking the acoustic model as a target training model for each group of training samples: coding according to the pronunciation label information to obtain phoneme feature information and phonetic sequence feature information corresponding to the song sample, wherein in the phoneme feature information, phonemes in the lyric pronunciation label information of the same singing language are represented according to coding values of a phoneme dictionary corresponding to the singing language in a phoneme dictionary library; extracting pitch characteristic information of the song samples by adopting a preset algorithm; extracting tone characteristic information corresponding to a singing singer of a song sample in a pre-trained tone extraction model, and constructing a tone characteristic library for storing mapping relation data between the tone characteristic information and the singing singer; extracting original Mel frequency spectrum information of the song samples by adopting a preset algorithm; inputting the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information and the tone characteristic information of the training sample into a target training model to predict Mel frequency spectrum information, supervising the training process by utilizing the original Mel frequency spectrum information, and circularly performing iterative training of the next training sample when the target training model is not converged.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 9, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can cause the processor to implement a cross-language song synthesis method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the cross-lingual song synthesis method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 8, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the cross-lingual song synthesizing apparatus of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application further provides a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the cross-lingual song synthesis method of any of the embodiments of the present application.
The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
In summary, the application can realize cross-language song synthesis service, and can be used for synthesizing target songs in multiple singing languages according to needs by using the same acoustic model.
Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A cross-language song synthesis method is characterized by comprising the following steps:
acquiring a target music score and synthesis configuration information of a target song, wherein the synthesis configuration information comprises a song singing language, a target pitch object and a target tone object;
calling a corresponding phoneme dictionary according to the song singing language to encode the target music score to obtain phoneme characteristic information and sound order characteristic information of the target song, wherein the phoneme dictionary comprises a mapping relation between phonemes of the corresponding language and encoding numerical values;
coding and decoding are carried out according to song synthesis characteristic information by adopting a pre-trained acoustic model to obtain Mel frequency spectrum information, wherein the song synthesis characteristic information comprises the phoneme characteristic information, the tone sequence characteristic information, pitch characteristic information generated corresponding to the target pitch object and preset tone characteristic information generated corresponding to the target tone object;
and converting the Mel frequency spectrum information into audio data corresponding to the target song by adopting a vocoder.
2. The method for synthesizing a cross-lingual song according to claim 1, wherein the target score is encoded by calling a corresponding phoneme dictionary according to the language in which the song is performed to obtain phoneme feature information and phonetic sequence feature information of the target song, comprising the steps of:
according to the song singing language, determining a phoneme dictionary corresponding to the singing language from a phoneme dictionary library, wherein the phoneme dictionary library comprises a plurality of phoneme dictionaries corresponding to different singing languages;
according to the lyric text in the target music score corresponding to each phoneme in the lyric pronunciation marking information of the singing language, searching coding numerical values corresponding to the phonemes from the phoneme dictionary, and constructing phoneme characteristic information corresponding to the lyrics;
and coding the sound sequence characteristic information corresponding to the phoneme characteristic information according to the position information of each phoneme.
3. The method for synthesizing a cross-lingual song according to claim 1, wherein a pre-trained acoustic model is adopted, and encoding and decoding are performed according to the song synthesis characteristic information, before obtaining mel-frequency spectrum information, the method comprises the following steps:
generating note characteristic information of the target song according to the melody labeling information in the target music score;
inputting the note characteristic information, phoneme characteristic information and tone sequence characteristic information of the target song into a pre-trained pitch generation model matched with the control parameters of the target object so as to generate pitch characteristic information of the target pitch object;
calling tone characteristic information of the target tone object from a preset tone characteristic library according to the target tone object;
and splicing the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information of the target pitch object and the tone characteristic information of the target tone object into song synthesis characteristic information.
4. The method for synthesizing a cross-lingual song according to claim 1, wherein a pre-trained acoustic model is used to perform encoding and decoding according to the song synthesis characteristic information to obtain Mel frequency spectrum information, comprising the steps of:
coding the song synthesis characteristic information set by adopting a coding network in an acoustic model to obtain a coded coding characteristic vector;
performing down-sampling processing on the coded coding feature vector to obtain a down-sampled coding feature vector;
performing feature recombination processing on the down-sampled coding feature vector by adopting an attention mechanism to obtain a coding feature vector recombined according to context information;
and decoding the recombined coding characteristic vector by adopting a decoding network in the acoustic model to obtain Mel frequency spectrum information.
5. The method for synthesizing a cross-lingual song according to claim 4, wherein after decoding the recombined coded feature vector by using a decoding network in an acoustic model to obtain Mel frequency spectrum information, the method further comprises the following steps:
residual error pre-estimation processing is carried out on the Mel frequency spectrum information of the audio data obtained from the decoding network by adopting a residual error pre-estimation network, so as to obtain residual error information;
and correcting the Mel frequency spectrum information of the audio data based on the residual error information to obtain the corrected Mel frequency spectrum information.
6. The method according to any one of claims 1 to 5, wherein the step of converting the mel-frequency spectrum information into audio data corresponding to a target song by using a vocoder comprises the steps of:
obtaining first audio data of a vocal singing part of a corresponding target song output by the acoustic model;
acquiring second audio data of background music corresponding to the target song;
extracting music basic information commonly followed by the background music and a target music score of the target song, wherein the music basic information comprises playing speed per hour, beat number and tone number;
synthesizing the first audio data and the second audio data into audio data corresponding to a target song according to the music basic information;
and outputting the audio data corresponding to the target song.
7. The method according to claim 3, wherein the acoustic model is pre-trained, and the training process comprises the following steps:
acquiring a training sample set, wherein the training sample set comprises a plurality of groups of training samples, the training samples comprise song samples of different singing languages sung by the same singer and song samples of different singing languages sung by different singers respectively, and each group of song samples comprises corresponding audio data of a song and pronunciation marking information of the song lyrics;
for each group of training samples, performing iterative training of the following process by taking the acoustic model as a target training model:
coding according to the pronunciation label information to obtain phoneme feature information and phonetic sequence feature information corresponding to the song sample, wherein in the phoneme feature information, phonemes in the lyric pronunciation label information of the same singing language are represented according to coding values of a phoneme dictionary corresponding to the singing language in a phoneme dictionary library;
extracting pitch characteristic information of the song samples by adopting a preset algorithm;
extracting tone characteristic information corresponding to a singing singer of a song sample in a pre-trained tone extraction model, and constructing a tone characteristic library for storing mapping relation data between the tone characteristic information and the singing singer;
extracting original Mel frequency spectrum information of the song samples by adopting a preset algorithm;
inputting the phoneme characteristic information, the tone sequence characteristic information, the pitch characteristic information and the tone characteristic information of the training sample into a target training model to predict Mel frequency spectrum information, supervising the training process by utilizing the original Mel frequency spectrum information, and circularly performing iterative training of the next training sample when the target training model is not converged.
8. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.
9. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.
10. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 7.
CN202111257558.4A 2021-10-27 2021-10-27 Cross-language song synthesis method and device, equipment, medium and product thereof Pending CN113963717A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111257558.4A CN113963717A (en) 2021-10-27 2021-10-27 Cross-language song synthesis method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111257558.4A CN113963717A (en) 2021-10-27 2021-10-27 Cross-language song synthesis method and device, equipment, medium and product thereof

Publications (1)

Publication Number Publication Date
CN113963717A true CN113963717A (en) 2022-01-21

Family

ID=79467681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111257558.4A Pending CN113963717A (en) 2021-10-27 2021-10-27 Cross-language song synthesis method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN113963717A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023245389A1 (en) * 2022-06-20 2023-12-28 北京小米移动软件有限公司 Song generation method, apparatus, electronic device, and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023245389A1 (en) * 2022-06-20 2023-12-28 北京小米移动软件有限公司 Song generation method, apparatus, electronic device, and storage medium

Similar Documents

Publication Publication Date Title
CN112863483B (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN112687259B (en) Speech synthesis method, device and readable storage medium
EP3616190B1 (en) Automatic song generation
CN112037754B (en) Method for generating speech synthesis training data and related equipment
JP2021157193A (en) Speech synthesis method and method for training corresponding model, device, electronic apparatus, storage medium, and computer program
EP3994683A1 (en) Multilingual neural text-to-speech synthesis
WO2018200268A1 (en) Automatic song generation
CN112951198A (en) Singing voice synthesis
CN113808555A (en) Song synthesis method and device, equipment, medium and product thereof
CN112164379A (en) Audio file generation method, device, equipment and computer readable storage medium
JP2021192119A (en) Method for registering attribute of voice synthesis model, device, electronic apparatus, storage medium and computer program
CN112802446B (en) Audio synthesis method and device, electronic equipment and computer readable storage medium
CN112035699A (en) Music synthesis method, device, equipment and computer readable medium
CN113963717A (en) Cross-language song synthesis method and device, equipment, medium and product thereof
CN114360492A (en) Audio synthesis method and device, computer equipment and storage medium
CN112669815A (en) Song customization generation method and corresponding device, equipment and medium
Bulyko et al. Efficient integrated response generation from multiple targets using weighted finite state transducers
CN117373429A (en) Voice cloning method, device, storage medium and computer equipment
CN115810341A (en) Audio synthesis method, apparatus, device and medium
CN113314097B (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN113539236B (en) Speech synthesis method and device
CN115033734A (en) Audio data processing method and device, computer equipment and storage medium
Zou et al. Boosting Character-Based Chinese Speech Synthesis via Multi-Task Learning and Dictionary Tutoring.
CN116645957B (en) Music generation method, device, terminal, storage medium and program product
CN113744759B (en) Tone color template customizing method and device, equipment, medium and product thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination