CN109949783B - Song synthesis method and system - Google Patents

Song synthesis method and system Download PDF

Info

Publication number
CN109949783B
CN109949783B CN201910188123.5A CN201910188123A CN109949783B CN 109949783 B CN109949783 B CN 109949783B CN 201910188123 A CN201910188123 A CN 201910188123A CN 109949783 B CN109949783 B CN 109949783B
Authority
CN
China
Prior art keywords
audio
song
lyric
user
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910188123.5A
Other languages
Chinese (zh)
Other versions
CN109949783A (en
Inventor
初敏
杜斌
杨喜鹏
陈博
刘亚祝
游永彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Publication of CN109949783A publication Critical patent/CN109949783A/en
Application granted granted Critical
Publication of CN109949783B publication Critical patent/CN109949783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Reverberation, Karaoke And Other Acoustics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a song synthesis method, which comprises the following steps: acquiring lyric audio which is lyric reading audio corresponding to a song to be synthesized; acquiring a target song dry sound corresponding to the current song to be synthesized; acquiring target audio characteristics corresponding to the dry sound of the target song; adjusting the audio frequency characteristics of the lyric audio frequency according to the target audio frequency characteristics to obtain corresponding song audio frequency; and synthesizing the song audio and the corresponding background music to obtain the song. The song synthesizing method provided by the embodiment of the invention can synthesize the song desired by the user according to the read audio of the user only by reading the lyrics of the user. The user does not need to have any singing skill and know any rhythm knowledge, and can obtain the song sung by the voice of the user only by reading the lyrics.

Description

Song synthesis method and system
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a song synthesis method, a song synthesis system, an electronic device, and a storage medium.
Background
The technical scheme for synthesizing songs on the market at present comprises the following steps: and (3) music score synthesis: the song is synthesized based on the music score and a model trained by a large number of speeches of speakers. And (3) feature conversion and synthesis: by modifying the melody of the singing voice by changing the pauses and durations of the sounds, this approach can only synthesize relatively simple songs, such as rap-style songs.
Of the two schemes, the first scheme requires a large amount of user data to train the model, and has high cost, poor practicability and strong mechanical sense of audio. The second scheme cannot be combined with some songs with higher singing difficulty, such as opera cavity, vibrato and the like.
The current defects of the similar technology are as follows: non-real-time song synthesis can synthesize any song, but mechanical sound is synthesized, and the model training cost is high; real-time song synthesis only specifies a few song types.
Disclosure of Invention
An embodiment of the present invention provides a song synthesis method and system, which are used for solving at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a song synthesis method, including:
acquiring lyric audio which is lyric reading audio corresponding to a song to be synthesized;
acquiring a target song dry sound corresponding to the current song to be synthesized;
acquiring target audio characteristics corresponding to the dry sound of the target song;
adjusting the audio frequency characteristics of the lyric audio frequency according to the target audio frequency characteristics to obtain corresponding song audio frequency;
and synthesizing the song audio and the corresponding background music to obtain the song.
In a second aspect, an embodiment of the present invention provides a song synthesizing system, including:
the audio acquisition program module is used for acquiring lyric audio which is lyric reading audio corresponding to a song to be synthesized;
a dry sound obtaining program module for obtaining the dry sound of the target song corresponding to the current song to be synthesized;
the characteristic acquisition program module is used for acquiring target audio characteristics corresponding to the dry sound of the target song;
the characteristic adjusting program module is used for adjusting the audio frequency characteristics of the lyric audio frequency according to the target audio frequency characteristics to obtain corresponding song audio frequency;
and the audio synthesis program module is used for synthesizing the song audio and the corresponding background music to obtain the song.
In a third aspect, an embodiment of the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above song synthesis methods of the present invention.
In a fourth aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the song synthesizing methods of the present invention described above.
In a fifth aspect, an embodiment of the present invention further provides a computer program product, which includes a computer program stored on a storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to execute any one of the above song synthesizing methods.
The embodiment of the invention has the beneficial effects that: the song synthesizing method provided by the embodiment of the invention can synthesize the song desired by the user according to the read audio of the user only by reading the lyrics of the user. The user does not need to have any singing skill and know any rhythm knowledge, and can obtain the song sung by the voice of the user only by reading the lyrics. In addition, because the source data of the user song dry tone for simulating the tone color of the user is the audio data of the user reading the lyrics, the audio data of the user reading the lyrics only needs to be adaptively adjusted according to the audio characteristics of the standard song dry tone, the song synthesis method is simplified, the technical difficulty of song synthesis is reduced, and the efficiency of song synthesis is greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow diagram of one embodiment of a song synthesis method of the present invention;
FIG. 2 is a functional block diagram of an embodiment of a song synthesis system in the present invention;
FIG. 3 is a flowchart of a method for executing a song-reading product according to the present invention;
FIG. 4 is a flow chart of a song synthesizing technique in sentence reading and song forming in the present invention;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device according to the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As shown in fig. 1, a song synthesizing method provided in an embodiment of the present invention includes: s10, obtaining lyrics audio, which is the lyrics reading audio corresponding to the song to be synthesized. For example, the user may read the spoken audio by the lyrics of the song to be synthesized or the user may simulate the synthesized spoken audio based on the historical audio data of the user, which is not a limitation of the present invention.
S20, acquiring a target song stem corresponding to the current song to be synthesized; illustratively, a target song stem corresponding to the song to be synthesized is obtained from the song library. The song library is constructed in advance, and the song library stores the dry voices of a plurality of songs to be synthesized.
S30, acquiring target audio characteristics corresponding to the dry sound of the target song; the target audio features are stored together in a pre-constructed song library. The target audio features comprise fundamental frequencies of initial and final information of each word in the target song dry note, specifically, the fundamental frequencies in the target song dry note are extracted, and the fundamental frequencies of each word in the target song dry note are further obtained.
S40, adjusting the audio frequency characteristics of the lyric audio frequency according to the target audio frequency characteristics to obtain corresponding song audio frequency;
and S50, synthesizing the song audio and the corresponding background music to obtain the song.
The song synthesizing method provided by the embodiment of the invention can synthesize the song desired by the user according to the read audio of the user only by reading the lyrics of the user. The user does not need to have any singing skill and know any rhythm knowledge, and can obtain songs of any type and any style sung by the voice of the user only by reading the lyrics.
Illustratively, in one aspect, a user selects a song that the user wants to synthesize to obtain corresponding lyrics, the user recites the lyrics and obtains audio data; on the other hand, acquiring the song stem and the background music of the song selected by the user from a pre-constructed song library; further acquiring the audio characteristics (namely standard audio characteristics) of the song dry tone, so that the audio data acquired by the user reading the lyrics is adjusted according to the audio characteristics to obtain the user song dry tone which accords with the tone quality of the user; and finally, synthesizing the dry sound of the user song and the background music into a song.
In addition, because the source data of the user song dry tone for simulating the tone color of the user is the audio data of the user reading the lyrics, the audio data of the user reading the lyrics only needs to be adaptively adjusted according to the audio characteristics of the standard song dry tone, the song synthesis method is simplified, the technical difficulty of song synthesis is reduced, and the efficiency of song synthesis is greatly improved.
In some embodiments, the song synthesis method of the present invention further comprises: acquiring audio segmentation information of user audio according to the lyric audio, wherein the audio segmentation information comprises phoneme segmentation information and/or syllable segmentation information and/or initial and final segmentation information;
the adjusting the audio characteristics of the lyric audio according to the target audio characteristics to obtain a corresponding song audio comprises: and inputting the target audio characteristics, the audio segmentation information and the lyric audio into an adaptive model to obtain corresponding song audio.
Illustratively, the inputting the target audio feature, the audio slicing information, and the lyric audio to an adaptation model to obtain a corresponding song audio comprises:
inputting the lyric audio and audio segmentation information into a pre-trained acoustic adaptive model to perform adaptive processing on the lyric audio;
inputting the target audio features into a pre-trained song rhythm model to obtain rhythm parameters;
and adjusting the lyric audio after the self-adaptive processing according to the prosody parameters to obtain the corresponding song audio.
The song synthesis method of the invention is divided into 2 processing stages from a section of voice read by a user to the formation of personalized songs from the technical point of view, namely voice recognition and voice synthesis, and the latter is subdivided into two parts, namely acoustic model adding and prosodic model adjusting prosodic parameters. The acoustic model is added, and in popular terms, the acoustic model is used for collecting voice data of a user to form a training model, and then the user can generate a song with tone color like the user himself after personalized learning. The rhythm model adjusts rhythm parameters by controlling the length and height of each tone, so that the rhythm beats the upper tone naturally and smoothly. Subsequently, the rhythm parameters and the frequency spectrum parameters are combined to generate singing voice, and a segment which adopts the tone color of the user to perform deduction and has the melody similar to the original singing is synthesized. The whole process is carried out for 1-2 seconds, and the synthesis can be successful.
In some embodiments, after obtaining the lyric audio, further comprising:
detecting whether the lyric audio correctly corresponds to the corresponding lyrics; illustratively, it is detected whether each word read by the user in the lyric audio is correct, for example, the reading of the word "middle" in "love me china" may be read as "zong", and this is the wrong word read.
If not, further determining wrong words in the lyric audio; if so, the subsequent song synthesizing steps S20-S50 continue.
Determining a user audio characteristic corresponding to the current user according to the lyric audio; illustratively, user audio features that can represent the user are extracted from the lyric audio spoken by the user to synthesize a correct reading or to adjust an incorrect reading to a correct reading by the user.
Correcting the erroneous words according to the user audio characteristics to obtain lyric audio correctly corresponding to the corresponding lyrics, and sequentially performing steps S20-S50.
According to the embodiment, the wrong content can be automatically corrected when the user reads the lyrics wrongly, so that the song synthesis is ensured to be smoothly carried out, and the user does not need to read the lyrics again.
In some embodiments, after obtaining the lyric audio, further comprising:
detecting whether the lyric audio correctly corresponds to the corresponding lyrics; illustratively, it is detected whether each word read by the user in the lyric audio is correct, for example, the reading of the word "middle" in "love me china" may be read as "zong", and this is the wrong word read.
If not, further determining wrong words in the lyric audio; if so, the subsequent song synthesizing steps S20-S50 continue.
Presenting the determined erroneous words to the user and directing the user to recite the erroneous words individually;
obtaining a corrected audio of the user separately reciting the wrong word;
determining a correct lyric audio according to the corrected audio and the lyric audio, and sequentially performing steps S20-S50.
According to the embodiment, when the user reads the lyrics mistakenly, the wrong content can be automatically identified and the user is guided to read the wrong part again independently, the user does not need to read the whole lyrics again, so that the user experience is improved when the synthesized song is smoothly carried out, and the situation that the synthesis time of the whole song is long due to the fact that the user reads the whole lyrics again is avoided.
In some embodiments, before obtaining the lyric audio, further comprising:
acquiring user attribute information, wherein the user attribute information comprises user gender and user age;
generating a song recommendation list to be synthesized according to the user attribute information;
the acquiring the lyric audio comprises:
determining a song to be synthesized according to the selection operation of a user, and presenting the lyrics of the song to be synthesized to the user;
and detecting and acquiring the lyric audio of the song to be synthesized read by the user.
In the embodiment, a suitable song list is recommended for the user according to the attribute information of the user, so that the user can conveniently and quickly obtain interested songs, and the experience of the user in the song synthesis operation is improved.
As shown in fig. 2, an embodiment of the present invention further provides a song synthesizing system 200, which includes:
an audio obtaining program module 210, configured to obtain a lyric audio frequency, where the lyric audio frequency is a lyric reading audio frequency corresponding to a song to be synthesized;
a dry sound obtaining program module 220, configured to obtain a dry sound of a target song corresponding to the current song to be synthesized;
a feature obtaining program module 230, configured to obtain a target audio feature corresponding to the target song stem;
a feature adjustment program module 240, configured to adjust an audio feature of the lyric audio according to the target audio feature to obtain a corresponding song audio;
and an audio synthesizing program module 250, configured to synthesize the song audio and the corresponding background music to obtain a song.
The song synthesis system according to the embodiment of the present invention may be configured to execute the song synthesis method according to the embodiment of the present invention, and accordingly achieve the technical effect achieved by the song synthesis method according to the embodiment of the present invention, which is not described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
The invention provides a technical scheme for synthesizing high-naturalness songs supporting any song in real time, which is divided into an off-line part and an on-line part.
The off-line part is used for constructing song database data, firstly finding a professional singer to record song dry tones and background music which need to be synthesized and converted, and cutting the song dry tones to obtain song segments used for synthesis. Marking out initial consonant and final information of each character in the song dry note, then extracting the fundamental frequency of the song dry note (which is a section of audio data), and correcting the fundamental frequency of the initial consonant and the final. The fundamental frequency of some initials and finals (voice) is 0, the fundamental frequency of some initials and finals (voice) is not 0, and if the fundamental frequency of the plosive is 0, the fundamental frequency data of the initials and finals are adjusted considering the reason that the current fundamental frequency extracting tool is possibly inaccurate.
Wherein, song dry refers to the sound of the song with background music removed. The acquisition mode of the song dry sound is not necessarily only a recording mode, and the song dry sound and the background music can be obtained by stripping from the existing songs by the existing mature means of the technology. For example, the song audio, i.e., the song stem, of most background music can be removed using techniques such as echo cancellation (not limited to this method).
The cutting mode adopts cutting each syllable or cutting each phoneme or cutting initial and final information. For example, an audio file sings 'love me Chinese', so that a user can obtain the whole fundamental frequency of the 'love me Chinese', and then information of initials and finals of four characters of 'love', 'me', 'middle', 'Hua' and the like is segmented in a manual mode (the same principle is also applied to segmentation, phoneme combination and the like).
The on-line part firstly is a data preprocessing stage, audio data (such as lyric audio of lyrics of a song to be synthesized read by a user) is corrected through a voice recognition technology, whether the audio data corresponds to the lyrics is detected, and phoneme segmentation information of the user audio is obtained by utilizing a deep learning model in big data.
Then, the audio data is processed by noise reduction, endpoint detection, uv (unvoiced voice) fundamental frequency repair, and the like. The phoneme segmentation information, the song characteristics in the song library and the audio data are sent into a trained adaptive model together for data matching, wherein the adaptive model is used for adjusting the characteristics of the sentence reading audio of the user to the melody of the song, and the training data are a large amount of accurately labeled voice data and segmentation labeling information, so that the characteristics of the user audio are matched with the characteristics of the song.
The audio features after the conversion are post-processed in a signal processing or big data driving mode, so that the converted audio is closer to the audio of the singing of natural people in terms of pronunciation characteristics, and finally background music is incorporated.
Based on the design of the invention, a play scheme of reading sentences into songs can be realized: the user inputs the lyric text including but not limited to natural human voice, synthetic voice, splicing voice, recorded voice data and the like to the system, and the song with the synthesized user tone is output. For synthetic tones, but not limited to, scoring incentives, friend ring song blessings, song voting is supported.
The method can realize the synthesis of the song of reading sentences into songs of any song. The method adopts a speech model based on signal driving or big data driving to predict and process the audio signal points of the user, adjusts the sound of the user to the tone and syllable of the song on the premise of ensuring the tone and semantic of the user, and supports the synthesis of the 'sentence reading into song' sound of any song.
As shown in fig. 3, the method for executing the sentence-reading song-forming product of the present invention comprises the following steps:
step 1: enter the program interface, fill in or select user information (including but not limited to gender and age), generate a list of albums.
Step 2: selecting songs to be synthesized, and uploading audio including, but not limited to, natural human voice, synthetic sound, splicing sound, recording sound, etc.
And step 3: and (4) correcting the audio data quality (with or without functions), correctly synthesizing the songs, and recording the songs again in error.
And 4, step 4: the user acquires the synthesized song audio, which may be subjected to operations including, but not limited to: playing a song, sharing a song, voting for a song, downloading a song, etc.
As shown in fig. 4, the technical process of song synthesis in sentence reading and song formation in the present invention includes the following steps:
step 1: song library song resource loading, including but not limited to lyrics, music scores, signals or general big data models, adaptive models for different people, song categories, environments, and empirically adjusting parameters.
Step 2: data pre-processing including but not limited to noise reduction, speech recognition, endpoint detection, speaker characterization (gender, etc.), audio quality verification, language model adaptation. And text proofreading, mute section endpoint detection, excitation section normalization and the like of the audio data are realized.
And step 3: by means of a speech signal model driven by signals or big data, and by means of feature processing and prediction or signal point processing or prediction, under the condition of keeping the voice tone and the semantic meaning of a speaker, the tone of the user audio is converted into the tone of a designated song and a corresponding syllable, so that the real music score of the song is better met.
And 4, step 4: the converted audio is post-processed in a signal processing or big data driving mode, so that the converted audio is closer to the audio of the singing of natural people in pronunciation characteristics.
It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the above song synthesis methods of the present invention.
In some embodiments, embodiments of the invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the song synthesis methods described above.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a song composition method.
In some embodiments, an embodiment of the present invention further provides a storage medium having a computer program stored thereon, wherein the program is executed by a processor to perform a song composition method.
Fig. 5 is a schematic diagram of a hardware structure of an electronic device for executing a song synthesizing method according to another embodiment of the present application, and as shown in fig. 5, the electronic device includes:
one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5.
The apparatus for performing the song synthesizing method may further include: an input device 530 and an output device 540.
The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.
The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the song composition method in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions, and modules stored in the memory 520, so as to implement the song composition method of the above-described method embodiment.
The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the song composition apparatus, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to the song composition apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the song composition apparatus. The output device 540 may include a display device such as a display screen.
The one or more modules are stored in the memory 520 and, when executed by the one or more processors 510, perform a song synthesis method in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A song synthesis method, comprising:
acquiring lyric audio which is lyric reading audio corresponding to a song to be synthesized;
detecting whether the lyric audio correctly corresponds to the corresponding lyrics;
if not, further determining wrong words in the lyric audio;
presenting the determined erroneous words to the user and directing the user to recite the erroneous words individually;
obtaining a corrected audio of the user separately reciting the wrong word;
determining a correct lyric audio frequency according to the corrected audio frequency and the lyric audio frequency;
acquiring a target song dry sound corresponding to the current song to be synthesized;
acquiring target audio characteristics corresponding to the dry sound of the target song;
adjusting the audio frequency characteristics of the lyric audio frequency according to the target audio frequency characteristics to obtain corresponding song audio frequency;
and synthesizing the song audio and the corresponding background music to obtain the song.
2. The method according to claim 1, further comprising pre-constructing a song library in which song dry notes of a plurality of songs to be synthesized are stored;
the obtaining of the target song stem corresponding to the current song to be synthesized comprises: and acquiring the target song dry sound corresponding to the song to be synthesized from the song library.
3. The method of claim 2, further comprising obtaining audio slicing information of the user audio according to the lyric audio;
the adjusting the audio characteristics of the lyric audio according to the target audio characteristics to obtain a corresponding song audio comprises:
and inputting the target audio characteristics, the audio segmentation information and the lyric audio into an adaptive model to obtain corresponding song audio.
4. The method of claim 3, wherein the inputting the target audio feature, the audio slicing information, and the lyric audio to an adaptive model to derive a corresponding song audio comprises:
inputting the lyric audio and audio segmentation information into a pre-trained acoustic adaptive model to perform adaptive processing on the lyric audio;
inputting the target audio features into a pre-trained song rhythm model to obtain rhythm parameters;
and adjusting the lyric audio after the self-adaptive processing according to the prosody parameters to obtain the corresponding song audio.
5. The method of claim 3 or 4, wherein the audio-cut information comprises phone-cut information and/or syllable-cut information and/or initial and final-cut information.
6. The method of claim 1, wherein after obtaining the lyric audio, further comprising:
detecting whether the lyric audio correctly corresponds to the corresponding lyrics;
if not, further determining wrong words in the lyric audio;
determining a user audio characteristic corresponding to a current user according to the lyric audio;
and correcting the wrong words according to the audio features of the user to obtain the audio frequency of the lyrics correctly corresponding to the corresponding lyrics.
7. The method of claim 1, wherein prior to obtaining the lyric audio, further comprising:
acquiring user attribute information, wherein the user attribute information comprises user gender and user age;
generating a song recommendation list to be synthesized according to the user attribute information;
the acquiring the lyric audio comprises:
determining a song to be synthesized according to the selection operation of a user, and presenting the lyrics of the song to be synthesized to the user;
and detecting and acquiring the lyric audio of the song to be synthesized read by the user.
8. A song composition system, comprising:
the audio acquisition program module is used for acquiring lyric audio which is lyric reading audio corresponding to a song to be synthesized;
a lyric audio correction program module for detecting whether the lyric audio correctly corresponds to the corresponding lyrics; if not, further determining wrong words in the lyric audio; presenting the determined erroneous words to the user and directing the user to recite the erroneous words individually; obtaining a corrected audio of the user separately reciting the wrong word; determining a correct lyric audio frequency according to the corrected audio frequency and the lyric audio frequency;
a dry sound obtaining program module for obtaining the dry sound of the target song corresponding to the current song to be synthesized;
the characteristic acquisition program module is used for acquiring target audio characteristics corresponding to the dry sound of the target song;
the characteristic adjusting program module is used for adjusting the audio frequency characteristics of the lyric audio frequency according to the target audio frequency characteristics to obtain corresponding song audio frequency;
and the audio synthesis program module is used for synthesizing the song audio and the corresponding background music to obtain the song.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN201910188123.5A 2019-01-18 2019-03-13 Song synthesis method and system Active CN109949783B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019100483550 2019-01-18
CN201910048355 2019-01-18

Publications (2)

Publication Number Publication Date
CN109949783A CN109949783A (en) 2019-06-28
CN109949783B true CN109949783B (en) 2021-01-29

Family

ID=67009722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910188123.5A Active CN109949783B (en) 2019-01-18 2019-03-13 Song synthesis method and system

Country Status (1)

Country Link
CN (1) CN109949783B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112309351A (en) * 2019-07-31 2021-02-02 武汉Tcl集团工业研究院有限公司 Song generation method and device, intelligent terminal and storage medium
CN112417201A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Audio information pushing method and system, electronic equipment and computer readable medium
CN110600034B (en) * 2019-09-12 2021-12-03 广州酷狗计算机科技有限公司 Singing voice generation method, singing voice generation device, singing voice generation equipment and storage medium
CN111488485B (en) * 2020-04-16 2023-11-17 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN111554267A (en) * 2020-04-23 2020-08-18 北京字节跳动网络技术有限公司 Audio synthesis method and device, electronic equipment and computer readable medium
CN112164387A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer-readable storage medium
CN112331234A (en) * 2020-10-27 2021-02-05 北京百度网讯科技有限公司 Song multimedia synthesis method and device, electronic equipment and storage medium
CN112289300B (en) * 2020-10-28 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method and device, electronic equipment and computer readable storage medium
CN112669849A (en) * 2020-12-18 2021-04-16 百度国际科技(深圳)有限公司 Method, apparatus, device and storage medium for outputting information
CN112750422B (en) * 2020-12-23 2023-01-31 出门问问创新科技有限公司 Singing voice synthesis method, device and equipment
CN113539215B (en) * 2020-12-29 2024-01-12 腾讯科技(深圳)有限公司 Music style conversion method, device, equipment and storage medium
CN114863898A (en) * 2021-02-04 2022-08-05 广州汽车集团股份有限公司 Vehicle karaoke audio processing method and system and storage medium
CN113160849B (en) * 2021-03-03 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 Singing voice synthesizing method, singing voice synthesizing device, electronic equipment and computer readable storage medium
CN113488007B (en) * 2021-07-07 2024-06-11 北京灵动音科技有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN113555001A (en) * 2021-07-23 2021-10-26 平安科技(深圳)有限公司 Singing voice synthesis method and device, computer equipment and storage medium
CN115910002B (en) * 2023-01-06 2023-05-16 之江实验室 Audio generation method, storage medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105788589B (en) * 2016-05-04 2021-07-06 腾讯科技(深圳)有限公司 Audio data processing method and device
CN106971704B (en) * 2017-04-27 2020-03-17 维沃移动通信有限公司 Audio processing method and mobile terminal
CN108538302B (en) * 2018-03-16 2020-10-09 广州酷狗计算机科技有限公司 Method and apparatus for synthesizing audio
CN108877766A (en) * 2018-07-03 2018-11-23 百度在线网络技术(北京)有限公司 Song synthetic method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN109949783A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109949783B (en) Song synthesis method and system
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
CN108806656B (en) Automatic generation of songs
US10789937B2 (en) Speech synthesis device and method
JP4189051B2 (en) Pronunciation measuring apparatus and method
CN106531185B (en) voice evaluation method and system based on voice similarity
CN107316638A (en) A kind of poem recites evaluating method and system, a kind of terminal and storage medium
US7010489B1 (en) Method for guiding text-to-speech output timing using speech recognition markers
CN108053814B (en) Speech synthesis system and method for simulating singing voice of user
CN109326280B (en) Singing synthesis method and device and electronic equipment
KR20160122542A (en) Method and apparatus for measuring pronounciation similarity
JP2008026622A (en) Evaluation apparatus
CN103915093A (en) Method and device for realizing voice singing
CN113112575B (en) Mouth shape generating method and device, computer equipment and storage medium
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
KR20220165666A (en) Method and system for generating synthesis voice using style tag represented by natural language
CN113160780A (en) Electronic musical instrument, method and storage medium
CN114170999A (en) Voice conversion method, device, electronic equipment and storage medium
CN105895079A (en) Voice data processing method and device
CN111695777A (en) Teaching method, teaching device, electronic device and storage medium
CN110164414B (en) Voice processing method and device and intelligent equipment
JP2806364B2 (en) Vocal training device
CN108182946B (en) Vocal music mode selection method and device based on voiceprint recognition
JP6957069B1 (en) Learning support system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee after: Sipic Technology Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee before: AI SPEECH Ltd.