CN113436601A - Audio synthesis method and device, electronic equipment and storage medium - Google Patents

Audio synthesis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113436601A
CN113436601A CN202110587270.7A CN202110587270A CN113436601A CN 113436601 A CN113436601 A CN 113436601A CN 202110587270 A CN202110587270 A CN 202110587270A CN 113436601 A CN113436601 A CN 113436601A
Authority
CN
China
Prior art keywords
text
original
target
information
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110587270.7A
Other languages
Chinese (zh)
Inventor
卢家辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110587270.7A priority Critical patent/CN113436601A/en
Publication of CN113436601A publication Critical patent/CN113436601A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • H04N21/8113Monomedia components thereof involving special audio data, e.g. different tracks for different languages comprising music, e.g. song in MP3 format
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Abstract

The present disclosure relates to an audio synthesis method, apparatus, electronic device, and storage medium, the method comprising: acquiring background music and original voice information of original audio, wherein the original voice information comprises: text information, melody information and time information corresponding to the text information and the melody information of the original voice; acquiring a target text corresponding to the text information; converting the target text into target voice according to the melody information and the time information; and synthesizing the target voice and the background music to obtain the target audio. According to the technical scheme, the target text can be automatically converted into the target voice according to the melody information and the time information, the target audio for replacing the lyrics is finally obtained by synthesizing the target voice and the background music, complex operations that a user obtains the target voice manually or through recording are avoided, the creation threshold of the user is greatly reduced, the creation enthusiasm of the user is improved, the quality of uploading videos of the user is improved, and the flow and click rate of video websites are improved.

Description

Audio synthesis method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to data processing technologies, and in particular, to an audio synthesis method and apparatus, an electronic device, and a storage medium.
Background
On a video web site, many authors of professional produced Content (PGC) upload video clip works of songs containing replaced lyrics. Compared with the original song, the replaced lyrics are mostly interesting and have acupuncture and acupuncture points, and the original song is usually a popular classic song, so that the video editing works containing the song with the replaced lyrics can be widely spread on the network easily, and the flow and click rate of a video website can be greatly improved.
However, most of these lyric-substituted songs are currently produced by PGC users. For a general internet user, if a similar work is to be edited, the user is required to manually generate the singing voice audio of the replacement lyrics or the user is required to obtain the singing voice audio of the replacement lyrics through recording. Such an operation process is complicated and high in threshold, and an ordinary user cannot conveniently generate a song for replacing lyrics.
Disclosure of Invention
The present disclosure provides an audio synthesis method, apparatus, electronic device, and storage medium, to at least solve the problems of complicated operation process and high threshold in generating a song that replaces lyrics in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided an audio synthesis method, including:
acquiring background music and original voice information of original audio, wherein the original voice information comprises: text information of original voice, melody information and time information corresponding to the text information and the melody information;
acquiring a target text corresponding to the text information;
converting the target text into target voice according to the melody information and the time information;
and synthesizing the target voice and the background music to obtain target audio.
In an optional implementation manner, the step of obtaining background music and original voice information of original audio includes:
acquiring the original audio;
extracting the background music and the original voice from the original audio;
performing voice recognition on the original voice to obtain an original text, time information corresponding to characters in the original text, acoustic characteristic parameters of the original voice and time information corresponding to the acoustic characteristic parameters; the text information is information contained in the original text, the melody information is acoustic feature parameters of the original voice, and the time information comprises time information corresponding to characters in the original text and time information corresponding to the acoustic feature parameters.
In an optional implementation manner, the step of converting the target text into the target voice according to the melody information and the time information includes:
performing voice synthesis on a first character in the target text to obtain a first voice fragment;
determining first time information corresponding to an original character according to a corresponding relation between the character and the time information in the original text, wherein the original character is a character corresponding to the first character in the original text;
determining a first acoustic characteristic parameter corresponding to the first time information according to the corresponding relation between the acoustic characteristic parameter and the time information;
adjusting the acoustic feature of the first voice segment according to the first acoustic feature parameter to obtain a target voice segment;
the step of synthesizing the target speech and the background music to obtain a target audio includes:
and synthesizing the target voice segment and the background music according to the first time information to obtain the target audio.
In an optional implementation manner, after the step of performing speech recognition on the original speech to obtain an original text, the method further includes:
performing sentence-breaking processing on the original text according to time information corresponding to each character in the original text to obtain a plurality of original text segments;
and counting the number of characters contained in each original text segment to obtain the text information.
In an optional implementation manner, the target text includes a target text fragment, and the step of obtaining the target text corresponding to the text information includes:
outputting and displaying the number of characters contained in the original text segment to prompt a user to input according to the number of the characters;
and acquiring a target text fragment corresponding to the original text fragment.
In an optional implementation manner, after the step of synthesizing the target speech and the background music to obtain the target audio, the method further includes:
calculating the starting time and the duration of the original text segment according to the time information corresponding to each character in the original text segment;
and outputting and displaying the target text segment according to the starting time and the duration of the original text segment.
In an alternative implementation, the step of outputting and displaying the target text segment includes:
and displaying each character in the target text segment in an animation mode according to the time information corresponding to each character in the original text segment.
According to a second aspect of the embodiments of the present disclosure, there is provided an audio synthesizing apparatus including:
an information obtaining module configured to obtain background music of original audio and original voice information, the original voice information including: text information of original voice, melody information and time information corresponding to the text information and the melody information;
a text acquisition module configured to acquire a target text corresponding to the text information;
a voice conversion module configured to convert the target text into a target voice according to the melody information and the time information;
and the audio synthesis module is configured to synthesize the target voice and the background music to obtain target audio.
In an optional implementation manner, the information obtaining module is specifically configured to:
acquiring the original audio;
extracting the background music and the original voice from the original audio;
performing voice recognition on the original voice to obtain an original text, time information corresponding to characters in the original text, acoustic characteristic parameters of the original voice and time information corresponding to the acoustic characteristic parameters; the text information is information contained in the original text, the melody information is acoustic feature parameters of the original voice, and the time information comprises time information corresponding to characters in the original text and time information corresponding to the acoustic feature parameters.
In an alternative implementation, the speech conversion module is specifically configured to:
performing voice synthesis on a first character in the target text to obtain a first voice fragment;
determining first time information corresponding to an original character according to a corresponding relation between the character and the time information in the original text, wherein the original character is a character corresponding to the first character in the original text;
determining a first acoustic characteristic parameter corresponding to the first time information according to the corresponding relation between the acoustic characteristic parameter and the time information;
adjusting the acoustic feature of the first voice segment according to the first acoustic feature parameter to obtain a target voice segment;
the audio synthesis module is specifically configured to:
and synthesizing the target voice segment and the background music according to the first time information to obtain the target audio.
In an optional implementation, the information obtaining module is further configured to:
performing sentence-breaking processing on the original text according to time information corresponding to each character in the original text to obtain a plurality of original text segments;
and counting the number of characters contained in each original text segment to obtain the text information.
In an optional implementation manner, the target text includes a target text segment, and the text obtaining module is specifically configured to:
outputting and displaying the number of characters contained in the original text segment to prompt a user to input according to the number of the characters;
and acquiring a target text fragment corresponding to the original text fragment.
In an alternative implementation, the apparatus further includes a text display module configured to:
calculating the starting time and the duration of the original text segment according to the time information corresponding to each character in the original text segment;
and outputting and displaying the target text segment according to the starting time and the duration of the original text segment.
In an alternative implementation, the text display module is specifically configured to:
and displaying each character in the target text segment in an animation mode according to the time information corresponding to each character in the original text segment.
According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the audio synthesis method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the audio synthesis method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor of an electronic device, implements the audio synthesis method of the first aspect.
The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:
the method firstly acquires background music and original voice information of original audio, wherein the original voice information comprises the following steps: text information, melody information and time information corresponding to the text information and the melody information of the original voice; then acquiring a target text corresponding to the text information; then, converting the target text into target voice according to the melody information and the time information; and synthesizing the target voice and the background music to obtain the target audio. By adopting the technical scheme, only the background music, the text information, the melody information, the time information and the target text corresponding to the text information of the original audio are required to be acquired, the target text can be automatically converted into the target voice according to the melody information and the time information, the target audio of the original audio which is replaced by the original voice is acquired by synthesizing the target voice and the background music, the complicated operations that a user manually generates the target voice or acquires the target voice through recording and the like are avoided, the creation threshold of the user is greatly reduced, a common internet user can finish video editing of the song with lyrics replaced through video editing software of a mobile terminal, the creation enthusiasm of the user is greatly improved, the quality of the video uploaded by the user is improved, and the flow and click rate of a video website are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow diagram illustrating a method of audio synthesis according to an example embodiment.
FIG. 2 is a flow diagram illustrating another audio synthesis method according to an example embodiment.
Fig. 3 is a block diagram illustrating an audio synthesis apparatus according to an example embodiment.
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Currently, video of songs that replace lyrics is mostly produced by PGC authors. Among them, PGC refers to professional production content, and most of these contents are well-made and interesting, and are easily spread on the internet. The author of the PGC is the author who specializes in producing PGC content. Unlike general internet users, PGC authors are mostly skilled in the operation of video software, and the video clip level is much higher than that of general internet users. For a general internet user, if a video of a song with replaced lyrics is to be edited, the user is required to manually generate the singing voice audio of the replaced lyrics or the user is required to obtain the singing voice audio of the replaced lyrics through recording. Such an operation process is complicated and high in threshold, and an ordinary user cannot conveniently generate a song for replacing lyrics.
Fig. 1 is a flowchart illustrating an audio synthesis method according to an exemplary embodiment, which may be applied to an electronic device such as a terminal. The terminal includes, but is not limited to, a tablet computer, a smart phone, a palm reader, a laptop computer, a desktop computer, a wearable device, and the like. As shown in fig. 1, the audio synthesizing method includes the following steps.
In step S11, background music and original speech information of the original audio are acquired, the original speech information including: text information of the original voice, melody information and time information of the text information corresponding to the melody information.
The original audio may be, for example, a segment or an entire song of a song, etc., and is composed of background music and original voice. The background music is an unvoiced music, and may include music from which singing voice is filtered out from a song, and may also include pure music, etc., which is not limited in this embodiment. The original speech may be singing voice in an original song, etc., which is not limited in this embodiment.
The original text may be a text of lyrics obtained by speech recognition of an original speech such as singing voice, or the like.
The text information may include information such as the number of characters in an original text (e.g., lyrics) obtained by performing speech recognition on an original speech, which is not limited in this embodiment.
The melody information may include acoustic feature parameters such as pitch of the original speech, which is not limited in this embodiment.
The time information corresponding to the text information and the melody information may include, for example, each character in the original text and information such as a start time and a duration (or an end time) of an acoustic feature parameter in the melody information in the background music, which is not limited in this embodiment.
In this embodiment, the background music and the original voice information of the original audio can be imported into the terminal by the user; or may be stored in the terminal in advance; the user may select an original audio, such as a song, from a local or music library, extract background music and an original voice from the original audio, analyze the original voice to obtain text information, melody information, and time information corresponding to the text information and the melody information, and the following embodiments will describe the latter implementation in detail.
In step S12, a target text corresponding to the text information is acquired.
In a specific implementation, a target text input by a user is obtained, and the target text can be used as a lyric text replacing an original text such as original lyrics.
The target text input by the user may be a partial lyric text or a complete lyric text corresponding to the background music, and the embodiment does not limit this.
When the text information is the number of characters in the original text, the target text corresponding to the text information may be a target text that coincides with the number of characters in the original text.
In step S13, the target text is converted into the target voice according to the melody information and the time information.
In a specific implementation, the melody information of the original speech may be used as the melody information of the target text, for example, the tone corresponding to each character in the original text may be used as the tone of each character in the target text; and according To the time information corresponding To the Text information and the melody information, for example, the time information of each character in the original Text in the background music is used as the time information of each character in the target Text in the target voice, and the target Text is converted into a voice signal by adopting a Text To Speech (TTS) technology To obtain the target voice. The TTS technology can convert text input by a user or the outside into voice to be output, and generates a voice signal corresponding to the text, so that a machine can simulate human voice speaking.
In step S14, the target speech is synthesized with the background music to obtain the target audio.
In a specific implementation, a singing voice synthesis technology can be adopted to fuse the target voice with background music in the original audio frequency to obtain the target audio frequency, so that a song with lyrics replaced is created.
The audio synthesis method provided by the exemplary embodiment can automatically convert the target text into the target voice according to the melody information and the time information only by acquiring the background music, the text information, the melody information, the time information of the original audio and the target text corresponding to the text information, and finally obtain the target audio in which the original voice in the original audio is replaced by the target voice by synthesizing the target voice and the background music, so as to obtain the audio in which the lyrics are replaced, thereby avoiding the complicated operations of manually generating the target voice by a user or obtaining the target voice by recording, greatly reducing the creation threshold of the user, and the common internet user can finish the video clip of the song in which the lyrics are replaced by the video clip software of the mobile terminal, thereby greatly improving the creation enthusiasm of the user and improving the quality of the video uploaded by the user, and further, the flow and click rate of the video website are improved, and the method has a great positive effect.
In an optional implementation manner, step S11 may specifically include: acquiring original audio; extracting background music and original voice from original audio; performing voice recognition on the original voice to obtain an original text, time information corresponding to characters in the original text, acoustic characteristic parameters of the original voice and time information corresponding to the acoustic characteristic parameters; the text information is information contained in the original text, the melody information is acoustic characteristic parameters of the original voice, and the time information comprises time information corresponding to characters in the original text and time information corresponding to the acoustic characteristic parameters.
In a particular implementation, the original audio may first be retrieved by the user importing a song from a local or music library. Then, the background music and the original voice in the original audio can be separated by adopting a human voice separation technology, and the separated background music and the separated original voice can be respectively stored in a local disk. The voice separation technology uses a voice detection principle to find out a part with voice in original audio, uses Robust Principal Component Analysis (RPCA) to separate the voice from the score, and obtains the voice, i.e. the original voice, and the score, i.e. the background music.
The original speech stored locally can then be recognized as original text by speech recognition technology, and the original text can include a plurality of characters. In the recognition process, time information corresponding to characters in the original text, acoustic feature parameters of the original speech and time information corresponding to the acoustic feature parameters can be obtained. The time information corresponding to the character may include a start time and a duration (or an end time) of the character in the background music. The time information corresponding to the acoustic feature parameter may include a start time and a duration (also may be an end time) of the acoustic feature parameter in the background music.
The above process may be done in video clip software. Among them, the voice recognition technology is a technology for a machine to convert a voice signal into a corresponding text or command through a recognition and understanding process.
In the implementation mode, the user can independently select the original audio, so that background music and original voice information of any original audio can be obtained, and the quality of the video created by the user can be improved.
In an optional implementation manner, step S13 may specifically include: performing voice synthesis on a first character in a target text to obtain a first voice fragment; determining first time information corresponding to the original characters according to the corresponding relation between the characters and the time information in the original text, wherein the original characters are characters corresponding to the first characters in the original text; determining a first acoustic characteristic parameter corresponding to the first time information according to the corresponding relation between the acoustic characteristic parameter and the time information; and adjusting the acoustic characteristics of the first voice segment according to the first acoustic characteristic parameters to obtain the target voice segment.
And the first character is any character in the target text.
The corresponding relation between the characters in the original text and the time information can be obtained according to the time information corresponding to the characters in the original text.
The correspondence between the acoustic characteristic parameter and the time information may be obtained from the time information corresponding to the acoustic characteristic parameter.
In a specific implementation, a Speech synthesis (Text To Speech, TTS) technology may be adopted, and first, a first character is converted into a Speech signal To obtain a first Speech segment; then determining first time information corresponding to the first character; then determining a first acoustic characteristic parameter corresponding to the first time information; and then, according to the first acoustic characteristic parameter, adjusting the acoustic characteristic of the first voice segment, and further obtaining a target voice segment corresponding to the first character. The target speech may include a target speech segment corresponding to each character in the target text.
In this implementation, step S14 may specifically include: and synthesizing the target voice segment and the background music according to the first time information to obtain the target audio.
Wherein, the first time information includes the starting time and duration (or ending time) of the target speech segment in the background music.
Specifically, a singing voice synthesis technique may be adopted to align and fuse the target speech segment and the background music according to the first time information, so as to obtain a target audio segment corresponding to the first character. The target audio comprises target audio fragments corresponding to all characters in the target text.
In an alternative implementation manner, in step S11, after the step of performing speech recognition on the original speech to obtain the original text, the method may further include: and performing sentence-breaking processing on the original text according to the time information corresponding to each character in the original text to obtain a plurality of original text segments.
The original text fragment may correspond to a lyric of a sentence in the original text, for example.
When the original text segment is a sentence of lyrics, the sentence breaking processing can be carried out on the lyric text according to the interval time between characters in the lyrics, and the original text segment is obtained.
In a specific implementation, the starting time of the original text segment and the duration of each character included in the original text segment may be stored in units of the original text segment. For example, the above information may be stored in json's text format, which is as follows:
Figure BDA0003088185860000101
Figure BDA0003088185860000111
it should be noted that, in the json text, each lyric (i.e. original text fragment) is taken as an element in the array, and the starting time of each lyric and the duration of each character in the lyric are also recorded in the element.
In an alternative implementation manner, in step S11, after the step of obtaining a plurality of original text segments, the method may further include: and counting the number of characters contained in each original text segment to obtain text information.
In an alternative implementation manner, the target text may include a target text segment, and the step S12 may specifically include: outputting and displaying the number of characters contained in the original text fragment to prompt a user to input according to the number of the characters; and acquiring a target text segment corresponding to the original text segment.
In a specific implementation, the array of json root nodes may be traversed, each element (information corresponding to each lyric) in the array may be taken out, the number of characters of each lyric may be obtained, then a text box may be popped up in the terminal interface to prompt the user to modify the lyrics according to the number of characters of the lyric, and the number of characters of the replacement lyrics or the target text segment input in the text box by the user may be the same as the number of characters of the original lyrics or the original text segment. And then, storing the target text segment input by the user into the json file until the traversal of the last lyric in the array of the json root node is finished.
In an optional implementation manner, after step S14, the method may further include: calculating the starting time and the duration of the original text segment according to the time information corresponding to each character in the original text segment; and outputting and displaying the target text segment according to the starting time and the duration of the original text segment.
Specifically, the sum of the duration of each character in the original text segment can be calculated to obtain the duration of the original text segment; and outputting and displaying the target text segment according to the starting time of the first character in the original text segment and the duration of the original text segment.
The step of outputting and displaying the target text segment may specifically include: and displaying each character in the target text segment in an animation mode according to the time information corresponding to each character in the original text segment.
Specifically, the characters in the target text segment may be displayed in an animated form according to the duration of the characters in the original text segment.
In a particular implementation, the start time and duration of the original text segment may be taken as the start time and duration of the target text segment. The duration of the target text segment can be used as the display duration of the target text segment, subtitles are added in the video clip software, and the target text segment is output and displayed. And adding karaoke animation in the subtitle by taking the duration of each character in the original text segment as the duration of each character in the target text segment.
By the technical scheme, a common internet user can conveniently finish songs for replacing lyrics, and the videos are edited by combining the created songs, so that the enthusiasm of the user for editing the videos is greatly improved. For a video website, the quality of videos uploaded by a user can be greatly improved, and the method has a great positive effect on improving the flow and click rate of the video website.
Fig. 2 is a flow diagram illustrating another audio synthesis method according to an exemplary embodiment, which includes the following steps, as shown in fig. 2.
In step S21, a song imported by the user from the local or a song selected from the online music library is acquired. Where the song is the original audio.
In step S22, the vocal sounds in the song and the background music are separated and stored in the local disk, respectively, by using the vocal separation technique. Wherein the human voice in the song is the original voice.
In step S23, the pitch of the human voice is recognized by using a voice recognition technique, and the number of words, the start time, and the duration of each word of the lyrics of each sentence are recognized and stored in json form.
In step S24, a dialog box pops up, prompting the user to modify the lyrics according to the word number of each lyric, and saving the lyrics modified by the user in the json file. Wherein the modified lyrics are the target text.
In step S25, using a singing voice synthesis technique, the user-modified lyrics are generated into a target voice based on the pitch in the voice and the duration of each word in each sentence of lyrics, and the target voice and the background music are combined based on the start time of each sentence of lyrics to generate a target audio.
In step S26, subtitle content is added.
According to the implementation mode, a brand-new interaction can be created in the video editing software by integrating a voice separation algorithm, a voice recognition technology and a singing voice synthesis technology, and the interaction can enable ordinary Internet users to conveniently generate video works of songs for replacing lyrics, so that the enthusiasm of the ordinary Internet users for creating the videos is greatly improved. The one-stop interaction saves the tedious operation of calibrating and replacing lyrics by a user, greatly reduces the use threshold of the user, and has great significance for improving the creation enthusiasm of the user for creating the video of the replaced lyrics.
Fig. 3 is a block diagram illustrating an audio synthesis apparatus according to an example embodiment. Referring to fig. 3, the apparatus includes:
an information obtaining module 31 configured to obtain background music of original audio and original voice information, the original voice information including: text information of original voice, melody information and time information corresponding to the text information and the melody information;
a text acquisition module 32 configured to acquire a target text corresponding to the text information;
a voice conversion module 33 configured to convert the target text into a target voice according to the melody information and the time information;
and an audio synthesizing module 34 configured to synthesize the target speech and the background music to obtain a target audio.
In an optional implementation manner, the information obtaining module is specifically configured to:
acquiring the original audio;
extracting the background music and the original voice from the original audio;
performing voice recognition on the original voice to obtain an original text, time information corresponding to characters in the original text, acoustic characteristic parameters of the original voice and time information corresponding to the acoustic characteristic parameters; the text information is information contained in the original text, the melody information is acoustic feature parameters of the original voice, and the time information comprises time information corresponding to characters in the original text and time information corresponding to the acoustic feature parameters.
In an alternative implementation, the speech conversion module is specifically configured to:
performing voice synthesis on a first character in the target text to obtain a first voice fragment;
determining first time information corresponding to an original character according to a corresponding relation between the character and the time information in the original text, wherein the original character is a character corresponding to the first character in the original text;
determining a first acoustic characteristic parameter corresponding to the first time information according to the corresponding relation between the acoustic characteristic parameter and the time information;
adjusting the acoustic feature of the first voice segment according to the first acoustic feature parameter to obtain a target voice segment;
the audio synthesis module is specifically configured to:
and synthesizing the target voice segment and the background music according to the first time information to obtain the target audio.
In an optional implementation, the information obtaining module is further configured to:
performing sentence-breaking processing on the original text according to time information corresponding to each character in the original text to obtain a plurality of original text segments;
and counting the number of characters contained in each original text segment to obtain the text information.
In an optional implementation manner, the target text includes a target text segment, and the text obtaining module is specifically configured to:
outputting and displaying the number of characters contained in the original text segment to prompt a user to input according to the number of the characters;
and acquiring a target text fragment corresponding to the original text fragment.
In an alternative implementation, the apparatus further includes a text display module configured to:
calculating the starting time and the duration of the original text segment according to the time information corresponding to each character in the original text segment;
and outputting and displaying the target text segment according to the starting time and the duration of the original text segment.
In an alternative implementation, the text display module is specifically configured to:
and displaying each character in the target text segment in an animation mode according to the time information corresponding to each character in the original text segment.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 4 is a block diagram of one type of electronic device 800 shown in the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or the processor 820 to execute instructions to perform all or a portion of the steps of any of the methods described in any of the embodiments. Further, the processing component 802 can include one or modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the methods described in any of the embodiments.
In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the method of any of the embodiments is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, which comprises readable program code executable by the processor 820 of the device 800 to perform the method of any of the embodiments. Alternatively, the program code may be stored in a storage medium of the apparatus 800, and the computer-readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 5 is a block diagram of one type of electronic device 1900 shown in the present disclosure. For example, the electronic device 1900 may be provided as a server.
Referring to fig. 5, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform a method as described in any of the embodiments.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like, stored in memory 1932.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. An audio synthesis method, comprising:
acquiring background music and original voice information of original audio, wherein the original voice information comprises: text information of original voice, melody information and time information corresponding to the text information and the melody information;
acquiring a target text corresponding to the text information;
converting the target text into target voice according to the melody information and the time information;
and synthesizing the target voice and the background music to obtain target audio.
2. The audio synthesizing method according to claim 1, wherein the step of obtaining the background music and the original voice information of the original audio comprises:
acquiring the original audio;
extracting the background music and the original voice from the original audio;
performing voice recognition on the original voice to obtain an original text, time information corresponding to characters in the original text, acoustic characteristic parameters of the original voice and time information corresponding to the acoustic characteristic parameters; the text information is information contained in the original text, the melody information is acoustic feature parameters of the original voice, and the time information comprises time information corresponding to characters in the original text and time information corresponding to the acoustic feature parameters.
3. The audio synthesizing method according to claim 2, wherein the step of converting the target text into the target voice based on the melody information and the time information includes:
performing voice synthesis on a first character in the target text to obtain a first voice fragment;
determining first time information corresponding to an original character according to a corresponding relation between the character and the time information in the original text, wherein the original character is a character corresponding to the first character in the original text;
determining a first acoustic characteristic parameter corresponding to the first time information according to the corresponding relation between the acoustic characteristic parameter and the time information;
adjusting the acoustic feature of the first voice segment according to the first acoustic feature parameter to obtain a target voice segment;
the step of synthesizing the target speech and the background music to obtain a target audio includes:
and synthesizing the target voice segment and the background music according to the first time information to obtain the target audio.
4. The audio synthesizing method according to claim 2, wherein after the step of performing speech recognition on the original speech to obtain an original text, further comprising:
performing sentence-breaking processing on the original text according to time information corresponding to each character in the original text to obtain a plurality of original text segments;
and counting the number of characters contained in each original text segment to obtain the text information.
5. The audio synthesis method according to claim 4, wherein the target text comprises a target text segment, and the step of obtaining the target text corresponding to the text information comprises:
outputting and displaying the number of characters contained in the original text segment to prompt a user to input according to the number of the characters;
and acquiring a target text fragment corresponding to the original text fragment.
6. The audio synthesizing method according to claim 4 or 5, further comprising, after the step of synthesizing the target speech with the background music to obtain target audio:
calculating the starting time and the duration of the original text segment according to the time information corresponding to each character in the original text segment;
and outputting and displaying the target text segment according to the starting time and the duration of the original text segment.
7. An audio synthesizing apparatus, comprising:
an information obtaining module configured to obtain background music of original audio and original voice information, the original voice information including: text information of original voice, melody information and time information corresponding to the text information and the melody information;
a text acquisition module configured to acquire a target text corresponding to the text information;
a voice conversion module configured to convert the target text into a target voice according to the melody information and the time information;
and the audio synthesis module is configured to synthesize the target voice and the background music to obtain target audio.
8. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 6.
9. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the method according to any of claims 1 to 6 when executed by a processor.
CN202110587270.7A 2021-05-27 2021-05-27 Audio synthesis method and device, electronic equipment and storage medium Pending CN113436601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110587270.7A CN113436601A (en) 2021-05-27 2021-05-27 Audio synthesis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110587270.7A CN113436601A (en) 2021-05-27 2021-05-27 Audio synthesis method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113436601A true CN113436601A (en) 2021-09-24

Family

ID=77803113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110587270.7A Pending CN113436601A (en) 2021-05-27 2021-05-27 Audio synthesis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113436601A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
CN106547797A (en) * 2015-09-23 2017-03-29 腾讯科技(深圳)有限公司 Audio frequency generation method and device
CN106652997A (en) * 2016-12-29 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Audio synthesis method and terminal
US20180254034A1 (en) * 2015-10-20 2018-09-06 Baidu Online Network Technology (Beijing) Co., Ltd Training method for multiple personalized acoustic models, and voice synthesis method and device
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109979497A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Generation method, device and system and the data processing and playback of songs method of song
CN110189741A (en) * 2018-07-05 2019-08-30 腾讯数码(天津)有限公司 Audio synthetic method, device, storage medium and computer equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
CN106547797A (en) * 2015-09-23 2017-03-29 腾讯科技(深圳)有限公司 Audio frequency generation method and device
US20180254034A1 (en) * 2015-10-20 2018-09-06 Baidu Online Network Technology (Beijing) Co., Ltd Training method for multiple personalized acoustic models, and voice synthesis method and device
CN106652997A (en) * 2016-12-29 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Audio synthesis method and terminal
CN109979497A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Generation method, device and system and the data processing and playback of songs method of song
CN110189741A (en) * 2018-07-05 2019-08-30 腾讯数码(天津)有限公司 Audio synthetic method, device, storage medium and computer equipment
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107705783B (en) Voice synthesis method and device
CN106024009B (en) Audio processing method and device
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
US9691429B2 (en) Systems and methods for creating music videos synchronized with an audio track
US10681408B2 (en) Systems and methods for creating composite videos
US20180295427A1 (en) Systems and methods for creating composite videos
CN107864410B (en) Multimedia data processing method and device, electronic equipment and storage medium
CN111583944A (en) Sound changing method and device
CN105930035A (en) Interface background display method and apparatus
CN111508511A (en) Real-time sound changing method and device
CN110990534B (en) Data processing method and device for data processing
CN112188266A (en) Video generation method and device and electronic equipment
CN110610720B (en) Data processing method and device and data processing device
CN110660375A (en) Method, device and equipment for generating music
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN110930977B (en) Data processing method and device and electronic equipment
CN113436601A (en) Audio synthesis method and device, electronic equipment and storage medium
CN113923517A (en) Background music generation method and device and electronic equipment
CN112837668B (en) Voice processing method and device for processing voice
CN113674731A (en) Speech synthesis processing method, apparatus and medium
CN112699269A (en) Lyric display method, device, electronic equipment and computer readable storage medium
CN113259701A (en) Method and device for generating personalized timbre and electronic equipment
CN113709548A (en) Image-based multimedia data synthesis method, device, equipment and storage medium
CN113407275A (en) Audio editing method, device, equipment and readable storage medium
CN111091807A (en) Speech synthesis method, speech synthesis device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210924