WO2022179110A1 - 一种混音歌曲生成方法、装置、设备及存储介质 - Google Patents

一种混音歌曲生成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022179110A1
WO2022179110A1 PCT/CN2021/122573 CN2021122573W WO2022179110A1 WO 2022179110 A1 WO2022179110 A1 WO 2022179110A1 CN 2021122573 W CN2021122573 W CN 2021122573W WO 2022179110 A1 WO2022179110 A1 WO 2022179110A1
Authority
WO
WIPO (PCT)
Prior art keywords
vocal
song
audio
signal
accompaniment
Prior art date
Application number
PCT/CN2021/122573
Other languages
English (en)
French (fr)
Inventor
闫震海
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2022179110A1 publication Critical patent/WO2022179110A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/005Musical accompaniment, i.e. complete instrumental rhythm synthesis added to a performed melody, e.g. as output by drum machines
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor
    • G10H2210/125Medley, i.e. linking parts of different musical pieces in one single piece, e.g. sound collage, DJ mix

Definitions

  • the present application relates to the technical field of computer signal processing, and in particular, to a method, apparatus, device and storage medium for generating a mixed song.
  • the current way to make a remix song is to mix the left channel audio of one song with the right channel audio of another song, creating a wonderful stereo effect.
  • these two songs are two different sung versions of the same song.
  • the purpose of the present application is to provide a method, apparatus, device and storage medium for generating a mixed song, so as to cover more songs to generate a mixed song with a good mixing effect. Its specific plan is as follows:
  • the application provides a method for generating a remixed song, comprising:
  • the accompaniment signal aligned with the audio track of the vocal audio is determined as the accompaniment audio to be mixed
  • the vocal audio and the accompaniment audio are mixed to obtain a mixed song.
  • the present application also provides an apparatus for generating a mixed song, comprising:
  • an acquisition module for acquiring at least two song audios; the at least two song audios are different singing versions of the same song;
  • the extraction module is used to extract the vocal signal and the accompaniment signal in the audio frequency of each song, and obtains a vocal set comprising at least two vocal signals and an accompaniment set comprising at least two accompaniment signals;
  • the alignment module is used to select reference rhythm information from the rhythm information corresponding to the audio of each song, align all the vocal signals in the vocal set based on the reference rhythm information, and align all the audio tracks after the alignment.
  • the vocal signal is used as the vocal audio to be mixed;
  • a selection module for determining the accompaniment signal aligned with the track of the vocal audio in the accompaniment set as the accompaniment audio to be mixed
  • a mixing module configured to mix the vocal audio and the accompaniment audio to obtain a mixed song.
  • the present application also provides an electronic device, the electronic device includes a processor and a memory; wherein the memory is used to store a computer program, and the computer program is loaded and executed by the processor to realize the foregoing Remix song generation method.
  • the present application also provides a storage medium, where computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are loaded and executed by a processor, the foregoing method for generating a mixed song is implemented.
  • the present application extracts the vocal signal and the accompaniment signal in each song audio, and then selects the reference rhythm information in the rhythm information corresponding to each song audio, All vocal signals are track-aligned based on the reference rhythm information, and all of the track-aligned vocal signals are used as the vocal audio to be mixed, and the accompaniment signal aligned with the vocal audio track is selected as the to-be-mixed accompaniment signal accompaniment audio, and finally mix the vocal audio and accompaniment audio to get a remixed song.
  • the application can mix at least two singing versions of the same song, and can cover more songs for mixing, and during the mixing process, all vocal signals in each singing version are track-aligned, and The accompaniment signal that is aligned with the vocal signal track is selected, so when mixing vocals and accompaniment, elements such as lyrics and beats can be kept in harmony and synchronization, and a remixed song with good mixing effect can be obtained. Effect.
  • the device, device and storage medium for generating a mixed song provided by the present application also have the above technical effects.
  • FIG. 1 is a schematic diagram of a physical architecture applicable to the present application provided by the present application.
  • FIG. 2 is a flow chart of a method for generating a remixed song provided by the application
  • Fig. 4 is a kind of Beat point schematic diagram that this application provides
  • FIG. 5 is a schematic diagram of a data segment corresponding to an adjacent beat group provided by the present application.
  • FIG. 6 is a schematic diagram of a data segment corresponding to another adjacent beat group provided by the present application.
  • FIG. 8 is a flowchart of a method for producing a remixed song provided by the application.
  • FIG. 9 is a schematic diagram of a mixing song generation device provided by the application.
  • Fig. 10 is a kind of server structure diagram provided by this application.
  • FIG. 11 is a structural diagram of a terminal provided by this application.
  • the present application proposes a mixed song generation scheme, which can cover more songs for mixing, and during the mixing process, all vocal signals in each singing version are tracked Align and select the accompaniment signal that is aligned with the vocal signal track, so when mixing vocals and accompaniment, elements such as lyrics, beats and other elements can be kept in harmony and synchronization, and a remixed song with good mixing effect can be obtained. mix effect.
  • the method for generating a mixed song provided by the present application can be applied to a system or program with a sound mixing function, such as a music game.
  • a system or program with a sound mixing function may run in a server, a personal computer, or other devices.
  • FIG. 1 is a schematic diagram of a physical architecture to which this application applies.
  • a system or program with a sound mixing function can run on a server, and the server obtains the song audio of at least two singing versions of the same song from other terminal devices through the network; extracts the human voice in the audio of each song signal and accompaniment signal, obtain a vocal set including at least two vocal signals and an accompaniment set including at least two accompaniment signals; select reference rhythm information in the rhythm information corresponding to the audio of each song, All the vocal signals of the accompaniment set are track-aligned, and all the track-aligned vocal signals are used as the vocal audio to be mixed; the accompaniment signals in the accompaniment set that are aligned with the vocal audio tracks are determined as the to-be-mixed accompaniment signals accompaniment audio for vocals; mix vocal audio and accompaniment audio to end up with a remixed song.
  • the server can establish communication connections with multiple devices, and the server obtains song audio for mixing from these devices.
  • song audio for mixing can also be stored in a database.
  • the server can obtain the corresponding mixed songs by collecting the audio of songs uploaded by these devices and mixing them.
  • Figure 1 shows a variety of terminal devices. In an actual scenario, more or less types of terminal devices may participate in the mixing process. The specific number and type depend on the actual scenario, which is not limited here. , Figure 1 shows one server, but in an actual scenario, multiple servers can also participate, and the specific number of servers depends on the actual scenario.
  • the method for generating a mixed song provided in this embodiment can be performed offline, that is, the server locally stores song audio for mixing, which can directly use the solution provided in this application to mix to obtain a desired mixed song.
  • FIG. 2 is a flowchart of a first method for generating a mixed song provided by an embodiment of the present application.
  • the method for generating a mixed song may include the following steps:
  • the different singing versions of the same song are: the original singing version, the cover version, the adapted version, etc. of the song.
  • Song audio is a song in formats such as MP3.
  • Method 2 Extract the left-channel vocals and right-channel vocals in the audio of each song, and determine the amplitude average or spectral feature average value of the left-channel vocals and the right-channel vocals as each song.
  • the average value of the amplitude corresponds to the range of the time domain
  • the average value of the spectral features corresponds to the range of the frequency domain, that is, the left channel vocal and the right channel vocal can be processed based on the two dimensions of the time domain and the frequency domain.
  • extracting the accompaniment signal in the audio of each song includes: extracting the accompaniment of the left channel or the accompaniment of the right channel in the audio of each song, and determining the accompaniment of the left channel or the accompaniment of the right channel as the accompaniment in the audio of each song Signal.
  • the left and right channel audio of a certain song audio is dataLeft and dataRight respectively
  • the left channel accompaniment can be extracted from dataLeft as the accompaniment signal of the song audio
  • the right channel accompaniment can also be extracted from dataRight as the accompaniment signal of the song audio.
  • Extracting the vocal signal and accompaniment signal in the audio of each song can also be achieved using a vocal accompaniment separation tool (such as spleeter, etc.). Assuming that two different versions of the same song are song1 and song2, respectively, after the vocal accompaniment is separated, two vocal signals can be obtained: vocal1 and vocal2, and two accompaniment signals: surround1 and surround2.
  • a vocal accompaniment separation tool such as spleeter, etc.
  • determining the accompaniment signal aligned with the vocal track in the accompaniment set as the accompaniment audio to be mixed includes: in the accompaniment set, selecting the accompaniment signal aligned with the reference rhythm information as the accompaniment signal to be mixed Accompaniment audio to be mixed; or after track alignment of any accompaniment signal in the accompaniment set with the reference rhythm information, it is used as the accompaniment audio to be mixed.
  • mixing the vocal audio and the accompaniment audio to obtain a mixed song includes: calculating the gain value of the left channel and the gain value of the right channel; Stereo signal of a vocal signal; mix individual stereo signals and accompaniment audio to get a mixed song.
  • the signal on the right channel which is the stereo signal of the vocal signal.
  • vocalALeft vocalA ⁇ gainLeft
  • vocalARight vocalalA ⁇ gainRight
  • the alpha When the alpha is adjusted in the direction of less than 0.5, it means that the final mixing effect is to enhance the background (that is, the accompaniment) sound, thereby increasing the surround and immersion of the music; when the alpha is adjusted in the direction of greater than 0.5, it means that the final sound
  • the mixing effect is to elevate the clarity of the vocals, thus creating the effect of clear vocals.
  • each stereo signal before mixing each stereo signal and accompaniment audio, software such as an equalizer can also be used to enhance the low-frequency components of the surround to enhance the rhythm of the entire music. Or, before mixing each stereo signal and the accompaniment audio, each stereo signal is processed without changing the pitch, so as to obtain more singing styles.
  • Method 1 Calculate the gain value of the left channel and the gain value of the right channel according to the preset sound image angle and the preset position of the vocal signal in the preset sound image angle. Set the pan angle to thetaBase, set the position of the vocal signal in the preset pan angle to theta, then the gain value is:
  • Method 2 Calculate the gain value of the left channel and the gain value of the right channel by assigning a linear gain. Assuming that the human voice is positioned to the left of the center, then
  • Mode 1 uses the method of setting the modulation angle for audio-image modulation
  • mode 2 uses the method of distributing linear gain for audio-image modulation, both of which can place the human voice at any position between the left and right 90 degrees, thereby forming a sound image.
  • the simultaneous chorus effect can create a more three-dimensional voice image, and the chorus effect can be controlled, so that the user can easily and conveniently adjust the position of the voice image without changing the spectral components of the voice signal. It really combines two voices that are not in the same time and space into the same song.
  • each vocal signal in the vocal audio can also be determined over time. For example: in a certain period of time, only one or a few vocal signals appear to achieve duet effect.
  • this embodiment can mix at least two singing versions of the same song, and can cover more songs for mixing, and during the mixing process, the reference rhythm information is selected from the rhythm information corresponding to the audio of each song, Based on the reference rhythm information, all vocal signals in each singing version are track-aligned, and the accompaniment signal that is aligned with the vocal signal is selected. Therefore, when mixing vocals and accompaniment, elements such as lyrics and beats can be used. Keeping coordination and synchronization, you get a well-remixed song that improves your mix.
  • the alignment method includes:
  • the beat information in the audio of each song can be extracted using beattracker or drum beat extraction algorithms.
  • the beat information in the beat set and the vocal signal in the vocal set have a one-to-one mapping relationship.
  • 3 vocal signals ie vocal set
  • vocalA, vocalB and vocalC 3 vocal signals
  • accompaniment signals ie accompaniment set
  • surroundA, surroundB and surroundC 3 beat information (ie, beat collection): BeatA, BeatB, BeatC.
  • the elements in the above three sets have a one-to-one mapping relationship, namely: vocalA-surroundA-BeatA, vocalB-surroundB-BeatB, and vocalC-surroundC-BeatC.
  • S302 Determine whether the number of elements included in each beat information in the beat set is the same; if so, execute S303; if not, execute S308.
  • each beat information in the beat set includes multiple elements (that is, the beat, that is, the beat point). If the number of elements included in different beat information is the same, it indicates that the rhythm of the corresponding song audio is similar and belongs to the same Arrangement, beat points are not much different, so the steps of S303-S307 can be used for rough alignment. On the contrary, if the number of elements included in different beat information is different, it indicates that the rhythm of the corresponding song audio is quite different and does not belong to the same arrangement. The beat point may be quite different and needs to be adjusted frame by frame, so it is necessary to use S309-S313. The steps are segmented for more detailed alignment.
  • FIG. 4 For the Beat points included in the beat information, reference may be made to FIG. 4 . “1, 2, 3...n, n+1...” in FIG. 4 represents each data frame in the audio of the song.
  • the arrows indicate the time stamp positions corresponding to the beat points, and the positions corresponding to these beat points are also applicable to human voice signals.
  • the second beat information is other beat information other than the first beat information in the beat set. For example, suppose that BeatA is selected from the above beat set as the first beat information, then BeatB and BeatC are the second beat information.
  • the second vocal signal is another vocal signal in the vocal set except the first vocal signal
  • the first vocal signal is a vocal signal in the vocal set having a mapping relationship with the first beat information.
  • S306. Determine a corresponding difference value required for adjusting each second human voice signal according to the first corresponding relationship, and determine the redundant end and the to-be-filled end of each second human voice signal based on the corresponding difference value.
  • Steps S303-S307 align the vocal signal by panning the vocal signal as a whole, which follows the Euclidean distance minimization principle.
  • M is a positive number, it means that the singer of the song audio A starts singing later than the singer of the song audio B, then using vocalA as the comparison benchmark, move vocalB backward (right) M data points, and the head and tail of vocalA are used as reference points to determine the redundant end and the to-be-complemented end of vocalB.
  • the redundant end the part of the translated vocalB that exceeds the vocalA is cut off; for the end to be complemented, the part that is lacking in the vocalB compared with the vocalA is filled with zeros, so that the vocalB and the vocalA can be aligned.
  • S308 determine whether the number of currently acquired song audios is only two; if so, execute S309; if not, exit the process.
  • the reference tempo information is the third tempo information
  • the third tempo information is tempo information with the smallest number of elements in the tempo set.
  • the fourth beat information is other beat information except the third beat information in the beat set. Assuming that the beat set includes: BeatA and BeatB, and BeatA includes 3 elements: aA, bA, cA, and BeatB includes 4 elements: aB, bB, cB, dB, then BeatA is the third beat information, and BeatB is the fourth beat information .
  • reducing the number of elements in the fourth beat information to be the same as the number of elements in the third beat information includes: arranging each element in the third beat information into a target sequence according to the time stamp size; determining the current iteration number of times, determine the element in the target sequence at the arrangement position equal to the current iteration number as the target element; calculate the time stamp distance between the target element and each comparison element respectively; the comparison element is the fourth beat information that does not match any element in the target sequence.
  • the current number of iterations is incremented by one, and the current number of iterations is determined, and the element in the target sequence at the arrangement position equal to the current number of iterations is determined as the target element; the target element and each comparison are calculated separately.
  • the timestamp distance of the element the step of determining the contrast element corresponding to the minimum timestamp distance as the element matching the target element, until the current number of iterations is not less than the maximum number of iterations.
  • the maximum number of iterations is the number of elements in the third beat information.
  • the specific process is as follows: Assuming that the elements in BeatA have been arranged in ascending order of timestamps, the maximum number of iterations is 3. In the first iteration, the current iteration number is 1, then the target element is aA. At this time, the timestamp distances of aA and aB, aA and bB, aA and cB, aA and dB are calculated respectively, and 4 distances can be obtained: 0.1 , 0.2, 0.3, 0.4; then the minimum timestamp distance is 0.1, and its corresponding contrast element is aB, so it is determined that aA matches aB.
  • the number of iterations is less than the maximum number of iterations of 3, then the number of iterations changes from 1 to 2, then the target element of the second round of iteration is bA; since aA matches aB, then aB is no longer a comparison element, so calculate bA and bB,
  • the timestamp distance between bA and cB, bA and dB three distances can be obtained: 0.5, 0.6, and 0.7; then the minimum timestamp distance is 0.5, and the corresponding comparison element is bB, so it is determined that bA and bB match.
  • the number of iterations is less than the maximum number of iterations of 3, and the number of iterations changes from 2 to 3, then the target element of the third round of iteration is cA; since aA matches aB, bA matches bB, then aB and bB are no longer comparison elements. , so by calculating the timestamp distance between cA and cB, cA and dB, two distances can be obtained: 0.7 and 0.8, then the minimum timestamp distance is 0.7, and the corresponding contrast element is cB, so it is determined that cA and cB match.
  • BeatA includes 3 elements: aA, bA, cA
  • BeatB includes 3 elements: aB, bB, cB.
  • S311 Determine a plurality of adjacent beat groups based on the third beat information or the fourth beat information.
  • BeatA includes 3 elements: aA, bA, cA
  • BeatB includes 3 elements: aB, bB, cB.
  • two adjacent beat groups can be determined, a and b, b and c.
  • the first data segment corresponding to a and b is the segment in vocalA corresponding to aA-bA
  • the second data segment is the segment in vocalB corresponding to aB-bB
  • the first data segment corresponding to b and c is the segment in vocalA corresponding to bA ⁇ cA
  • the second data segment is the segment in vocalB corresponding to bB ⁇ cB.
  • FIG. 5 illustrates an adjacent beat group a and b.
  • the first data segment (segment in vocalA) corresponding to the adjacent beat group includes 4 data frames (data frames 2, 3, 4, 5), the second data segment (segment in vocalB) includes 3 data frames (data frames 2, 3, and 4).
  • the third human voice signal is a human voice signal in the human voice set that has a mapping relationship with the third beat information
  • the fourth human voice signal is other human voice signals in the human voice set except the third human voice signal. If BeatA is the third beat information and BeatB is the fourth beat information, then the third vocal signal is vocalA, and the fourth vocal signal is vocalB.
  • the first data segment is a segment in the third vocal signal, and the second data segment is a segment in the fourth vocal signal.
  • the number of first data frames in the first data segment is equal to the number of second data frames in the second data segment The number of data frames.
  • the number of first data frames in the first data segment is not equal to the number of second data frames in the second data segment, then the maximum number of the first data frames and the number of second data frames
  • the data segment corresponding to the value is determined as the segment to be deleted; the number of deletions of each data frame in the segment to be deleted is calculated, and each data frame in the segment to be deleted is deleted according to the number of deletions.
  • FIG. 6 illustrates an adjacent beat group b and c.
  • the first data segment (segment in vocalA) corresponding to the adjacent beat group includes 3 data frames (data frames 2, 3, and 4).
  • the second data segment (segment in vocalB) includes 4 data frames (data frames 2, 3, 4, and 5). It can be seen that, when performing data deletion for each adjacent beat group in this embodiment, sometimes vocalA needs to be deleted, and sometimes vocalB needs to be deleted. Therefore, steps S309 to S313 only mix audios of two songs. Align each data segment in vocalA and vocalB according to steps S309 to S313, so that the alignment of vocalA and vocalB can be realized.
  • vocal1 and vocal2 can be aligned according to S309 ⁇ S313 to obtain mutually aligned vocal1' and vocal2'.
  • the data in vocal1' and vocal2' The number of frames is equal, so vocal1' and vocal2' can be considered to be the same (meaning the same number of data frames).
  • align vocal1' and vocal3, vocal2' and vocal3 respectively to complete the alignment of the three vocal signals.
  • vocal1' and vocal2' can be considered to be the same, the data deleted when they align vocal3 are exactly the same.
  • the deleted data is also the same. Therefore, by aligning vocal1' and vocal3, vocal2' and vocal3, the same vocal3' can be obtained. Finally, vocal1", vocal2" and vocal3' can be obtained which are aligned with each other.
  • the corresponding accompaniment signal also needs to be aligned in the same alignment as the vocal signal, and finally output the accompaniment signal aligned with all the aligned vocal signals.
  • tracks of different versions of human voices are aligned according to the beat information of the song audio.
  • at least two singing versions of the same song can be mixed, and more songs can be covered for mixing, and during the mixing process, the reference rhythm information is selected from the rhythm information corresponding to the audio of each song, based on the reference rhythm information.
  • the rhythm information tracks all vocal signals in each singing version, and selects the accompaniment signal that is aligned with the vocal signal's track, so when mixing vocals and accompaniment, elements such as lyrics, beats, etc. can be kept in harmony Synchronization and synchronization, get a remixed song with good mixing effect, and improve the mixing effect.
  • the alignment method provided by this embodiment includes:
  • the BPM value corresponding to the audio of each song can be counted by using the BPM detection algorithm.
  • BPM is the abbreviation of Beat Per Minute, also known as the number of beats, which means the number of beats per minute.
  • the BPM values in the BPM value set and the vocal signals in the vocal set have a one-to-one mapping relationship.
  • 3 vocal signals ie, vocal set
  • 3 BPM values ie, BPM value set
  • BPMA BPMA
  • BPMB BPMB
  • BPMC BPMC
  • the reference BPM value is the reference tempo information. At this time, one BPM value may be randomly selected from the BPM value set as the reference BPM value.
  • the target BPM value is other BPM values other than the reference BPM value in the BPM value set. Assuming that BPMA is selected from the BPM value set as the reference BPM value, then BPMB and BPMC are the target BPM values. From this, the ratios can be obtained: BPMA/BPMB, BPMA/BPMC.
  • the target human voice signal is other human voice signals in the human voice set except the reference human voice signal, and the reference human voice signal is a human voice signal in the human voice set that has a mapping relationship with the reference BPM value. If BPMA is selected as the reference BPM value, then the reference vocal signal is vocalA, and the target vocal signals are vocalB and vocalC.
  • S705 Determine a corresponding ratio required to adjust each target human voice signal according to the second corresponding relationship, and perform variable speed and invariant pitch processing on each target human voice signal based on the corresponding ratio.
  • BPMA/BPMB corresponds to vocalB
  • BPMA/BPMC corresponds to vocalC
  • This embodiment can be implemented by using a variable-speed and constant-modulation processor.
  • tracks of different versions of human voices are aligned according to the beat information of the song audio.
  • at least two singing versions of the same song can be mixed, and more songs can be covered for mixing, and during the mixing process, the reference rhythm information is selected from the rhythm information corresponding to the audio of each song, based on the reference rhythm information.
  • the rhythm information tracks all vocal signals in each singing version, and selects the accompaniment signal that is aligned with the vocal signal's track, so when mixing vocals and accompaniment, elements such as lyrics, beats, etc. can be kept in harmony Synchronization and synchronization, get a remixed song with good mixing effect, and improve the mixing effect.
  • vocalA is randomly selected as the standard vocal signal
  • adjusted vocalB vocalB ⁇ (RMSA/RMSB)
  • adjusted vocalC vocalC ⁇ (RMSA/RMSC)
  • This embodiment utilizes the principle of energy difference between the left and right ears, which can reduce the difference in loudness of different human voice signals, and obtain a human voice chorus effect with a stereo image.
  • the remix song generation scheme can create remix songs based on existing songs.
  • a corresponding mixed song production tool can be designed, and the production of the mixed song can be completed by using the tool.
  • Remix Song Maker can be installed on any computer device.
  • the mixed song production tool executes the mixed song generation method provided by this application.
  • the process of making a remixed song may include the following steps:
  • the client uploads the song audio of at least two singing versions of the same song to the server;
  • the server inputs the audio of each song into the mixing song production tool in itself, and outputs the mixed song by the mixing song production tool;
  • the mixed song creation tool can cover all the songs in the music library. Users can upload any songs they want to remix for remixing. If there is only one singing version of a song in the music library, you can sing it along with the separated accompaniment, so as to create a mixing effect in which you and the professional singer appear in the same song. Moreover, the different singing versions used for mixing only need to have the same score, even if they are performed in different languages.
  • the human voice is aligned based on the Beat point and BPM value of the song.
  • the human voice can be clarified or the background can be enhanced to widen the sound field, and the tone of the human voice can be adjusted to adjust the background The ratio of the spectral energy of the sound.
  • the user can not only adapt the human voice (produce multi-directional dual-tone effects, or perform pitch-shifting processing on the vocals of the song), but also adapt the background sound (produce clear human voice, sound field, etc.) broadening and rhythm enhancement, etc.).
  • This production method greatly expands the range of songs covered by the dual-tone effect, and at the same time, it also makes the production of the mixing effect more adaptable content and methods.
  • FIG. 9 is a schematic diagram of an apparatus for generating a mixed song provided by an embodiment of the present application, including:
  • Obtaining module 901 is used to obtain at least two song audios; at least two song audios are different singing versions of the same song;
  • the extraction module 902 is used to extract the vocal signal and the accompaniment signal in the audio of each song to obtain a vocal set including at least two vocal signals and an accompaniment set including at least two accompaniment signals;
  • Alignment module 903 used for selecting reference rhythm information in the rhythm information corresponding to the audio of each song, aligning all the vocal signals in the vocal set based on the benchmark rhythm information, and aligning all the vocal signals after the audio tracks are aligned As the vocal audio to be mixed;
  • the selection module 904 is used to determine the accompaniment signal aligned with the track of the vocal audio in the accompaniment set as the accompaniment audio to be mixed;
  • the mixing module 905 is used for mixing vocal audio and accompaniment audio to obtain a mixed song.
  • the extraction module includes:
  • the first extraction unit is used to calculate the corresponding mid-position signal of each song audio, and extract the vocal signal in each song audio from the mid-position signal;
  • the second extraction unit is used to extract the left-channel vocals and the right-channel vocals in the audio of each song, and determine the amplitude average value or the spectral characteristic average value of the left-channel vocals and the right-channel vocals to determine for the vocal signal in the audio for each song.
  • the extraction module includes:
  • the third extraction unit is configured to extract the left channel accompaniment or the right channel accompaniment in the audio of each song, and determine the left channel accompaniment or the right channel accompaniment as the accompaniment signal in the audio of each song.
  • the alignment module includes:
  • the beat extraction unit is used to extract the beat information in the audio of each song, and obtain a beat set including at least two beat information; the beat information in the beat set and the vocal signal in the vocal set have a one-to-one mapping relationship;
  • a first selection unit configured to determine that the reference rhythm information is the first rhythm information if the number of elements included in each rhythm information in the rhythm set is the same, and the first rhythm information is any rhythm information in the rhythm set;
  • the first calculation unit is used to calculate the difference value between the first beat information and each second beat information respectively; the second beat information is other beat information except the first beat information in the beat set;
  • the first determination unit is used to determine the first correspondence between each difference value and each second vocal signal according to a one-to-one mapping relationship;
  • the second vocal signal is other people in the vocal set except the first vocal signal an acoustic signal, the first vocal signal is a vocal signal that has a mapping relationship with the first beat information in the vocal set;
  • the second determination unit is configured to determine the corresponding difference value required to adjust each second vocal signal according to the first correspondence, and determine the redundant end and the to-be-filled end of each second vocal signal based on the corresponding difference value ;
  • the first alignment unit is used to delete redundant data equal to the difference value from the redundant end of each second vocal signal, and add an amount equal to the difference value at the to-be-filled end of each second vocal signal all zero data.
  • the first computing unit is specifically used for:
  • M is the difference between Beat0 and BeatX; Beat0 is the vector representation of the first beat information; BeatX is the vector representation of any second beat information; sum(Beat0–BeatX) is the bitwise subtraction of each element in Beat0 and BeatX , the accumulated sum of all the differences obtained; numBeats is the number of elements included in each beat information; L is the length of the unit data frame.
  • the alignment module further includes:
  • the second selection unit is configured to determine that the reference rhythm information is the third rhythm information if two song audios are acquired and the number of elements included in each rhythm information in the rhythm set is different; the third rhythm information is the number of elements in the rhythm set. The least number of beat information;
  • the deletion unit is used to delete the number of elements in the fourth beat information to be the same as the number of elements in the third beat information; the fourth beat information is other beat information except the third beat information in the beat set;
  • a third determining unit configured to determine a plurality of adjacent beat groups based on the third beat information or the fourth beat information
  • the dividing unit is used to divide the third vocal signal and the fourth vocal signal according to each adjacent beat group, and obtain the first data segment and the second data segment corresponding to each adjacent beat group;
  • the third vocal signal is a human
  • the vocal signal in the vocal set has a mapping relationship with the third beat information, and the fourth vocal signal is other vocal signals in the vocal set except the third vocal signal;
  • the second alignment unit is configured to make the data length of the first data segment equal to the data length of the second data segment for each adjacent beat group.
  • the second alignment unit includes:
  • the first determination subunit is used for determining the number of first data frames and the number of second data frames if the number of first data frames in the first data segment is not equal to the number of second data frames in the second data segment.
  • the data segment corresponding to the largest value in the number is determined as the segment to be deleted;
  • the first calculation subunit is used to calculate the number of deletions of each data frame in the segment to be deleted, and delete each data frame in the segment to be deleted according to the number of deletions.
  • calculation subunit is specifically used for:
  • P is the number of deletions in each data frame
  • m is the maximum value
  • n is the minimum value between the number of the first data frame and the number of the second data frame
  • L is the length of the unit data frame.
  • the pruning unit includes:
  • Arranging subunits for arranging each element in the third beat information into a target sequence according to the size of the timestamp
  • the second determination subunit is used to determine the current iteration number, and determines the element on the arrangement position equal to the current iteration number in the target sequence as the target element;
  • the second calculation subunit is used to calculate the time stamp distance between the target element and each contrast element respectively;
  • the contrast element is an element that does not match any element in the target sequence in the fourth beat information;
  • the third determination subunit is used to determine the contrast element corresponding to the minimum time stamp distance as the element matching the target element
  • the deletion subunit is used to delete the contrast element in the current fourth beat information if the current iteration number is not less than the maximum iteration number, and retain the elements matching each target element in the fourth beat information.
  • the pruning unit also includes:
  • the iteration subunit is used to increment the current iteration number by one if the current iteration number is less than the maximum iteration number, and execute the steps in the second determination subunit, the second calculation subunit, and the third determination subunit until the current iteration number is no longer than less than the maximum number of iterations.
  • the alignment module includes:
  • the statistical unit is used to count the BPM value corresponding to each song audio, and obtains a BPM value set including at least two BPM values; the BPM value in the BPM value set and the vocal signal in the vocal set have a one-to-one mapping relationship;
  • the third selection unit is used to select a BPM value from the BPM value set as the reference BPM value; the reference BPM value is the reference rhythm information;
  • the second calculation unit is used to calculate the ratio of the reference BPM value to each target BPM value; the target BPM value is other BPM values except the reference BPM value in the BPM value set;
  • the fourth determination unit is used to determine the second correspondence between each ratio and each target vocal signal according to a one-to-one mapping relationship;
  • the human voice signal is a human voice signal that has a mapping relationship with the reference BPM value in the human voice set;
  • the third aligning unit is configured to determine the corresponding ratio required to adjust each target human voice signal according to the second corresponding relationship, and perform variable speed and invariant tone processing on each target human voice signal based on the corresponding ratio.
  • it also includes:
  • the standard vocal selection module is used to randomly select a vocal signal from all the aligned vocal signals as the standard vocal signal;
  • the adjustment module is used to adjust the loudness of each vocal signal to be tuned according to the third formula; the vocal signal to be tuned is the vocal signal other than the standard vocal signal among all the vocal signals after track alignment;
  • B is the vocal signal to be tuned after adjusting the loudness
  • vocalX is the vocal signal to be tuned before adjusting the loudness
  • RMS0 is the root mean square of the standard vocal signal
  • RMSX is the root mean square of vocalX.
  • the mixing module includes:
  • the third calculation unit is used to calculate the left channel gain value and the right channel gain value
  • the 5th determining unit is used to determine the stereo signal of each vocal signal in the vocal audio based on the left channel gain value and the right channel gain value;
  • the mixing unit is used to mix the individual stereo signals and the accompaniment audio to obtain a mixed song.
  • the mixing unit is specifically used for:
  • SongComb is the mixed song
  • vocal1 is each stereo signal
  • alpha is the preset adjustment factor
  • surround is the accompaniment audio.
  • the third computing unit is specifically used for:
  • the selection module includes:
  • a fourth selection unit configured to select, in the accompaniment set, an accompaniment signal aligned with the reference rhythm information as the accompaniment audio to be mixed;
  • the fourth aligning unit is used for aligning any accompaniment signal in the accompaniment set with the reference rhythm information as the accompaniment audio to be mixed.
  • this embodiment provides an apparatus for generating a mixed song, which aligns the tracks of different versions of human voices according to the beat information of the audio of the song.
  • at least two singing versions of the same song can be mixed, and more songs can be covered for mixing, and during the mixing process, all vocal signals in each singing version are track-aligned,
  • the accompaniment signal aligned with the vocal signal track is selected, so when mixing vocals and accompaniment, elements such as lyrics, beats and other elements can be kept in harmony and synchronization, and a remixed song with good mixing effect can be obtained, which improves the mixing effect. sound effect.
  • FIG. 10 and FIG. 11 are both structural diagrams of an electronic device according to an exemplary embodiment, and the contents in the figures should not be considered as any limitation on the scope of use of the present application.
  • FIG. 10 is a schematic structural diagram of a server according to an embodiment of the present application.
  • the server 50 may specifically include: at least one processor 51 , at least one memory 52 , a power supply 53 , a communication interface 54 , an input and output interface 55 and a communication bus 56 .
  • the memory 52 is used to store a computer program, and the computer program is loaded and executed by the processor 51 to implement the relevant steps in the generation of the mixed song disclosed in any of the foregoing embodiments.
  • the power supply 53 is used to provide working voltage for each hardware device on the server 50;
  • the communication interface 54 can create a data transmission channel between the server 50 and external devices, and the communication protocol it follows is applicable to this Any communication protocol applying for the technical solution is not specifically limited here;
  • the input and output interface 55 is used to obtain external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, which is not carried out here. Specific restrictions.
  • the memory 52 as a carrier for resource storage, can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc.
  • the resources stored on the memory 52 include the operating system 521, the computer program 522, and the data 523, etc., and the storage method can be short-term storage or Permanent storage.
  • the operating system 521 is used to manage and control each hardware device and computer program 522 on the server 50, so as to realize the operation and processing of the data 523 in the memory 52 by the processor 51, which can be Windows Server, Netware, Unix, Linux, etc. .
  • the computer program 522 may further include a computer program that can be used to complete other specific tasks in addition to the computer program that can be used to complete the method for generating a remixed song disclosed in any of the foregoing embodiments.
  • the data 523 may include, in addition to data such as song audio for mixing, data such as developer information of the application.
  • FIG. 11 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal 60 may specifically include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
  • the terminal 60 in this embodiment includes: a processor 61 and a memory 62 .
  • the processor 61 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 61 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 61 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 61 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 61 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 62 may include one or more computer-readable storage media, which may be non-transitory. Memory 62 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash storage devices. In this embodiment, the memory 62 is at least used to store the following computer program 621, wherein, after the computer program is loaded and executed by the processor 61, it can implement the method for generating a mixed song disclosed by any of the foregoing embodiments and executed by the terminal side. related steps. In addition, the resources stored in the memory 62 may also include an operating system 622, data 623, etc., and the storage mode may be short-term storage or permanent storage. The operating system 622 may include Windows, Unix, Linux, and the like. Data 623 may include, but is not limited to, the audio of the song to be mixed.
  • the terminal 60 may further include a display screen 63 , an input/output interface 64 , a communication interface 65 , a sensor 66 , a power supply 67 and a communication bus 68 .
  • FIG. 11 does not constitute a limitation on the terminal 60, and may include more or less components than those shown in the drawings.
  • an embodiment of the present application further discloses a storage medium, where computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are loaded and executed by a processor, the hybrid system disclosed in any of the foregoing embodiments is implemented.
  • Method for generating music For the specific steps of the method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

一种混音歌曲生成方法、装置、设备及存储介质,在该方案中,获取同一首歌曲的至少两个演唱版本的歌曲音频(S201),提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合(S202),在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频(S203),将伴奏集合中,与人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频(S204),混合人声音频和伴奏音频,得到混音歌曲(S205)。能够覆盖更多歌曲进行混音,并将各个歌曲音频中的所有人声信号进行了音轨对齐,选择了与人声信号的音轨对齐的伴奏信号,因此能够保持歌词、节拍等元素的协调性和同步性,提高了混音效果。

Description

一种混音歌曲生成方法、装置、设备及存储介质
本申请要求于2021年02月24日提交至中国专利局、申请号为202110205483.9、发明名称为“一种混音歌曲生成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机信号处理技术领域,特别涉及一种混音歌曲生成方法、装置、设备及存储介质。
背景技术
目前制作混音歌曲的方式为:将一个歌曲的左通道音频和另一个歌曲的右通道音频混合,从而形成一种奇妙的立体声效果。一般这两首歌曲是同一首歌曲的两个不同演唱版本。
但是,上述方式依赖于人工制作,可体验的歌曲数量有限,无法对更多歌曲进行混音。简单的左右通道的混音,不能保证歌词、节拍等元素的协调性和同步性,可能导致混音效果不佳。
发明内容
有鉴于此,本申请的目的在于提供一种混音歌曲生成方法、装置、设备及存储介质,以覆盖更多歌曲生成具有良好混音效果的混音歌曲。其具体方案如下:
为实现上述目的,一方面,本申请提供了一种混音歌曲生成方法,包括:
获取至少两个歌曲音频;所述至少两个歌曲音频为同一首歌曲的不同演唱版本;
提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合;
在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于所述基准节奏信息将所述人声集合中的所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频;
将所述伴奏集合中,与所述人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频;
混合所述人声音频和所述伴奏音频,得到混音歌曲。
又一方面,本申请还提供了一种混音歌曲生成装置,包括:
获取模块,用于获取至少两个歌曲音频;所述至少两个歌曲音频为同一首歌曲的不同演唱版本;
提取模块,用于提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合;
对齐模块,用于在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于所述基准节奏信息将所述人声集合中的所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频;
选择模块,用于将所述伴奏集合中,与所述人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频;
混合模块,用于混合所述人声音频和所述伴奏音频,得到混音歌曲。
又一方面,本申请还提供了一种电子设备,所述电子设备包括处理器和存储器;其中,所述存储器用于存储计算机程序,所述计算机程序由所述处理器加载并执行以实现前述混音歌曲生成方法。
又一方面,本申请还提供了一种存储介质,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现前述混音歌曲生成方法。
可见,本申请在获取到同一首歌曲的至少两个演唱版本的歌曲音频后,提取每个歌曲音频中的人声信号和伴奏信号,然后在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频,选择与人声音频的音轨对齐的伴奏信号作为待混音的伴奏音频,最后混合人声音频和伴奏音频,可得到混音歌曲。本申请可以针对同一首歌曲的至少两个演唱版本进行混音,能够覆盖更多歌曲进行混音,并且在混音过程中,将各个演唱版本中的所有人声信号进行了音轨对齐,并选择了与人声信号的音轨对齐的伴奏信号,因此混合人声和伴奏时,能够使歌词、节拍等元素保持协调性和同步性,得到混音效果良好的混音歌曲,提高了混音效果。
相应地,本申请提供的一种混音歌曲生成装置、设备及存储介质,也同样具有上述技术效果。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请提供的一种本申请适用的物理架构示意图;
图2为本申请提供的一种混音歌曲生成方法流程图;
图3为本申请提供的一种对齐方法流程图;
图4为本申请提供的一种Beat点示意图;
图5为本申请提供的一种相邻节拍组对应的数据片段示意图;
图6为本申请提供的另一种相邻节拍组对应的数据片段示意图;
图7为本申请提供的另一种对齐方法流程图;
图8为本申请提供的一种混音歌曲制作方法流程图;
图9为本申请提供的一种混音歌曲生成装置示意图;
图10为本申请提供的一种服务器结构图;
图11为本申请提供的一种终端结构图。
具体实施方式
现有制作混音歌曲的方式依赖于人工制作,可体验的歌曲数量有限,无法对更多歌曲进行混音。简单的左右通道的混音,不能保证歌词、节拍等元素的协调性和同步性,可能导致混音效果不佳。
鉴于目前所存在的上述问题,本申请提出了混音歌曲生成方案,该方案能够覆盖更多歌曲进行混音,并且在混音过程中,将各个演唱版本中的所有人声信号进行了音轨对齐,并选择了与人声信号的音轨对齐的伴奏信号,因此混合人声和伴奏时,能够使歌词、节拍等元素保持协调性和同步性,得到混音效果良好的混音歌曲,提高了混音效果。
为了便于理解,先对本申请所适用的物理框架进行介绍。
应理解,本申请提供的混音歌曲生成方法可以应用于具有混音功能的系统或程序中,例如音乐游戏。具体的,具有混音功能的系统或程序可以运行于服务器、个人计算机等设备中。
如图1所示,图1为本申请适用的物理架构示意图。在图1中,具有混音功能的系统或程序可以运行于服务器,该服务器通过网络从其他终端设备中获取同一首歌曲的至少两个演唱版本的歌曲音频;提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合;在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将人声集合中的所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频;将伴奏集合中,与人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频;混合人声音频和伴奏音频,最终得到混音歌曲。
如图可知,该服务器可以与多个设备建立通信连接,服务器从这些设备中获取用于混音的歌曲音频。当然,用于混音的歌曲音频也可以以数据库形式存储。服务器通过收集这些设备上传的歌曲音频,并进行混音,从而可得到相应混音歌曲。图1中示出了多种终端设备,在实际场景中可以有更多或更少种类的终端设备参与到混音过程中,具体数量和种类因实际场景而定,此处不做限定,另外,图1中示出了一个服务器,但在实际场景中,也可以有多个服务器的参与,具体服务器数量因实际场景而定。
应当注意的是,本实施例提供的混音歌曲生成方法可以离线进行,即服务器本地存储有用于混音的歌曲音频,其可以直接利用本申请提供的方案混音得到想要的混音歌曲。
可以理解的是,上述具有混音功能的系统或程序也可以运行于个人移动终端,也可以作为云端服务程序的一种,具体运作模式因实际场景而定,此处不做限定。
结合以上共性,请参见图2,图2为本申请实施例提供的第一种混音歌曲生成方法流程图。如图2所示,该混音歌曲生成方法可以包括以下步骤:
S201、获取同一首歌曲的至少两个演唱版本的歌曲音频。
其中,同一首歌曲的不同演唱版本如:歌曲的原唱版本、翻唱版本、改编版本等。歌曲音频即MP3等格式的歌曲。
S202、提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合。
其中,从一个歌曲音频中提取人声信号可以有如下两种方式,任选其一即可。
方式一:计算每个歌曲音频对应的中置信号,并从中置信号中提取每个歌曲音频中的人声信号。假设某一个歌曲音频的左右通道音频(包括伴奏和人声)分别为dataLeft和dataRight,则该歌曲音频的中置信号为:dataMid=(dataLeft+dataRight)/2。由于中置信号能够更好的表示歌曲音频的内容信息,因此从中置信号中提取人声信号能更好保持人声效果。
方式二:提取每个歌曲音频中的左声道人声和右声道人声,并将左声道人声和右声道人声的幅度平均值或频谱特征平均值,确定为每个歌曲音频中的人声信号。假设某一个歌曲音频的左声道人声(仅包括人声)和右声道人声(仅包括人声)分别为vocalLeft和vocalRight,那么该歌曲音频的vocal均值=(vocalLeft+vocalRight)/2。其中,幅度平均值对应时域范围,频谱特征平均值对应频域范围,也就是能够基于时域、频域两个维度处理左声道人声和右声道人声。
为了保持声场宽度,伴奏信号可以从左通道音频或左通道音频分离得到,即可保持其立体声格式。因此提取每个歌曲音频中的伴奏信号,包括:提取每个歌曲音频中的左声道伴奏或右声道伴奏,并将左声道伴奏或右声道伴奏确定为每个歌曲音频中的伴奏信号。假设某一个歌曲音频的左右通道音频分别为dataLeft和dataRight,那么可以从dataLeft中提取左声道伴奏作为该歌曲音频的伴奏信号,也可以从dataRight中提取右声道伴奏作为该歌曲音频的伴奏信号。
提取每个歌曲音频中的人声信号和伴奏信号还可以使用声伴分离工具(如spleeter等)实现。假设同一首歌的两个不同版本的歌曲分别是song1和song2,分别对其做声伴分离后,可得两个人声信号:vocal1和vocal2,两个伴奏信号:surround1和surround2。
S203、在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基 准节奏信息将人声集合中的所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频。
由于同一歌曲的原唱版本、翻唱版本、改编版本等的演唱方式和语言可能不同,因此其人声信号的音轨可能存在偏差,所以需要对所有人声信号进行音轨对齐,使得所有人声信号具有良好的协调性和同步性。
S204、将伴奏集合中,与人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频。
相应的,同步所有人声信号后,还需要同步待混音的伴奏音频与所有人声信号的音轨。若将3个歌曲音频(即歌曲音频A、B、C)进行混音,那么可获得3个人声信号:vocalA、vocalB和vocalC,3个伴奏信号:surroundA、surroundB和surroundC,假设保持vocalA的音轨不变,调整vocalB和vocalC的音轨与vocalA对齐,那么可直接选择surroundA作为待混音的伴奏音频。如果想要使用surroundB或surroundC作为待混音的伴奏音频,就需要利用与对齐人声所采用的同样方式使surroundB或surroundC的音轨与surroundA对齐,以保证人声信号与背景声是完全对齐的。
在一种具体实施方式中,将伴奏集合中,与人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频,包括:在伴奏集合中,选择与基准节奏信息对齐的伴奏信号作为待混音的伴奏音频;或将伴奏集合中的任一个伴奏信号与基准节奏信息进行音轨对齐后,作为待混音的伴奏音频。
S205、混合人声音频和伴奏音频,得到混音歌曲。
需要说明的是,混合人声音频和伴奏音频之前,一般需要计算人声音频在左声道和右声道的分布情况,即:将某一个人声信号分配至左右声道,使左右声道得到不同能量大小的信号。因此,混合人声音频和伴奏音频,得到混音歌曲,包括:计算左声道增益值和右声道增益值;基于左声道增益值和右声道增益值,确定人声音频中的每个人声信号的立体声信号;混合各个立体声信号和伴奏音频,得到混音歌曲。人声音频中的每个人声信号的音轨同步,针对其中的某一个人声信号,利用基于左声道增益值和右声道增益值,可以计算出该人声信号分布在左声道和右声道上的信号,此即为该人声信号的立体声信号。
假设左声道增益值为gainLeft,右声道增益值为gainRight,那么人声信号vocalA在左声道上的信号为:vocalALeft=vocalA×gainLeft,vocalA在右声道上的信号为:vocalARight=vocalA×gainRight。vocalALeft和vocalARight共同组成vocalA的立体声信号。
其中,混合各个立体声信号和伴奏音频,得到混音歌曲,包括:按照第四公式混合各个立体声信号和伴奏音频,得到混音歌曲;其中,第四公式为:SongComb=alpha×(vocal1+…+vocalN)+(1-alpha)×surround;其中,SongComb为混音歌曲,vocal1、…、vocalN为各个立体声信号,alpha为预设调整因子,surround为伴奏音频。alpha在0-1之间取值。当alpha往小于0.5的方向调整时,则表示最终的混音效果为增强背景(即伴奏)声,从而增加了音乐的环绕感和沉浸感;当alpha往大于0.5的方向调整时,则表示最终的混音效果为抬升人声的清晰度,从而营造了清晰人声的效果。
需要说明的是,在混合各个立体声信号和伴奏音频之前,还可以利用均衡器等软件对surround的低频成分做增强处理,以增强整个音乐的节奏感。或者,在混合各个立体声信号和伴奏音频之前,对各个立体声信号进行变调不变速处理,以得到更多的演唱方式。
其中,计算左声道增益值和右声道增益值可以有如下两种方式,任选其一即可。
方式一:根据预设声像角度和人声信号在预设声像角度中的预设位置,计算左声道增益值和右声道增益值。设置声像角度为thetaBase,设置人声信号在预设声像角度中的位置为theta,那么增益值为:
gain=[tan(thetaBase)–tan(theta)]/[tan(thetaBase)+tan(theta)];
左声道增益值为:gainLeft=gain/sqrt(gain×gain+1);
右声道增益值为:gainRight=1/sqrt(gain×gain+1)。
方式二:通过分配线性增益的方式计算左声道增益值和右声道增益值。假设人声在中间偏左的位置,则
gainLeft=1.0;
gainRight=1.0-pan;
其中,参数pan是0-1之间的实数。若pan取值0,则gainLeft=1.0;gainRight=1.0;表示人声在正前方。若pan取值1,则gainLeft=1.0; gainRight=0;表示人声在正左方。所以调整pan的大小,可使人声的位置在正前方和正左方之间变化。如果人声在中间偏右的位置,则只需对掉两个增益值即可。
方式一采用设定调制角度的方式进行声像调制,而方式二采用分配线性增益的方式进行声像调制,二者均能够将人声分别放在左右90度之间的任意位置,从而形成一种同时合唱的效果,能够营造一种更加立体的人声声像、且可控制合唱效果,使得用户可以简单方便的调整声像位置,且不改变人声信号的频谱成分。真正将两个不在同一时空的人声,捏合在了同一首歌曲内。
当然,人声音频中的每个人声信号还可以随时间来决定出现与否。比如:某段时间内,只出现一个或某几个人声信号,以实现对唱效果。
可见,本实施例可以针对同一首歌曲的至少两个演唱版本进行混音,能够覆盖更多歌曲进行混音,并且在混音过程中,在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将各个演唱版本中的所有人声信号进行了音轨对齐,并选择了与人声信号的音轨对齐的伴奏信号,因此混合人声和伴奏时,能够使歌词、节拍等元素保持协调性和同步性,得到混音效果良好的混音歌曲,提高了混音效果。
上述实施例介绍的“在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将人声集合中的所有人声信号进行音轨对齐”可以有多种方式,本实施例将对其中的一种对齐方法进行介绍。若节奏信息为节拍信息,则本实施例提供的对齐方法包括:
S301、提取每个歌曲音频中的节拍信息,得到包括至少两个节拍信息的节拍集合。
每个歌曲音频中的节拍信息可以利用beattracker或鼓点提取算法完成提取。
其中,节拍集合中的节拍信息和人声集合中的人声信号具有一一映射关系。例如:针对3个歌曲音频:A、B、C进行混音,那么可获得3个人声信号(即人声集合):vocalA、vocalB和vocalC,3个伴奏信号(即伴奏集合):surroundA、surroundB和surroundC,3个节拍信息(即节拍集合): BeatA、BeatB、BeatC。可见,上述3个集合中的元素具有一一映射关系,即:vocalA-surroundA-BeatA,vocalB-surroundB-BeatB,vocalC-surroundC-BeatC。
S302、判断节拍集合中各个节拍信息包括的元素个数是否相同;若是,则执行S303;若否,则执行S308。
需要说明的是,节拍集合中的每个节拍信息包括多个元素(即节拍,也就是Beat点),若不同节拍信息所包括的元素个数相同,则表明相应歌曲音频的节奏相近,属于同一编曲,Beat点相差不大,因此可以采用S303-S307的步骤进行粗略对齐。反之,若是不同节拍信息所包括的元素个数不同,则表明相应歌曲音频的节奏差别较大,不属于同一编曲,Beat点可能存在较大差异,需要逐帧调整,因此需要采用S309-S313的步骤分片段进行较为细致的对齐。
节拍信息包括的Beat点可参照图4,图4中的“1、2、3…n、n+1…”表示歌曲音频中的各个数据帧。箭头表示Beat点所对应的时间戳位置,这些Beat点所对应的位置同样适用于人声信号。
S303、确定基准节奏信息为第一节拍信息,第一节拍信息为节拍集合中的任一个节拍信息。
S304、分别计算第一节拍信息与每个第二节拍信息的差异值。
其中,第二节拍信息为节拍集合中除第一节拍信息以外的其他节拍信息。例如:假设从上述节拍集合中选择BeatA作为第一节拍信息,那么BeatB和BeatC即为第二节拍信息。
其中,分别计算第一节拍信息与每个第二节拍信息的差异值,包括:按照第一公式分别计算第一节拍信息与每个第二节拍信息的差异值;第一公式为:M=[sum(Beat0–BeatX)/numBeats]×L;其中,M为Beat0与BeatX的差异值;Beat0为第一节拍信息的向量表示;BeatX为任一个第二节拍信息的向量表示;sum(Beat0–BeatX)为Beat0和BeatX中各个元素对位相减(即各个元素的时间戳对位相减)后,得到的所有差值的累加和;numBeats为各个节拍信息包括的元素个数(即某个节拍信息包括的元素个数);L为单位数据帧长度。例如:若计算BeatA和BeatB的差异值,那么差异值M=[sum(BeatA–BeatB)/numBeats]×L。
S305、按照一一映射关系确定每个差异值与每个第二人声信号的第一对应关系。
其中,第二人声信号为人声集合中除第一人声信号以外的其他人声信号,第一人声信号为人声集合中与第一节拍信息具有映射关系的人声信号。以上述示例为例,当选择BeatA作为第一节拍信息后,第一人声信号即为vocalA,第二人声信号即为vocalB和vocalC。
S306、按照第一对应关系确定调整每个第二人声信号所需的相应差异值,并基于相应差异值确定每个第二人声信号的冗余端和待补位端。
S307、从每个第二人声信号的冗余端删除与差异值等量的冗余数据,并在每个第二人声信号的待补位端添加与差异值等量的全零数据。
步骤S303-S307通过整体平移人声信号对齐人声信号,该方式遵循欧式距离最小化原则。按照上述示例,若M为正数,则表明歌曲音频A的演唱者开始演唱的时间晚于歌曲音频B的演唱者开始演唱的时间,那么以vocalA为对比基准,将vocalB向后(右)平移M个数据点,并以vocalA的首尾为参考点确定vocalB的冗余端和待补位端。针对冗余端,将平移后的vocalB超出vocalA的部分切掉;针对待补位端,对与vocalA相比vocalB欠缺的部分补零,即可使vocalB与vocalA对齐。
S308、判断当前获取到的歌曲音频的数量是否仅有两个;若是,则执行S309;若否,则退出流程。
S309、确定基准节奏信息为第三节拍信息,第三节拍信息为节拍集合中元素个数最少的节拍信息。
S310、将第四节拍信息中的元素个数删减至与第三节拍信息中的元素个数相同。
其中,第四节拍信息为节拍集合中除第三节拍信息以外的其他节拍信息。假设节拍集合包括:BeatA和BeatB,且BeatA包括3个元素:aA、bA、cA,BeatB包括4个元素:aB、bB、cB、dB,那么BeatA作为第三节拍信息,BeatB为第四节拍信息。
其中,将第四节拍信息中的元素个数删减至与第三节拍信息中的元素个数相同,包括:将第三节拍信息中的各个元素按照时间戳大小排列为目标序列;确定当前迭代次数,将目标序列中与当前迭代次数相等的排列位 置上的元素确定为目标元素;分别计算目标元素与各个对比元素的时间戳距离;对比元素为第四节拍信息中不与目标序列中的任一个元素匹配的元素;将最小时间戳距离对应的对比元素确定为与目标元素匹配的元素;若当前迭代次数不小于最大迭代次数,则删除当前第四节拍信息中的对比元素,保留第四节拍信息中与每个目标元素匹配的元素。
若当前迭代次数小于最大迭代次数,则当前迭代次数递增一,并执行确定当前迭代次数,将目标序列中与当前迭代次数相等的排列位置上的元素确定为目标元素;分别计算目标元素与各个对比元素的时间戳距离;将最小时间戳距离对应的对比元素确定为与目标元素匹配的元素的步骤,直至当前迭代次数不小于最大迭代次数。最大迭代次数为第三节拍信息中的元素个数。
基于上述示例,需要将BeatB中的某个元素删除,那么具体过程为:假设BeatA中的元素已按照时间戳升序排列,最大迭代次数为3。第一次迭代时,当前迭代次数取值1,那么目标元素为aA,此时分别计算aA与aB,aA与bB,aA与cB,aA与dB的时间戳距离,可获得4个距离:0.1、0.2、0.3、0.4;那么最小时间戳距离为0.1,其对应的对比元素为aB,因此确定aA与aB匹配。此时迭代次数小于最大迭代次数3,则迭代次数由1变为2,那么第二轮迭代的目标元素为bA;由于aA与aB匹配,那么aB不再是对比元素,因此计算bA与bB,bA与cB,bA与dB的时间戳距离,可获得3个距离:0.5、0.6、0.7;那么最小时间戳距离为0.5,其对应的对比元素为bB,因此确定bA与bB匹配。此时迭代次数小于最大迭代次数3,则迭代次数由2变为3,那么第三轮迭代的目标元素为cA;由于aA与aB匹配,bA与bB匹配,那么aB和bB不再是对比元素,因此计算cA与cB,cA与dB的时间戳距离,可获得2个距离:0.7、0.8,那么最小时间戳距离为0.7,其对应的对比元素为cB,因此确定cA与cB匹配。此时迭代次数不小于最大迭代次数3,则删除BeatB中的对比元素dB(由于aA与aB匹配,bA与bB匹配,cA与cB,那么对比元素仅有dB),保留aB、bB、cB。至此,BeatA和BeatB都只有3个元素。BeatA包括3个元素:aA、bA、cA,BeatB包括3个元素:aB、bB、cB。
S311、基于第三节拍信息或第四节拍信息确定多个相邻节拍组。
若BeatA包括3个元素:aA、bA、cA,BeatB包括3个元素:aB、bB、cB。那么可确定2个相邻节拍组,a与b、b与c。a与b对应的第一数据片段即为aA~bA所对应的vocalA中的片段,第二数据片段为aB~bB所对应的vocalB中的片段。b与c对应的第一数据片段即为bA~cA所对应的vocalA中的片段,,第二数据片段为bB~cB所对应的vocalB中的片段。
请参见图5,图5示意了a与b这一个相邻节拍组,该相邻节拍组对应的第一数据片段(vocalA中的片段)包括4个数据帧(数据帧2、3、4、5),第二数据片段(vocalB中的片段)包括3个数据帧(数据帧2、3、4)。
S312、按照每个相邻节拍组划分第三人声信号和第四人声信号,得到每个相邻节拍组对应的第一数据片段和第二数据片段。
其中,第三人声信号为人声集合中与第三节拍信息具有映射关系的人声信号,第四人声信号为人声集合中除第三人声信号以外的其他人声信号。若是BeatA作为第三节拍信息,BeatB为第四节拍信息,那么第三人声信号即为vocalA,第四人声信号即为vocalB。第一数据片段为第三人声信号中的片段,第二数据片段为第四人声信号中的片段。
S313、针对每个相邻节拍组,使第一数据片段的数据长度和第二数据片段的数据长度相等。
由于单位数据帧长度恒定不变,因此第一数据片段的数据长度和第二数据片段的数据长度相等后,第一数据片段中的第一数据帧个数就等于第二数据片段中的第二数据帧个数。
请参见图5,第一数据片段中的第一数据帧个数不等于第二数据片段中的第二数据帧个数,则将第一数据帧个数和第二数据帧个数中的最大值对应的数据片段确定为待删减片段;计算待删减片段中每个数据帧的删减数,并按照删减数删减待删减片段中每个数据帧。
其中,计算待删减片段中每个数据帧的删减数,包括:按照第二公式计算待删减片段中每个数据帧的删减数;第二公式为:P=[(m-n)×L]/m;其中,P为每个数据帧的删减数,m为最大值,n为第一数据帧个数和第二数据帧个数中的最小值,L为单位数据帧长度。如图5所示,最大值为4,最小值为3,那么每个数据帧的删减数P=[(4-3)×L]/4=L/4。针对每个数据帧进行删减时,统一删除每个数据帧的头部或者尾部,并将删除后的所有 数据帧按照原有顺序重新拼接起来。
请参见图6,图6示意了b与c这一个相邻节拍组,该相邻节拍组对应的第一数据片段(vocalA中的片段)包括3个数据帧(数据帧2、3、4),第二数据片段(vocalB中的片段)包括4个数据帧(数据帧2、3、4、5)。可见,本实施例针对每个相邻节拍组进行数据删减时,有时需要删减vocalA,有时需要删减vocalB,因此步骤S309~S313仅针对两个歌曲音频进行混音。按照步骤S309~S313使vocalA和vocalB中的各个数据片段对齐,即可实现vocalA和vocalB的对齐。
当然,按照步骤S309~S313的逻辑进行适应性改变,即可对3个、4个等更多人声信号进行对齐。假设三个待对齐的人声信号分别是:vocal1、vocal2、vocal3,那么可按照S309~S313对齐vocal1和vocal2,从而得到相互对齐的vocal1’和vocal2’,此时vocal1’和vocal2’中的数据帧个数相等,故vocal1’和vocal2’可认为是相同的(指数据帧个数相同)。之后分别对齐vocal1’和vocal3,vocal2’和vocal3,即可完成三个人声信号的对齐。
其中,由于vocal1’和vocal2’可认为是相同的,因此二者对齐vocal3时所删减的数据完全一致。同时,针对vocal3而言,其对齐vocal1’和vocal2’时,所删减的数据也相同。因此对齐vocal1’和vocal3,vocal2’和vocal3,可得到同一个vocal3’。最后可得到相互对齐的vocal1”、vocal2”和vocal3’。当然,如果vocal1”=vocal1’,那么也就无需对齐vocal2’和vocal3了,因为此种情况下vocal2”也会等于vocal2’。
如果对齐过程中每个人声信号都有改动,那么相应的伴奏信号也需要按照与人声信号相同的对齐方式进行对齐,最终输出与对齐后的所有人声信号对齐的伴奏信号。
本实施例根据歌曲音频的节拍信息,将不同版本的人声进行音轨对齐。本实施例可以针对同一首歌曲的至少两个演唱版本进行混音,能够覆盖更多歌曲进行混音,并且在混音过程中,在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将各个演唱版本中的所有人声信号进行了音轨对齐,并选择了与人声信号的音轨对齐的伴奏信号,因此混合人声和伴奏时,能够使歌词、节拍等元素保持协调性和同步性,得到混音效果良好的混音歌曲,提高了混音效果。
上述实施例介绍的“在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将人声集合中的所有人声信号进行音轨对齐”可以有多种方式,本实施例将对其中的另一种对齐方法进行介绍。若节奏信息为BPM值,则本实施例提供的对齐方法包括:
S701、统计每个歌曲音频对应的BPM值,得到包括至少两个BPM值的BPM值集合。
每个歌曲音频对应的BPM值可利用BPM检测算法完成统计。
BPM是Beat Per Minute的简称,又称拍子数,表示每分钟含有的节拍数。BPM值集合中的BPM值和人声集合中的人声信号具有一一映射关系。例如:针对3个歌曲音频:A、B、C进行混音,那么可获得3个人声信号(即人声集合):vocalA、vocalB和vocalC,3个BPM值(即BPM值集合):BPMA、BPMB、BPMC。可见,上述人声集合和BPM值集合中的元素具有一一映射关系,即:vocalA-BPMA,vocalB-BPMB,vocalC-BPMC。
S702、从BPM值集合中选择一个BPM值作为基准BPM值。
其中,基准BPM值即基准节奏信息。此时可在BPM值集合中随机选择一个BPM值作为基准BPM值。
S703、计算基准BPM值与每个目标BPM值的比值。
其中,目标BPM值为BPM值集合中除基准BPM值以外的其他BPM值。假设从BPM值集合中选择BPMA为基准BPM值,那么BPMB和BPMC即为目标BPM值。据此可获得比值:BPMA/BPMB、BPMA/BPMC。
S704、按照一一映射关系确定每个比值与每个目标人声信号的第二对应关系。
其中,目标人声信号为人声集合中除基准人声信号以外的其他人声信号,基准人声信号为人声集合中与基准BPM值具有映射关系的人声信号。若选择BPMA为基准BPM值,那么基准人声信号为vocalA,目标人声信号为vocalB和vocalC。
S705、按照第二对应关系确定调整每个目标人声信号所需的相应比值,并基于相应比值对每个目标人声信号进行变速不变调处理。
根据上述示例,BPMA/BPMB与vocalB对应,BPMA/BPMC与vocalC 对应,那么利用BPMA/BPMB对vocalB进行变速不变调处理,利用BPMA/BPMC对vocalC进行变速不变调处理,即可对齐vocalA、vocalB和vocalC。本实施例可利用变速不变调的处理器实现。
本实施例根据歌曲音频的节拍信息,将不同版本的人声进行音轨对齐。本实施例可以针对同一首歌曲的至少两个演唱版本进行混音,能够覆盖更多歌曲进行混音,并且在混音过程中,在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将各个演唱版本中的所有人声信号进行了音轨对齐,并选择了与人声信号的音轨对齐的伴奏信号,因此混合人声和伴奏时,能够使歌词、节拍等元素保持协调性和同步性,得到混音效果良好的混音歌曲,提高了混音效果。
基于上述任意实施例,需要说明的是,将音轨对齐后的所有人声信号作为待混音的人声音频之前,还可以基于人声信号的均方根(Root Mean Square,RMS)平衡不同人声信号的响度,以避免因响度不同而导致混音效果降低。在本实施例中,平衡不同人声信号的响度包括:从音轨对齐后的所有人声信号中随机选择一个人声信号作为标准人声信号;按照第三公式调整每个待调人声信号的响度;待调人声信号为音轨对齐后的所有人声信号中除标准人声信号以外的其他人声信号;其中,第三公式为:B=vocalX×(RMS0/RMSX);其中,B为调整响度之后的待调人声信号,vocalX为调整响度之前的待调人声信号,RMS0为标准人声信号的均方根,RMSX为vocalX的均方根。
假设音轨对齐后的所有人声信号为vocalA、vocalB和vocalC,那么可有RMSA、RMSB和RMSC。随机选择vocalA为标准人声信号,那么调整后的vocalB=vocalB×(RMSA/RMSB),调整后的vocalC=vocalC×(RMSA/RMSC),从而可减小vocalA、vocalB和vocalC的响度差异。
当然,也可以将两个人声信号分别放在左右声道上试听效果,通过人耳来判断两个音轨的响度是否是相近的。如果不相近,则调整人声信号的响度,以达到两个人声信号响度相近的效果。
本实施例利用左右耳能量差的原理,可减小不同人声信号的响度差异, 获得具有立体声像的人声合唱效果。
下面通过具体的应用场景实例描述,来介绍本申请提供的混音歌曲生成方案。混音歌曲生成方案可以基于已有歌曲制作混音歌曲。按照本申请提供的混音歌曲生成方案可设计相应的混音歌曲制作工具,利用该工具即可完成混音歌曲的制作。混音歌曲制作工具可安装在任意计算机设备上。混音歌曲制作工具执行本申请提供的混音歌曲生成方法。
请参见图8,混音歌曲制作过程可以包括如下步骤:
S801、客户端上传同一首歌曲的至少两个演唱版本的歌曲音频至服务器;
S802、服务器将各个歌曲音频输入至自身中的混音歌曲制作工具,并由混音歌曲制作工具输出混音歌曲;
S803、服务器将混音歌曲发送至客户端;
S804、客户端播放该混音歌曲。
可见,本实施例提供的混音歌曲制作工具可以覆盖曲库中所有的歌曲。用户可以任意上传自己想要改编的歌曲做混音改编。如果一个歌曲在曲库中只有一个演唱版本,则可以自己跟着分离出的伴奏演唱一遍,从而制作出自己与专业演唱者出现在同一首歌里的混音效果。而且,用于混音的不同演唱版本,只需要对应的乐谱是相同的即可,哪怕是不同语言演绎的版本也是可以的。
本实施例以歌曲的Beat点和BPM值为依据对齐人声,能够通过改变背景声和人声的比例,来清晰人声或增强背景,以进行声场展宽,还可以调整人声音调,调整背景声各频谱能量比例。此外,还可以任意调整人声的声像位置和出现时间、人声与背景声的比例、人声的音调,以及背景声各频段的能量,可得到具备不同混音风格和演唱效果的混音歌曲,降低了音乐二次创作的门槛。
基于本实施例提供的混音歌曲制作工具,用户既可以改编人声(制作多方位的双音效果,或者对歌曲人声做单独变调处理),也可以改编背景声(制作清晰人声、声场展宽和节奏增强等)。这种制作方法,大大扩展了双音效果所覆盖的歌曲范围,同时,也使得混音效果的制作有了更多可改编 的内容和方式。
请参见图9,图9为本申请实施例提供的一种混音歌曲生成装置示意图,包括:
获取模块901,用于获取至少两个歌曲音频;至少两个歌曲音频为同一首歌曲的不同演唱版本;
提取模块902,用于提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合;
对齐模块903,用于在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于基准节奏信息将人声集合中的所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频;
选择模块904,用于将伴奏集合中,与人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频;
混合模块905,用于混合人声音频和伴奏音频,得到混音歌曲。
在一种具体实施方式中,提取模块包括:
第一提取单元,用于计算每个歌曲音频对应的中置信号,并从中置信号中提取每个歌曲音频中的人声信号;
第二提取单元,用于提取每个歌曲音频中的左声道人声和右声道人声,并将左声道人声和右声道人声的幅度平均值或频谱特征平均值,确定为每个歌曲音频中的人声信号。
在一种具体实施方式中,提取模块包括:
第三提取单元,用于提取每个歌曲音频中的左声道伴奏或右声道伴奏,并将左声道伴奏或右声道伴奏确定为每个歌曲音频中的伴奏信号。
在一种具体实施方式中,若节奏信息为节拍信息,则对齐模块包括:
节拍提取单元,用于提取每个歌曲音频中的节拍信息,得到包括至少两个节拍信息的节拍集合;节拍集合中的节拍信息和人声集合中的人声信号具有一一映射关系;
第一选择单元,用于若节拍集合中各个节拍信息包括的元素个数相同, 则确定基准节奏信息为第一节拍信息,第一节拍信息为节拍集合中的任一个节拍信息;
第一计算单元,用于分别计算第一节拍信息与每个第二节拍信息的差异值;第二节拍信息为节拍集合中除第一节拍信息以外的其他节拍信息;
第一确定单元,用于按照一一映射关系确定每个差异值与每个第二人声信号的第一对应关系;第二人声信号为人声集合中除第一人声信号以外的其他人声信号,第一人声信号为人声集合中与第一节拍信息具有映射关系的人声信号;
第二确定单元,用于按照第一对应关系确定调整每个第二人声信号所需的相应差异值,并基于相应差异值确定每个第二人声信号的冗余端和待补位端;
第一对齐单元,用于从每个第二人声信号的冗余端删除与差异值等量的冗余数据,并在每个第二人声信号的待补位端添加与差异值等量的全零数据。
在一种具体实施方式中,第一计算单元具体用于:
按照第一公式分别计算第一节拍信息与每个第二节拍信息的差异值;第一公式为:M=[sum(Beat0–BeatX)/numBeats]×L;
其中,M为Beat0与BeatX的差异值;Beat0为第一节拍信息的向量表示;BeatX为任一个第二节拍信息的向量表示;sum(Beat0–BeatX)为Beat0和BeatX中各个元素对位相减后,得到的所有差值的累加和;numBeats为各个节拍信息包括的元素个数;L为单位数据帧长度。
在一种具体实施方式中,对齐模块还包括:
第二选择单元,用于若获取到两个歌曲音频,且节拍集合中各个节拍信息包括的元素个数不同,则确定基准节奏信息为第三节拍信息;第三节拍信息为节拍集合中元素个数最少的节拍信息;
删减单元,用于将第四节拍信息中的元素个数删减至与第三节拍信息中的元素个数相同;第四节拍信息为节拍集合中除第三节拍信息以外的其他节拍信息;
第三确定单元,用于基于第三节拍信息或第四节拍信息确定多个相邻节拍组;
划分单元,用于按照每个相邻节拍组划分第三人声信号和第四人声信号,得到每个相邻节拍组对应的第一数据片段和第二数据片段;第三人声信号为人声集合中与第三节拍信息具有映射关系的人声信号,第四人声信号为人声集合中除第三人声信号以外的其他人声信号;
第二对齐单元,用于针对每个相邻节拍组,使第一数据片段的数据长度和第二数据片段的数据长度相等。
在一种具体实施方式中,第二对齐单元包括:
第一确定子单元,用于若第一数据片段中的第一数据帧个数不等于第二数据片段中的第二数据帧个数,则将第一数据帧个数和第二数据帧个数中的最大值对应的数据片段确定为待删减片段;
第一计算子单元,用于计算待删减片段中每个数据帧的删减数,并按照删减数删减待删减片段中每个数据帧。
在一种具体实施方式中,计算子单元具体用于:
按照第二公式计算待删减片段中每个数据帧的删减数;第二公式为:P=[(m-n)×L]/m;
其中,P为每个数据帧的删减数,m为最大值,n为第一数据帧个数和第二数据帧个数中的最小值,L为单位数据帧长度。
在一种具体实施方式中,删减单元包括:
排列子单元,用于将第三节拍信息中的各个元素按照时间戳大小排列为目标序列;
第二确定子单元,用于确定当前迭代次数,将目标序列中与当前迭代次数相等的排列位置上的元素确定为目标元素;
第二计算子单元,用于分别计算目标元素与各个对比元素的时间戳距离;对比元素为第四节拍信息中不与目标序列中的任一个元素匹配的元素;
第三确定子单元,用于将最小时间戳距离对应的对比元素确定为与目标元素匹配的元素;
删减子单元,用于若当前迭代次数不小于最大迭代次数,则删除当前第四节拍信息中的对比元素,保留第四节拍信息中与每个目标元素匹配的元素。
在一种具体实施方式中,删减单元还包括:
迭代子单元,用于若当前迭代次数小于最大迭代次数,则当前迭代次数递增一,并执行第二确定子单元、第二计算子单元、第三确定子单元中的步骤,直至当前迭代次数不小于最大迭代次数。
在一种具体实施方式中,若节奏信息为BPM值,则对齐模块包括:
统计单元,用于统计每个歌曲音频对应的BPM值,得到包括至少两个BPM值的BPM值集合;BPM值集合中的BPM值和人声集合中的人声信号具有一一映射关系;
第三选择单元,用于从BPM值集合中选择一个BPM值作为基准BPM值;基准BPM值即基准节奏信息;
第二计算单元,用于计算基准BPM值与每个目标BPM值的比值;目标BPM值为BPM值集合中除基准BPM值以外的其他BPM值;
第四确定单元,用于按照一一映射关系确定每个比值与每个目标人声信号的第二对应关系;目标人声信号为人声集合中除基准人声信号以外的其他人声信号,基准人声信号为人声集合中与基准BPM值具有映射关系的人声信号;
第三对齐单元,用于按照第二对应关系确定调整每个目标人声信号所需的相应比值,并基于相应比值对每个目标人声信号进行变速不变调处理。
在一种具体实施方式中,还包括:
标准人声选择模块,用于从音轨对齐后的所有人声信号中随机选择一个人声信号作为标准人声信号;
调整模块,用于按照第三公式调整每个待调人声信号的响度;待调人声信号为音轨对齐后的所有人声信号中除标准人声信号以外的其他人声信号;
其中,第三公式为:B=vocalX×(RMS0/RMSX);
其中,B为调整响度之后的待调人声信号,vocalX为调整响度之前的待调人声信号,RMS0为标准人声信号的均方根,RMSX为vocalX的均方根。
在一种具体实施方式中,混合模块包括:
第三计算单元,用于计算左声道增益值和右声道增益值;
第五确定单元,用于基于左声道增益值和右声道增益值,确定人声音 频中的每个人声信号的立体声信号;
混合单元,用于混合各个立体声信号和伴奏音频,得到混音歌曲。
在一种具体实施方式中,混合单元具体用于:
按照第四公式混合各个立体声信号和伴奏音频,得到混音歌曲;
其中,第四公式为:
SongComb=alpha×(vocal1+…+vocalN)+(1-alpha)×surround;
其中,SongComb为混音歌曲,vocal1、…、vocalN为各个立体声信号,alpha为预设调整因子,surround为伴奏音频。
在一种具体实施方式中,第三计算单元具体用于:
根据预设声像角度和人声信号在预设声像角度中的预设位置,计算左声道增益值和右声道增益值;或通过分配线性增益的方式计算左声道增益值和右声道增益值。
在一种具体实施方式中,选择模块包括:
第四选择单元,用于在所述伴奏集合中,选择与所述基准节奏信息对齐的伴奏信号作为待混音的伴奏音频;
第四对齐单元,用于将所述伴奏集合中的任一个伴奏信号与所述基准节奏信息进行音轨对齐后,作为待混音的伴奏音频。
其中,关于本实施例中各个模块、单元更加具体的工作过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
可见,本实施例提供了一种混音歌曲生成装置,该装置根据歌曲音频的节拍信息,将不同版本的人声进行音轨对齐。本实施例可以针对同一首歌曲的至少两个演唱版本进行混音,能够覆盖更多歌曲进行混音,并且在混音过程中,将各个演唱版本中的所有人声信号进行了音轨对齐,并选择了与人声信号的音轨对齐的伴奏信号,因此混合人声和伴奏时,能够使歌词、节拍等元素保持协调性和同步性,得到混音效果良好的混音歌曲,提高了混音效果。
进一步的,本申请实施例还提供了一种电子设备。其中,上述电子设备既可以是如图10所示的服务器50,也可以是如图11所示的终端60。图 10和图11均是根据一示例性实施例示出的电子设备结构图,图中的内容不能被认为是对本申请的使用范围的任何限制。
图10为本申请实施例提供的一种服务器的结构示意图。该服务器50,具体可以包括:至少一个处理器51、至少一个存储器52、电源53、通信接口54、输入输出接口55和通信总线56。其中,所述存储器52用于存储计算机程序,所述计算机程序由所述处理器51加载并执行,以实现前述任一实施例公开的混音歌曲生成中的相关步骤。
本实施例中,电源53用于为服务器50上的各硬件设备提供工作电压;通信接口54能够为服务器50创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口55,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。
另外,存储器52作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源包括操作系统521、计算机程序522及数据523等,存储方式可以是短暂存储或者永久存储。
其中,操作系统521用于管理与控制服务器50上的各硬件设备以及计算机程序522,以实现处理器51对存储器52中数据523的运算与处理,其可以是Windows Server、Netware、Unix、Linux等。计算机程序522除了包括能够用于完成前述任一实施例公开的混音歌曲生成方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。数据523除了可以包括用于混音的歌曲音频等数据外,还可以包括应用程序的开发商信息等数据。
图11为本申请实施例提供的一种终端的结构示意图,该终端60具体可以包括但不限于智能手机、平板电脑、笔记本电脑或台式电脑等。
通常,本实施例中的终端60包括有:处理器61和存储器62。
其中,处理器61可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器61可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来 实现。处理器61也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器61可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器61还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器62可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器62还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。本实施例中,存储器62至少用于存储以下计算机程序621,其中,该计算机程序被处理器61加载并执行之后,能够实现前述任一实施例公开的由终端侧执行的混音歌曲生成方法中的相关步骤。另外,存储器62所存储的资源还可以包括操作系统622和数据623等,存储方式可以是短暂存储或者永久存储。其中,操作系统622可以包括Windows、Unix、Linux等。数据623可以包括但不限于待混音的歌曲音频。
在一些实施例中,终端60还可包括有显示屏63、输入输出接口64、通信接口65、传感器66、电源67以及通信总线68。
本领域技术人员可以理解,图11中示出的结构并不构成对终端60的限定,可以包括比图示更多或更少的组件。
进一步的,本申请实施例还公开了一种存储介质,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现前述任一实施例公开的混音歌曲生成方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。
需要指出的是,上述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的 都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (19)

  1. 一种混音歌曲生成方法,其特征在于,包括:
    获取至少两个歌曲音频;所述至少两个歌曲音频为同一首歌曲的不同演唱版本;
    提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合;
    在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于所述基准节奏信息将所述人声集合中的所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频;
    将所述伴奏集合中,与所述人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频;
    混合所述人声音频和所述伴奏音频,得到混音歌曲。
  2. 根据权利要求1所述的混音歌曲生成方法,其特征在于,所述提取每个歌曲音频中的人声信号,包括:
    计算每个歌曲音频对应的中置信号,并从所述中置信号中提取每个歌曲音频中的人声信号;
    提取每个歌曲音频中的左声道人声和右声道人声,并将所述左声道人声和所述右声道人声的幅度平均值或频谱特征平均值,确定为每个歌曲音频中的人声信号。
  3. 根据权利要求1所述的混音歌曲生成方法,其特征在于,所述提取每个歌曲音频中的伴奏信号,包括:
    提取每个歌曲音频中的左声道伴奏或右声道伴奏,并将所述左声道伴奏或所述右声道伴奏确定为每个歌曲音频中的伴奏信号。
  4. 根据权利要求1所述的混音歌曲生成方法,其特征在于,若所述节奏信息为节拍信息,则所述在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于所述基准节奏信息将所述人声集合中的所有人声信号进行音轨对齐,包括:
    提取每个歌曲音频中的节拍信息,得到包括至少两个节拍信息的节拍集合;所述节拍集合中的节拍信息和所述人声集合中的人声信号具有一一 映射关系;
    若所述节拍集合中各个节拍信息包括的元素个数相同,则确定所述基准节奏信息为第一节拍信息;所述第一节拍信息为所述节拍集合中的任一个节拍信息;
    分别计算所述第一节拍信息与每个第二节拍信息的差异值;所述第二节拍信息为所述节拍集合中除所述第一节拍信息以外的其他节拍信息;
    按照所述一一映射关系确定每个差异值与每个第二人声信号的第一对应关系;所述第二人声信号为所述人声集合中除第一人声信号以外的其他人声信号,所述第一人声信号为所述人声集合中与所述第一节拍信息具有映射关系的人声信号;
    按照所述第一对应关系确定调整每个第二人声信号所需的相应差异值,并基于相应差异值确定每个第二人声信号的冗余端和待补位端;
    从每个第二人声信号的冗余端删除与所述差异值等量的冗余数据,并在每个第二人声信号的待补位端添加与所述差异值等量的全零数据。
  5. 根据权利要求4所述的混音歌曲生成方法,其特征在于,所述分别计算所述第一节拍信息与每个第二节拍信息的差异值,包括:
    按照第一公式分别计算所述第一节拍信息与每个第二节拍信息的差异值;所述第一公式为:M=[sum(Beat0–BeatX)/numBeats]×L;
    其中,M为Beat0与BeatX的差异值;Beat0为所述第一节拍信息的向量表示;BeatX为任一个第二节拍信息的向量表示;sum(Beat0–BeatX)为Beat0和BeatX中各个元素对位相减后,得到的所有差值的累加和;numBeats为各个节拍信息包括的元素个数;L为单位数据帧长度。
  6. 根据权利要求4所述的混音歌曲生成方法,其特征在于,还包括:
    若获取到两个歌曲音频,且所述节拍集合中各个节拍信息包括的元素个数不同,则确定所述基准节奏信息为第三节拍信息;所述第三节拍信息为所述节拍集合中元素个数最少的节拍信息;
    将第四节拍信息中的元素个数删减至与所述第三节拍信息中的元素个数相同;所述第四节拍信息为所述节拍集合中除所述第三节拍信息以外的其他节拍信息;
    基于所述第三节拍信息或所述第四节拍信息确定多个相邻节拍组;
    按照每个相邻节拍组划分第三人声信号和第四人声信号,得到每个相邻节拍组对应的第一数据片段和第二数据片段;所述第三人声信号为所述人声集合中与所述第三节拍信息具有映射关系的人声信号,所述第四人声信号为所述人声集合中除所述第三人声信号以外的其他人声信号;
    针对每个相邻节拍组,使所述第一数据片段的数据长度和所述第二数据片段的数据长度相等。
  7. 根据权利要求6所述的混音歌曲生成方法,其特征在于,所述使所述第一数据片段的数据长度和所述第二数据片段的数据长度相等,包括:
    若所述第一数据片段中的第一数据帧个数不等于所述第二数据片段中的第二数据帧个数,则将所述第一数据帧个数和所述第二数据帧个数中的最大值对应的数据片段确定为待删减片段;
    计算所述待删减片段中每个数据帧的删减数,并按照所述删减数删减所述待删减片段中每个数据帧。
  8. 根据权利要求7所述的混音歌曲生成方法,其特征在于,所述计算所述待删减片段中每个数据帧的删减数,包括:
    按照第二公式计算所述待删减片段中每个数据帧的删减数;所述第二公式为:P=[(m-n)×L]/m;
    其中,P为每个数据帧的删减数,m为所述最大值,n为所述第一数据帧个数和所述第二数据帧个数中的最小值,L为单位数据帧长度。
  9. 根据权利要求6所述的混音歌曲生成方法,其特征在于,所述将第四节拍信息中的元素个数删减至与所述第三节拍信息中的元素个数相同,包括:
    将所述第三节拍信息中的各个元素按照时间戳大小排列为目标序列;
    确定当前迭代次数,将所述目标序列中与当前迭代次数相等的排列位置上的元素确定为目标元素;
    分别计算所述目标元素与各个对比元素的时间戳距离;所述对比元素为所述第四节拍信息中不与所述目标序列中的任一个元素匹配的元素;
    将最小时间戳距离对应的对比元素确定为与所述目标元素匹配的元素;
    若当前迭代次数不小于最大迭代次数,则删除当前所述第四节拍信息 中的对比元素,保留所述第四节拍信息中与每个目标元素匹配的元素。
  10. 根据权利要求9所述的混音歌曲生成方法,其特征在于,
    若当前迭代次数小于最大迭代次数,则当前迭代次数递增一,并执行确定当前迭代次数,将所述目标序列中与当前迭代次数相等的排列位置上的元素确定为目标元素;分别计算所述目标元素与各个对比元素的时间戳距离;将最小时间戳距离对应的对比元素确定为与所述目标元素匹配的元素的步骤,直至当前迭代次数不小于最大迭代次数。
  11. 根据权利要求1所述的混音歌曲生成方法,其特征在于,若所述节奏信息为BPM值,则所述在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于所述基准节奏信息将所述人声集合中的所有人声信号进行音轨对齐,包括:
    统计每个歌曲音频对应的BPM值,得到包括至少两个BPM值的BPM值集合;所述BPM值集合中的BPM值和所述人声集合中的人声信号具有一一映射关系;
    从所述BPM值集合中选择一个BPM值作为基准BPM值;所述基准BPM值为所述基准节奏信息;
    计算所述基准BPM值与每个目标BPM值的比值;所述目标BPM值为所述BPM值集合中除所述基准BPM值以外的其他BPM值;
    按照所述一一映射关系确定每个比值与每个目标人声信号的第二对应关系;所述目标人声信号为所述人声集合中除基准人声信号以外的其他人声信号,所述基准人声信号为所述人声集合中与所述基准BPM值具有映射关系的人声信号;
    按照所述第二对应关系确定调整每个目标人声信号所需的相应比值,并基于相应比值对每个目标人声信号进行变速不变调处理。
  12. 根据权利要求1所述的混音歌曲生成方法,其特征在于,所述将音轨对齐后的所有人声信号作为待混音的人声音频之前,还包括:
    从音轨对齐后的所有人声信号中随机选择一个人声信号作为标准人声信号;
    按照第三公式调整每个待调人声信号的响度;所述待调人声信号为音轨对齐后的所有人声信号中除所述标准人声信号以外的其他人声信号;
    其中,所述第三公式为:B=vocalX×(RMS0/RMSX);
    其中,B为调整响度之后的待调人声信号,vocalX为调整响度之前的待调人声信号,RMS0为所述标准人声信号的均方根,RMSX为vocalX的均方根。
  13. 根据权利要求1所述的混音歌曲生成方法,其特征在于,所述混合所述人声音频和所述伴奏音频,得到混音歌曲,包括:
    计算左声道增益值和右声道增益值;
    基于所述左声道增益值和所述右声道增益值,确定所述人声音频中的每个人声信号的立体声信号;
    混合各个立体声信号和所述伴奏音频,得到所述混音歌曲。
  14. 根据权利要求13所述的混音歌曲生成方法,其特征在于,所述混合各个立体声信号和所述伴奏音频,得到混音歌曲,包括:
    按照第四公式混合各个立体声信号和所述伴奏音频,得到所述混音歌曲;
    其中,所述第四公式为:
    SongComb=alpha×(vocal1+…+vocalN)+(1-alpha)×surround;
    其中,SongComb为所述混音歌曲,vocal1、…、vocalN为各个立体声信号,alpha为预设调整因子,surround为所述伴奏音频。
  15. 根据权利要求13所述的混音歌曲生成方法,其特征在于,所述计算左声道增益值和右声道增益值,包括:
    根据预设声像角度和人声信号在所述预设声像角度中的预设位置,计算所述左声道增益值和所述右声道增益值;
    通过分配线性增益的方式计算所述左声道增益值和所述右声道增益值。
  16. 根据权利要求1所述的混音歌曲生成方法,其特征在于,所述将所述伴奏集合中,与所述人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频,包括:
    在所述伴奏集合中,选择与所述基准节奏信息对齐的伴奏信号作为待混音的伴奏音频;
    将所述伴奏集合中的任一个伴奏信号与所述基准节奏信息进行音轨对齐后,作为待混音的伴奏音频。
  17. 一种混音歌曲生成装置,其特征在于,包括:
    获取模块,用于获取至少两个歌曲音频;所述至少两个歌曲音频为同一首歌曲的不同演唱版本;
    提取模块,用于提取每个歌曲音频中的人声信号和伴奏信号,得到包括至少两个人声信号的人声集合和包括至少两个伴奏信号的伴奏集合;
    对齐模块,用于在各个歌曲音频对应的节奏信息中选择基准节奏信息,基于所述基准节奏信息将所述人声集合中的所有人声信号进行音轨对齐,并将音轨对齐后的所有人声信号作为待混音的人声音频;
    选择模块,用于将所述伴奏集合中,与所述人声音频的音轨对齐的伴奏信号确定为待混音的伴奏音频;
    混合模块,用于混合所述人声音频和所述伴奏音频,得到混音歌曲。
  18. 一种电子设备,其特征在于,所述电子设备包括处理器和存储器;其中,所述存储器用于存储计算机程序,所述计算机程序由所述处理器加载并执行以实现如权利要求1至16任一项所述的混音歌曲生成方法。
  19. 一种存储介质,其特征在于,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现如权利要求1至16任一项所述的混音歌曲生成方法。
PCT/CN2021/122573 2021-02-24 2021-10-08 一种混音歌曲生成方法、装置、设备及存储介质 WO2022179110A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110205483.9A CN112967705B (zh) 2021-02-24 2021-02-24 一种混音歌曲生成方法、装置、设备及存储介质
CN202110205483.9 2021-02-24

Publications (1)

Publication Number Publication Date
WO2022179110A1 true WO2022179110A1 (zh) 2022-09-01

Family

ID=76285886

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122573 WO2022179110A1 (zh) 2021-02-24 2021-10-08 一种混音歌曲生成方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112967705B (zh)
WO (1) WO2022179110A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967705B (zh) * 2021-02-24 2023-11-28 腾讯音乐娱乐科技(深圳)有限公司 一种混音歌曲生成方法、装置、设备及存储介质
CN114203163A (zh) * 2022-02-16 2022-03-18 荣耀终端有限公司 音频信号处理方法及装置
CN117059055A (zh) * 2022-05-07 2023-11-14 北京字跳网络技术有限公司 音频处理方法、装置、设备及存储介质
CN116524883B (zh) * 2023-07-03 2024-01-05 腾讯科技(深圳)有限公司 音频合成方法、装置、电子设备和计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070218444A1 (en) * 2006-03-02 2007-09-20 David Konetski System and method for presenting karaoke audio features from an optical medium
CN110534078A (zh) * 2019-07-30 2019-12-03 黑盒子科技(北京)有限公司 一种基于音频特征的细粒度音乐节奏提取系统及方法
CN111326132A (zh) * 2020-01-22 2020-06-23 北京达佳互联信息技术有限公司 音频处理方法、装置、存储介质及电子设备
CN111916039A (zh) * 2019-05-08 2020-11-10 北京字节跳动网络技术有限公司 音乐文件的处理方法、装置、终端及存储介质
CN112216294A (zh) * 2020-08-31 2021-01-12 北京达佳互联信息技术有限公司 音频处理方法、装置、电子设备及存储介质
CN112967705A (zh) * 2021-02-24 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 一种混音歌曲生成方法、装置、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8005666B2 (en) * 2006-10-24 2011-08-23 National Institute Of Advanced Industrial Science And Technology Automatic system for temporal alignment of music audio signal with lyrics
CN106686431B (zh) * 2016-12-08 2019-12-10 杭州网易云音乐科技有限公司 一种音频文件的合成方法和设备
CN111345010B (zh) * 2018-08-17 2021-12-28 华为技术有限公司 一种多媒体内容同步方法、电子设备及存储介质
CN110992970B (zh) * 2019-12-13 2022-05-31 腾讯音乐娱乐科技(深圳)有限公司 音频合成方法及相关装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070218444A1 (en) * 2006-03-02 2007-09-20 David Konetski System and method for presenting karaoke audio features from an optical medium
CN111916039A (zh) * 2019-05-08 2020-11-10 北京字节跳动网络技术有限公司 音乐文件的处理方法、装置、终端及存储介质
CN110534078A (zh) * 2019-07-30 2019-12-03 黑盒子科技(北京)有限公司 一种基于音频特征的细粒度音乐节奏提取系统及方法
CN111326132A (zh) * 2020-01-22 2020-06-23 北京达佳互联信息技术有限公司 音频处理方法、装置、存储介质及电子设备
CN112216294A (zh) * 2020-08-31 2021-01-12 北京达佳互联信息技术有限公司 音频处理方法、装置、电子设备及存储介质
CN112967705A (zh) * 2021-02-24 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 一种混音歌曲生成方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112967705A (zh) 2021-06-15
CN112967705B (zh) 2023-11-28

Similar Documents

Publication Publication Date Title
WO2022179110A1 (zh) 一种混音歌曲生成方法、装置、设备及存储介质
WO2021103314A1 (zh) 一种构造听音场景的方法和相关装置
WO2016188322A1 (zh) K歌处理方法、装置以及k歌处理系统
US10062367B1 (en) Vocal effects control system
CN111741233B (zh) 视频配乐方法、装置、存储介质以及电子设备
US10249209B2 (en) Real-time pitch detection for creating, practicing and sharing of musical harmonies
CN103559876A (zh) 音效处理方法及系统
US20190139437A1 (en) Teaching vocal harmonies
CN110120212B (zh) 基于用户示范音频风格的钢琴辅助作曲系统及方法
CN108924610A (zh) 多媒体文件处理方法、装置、介质和计算设备
WO2023109278A1 (zh) 一种伴奏的生成方法、设备及存储介质
CN107948623A (zh) 投影仪及其音乐关联信息显示方法
CN105138625A (zh) 一种协同创作音乐的方法和用于音乐创作的云系统
CN106601268B (zh) 一种多媒体数据处理方法及装置
CN113077771B (zh) 异步合唱混音方法及装置、存储介质和电子设备
CN110347864A (zh) 一种智能调节音频参数的方法及系统
US20160307551A1 (en) Multifunctional Media Players
US20240135905A1 (en) Audio mixing song generation method and apparatus, device, and storage medium
CN106484833A (zh) 一种音源筛选方法及电子设备
WO2023061330A1 (zh) 音频合成方法、装置、设备及计算机可读存储介质
US9705953B2 (en) Local control of digital signal processing
CN106448710B (zh) 一种音乐播放参数的校准方法及音乐播放设备
WO2017000371A1 (zh) 一种调节蓝牙设备输出的方法、装置、系统及存储介质
CN112037738B (zh) 一种音乐数据的处理方法、装置及计算机存储介质
Hermes 8-BIT MUSIC ON TWITCH: How the Chiptune Scene is Overcoming the Pandemic

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927535

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18278602

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927535

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11/12/2023)