CN110599989B

CN110599989B - Audio processing method, device and storage medium

Info

Publication number: CN110599989B
Application number: CN201910942105.1A
Authority: CN
Inventors: 庄晓滨; 林森; 李胜存; 曹蜀
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2022-11-29
Anticipated expiration: 2039-09-30
Also published as: CN110599989A

Abstract

The embodiment of the invention discloses an audio processing method, an audio processing device and a storage medium. According to the scheme, the original singing audio of the target audio resource is obtained and is segmented, so that a first accompaniment audio and a mixed audio are obtained, the mixed audio comprises an accompaniment and a voice, the loudness of the first accompaniment audio and the mixed audio is respectively calculated, the loudness of the first voice audio is calculated based on the loudness of the first accompaniment audio and the mixed audio, the singing turning audio of the target audio resource is obtained, and the loudness of the segmented second accompaniment audio and the loudness of the second voice audio in the singing turning audio are respectively adjusted according to the loudness of the first accompaniment audio and the loudness of the first voice audio. According to the scheme provided by the embodiment of the application, the loudness of the voice and the accompaniment in the original singing audio is calculated, and the voice and the accompaniment in the singing works are adjusted, so that the vocal singing works are closer to the original singing, and the audio processing effect and the loudness adjustment accuracy for the singing music are improved.

Description

Audio processing method, device and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to an audio processing method, an audio processing device and a storage medium.

Background

In recent years, the market scale of karaoke software on a mobile terminal is gradually enlarged, and a user group is distributed in all ages and all music levels. Especially, with the popularization of intelligent terminals such as smart phones and tablet computers, it becomes possible for a user to do karaoke without going out. For example, after the user installs the karaoke software on the smart phone, the user can sing a song without going into a KTV. Wherein, under the K song scene, in order to promote the sense of hearing that includes the K song works of human voice and accompaniment two parts, often still need carry out loudness adjustment at the intelligent terminal to the user's audio frequency of K song user deduction and the accompaniment audio frequency of broadcast in the K song in-process and handle.

In the prior art, in the loudness adjustment process of user audio and accompaniment audio, generally, for different songs, a user manually adjusts the appropriate loudness of the user audio and the loudness of the accompaniment audio, so as to realize the adjustment of the sound-to-accompaniment ratio.

The applicant has found that the following two problems exist in the related art: first, the ideal sound accompaniment ratio for different types or styles of songs, and even for different deductive segments of the same song, is different, and therefore less effective if the optimal perception is not achieved by maintaining the same sound accompaniment ratio. Secondly, the sound companionship in the original song is not analyzed, the musical works consistent with the original listening feeling cannot be directly synthesized, the optimal listening feeling can be achieved only by the user adjusting the musical works, the operation is complex, and the accuracy is poor.

Disclosure of Invention

The embodiment of the invention provides an audio processing method, an audio processing device and a storage medium, aiming at improving the audio processing effect and the loudness adjustment accuracy of music for singing.

The embodiment of the invention provides an audio processing method, which comprises the following steps:

acquiring original singing audio of a target audio resource and segmenting the original singing audio to obtain first accompaniment audio and mixed audio, wherein the mixed audio comprises accompaniment and human voice;

respectively calculating loudness of the first accompaniment audio and the mixed audio;

calculating a loudness of a first person sound audio based on the loudness of the first accompaniment audio and the mixed audio;

and acquiring the singing audio of the target audio resource, and respectively adjusting the loudness of a second accompaniment audio and a second human voice audio which are divided in the singing audio according to the loudness of the first accompaniment audio and the first human voice audio.

An embodiment of the present invention further provides an audio processing apparatus, including:

the dividing unit is used for acquiring and dividing original singing audio of the target audio resource to obtain first accompaniment audio and mixed audio, wherein the mixed audio comprises accompaniment and voice;

a first calculating unit for calculating loudness of the first accompaniment audio and the mixed audio respectively;

a second calculation unit configured to calculate a loudness of the first person sound audio based on a loudness of the first accompaniment audio and the mixed audio;

and the adjusting unit is used for acquiring the singing audio of the target audio resource and respectively adjusting the loudness of the second accompaniment audio and the second vocal audio which are divided in the singing audio according to the loudness of the first accompaniment audio and the first vocal audio.

The embodiment of the invention also provides a storage medium, wherein a plurality of instructions are stored in the storage medium, and the instructions are suitable for being loaded by the processor to execute any audio processing method provided by the embodiment of the invention.

According to the audio processing scheme provided by the embodiment of the invention, the original singing audio of the target audio resource is obtained and is segmented to obtain the first accompaniment audio and the mixed audio, the mixed audio comprises the accompaniment and the voice, the loudness of the first accompaniment audio and the mixed audio is respectively calculated, the loudness of the first voice audio is calculated according to the loudness of the first accompaniment audio and the mixed audio, the singing turning audio of the target audio resource is obtained, and the loudness of the second accompaniment audio and the loudness of the second voice audio in the singing turning audio are respectively adjusted according to the loudness of the first accompaniment audio and the first voice audio. According to the scheme provided by the embodiment of the application, the loudness of the voice and the accompaniment in the original singing audio is calculated, and the voice and the accompaniment in the singing works are adjusted, so that the vocal singing works are closer to the original singing, and the audio processing effect and the loudness adjustment accuracy for the singing music are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a first flowchart of an audio processing method according to an embodiment of the invention;

fig. 1b is a second flow chart of the audio processing method according to the embodiment of the invention;

fig. 2 is a schematic view of a scene of an audio processing method according to an embodiment of the present invention;

fig. 3a is a schematic diagram of a first structure of an audio processing apparatus according to an embodiment of the present invention;

fig. 3b is a schematic diagram of a second structure of an audio processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

An embodiment of the present invention provides an audio processing method, where an execution main body of the audio processing method may be the audio processing apparatus provided in the embodiment of the present invention, or a server integrated with the audio processing apparatus, where the audio processing apparatus may be implemented in a hardware or software manner.

Before describing the technical solution of the present invention, the related technical terms are briefly explained:

loudness: loudness is the subjective perception of sound pressure by a person and is also an attribute of hearing according to which sounds can be ranked, e.g., from quiet to loud or from loud to quiet. Although loudness is a physical attribute of sound, it is closely related to the physiological and psychological feelings of a listener, and precisely, this belongs to the category of psychophysics.

Sound-accompaniment ratio is the loudness ratio relationship between human voice and accompaniment in music. This is not a specific numerical scale, but a relative relationship of loudness. If the sound accompaniment is high, the large sound loudness and the small accompaniment loudness are represented; if the sound accompaniment ratio is low, the sound loudness is small, and the loudness of the accompaniment is large; if the sound accompaniment is balanced, it means that the loudness of human voice and the loudness of accompaniment are close.

The loudness standard EUBR128, the EBU is called European Broadcasting Union, EBUR128 is a recommendation for loudness control by the European Broadcasting Union, and details of the measured volume of loudness, the integration window length, and the like are more clearly defined on the basis of the ITU-rbs.1770 standard (measurement algorithm for loudness and true peak of audio program specified by the international Broadcasting Union).

Loudness unit LUFS: LUFS is known collectively as the Loudness units relative to full scale, the Loudness unit relative to the full scale. The greater the value of the LUFS, the greater its loudness. In particular, 0 is taken as the maximum value of such full scale units, i.e. such values are all negative numbers

Loudness gain: i.e. the loudness difference. For example, the current loudness is-10 LUFS, the target loudness is-8 LUFS, and the loudness gain is 2LUFS.

Lyric file QRC: the lyric file can be used for realizing the karaoke function, and the lyric display is accurately positioned word by word and accurate to millisecond, so that the lyric synchronous display is more accurate.

As shown in fig. 1a, fig. 1a is a first flowchart of an audio processing method according to an embodiment of the present invention, and the specific flow of the audio processing method may be as follows:

101. the method comprises the steps of obtaining original singing audio of a target audio resource and a lyric file of the original singing audio, and segmenting the original singing audio according to the lyric file to obtain first accompaniment audio and mixed audio comprising accompaniment and human voice.

In an embodiment, the target audio resource may be a song currently or to be sung by the karaoke user, and specifically, the original audio of the target audio resource and a lyric file of the original audio may be requested from the server, for example, a request may be sent to the server according to an identifier of the target audio resource (a song name, an album name, a singer, and the like), and then the original audio of the target audio resource and the lyric file of the original audio returned by the server according to the request for writing may be received. The original singing audio is the original singing work, the first accompanying audio is the pure accompanying part in the original singing audio, the mixed audio is the part which contains the accompanying sound and the human voice in the original singing audio,

in an embodiment, the original singing audio frequency can be segmented according to the lyric file, and specifically, the lyric file of the original singing audio frequency can be crawled in the internet through a crawler technology. For example, the electronic device captures lyric files of the original audio in each music platform by running a preset insect capturing script. The preset insect catching script can be written by a user according to actual requirements. It should be noted that the lyrics file of the original audio frequency may also be directly imported by a user, and those skilled in the art can understand that in practical application, the lyrics file of the original audio frequency may be obtained in various ways, and the embodiment does not limit the specific way of obtaining the lyrics file of the original audio frequency.

In an embodiment, the lyric file may be a QRC lyric file, and since the QRC lyric file includes time stamp information of each word in all lyrics, the time stamps may be used to segment the original singing audio. For example, in a song with a duration of three minutes, it may be determined that the interval in which the human voice exists in the song is twenty seconds to one tenth of a second and one thirty seconds to two forty minutes according to the timestamp of the lyrics, at this time, the original audio may be divided into two parts, where the first part includes, from the beginning of the audio, that is, 0 second to twenty seconds, one tenth of a second to one thirty seconds, and two forty seconds to the end of the audio, that is, three minutes, and the first part is a pure accompaniment part that does not include the human voice, that is, a first accompaniment audio; the second part includes twenty seconds to one tenth of a second and one thirty to two fortieth of the original singing audio, which includes both accompaniment and human voice, i.e., mixed audio.

102. And respectively calculating the loudness of the first accompaniment audio and the mixed audio.

In one embodiment, the loudness of the first accompaniment audio and the mixed audio may be calculated separately using the EBUR128 function. The EBU specifies three loudness units, namely a loudness level, a loudness range, and a true peak level.

The loudness level is used to describe the loudness of the program, i.e. the subjective volume of the tested audio program is compared with the size of the standard reference program under the specified playback condition and the same broadcasting time duration. The loudness level of the EBU system is LUFS and LU, wherein LUFS = LU, and the larger the value is, the louder the program is. EBU follows the ITU-R defined loudness algorithm based on K weights, which is divided into three steps: firstly, a sliding rectangular window is used for intercepting a loudness block with a period of time T from the audio frequency to be measured and carrying out K weighting filtering. And step two, after calculating and filtering, determining the mean square energy of each channel audio sample. And thirdly, weighting and accumulating the mean square values of all the sound channels and taking a logarithm value to obtain the loudness level.

The loudness range is used to describe the loudness contrast of a program, i.e., the dispersion of the loudness level of a program over a short period of time, or the distribution range of the loudness of 85% that occurs most frequently. The measuring method of the loudness range comprises four steps: in a first step, program audio is cut into short-time loudness blocks that overlap one another. And secondly, drawing a loudness-probability distribution map of the program by taking the loudness as a horizontal axis and the probability density as a vertical axis. Third, the horizontal axis portion below-70 LUFS and 20LU below the absolute gated overall loudness is removed. And fourthly, the width of a transverse axis corresponding to two points of 10 percent to 95 percent of the residual loudness accumulation distribution function is the loudness range.

The real peak value level is different from the PPM 'peak value level' which is commonly used at ordinary times, the transient response of the analog quasi-peak value table is limited by the rise time, and the peak value which is shorter than the rise time cannot be displayed, and the digital PPM table which is realized in a sampling peak value table mode indicates the maximum value of the sampling points and cannot reflect the peak value between the sampling points. The real peak level table is an improved sampled peak table that upsamples the audio by at least a factor of 4 prior to reading the sampled peak. Compared with a PPM (pulse position modulation) table, the error of the TP table is small, the allowance reserved for the uncertainty of the number can be reduced, and the dynamic range of the digital signal is utilized to the maximum extent. The unit of the true peak level is dBTP or TPFS, referenced to the full level.

In other embodiments, the loudness of the first accompaniment audio and the mixed audio may also be calculated according to methods such as an average amplitude or a maximum amplitude, which is not further limited in this application.

103. And calculating the loudness of the first person sound according to the loudness of the first accompaniment audio and the mixed audio.

In an embodiment, the loudness of the dry sound in the mixed audio may be calculated according to the sound superposition principle. Specifically, the loudness gain brought by the human voice in the mixed audio can be calculated according to the loudness of the first accompaniment audio and the loudness of the mixed audio, and then the loudness of the dry sound in the mixed audio is calculated according to the loudness gain brought by the human voice in the mixed audio. That is, the loudness of the first person sound audio is calculated according to a preset formula and the loudness of the first accompaniment audio and the mixed audio, where the preset formula is:

L _G ＝L _M -L _A

wherein: l is _A Is the loudness of the first accompaniment audio, L _M For the loudness of mixed audio, L _G Loudness gain, L, corresponding to the first person audio in the mixed audio _V The loudness of the first person's sound audio.

104. And acquiring the singing turning audio of the target audio resource, and respectively adjusting the loudness of a second accompaniment audio and a second human voice audio in the singing turning audio according to the loudness of the first accompaniment audio and the first human voice audio.

In an embodiment, the singing audio of the target audio resource is a deductive audio of the user, after the singing audio is obtained, a second accompaniment audio and a second vocal audio in the singing audio may be extracted first, and then loudness of the second accompaniment audio and loudness of the second vocal audio are calculated respectively. After the loudness of the second accompaniment audio and the second vocal audio is calculated, the loudness of the second accompaniment audio and the second vocal audio in the dubbing audio can be adjusted according to the loudness of the first accompaniment audio and the first vocal audio, for example, the loudness of the second accompaniment audio is adjusted according to the loudness of the first accompaniment audio, and the loudness of the second vocal audio is adjusted according to the loudness of the first vocal audio.

Furthermore, after the loudness of the second accompaniment audio and the second human voice audio are adjusted, the second accompaniment audio and the second human voice audio can be subjected to sound mixing synthesis processing, the loudness of the synthesized work obtained after processing is close to or consistent with that of the original work, and the sound accompaniment ratio of the synthesized work is also close to or consistent with that of the original work, so that the synthesized work can better accord with ideal auditory perception of human, and better accord with the expectation of a user. That is, after adjusting the loudness of the second accompaniment audio and the second vocal audio in the cover audio, respectively, the method further includes:

and synthesizing the adjusted second accompaniment audio and the second voice audio to obtain synthesized vocal flipping audio.

As described above, the audio processing method provided by the embodiment of the present invention can obtain the original song audio of the target audio resource and the lyric file of the original song audio, and segment the original song audio according to the lyric file to obtain the first accompaniment audio and the mixed audio including the accompaniment and the vocal, calculate the loudness of the first accompaniment audio and the mixed audio respectively, calculate the loudness of the first vocal according to the loudness of the first accompaniment audio and the mixed audio, obtain the singing flipping audio of the target audio resource, and adjust the loudness of the second accompaniment audio and the second vocal audio in the singing flipping audio respectively according to the loudness of the first accompaniment audio and the first vocal. According to the scheme provided by the embodiment of the application, the loudness of the voice and the accompaniment in the original singing audio is calculated, and the voice and the accompaniment in the singing works are adjusted, so that the vocal singing works are closer to the original singing, and the audio processing effect and the loudness adjustment accuracy for the singing music are improved.

The method described in the previous examples is described in further detail below.

Referring to fig. 1b, fig. 1b is a second flow chart of the audio processing method according to the embodiment of the invention. The method comprises the following steps:

201. and acquiring the original singing audio frequency of the target audio frequency resource and a lyric file of the original singing audio frequency, and extracting time stamp information corresponding to the lyrics in the lyric file.

In an embodiment, the target audio resource may be a song currently or to be sung by the karaoke user, and specifically, the original audio of the target audio resource and the lyric file of the original audio may be requested from the server, for example, the request may be sent to the server according to the identifier of the target audio resource, and then the original audio of the target audio resource and the lyric file of the original audio returned by the server according to the request for writing may be received.

In an embodiment, the lyric file may be a QRC lyric file, where the QRC lyric file includes timestamp information of each word of all lyrics.

202. And segmenting the original singing audio according to the timestamp information to obtain a first accompaniment audio and a mixed audio comprising the accompaniment and the voice.

For example, in a song with a duration of three minutes, it may be determined that the interval in which the human voice exists in the song is twenty seconds to one tenth of a second and one thirty seconds to two forty minutes according to the timestamp of the lyrics, at this time, the original audio may be divided into two parts, where the first part includes, from the beginning of the audio, that is, 0 seconds to twenty seconds, one tenth of a second to one thirty seconds, and two forty seconds to the end of the audio, that is, three minutes, and the first part of the audio is a pure accompaniment part that does not include the human voice, that is, a first accompaniment audio, which may be denoted as a chapter; the second part includes twenty seconds to one tenth of a second and one thirty second to two fortieth of the original singing audio, and this part of audio includes both accompaniment and human voice, i.e. mixed audio, which can be recorded as M segments.

203. And calculating the loudness of the first accompaniment audio and the mixed audio respectively.

In an embodiment of the application, the loudness L of the first accompaniment audio may be calculated separately using the EBUR128 function _A Loudness L of mixed audio _M . Wherein the loudness unit of the EBU may be a loudness level. Specifically, a sliding rectangular window can be used to intercept a loudness block with a duration of T from the audio to be measured, perform K weighting filtering, determine the mean square energy of each channel audio sample after filtering, and then perform weighted accumulation on the mean square value of each channel and take a logarithm value, thereby obtaining the loudness level.

204. And calculating the loudness of the first person sound according to the loudness of the first accompaniment audio and the mixed audio.

In an embodiment, the loudness L of the dry sound in the mixed audio may be calculated according to the sound superposition principle _V . Specifically, the loudness L of the first accompaniment audio may be first determined according to the loudness L of the first accompaniment audio _A Loudness L of mixed audio _M Calculating loudness gain L brought by human voice in mixed audio _G And further according to loudness gain L brought by human voice in mixed audio _G To calculate the loudness L of the stem sound in the mixed audio _V I.e. the loudness of the first person's sound audio.

205. And extracting an accompaniment part in the singing audio of the target audio resource, and determining a second accompaniment audio in the accompaniment part according to the lyric file.

For example, in a song with a duration of three minutes, it may be determined from the time stamp of the lyrics that the intervals in which the human voice audio exists in the song are twenty seconds to one tenth of a second and one thirty seconds to two and forty seconds, and thus the intervals in the song corresponding to the audio of the pure accompaniment are 0 second to twenty seconds, one tenth of a second to one thirty seconds and two and forty seconds to the end of the audio, that is, three minutes. After extracting the accompaniment part in the singing audio, the accompaniment part is divided into parts corresponding to the above-mentioned intervals, that is, parts from 0 second to twenty seconds, one tenth second to one thirty seconds, and two minutes and forty seconds to the end of the audio, that is, three minutes, in the accompaniment part, which is the second accompaniment audio, can be marked as a'.

206. And extracting the voice part recorded by the user in the singing voice frequency of the target audio resource, and determining a second human voice frequency in the voice part according to the lyric file.

The voice part recorded by the user in the song turning audio can be the audio recorded by the user in the K song through a microphone, and the part corresponding to the original vocal sound in the voice part can be divided according to the lyric file. For example, the part of the dry sound part from twenty seconds to ten minutes and from thirty minutes to two minutes and forty seconds is divided, which is the second human voice audio, and can be denoted as V'.

207. And adjusting the loudness of the second accompaniment audio and the second human voice audio respectively according to the loudness of the first accompaniment audio and the first human voice audio.

In one embodiment, the loudness L of the second accompaniment audio may be calculated first _A’ Loudness L with second human voice audio _V’ . Then according to the loudness L of the first accompaniment audio _A To calculate a loudness gain of the second accompaniment audio:

G _A ＝L _A -L _A’

loudness L from first person's audio _V To calculate a loudness gain of the second human voice audio:

G _V ＝L _V -L _V’

gain G of loudness of audio frequency through the second accompaniment _A And loudness gain G of the second human voice audio _V And adjusting the loudness of the second accompaniment audio and the second human voice audio. That is, adjusting the loudness of the second accompaniment audio and the loudness of the second human voice audio according to the loudness of the first accompaniment audio and the loudness of the first human voice audio respectively includes:

calculating the loudness gain of the second accompaniment audio according to the loudness of the first accompaniment audio and the second accompaniment audio, and adjusting the loudness of the second accompaniment audio according to the loudness gain of the second accompaniment audio;

and calculating the loudness gain of the second human voice audio according to the loudness of the first human voice audio and the second human voice audio, and adjusting the loudness of the second human voice audio according to the loudness gain of the second human voice audio.

Furthermore, after the loudness of the second accompaniment audio and the second human voice audio are adjusted, the second accompaniment audio and the second human voice audio can be subjected to sound mixing synthesis processing, the loudness of the synthesized work obtained after the processing is close to or consistent with that of the original work, and the sound accompaniment ratio of the synthesized work is also close to or consistent with that of the original work, so that the synthesized work better accords with ideal auditory perception of human beings and better accords with the expectation of users.

And step 208, receiving an operation instruction input by the user, and adjusting the loudness of the second accompaniment audio and the second vocal audio again according to the operation instruction.

In an embodiment, the scheme can further adjust the sound accompaniment ratio, the volume, the sound effect and the like of the finally obtained singing audio by the user. As shown in fig. 2, fig. 2 is a scene schematic diagram of an audio processing method according to an embodiment of the present invention. For example, the user can adjust the sound accompaniment ratio of the singing audio by manually adjusting the slide bar, wherein the slide bar can move on a straight line which visually represents the loudness ratio of the human voice audio and the accompaniment audio, one end of the straight line is the maximum proportion of the accompaniment audio loudness, the other end of the straight line is the maximum proportion of the human voice audio loudness, and the middle part is defaulted to be the initial proportion of the human voice audio loudness and the accompaniment audio loudness, so that the intelligently recommended proportion of the human voice audio loudness and the accompaniment audio loudness is represented.

Furthermore, in the process of adjusting the sound-to-accompaniment ratio by the user, when the loudness of the human voice is amplified, the loudness of the accompaniment audio is correspondingly reduced, and when the loudness of the human voice is reduced, the loudness of the accompaniment audio is correspondingly amplified.

As described above, the audio processing method provided in the embodiment of the present invention may obtain an original song audio of a target audio resource and a lyric file of the original song audio, extract timestamp information corresponding to lyrics in the lyric file, segment the original song audio according to the timestamp information to obtain a first accompaniment audio and a mixed audio including accompaniment and vocal, calculate loudness of the first accompaniment audio and the mixed audio respectively, calculate loudness of the first vocal according to the loudness of the first accompaniment audio and the mixed audio, extract an accompaniment part in a singing audio of the target audio resource, determine a second vocal according to the lyric file in the accompaniment part, extract a dry sound part recorded by a user in the singing audio of the target audio resource, determine a second vocal according to the lyric file in the dry sound part, adjust loudness of the second accompaniment audio and the second vocal according to the loudness of the first accompaniment audio and the first vocal, receive an operation instruction input by the user, and adjust loudness of the second accompaniment audio and the second vocal again according to the operation instruction. The scheme provided by the embodiment of the application is through calculating the voice and the accompaniment loudness in the original singing audio to adjust the voice and the accompaniment in the singing works, so that the singing works are closer to the original singing, and the audio processing effect and the loudness adjustment accuracy for the singing music are improved.

In order to implement the above method, an embodiment of the present invention further provides an audio processing apparatus, where the audio processing apparatus may be specifically integrated in a terminal device, such as a mobile phone, a tablet computer, and the like.

For example, as shown in fig. 3a, it is a schematic diagram of a first structure of an audio processing apparatus according to an embodiment of the present invention. The audio processing apparatus may include:

the segmentation unit 301 is configured to obtain an original audio of the target audio resource and segment the original audio to obtain a first accompaniment audio and a mixed audio, where the mixed audio includes an accompaniment and a voice.

In an embodiment, the target audio resource may be a song currently or to be sung by the user of the karaoke song, and specifically, the original singing audio of the target audio resource and the lyric file of the original singing audio may be requested from the server, for example, the request is sent to the server according to the identifier of the target audio resource, and then the original singing audio of the target audio resource and the lyric file of the original singing audio returned by the server according to the request for writing are received.

In an embodiment, the lyric file may be a QRC lyric file, and since the QRC lyric file includes time stamp information of each word of all lyrics, the dividing unit 301 may divide the original audio by using the time stamps.

A first calculating unit 302, configured to calculate loudness of the first accompaniment audio and the mixed audio respectively.

In an embodiment of the application, the first calculating unit 302 may calculate the loudness of the first accompaniment audio and the mixed audio respectively by using the EBUR128 function. The EBU specifies three loudness units, namely a loudness level, a loudness range, and a true peak level. The loudness of the first accompaniment audio and the mixed audio may be calculated according to a mean amplitude or a maximum amplitude.

A second calculating unit 303, configured to calculate a loudness of the first person sound audio based on the loudness of the first accompaniment audio and the mixed audio.

In an embodiment, the second calculation unit 303 may calculate the loudness of the dry sound in the mixed audio according to the sound superposition principle. Specifically, the loudness gain brought by the human voice in the mixed audio can be calculated according to the loudness of the first accompaniment audio and the loudness of the mixed audio, and then the loudness of the mixed audio in the middle stem voice can be calculated according to the loudness gain brought by the human voice in the mixed audio.

The adjusting unit 304 is configured to acquire a singing audio of a target audio resource, and adjust loudness of a second accompaniment audio and loudness of a second vocal audio, which are divided from the singing audio, according to loudness of the first accompaniment audio and the first vocal audio, respectively.

In an embodiment, the singing audio of the target audio resource is a deductive audio of the user, and after the singing audio is acquired, a second accompaniment audio and a second vocal audio in the singing audio may be extracted, and then loudness of the second accompaniment audio and loudness of the second vocal audio are calculated respectively. After calculating the loudness of the second accompaniment audio and the second vocal audio, the adjusting unit 304 may respectively adjust the loudness of the second accompaniment audio and the second vocal audio in the zapping audio according to the loudness of the first accompaniment audio and the first vocal audio, for example, adjust the loudness of the second accompaniment audio according to the loudness of the first accompaniment audio and adjust the loudness of the second vocal audio according to the loudness of the first vocal audio.

In an embodiment, referring to fig. 3b, the segmentation unit 301 may include:

an obtaining subunit 3011, configured to obtain timestamp information corresponding to a lyric in a lyric file of the original audio;

and a dividing subunit 3012, configured to divide the original singing audio according to the timestamp information to obtain a first accompaniment audio and a mixed audio, where the mixed audio includes an accompaniment and a voice.

In an embodiment, the adjusting unit 304 may include:

a first extracting subunit 3041, configured to extract an accompaniment part in the singing audio of the target audio resource, and determine a second accompaniment audio in the accompaniment part according to the lyric file;

a second extracting subunit 3042, configured to extract an stem part recorded by the user in the song turning audio of the target audio resource, and determine a second voice audio in the stem part according to the lyric file;

an adjusting subunit 3043, configured to adjust the loudness of the second accompaniment audio and the loudness of the second human voice audio according to the loudness of the first accompaniment audio and the loudness of the first human voice audio, respectively.

The audio processing apparatus provided in the embodiment of the present invention obtains an original song audio of a target audio resource and a lyric file of the original song audio through a dividing unit 301, and divides the original song audio according to the lyric file to obtain a first accompaniment audio and a mixed audio including an accompaniment and a vocal, a first calculating unit 302 calculates loudness of the first accompaniment audio and the mixed audio respectively, a second calculating unit 303 calculates loudness of the first vocal according to the loudness of the first accompaniment audio and the mixed audio, and an adjusting unit 304 obtains a reverse song audio of the target audio resource, and adjusts loudness of a second accompaniment audio and a second vocal audio in the reverse song audio respectively according to the loudness of the first accompaniment audio and the first vocal. According to the scheme provided by the embodiment of the application, the loudness of the voice and the accompaniment in the original singing audio is calculated, and the voice and the accompaniment in the singing works are adjusted, so that the vocal singing works are closer to the original singing, and the audio processing effect and the loudness adjustment accuracy for the singing music are improved.

An embodiment of the present invention further provides a terminal, as shown in fig. 4, the terminal may include a Radio Frequency (RF) circuit 601, a memory 602 including one or more computer-readable storage media, an input unit 603, a display unit 604, a sensor 605, an audio circuit 606, a Wireless Fidelity (WiFi) module 607, a processor 608 including one or more processing cores, and a power supply 609. Those skilled in the art will appreciate that the terminal configuration shown in fig. 4 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 601 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuit 601 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 601 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Message Service (SMS), and the like.

The memory 602 may be used to store software programs and modules, and the processor 608 executes various functional applications and information processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 608 and the input unit 603 access to the memory 602.

The input unit 603 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, input unit 603 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 608, and can receive and execute commands sent by the processor 608. In addition, the touch sensitive surface can be implemented in various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 603 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 604 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 604 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 608 to determine the type of touch event, and the processor 608 then provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 4 the touch sensitive surface and the display panel are implemented as two separate components for input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel for input and output functions.

The terminal may also include at least one sensor 605, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.

Audio circuitry 606, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 606 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 606 and converted into audio data, which is then output to the processor 608, and then processed by the RF circuit 601 to be transmitted to another terminal, for example, or output to the memory 602 for further processing. The audio circuit 606 may also include an earbud jack to provide communication of peripheral headphones with the terminal.

WiFi belongs to a short-distance wireless transmission technology, and the terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 607 and provides wireless broadband internet access for the user. Although fig. 4 shows the WiFi module 607, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 608 is a control center of the terminal, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the handset. Optionally, processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 608.

The terminal also includes a power supply 609 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 608 via a power management system, such that functions such as managing charging, discharging, and power consumption are performed via the power management system. The power supply 609 may also include any component, such as one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 608 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 608 runs the application programs stored in the memory 602, thereby implementing various functions:

acquiring and segmenting original singing audio of a target audio resource to obtain first accompaniment audio and mixed audio, wherein the mixed audio comprises accompaniment and human voice;

calculating loudness of the first accompaniment audio and the mixed audio respectively;

calculating the loudness of a first person sound based on the loudness of the first accompaniment audio and the mixed audio;

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the audio processing method, and are not described herein again.

As can be seen from the above, the terminal according to the embodiment of the present invention may obtain the original singing audio of the target audio resource and the lyric file of the original singing audio, and segment the original singing audio according to the lyric file to obtain the first accompaniment audio and the mixed audio including the accompaniment and the vocal, calculate the loudness of the first accompaniment audio and the mixed audio respectively, calculate the loudness of the first vocal based on the loudness of the first accompaniment audio and the mixed audio, obtain the singing audio of the target audio resource, and adjust the loudness of the second accompaniment audio and the second vocal audio after segmentation in the singing audio respectively according to the loudness of the first accompaniment audio and the first vocal. The scheme provided by the embodiment of the application is through calculating the voice and the accompaniment loudness in the original singing audio to adjust the voice and the accompaniment in the singing works, so that the singing works are closer to the original singing, and the audio processing effect and the loudness adjustment accuracy for the singing music are improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, the present invention provides a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the audio processing methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any audio processing method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any audio processing method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The audio processing method, apparatus, storage medium, and terminal provided in the embodiments of the present invention are described in detail above, and specific embodiments are applied in this document to explain the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understanding the method and its core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as limiting the present invention.

Claims

1. An audio processing method, comprising:

2. The audio processing method of claim 1, wherein the step of obtaining and segmenting the original audio of the target audio resource comprises:

acquiring original singing audio of a target audio resource and a lyric file of the original singing audio;

and segmenting the original singing audio according to the lyric file.

3. The audio processing method of claim 2, wherein the segmenting the original song audio according to the lyric file to obtain a first accompaniment audio and a mixed audio including an accompaniment and a human voice comprises:

obtaining timestamp information corresponding to lyrics in the lyric file;

and segmenting the original singing audio according to the timestamp information to obtain a first accompaniment audio and a mixed audio comprising the accompaniment and the voice.

4. The audio processing method of claim 1, wherein the loudness of the first person sound audio is calculated based on a preset formula and the loudness of the first accompaniment audio and the mixed audio, wherein the preset formula is:

L _G ＝L _M -L _A

5. The audio processing method of claim 1, wherein obtaining the singing audio of the target audio resource, and adjusting the loudness of the second accompaniment audio and the second human voice audio, which are divided from the singing audio, according to the loudness of the first accompaniment audio and the first human voice audio, respectively, comprises:

extracting an accompaniment part in the singing audio of the target audio resource, and determining a second accompaniment audio in the accompaniment part;

extracting an stem voice part recorded by a user in the singing voice frequency of the target audio resource, and determining a second voice frequency in the stem voice part;

and adjusting the loudness of the second accompaniment audio and the second human voice audio respectively according to the loudness of the first accompaniment audio and the first human voice audio.

6. The audio processing method of claim 5, wherein adjusting the loudness of the second accompaniment audio and the second human voice audio according to the loudness of the first accompaniment audio and the first human voice audio, respectively, comprises:

7. The audio processing method of claim 1, wherein after adjusting the loudness of the second accompaniment audio and the second vocal audio, respectively, of the cover audio, the method further comprises:

and synthesizing the adjusted second accompaniment audio and the second voice audio to obtain synthesized rap audio.

8. The audio processing method of claim 1, wherein after adjusting the loudness of the second accompaniment audio and the second vocal audio, respectively, of the cover audio, the method further comprises:

and receiving an operation instruction input by a user, and adjusting the loudness of the second accompaniment audio and the second voice audio again according to the operation instruction.

9. An audio processing apparatus, comprising:

10. The audio processing apparatus of claim 9, wherein the segmentation unit comprises:

the acquisition subunit is used for acquiring timestamp information corresponding to the lyrics in the lyric file of the original singing audio;

and the dividing subunit is used for dividing the original singing audio according to the timestamp information to obtain a first accompaniment audio and a mixed audio, and the mixed audio comprises an accompaniment and a voice.

11. The audio processing apparatus according to claim 9, wherein the adjusting unit includes:

the first extraction subunit is used for extracting an accompaniment part in the singing audio of the target audio resource and determining a second accompaniment audio in the accompaniment part;

the second extraction subunit is used for extracting an stem voice part recorded by the user in the singing voice frequency of the target audio resource and determining a second human voice frequency in the stem voice part;

and the adjusting subunit is used for respectively adjusting the loudness of the second accompaniment audio and the second human voice audio according to the loudness of the first accompaniment audio and the first human voice audio.

12. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the audio processing method of any of claims 1 to 8.