WO2023217003A1 - 音频处理方法、装置、设备及存储介质 - Google Patents

音频处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023217003A1
WO2023217003A1 PCT/CN2023/092377 CN2023092377W WO2023217003A1 WO 2023217003 A1 WO2023217003 A1 WO 2023217003A1 CN 2023092377 W CN2023092377 W CN 2023092377W WO 2023217003 A1 WO2023217003 A1 WO 2023217003A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
accompaniment
control
response
interface
Prior art date
Application number
PCT/CN2023/092377
Other languages
English (en)
French (fr)
Inventor
袁帅
孟文翰
黄益修
Original Assignee
北京字跳网络技术有限公司
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司, 北京字节跳动网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2023217003A1 publication Critical patent/WO2023217003A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • G10H1/366Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/01Correction of time axis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups

Definitions

  • the embodiments of the present disclosure relate to the field of human-computer interaction technology, and in particular, to an audio processing method, device, equipment and storage medium.
  • Audio editing is a common way to create media content.
  • Embodiments of the present disclosure provide an audio processing method, device, equipment and storage medium to improve audio processing efficiency and meet users' personalized needs for audio production.
  • an embodiment of the present disclosure provides an audio processing method, including:
  • the vocal and the accompaniment are mixed to obtain target audio.
  • an audio processing device including:
  • an acquisition module configured to acquire the human voice in response to the first instruction
  • the acquisition module is also used to acquire the accompaniment in response to the second instruction
  • a processing module configured to mix the vocal and the accompaniment in response to the third instruction to obtain target audio.
  • embodiments of the present disclosure provide an electronic device, including: a processor and a memory;
  • the memory stores computer execution instructions
  • the processor executes the computer execution instructions stored in the memory, so that the at least one processor executes the audio processing method described in the above first aspect and various possible designs of the first aspect.
  • embodiments of the present disclosure provide a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the processor executes the computer-executable instructions, the above first aspect and the first aspect are implemented. aspects of various possible designs for the described audio processing method.
  • embodiments of the present disclosure provide a computer program product, including a computer program that, when executed by a processor, implements the audio processing method described in the first aspect and various possible designs of the first aspect.
  • embodiments of the present disclosure provide a computer program that, when executed by a processor, implements the image processing method described in the first aspect and various possible designs of the first aspect.
  • Figure 1 is a schematic diagram of an application scenario of the audio processing method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart 1 of an audio processing method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart 2 of the audio processing method provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic diagram of a user interface provided by an embodiment of the present disclosure.
  • Figure 5 is a schematic diagram of user interface changes provided by an embodiment of the present disclosure.
  • Figure 6a is a schematic diagram 2 of the user interface provided by an embodiment of the present disclosure.
  • Figure 6b is a schematic diagram three of the user interface provided by an embodiment of the present disclosure.
  • Figure 7a is a second schematic diagram of user interface changes provided by an embodiment of the present disclosure.
  • Figure 7b is a schematic diagram three of user interface changes provided by an embodiment of the present disclosure.
  • Figure 7c is a schematic diagram 4 of user interface changes provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic diagram 5 of user interface changes provided by an embodiment of the present disclosure.
  • Figure 9 is a schematic diagram 4 of the user interface provided by an embodiment of the present disclosure.
  • Figure 10a is a schematic diagram 6 of user interface changes provided by an embodiment of the present disclosure.
  • Figure 10b is a schematic diagram 7 of user interface changes provided by an embodiment of the present disclosure.
  • Figure 11 is a structural block diagram of an audio processing device provided by an embodiment of the present disclosure.
  • Figure 12 is a structural block diagram of an electronic device provided by an embodiment of the present disclosure.
  • embodiments of the present disclosure propose an audio processing method that provides a visual intelligent audio processing process and can automatically realize the fusion of human voice and accompaniment in the target audio, and Audio editing can be performed directly after intelligent processing.
  • the material package can be packaged and output to meet the personalized needs of different users and improve the user's audio production experience.
  • FIG. 1 is a schematic diagram of an application scenario of the audio processing method provided by an embodiment of the present disclosure.
  • the application scenario provided by this embodiment includes a terminal device 101 and a server 102, and the terminal device 101 is communicatively connected with the server 102.
  • the terminal device 101 is preset with an audio processing application APP, which provides the user with one or more of the following functions: recording studio editing function, accompaniment separation function, audio mixing function, style synthesis function, and audio optimization function.
  • the user accesses the server 102 through the terminal device 101, for example, uploads two pieces of audio data through the terminal device 101.
  • the server 102 first performs sound source separation on these two pieces of audio data (including separation of human voices, musical instruments, etc.), For example, obtain the human voice in one piece of audio data and the accompaniment in another piece of audio data; secondly perform segment recognition on the human voice and accompaniment, and obtain the target segment of the human voice and accompaniment (such as the climax segment); finally, perform segment recognition on the human voice and accompaniment Perform rhythm detection and rhythm alignment on the target clips to generate mixed target audio.
  • the server 102 sends the target audio to the terminal device 101 so that the user can listen to, save, share the target audio, or perform post-processing on the target audio.
  • the terminal device in this embodiment can be any electronic device with information display function, including but not limited to smartphones, laptops, tablets, smart vehicle equipment, smart wearable devices, smart screens, etc.
  • the server in this embodiment can be an ordinary server or a cloud server.
  • the cloud server is also called a cloud computing server or a cloud host, and is a host product in the cloud computing service system.
  • the server can also be a distributed system server or a server combined with a blockchain.
  • the product implementation form of the present disclosure is program code included in platform software and deployed on electronic devices (which may also be hardware with computing capabilities such as computing clouds or mobile terminals).
  • the program code of the present disclosure may be stored inside an electronic device.
  • the program code runs in the electronic device's host memory and/or GPU memory.
  • Embodiments of the present disclosure provide an audio processing method, device, equipment and storage medium.
  • the method includes: in response to the first instruction, obtaining the human voice in a piece of audio uploaded by the user; in response to the second instruction, obtaining another piece of audio uploaded by the user. Accompaniment in a piece of audio; in response to the third instruction, the vocals and accompaniment in the two pieces of audio are automatically mixed to improve audio processing efficiency and meet the user's personalized needs for audio production.
  • FIG. 2 is a schematic flowchart 1 of an audio processing method provided by an embodiment of the present disclosure. As shown in Figure 2, the method of this embodiment can be applied to terminal devices or servers.
  • the audio processing method includes:
  • Step 201 In response to the first instruction, obtain a human voice.
  • obtaining the human voice includes: in response to the first instruction, obtaining audio data containing only the human voice.
  • a first instruction is generated, and the human voice is acquired according to the first instruction.
  • a first instruction is generated, and the human voice is acquired according to the first instruction.
  • a first instruction is generated, and the human voice is acquired according to the first instruction.
  • the first instruction is not only used to instruct the acquisition of audio data containing human voice, but also used to trigger the extraction of the human voice part of the audio data.
  • Step 202 In response to the second instruction, obtain the accompaniment.
  • obtaining the accompaniment includes: in response to the second instruction, obtaining audio data containing only the accompaniment.
  • a second instruction is generated, and the accompaniment is obtained according to the second instruction.
  • a second instruction is generated, and according to the second Indicates getting the accompaniment.
  • a second instruction is generated, and the accompaniment is obtained according to the second instruction.
  • the second instruction not only instructs to obtain the audio data containing the accompaniment, but is also used to trigger the extraction of the accompaniment part of the audio data.
  • Step 203 In response to the third instruction, mix the vocal and the accompaniment to obtain the target audio.
  • a third instruction is generated, and the vocal and accompaniment are mixed according to the third instruction to obtain the target audio.
  • a third instruction is generated, and the vocal and accompaniment are mixed according to the third instruction to obtain the target audio.
  • a third instruction is generated, and the human voice and accompaniment are mixed according to the third instruction to obtain the target audio.
  • mixing the human voice and the accompaniment can also be described as mixing the human voice and the accompaniment.
  • the user uploads two pieces of audio data that need to be mixed through interface touch or voice control, and extracts two pieces of audio data respectively.
  • the human voice of one piece of audio data and the accompaniment of another piece of audio data realize the automatic mixing and matching of the human voice and accompaniment in the two pieces of audio, improving audio processing efficiency and meeting the user's personalized needs for audio production.
  • FIG 3 is a schematic flowchart 2 of an audio processing method provided by an embodiment of the present disclosure. As shown in Figure 3, the method of this embodiment can be applied to terminal devices or servers.
  • the audio processing method includes:
  • Step 301 In response to a touch operation on the first control on the first interface, import the first audio, and separate the human voice from the first audio.
  • Step 302 In response to the touch operation on the second control on the first interface, import the second audio, and separate the accompaniment from the second audio.
  • Step 303 In response to the touch operation on the third control on the first interface, mix the vocal and the accompaniment to obtain the target audio.
  • the first interface can also be described as an audio import interface.
  • the touch operation for the first control, the second control and the third control on the first interface include but are not limited to click operations.
  • FIG. 4 is a schematic diagram of a user interface provided by an embodiment of the present disclosure.
  • the first interface 400 includes: a first control 401 , a second control 402 , and a third control 403 .
  • the first control 401 is used to extract the human voice from the audio data
  • the second control 402 is used to extract the accompaniment from the audio data
  • the third control 403 is used to automatically mash (mix) the extracted human voice and accompaniment.
  • interface controls in this embodiment include but are not limited to icons, buttons, drop-down boxes, sliders, etc.
  • touch operations include but are not limited to click operations, long press operations, double click operations, sliding operations, etc.
  • the first voice input by the user is obtained, the first instruction is generated through speech recognition, and the first audio is imported according to the first instruction. , and separate the vocal from the first audio.
  • the second voice input by the user is obtained, the second instruction is generated through speech recognition, and the second audio is imported according to the second instruction. , and separate the accompaniment from the second audio.
  • a third voice input by the user is obtained, a third instruction is generated through speech recognition, and the human voice and The accompaniment is mixed to obtain the target audio.
  • the user can also input control voice through physical buttons of the device, such as side buttons on a smartphone, to import the above-mentioned first audio or second audio or perform audio mixing.
  • FIG. 5 is a schematic diagram 1 of user interface changes provided by an embodiment of the present disclosure
  • FIG. 6 a is a schematic diagram 2 of a user interface provided by an embodiment of the present disclosure.
  • the user imports audio data by clicking the first control 401 of the first interface 400.
  • the user can choose to import audio data from a file or video album.
  • the user selects Audio 1 on the video album interface 404.
  • the human voice part is extracted from the audio data
  • the human voice is obtained, and the human voice is visually displayed in the first interface 400, such as the human voice track shown in Figure 5 or Figure 6a.
  • the user can Listen to the extracted vocals.
  • the user imports another audio data by clicking the second control 402 of the first interface 400. While importing the audio data, the accompaniment part of the audio data is extracted, the accompaniment is obtained, and the accompaniment is visualized in the first interface 400.
  • the user can audition the extracted accompaniment, such as the accompaniment track shown in Figure 6a, and the user can audition the extracted accompaniment.
  • the user automatically mixes the extracted vocals and accompaniment by clicking the third control 403 of the first interface 400 to obtain the target audio.
  • the user uploads the recorded playing and singing audio, and at the same time uploads the existing finished musical work.
  • the extracted vocal and accompaniment are mixed to obtain the target Audio, which combines the user's vocals with existing accompaniment.
  • the above audio processing process greatly facilitates users to create personalized music and meets the music creation needs of different users.
  • FIG. 6b is a schematic diagram 3 of the user interface provided by an embodiment of the present disclosure.
  • the interface shown in Figure 6b can be regarded as an optimized version of the interface shown in Figure 6a, including more functional controls.
  • the first interface 400 includes: a first playback control, a first delete control and a first replacement control associated with the human voice.
  • the first playback control is used to listen to the human voice
  • the first delete control is used to delete the vocal
  • the first replacement control is used to replace the vocal
  • the second play control, the second delete control and the second replacement control associated with the accompaniment the second play control is used to listen to the accompaniment
  • the second delete control is used to delete the accompaniment
  • the second replacement control is used to replace the accompaniment.
  • the first interface 400 also includes: a fourth control 405 and a fifth control 406.
  • the fourth control 405 is used to trigger customized processing of the vocal and/or accompaniment, where the custom processing includes audio clips of the vocal and/or accompaniment.
  • the fifth control is used to trigger audio editing of vocals and/or accompaniment (go to the recording studio for audio editing or processing), please refer to the following article for details.
  • mixing the human voice and the accompaniment to obtain the target audio specifically includes: obtaining the vocal segment of the human voice and the accompaniment segment of the accompaniment; mixing the vocal segment and the accompaniment segment, to get the target audio. That is, when mixing vocals and accompaniment, you can first extract mixable vocal segments and accompaniment segments from the vocals and accompaniment respectively, and then perform audio mixing based on the vocal segments and accompaniment segments to obtain the target audio.
  • the vocal clips and accompaniment clips can be obtained through the following implementation:
  • the vocal and the accompaniment are respectively input into the segment recognition model to obtain the vocal segment of the human voice and the accompaniment segment of the accompaniment.
  • the paragraph recognition model is used to identify target segments of audio.
  • the human voice is input into the paragraph recognition model to obtain the target segment of the human voice.
  • the target segment can be a chorus segment, climax segment, or other segment of the audio.
  • the target segment is a repeated segment in a song.
  • the paragraph recognition model can be trained using a deep learning model.
  • This embodiment does not limit the structure of the deep learning model.
  • This implementation implements intelligent extraction of vocal segments and accompaniment segments through training models, which can improve audio processing efficiency and accuracy.
  • the training process of the paragraph recognition model includes: obtaining a training data set.
  • the training data set includes multiple audio samples and annotation information of each audio sample.
  • the annotation information is used to indicate the target segment corresponding to the audio sample.
  • Multiple audio samples in the training data set are used as the input of the paragraph recognition model, and the annotation information of each audio sample in the training data set is used as the output of the paragraph recognition model.
  • the paragraph recognition model is trained until the loss function of the paragraph recognition model is achieved. When convergence occurs, the training of the paragraph recognition model is stopped and the model parameters of the trained paragraph recognition model are obtained.
  • the paragraph recognition model can be used to analyze information such as the rhythm and loudness changes of the input audio, and can identify the intro, main song, chorus, interlude, bridge, outro, silence and other segments of the audio, and extract the best There is a possible chorus, the climax. Specifically, the start and end timestamps of different segments are extracted, subsequent trimming is performed, and the target audio segment is finally output.
  • the vocal and accompaniment tracks are displayed on the second interface; in response to the editing operation on the vocal track, Get the vocal segment; get the accompaniment segment in response to a clip operation on the accompaniment track.
  • This implementation method is to obtain the target segments of vocals and accompaniments by the user editing segments on the interface for subsequent audio mixing. This method increases the user's custom processing of imported vocals and accompaniments, which can improve the user's participation in audio production. , to meet the audio production needs of different users.
  • mixing the vocal and the accompaniment to obtain the target audio includes: obtaining the first rhythm of the vocal and the second rhythm of the accompaniment, and mixing the first rhythm of the vocal and the accompaniment.
  • the second rhythm performs rhythm alignment based on mixing the aligned vocals and accompaniment to obtain the target audio.
  • the second rhythm of the accompaniment is adjusted so that the first rhythm of the human voice and the second rhythm of the accompaniment are consistent.
  • the first rhythm of the vocal is adjusted based on the second rhythm of the accompaniment, so that the first rhythm of the vocal is consistent with the second rhythm of the accompaniment.
  • mixing the human voice and the accompaniment to obtain the target audio includes: mixing the vocal segment of the human voice and the accompaniment segment of the accompaniment to obtain the target audio. Specifically, the first rhythm of the vocal segment and the second rhythm of the accompaniment segment are obtained, the first rhythm of the vocal segment and the second rhythm of the accompaniment segment are rhythmically aligned, and based on the aligned vocal segment and the accompaniment segment, Mix to get the target audio.
  • the first rhythm of the vocal segment is used as a reference to adjust the second rhythm of the accompaniment segment so that the first rhythm of the vocal segment and the second rhythm of the accompaniment segment are consistent.
  • the first rhythm of the vocal segment is adjusted based on the second rhythm of the accompaniment segment, so that the first rhythm of the vocal segment is consistent with the second rhythm of the accompaniment segment.
  • the target audio By obtaining the first rhythm of the third audio and the second rhythm of the fourth audio; rhythmically aligning the first rhythm of the third audio and the second rhythm of the fourth audio; based on the aligned third audio and fourth audio, Get the target audio. Specifically, based on the first rhythm of the third audio, the second rhythm of the fourth audio is adjusted so that the rhythm of the third audio and the fourth audio are consistent.
  • the third audio frequency may be one of the human voice and the accompaniment, and correspondingly, the fourth audio frequency may be the human voice and the accompaniment.
  • Another audio in may be .
  • the third audio may be one of the vocal segment and the accompaniment segment, and the fourth audio may be the other audio of the vocal segment and the accompaniment segment.
  • Rhythm detection is used to detect the rebeat time in the beat and infer the speed of the entire audio or audio segment. Adjusting the audio rhythm includes stretching or compressing the audio rhythm. Typically, the rhythm of the vocal track is aligned to the accompaniment track, and the vocal track file is processed through audio stretching or compression.
  • the vocals and accompaniment in the mixed target audio are better integrated and the audio processing effect is improved.
  • the interface in response to the touch operation on the third control on the first interface, the interface jumps to the third interface, the third interface includes a third playback control, and the third playback control is used to Triggers playback of the target audio.
  • the third interface is the audio mixing preview interface. The following is a graphical explanation of the user interface changes to obtain the target audio after the user imports two pieces of audio.
  • FIG. 7a is a second schematic diagram of user interface changes provided by an embodiment of the present disclosure.
  • the vocal and accompaniment are visually displayed in the first interface 400.
  • the user can directly click the third control 403 to automatically mix and match the vocal and accompaniment, and display the vocal and accompaniment in the third interface.
  • 701 visualizes the target audio after audio mixing.
  • the user can listen to, export, and share the mixed target audio on the third interface 701, or choose to play it again, or import it to the recording studio for further audio processing.
  • FIG. 7b is a schematic diagram 3 of user interface changes provided by an embodiment of the present disclosure.
  • the user can also click the fourth control 405 of the first interface 400 after uploading two pieces of audio data to trigger customized processing of the vocal and/or accompaniment, and the interface jumps to the second interface 700.
  • the user can perform audio editing on the vocal and/or accompaniment on the second interface 700, such as intercepting the climax clip of the vocal and the climax clip of the accompaniment.
  • the user can also listen to the edited vocal or accompaniment climax clip on the second interface 700. .
  • After completing the audio editing jump to the third interface 701 by clicking the "automatic mix and match" control on the second interface 700 .
  • FIG. 7c is a schematic diagram 4 of user interface changes provided by an embodiment of the present disclosure.
  • the user after uploading two pieces of audio data, the user directly clicks the third control 403 of the first interface 400 to automatically mix and match the vocals and accompaniment, which can be listened to and exported in the third interface 701 shown in Figure 7c , share the mixed target audio, or choose to cancel, or import it to the recording studio for further audio processing, and you can also set the cover of the target audio in the third interface 701 shown in Figure 7c.
  • the third interface 701 shown in Figure 7c can be regarded as an optimized version of the third interface 701 shown in Figure 7a.
  • the first window in response to a touch operation on the cover editing control on the third interface, the first window is displayed; in response to a control selection operation on the first window, the target cover is obtained.
  • the first window includes a cover import control, one or more preset static cover controls, and one or more preset animation effect controls.
  • the target cover is a static cover or a dynamic cover.
  • obtaining the target cover in response to a control selection operation on the first window includes: obtaining a static cover and animation in response to a control selection operation on the first window Effect; based on the audio characteristics, static cover and animation effects of the target audio, generate a dynamic cover that changes with the audio characteristics of the target audio.
  • the audio characteristics include audio beat and/or volume.
  • FIG. 8 is a schematic diagram 5 of user interface changes provided by an embodiment of the present disclosure.
  • the user passes Click the cover editing control 705 on the third interface 701, and the first window 800 pops up at the bottom of the third interface 701.
  • the first window 800 includes a cover import control 801, a plurality of preset static covers, such as Cover 1 to Cover 3 in Figure 8, and multiple animation effects, such as Animation 1 to Animation 3 in Figure 8.
  • the user can import a custom picture from the local photo album by clicking the cover import control 801, and use the custom picture as a static cover, or directly select a preset static cover. Users can directly select a preset animation or not set animation.
  • export the target audio and the generated target cover to the album or file system, or share it to a designated application, or import it to the recording studio for further audio processing.
  • FIG. 9 is a schematic diagram 4 of the user interface provided by an embodiment of the present disclosure.
  • the user sets the static cover and animation effects of the target audio through Figure 8, and can preview the synthesized target cover in the audio mixing preview interface.
  • the target cover includes a static cover and animation effects that change with the audio characteristics of the target audio.
  • the animation effect can be seen as adding an animation special effects layer at the bottom of the static cover.
  • the animation effect can dynamically change at any position around the static cover.
  • This embodiment provides users with the function of setting an audio cover, enabling different users to edit the cover in a personalized manner, thereby improving the user's audio production experience.
  • data associated with the target audio is exported to the target location.
  • the target location includes a photo album or file system.
  • the user triggers the first selection window by clicking the export control 702 on the third interface 701, and the user can choose to export the data associated with the target audio to Photo album or file system.
  • the fourth voice input by the user is obtained, an export instruction is generated through speech recognition, and the export instruction is related to the target audio. Export the associated data to the target location.
  • the data associated with the target audio is shared to the target application.
  • the user triggers the second selection window by clicking the sharing control 704 on the third interface 701, and the user can choose to share the data associated with the target audio on the second selection window.
  • the target application or a specified user in the target application.
  • the fifth voice input by the user is obtained, a sharing instruction is generated through speech recognition, and the sharing instruction is related to the target audio. Share the associated data to the target application, or to specified users in the target application.
  • the data associated with the target audio includes at least one of the following: target audio, vocal, accompaniment, vocal segment of the vocal, accompaniment segment of the accompaniment, static cover of the target audio, and dynamic cover of the target audio.
  • the data exported or shared by the user can only contain the target audio, or it can also contain all intermediate data in the process of obtaining the target audio.
  • the data can be compressed first, and then the compressed data can be exported locally or shared with other users. If the shared data received by other users contains all the intermediate data in the process of obtaining the target audio, in addition to playing the target audio, the user can also query or re-edit the intermediate data and generate new target audio, thereby achieving multi-person collaboration. Carry out audio production to increase interaction between users and improve user experience.
  • jumping from the third interface to the fourth interface includes audio processing function controls.
  • the fourth interface is an interface for audio post-processing, which can also be described as a recording studio interface. The user can perform audio post-processing on the vocals and accompaniment in the target audio on the fourth interface. reason.
  • jumping from the third interface to the fourth interface includes the audio processing function control associated with the third interface.
  • Trigger control the trigger control is used to trigger the display of audio processing function controls.
  • audio processing function controls include one or more of the following:
  • Audio optimization control used to trigger editing of audio to optimize audio
  • Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
  • Decorative synthesis controls for triggering the separation of vocals from the audio, mixing and editing the separated vocals with preset accompaniments.
  • Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
  • audio optimization includes optimizing the vocals and/or accompaniment of the user's playing and singing audio, that is, audio optimization includes optimization of playing and singing, such as optimization of boys' guitar, girls' guitar, boys' piano, and girls' piano.
  • accompaniment separation includes separation processing such as vocal removal and instrument removal.
  • style synthesis includes style optimization such as car music, classic pop, heart-warming moments, relaxing moments, childhood memories, etc.
  • audio mashing includes optimization processing such as rhythm alignment and pitch transposition.
  • Figure 10a is a schematic diagram 6 of user interface changes provided by an embodiment of the present disclosure.
  • the user clicks the audio editing control 703 on the third interface 701, and the interface jumps to the fourth interface 1000.
  • the fourth interface 1000 directly displays the target audio after mixing the vocal and accompaniment in track 1, and displays a plurality of optional audio processing controls in the audio processing window 1004 of the fourth interface 1000 .
  • Figure 10b is a schematic diagram 7 of user interface changes provided by an embodiment of the present disclosure.
  • the user clicks the audio editing control 703 on the third interface 701, and the interface jumps to the fourth interface 1000.
  • the user can perform audio post-processing on the vocals and accompaniment in the target audio on the fourth interface 1000.
  • track 1 of the fourth interface 1000 corresponds to the vocal
  • track 2 of the fourth interface 1000 corresponds to the accompaniment.
  • the user can also enter the fifth interface 1002 by clicking the interface switching control 1001 on the fourth interface 1000, or by sliding left or right.
  • the fifth interface 1002 includes a trigger control 1003 associated with the audio processing function control.
  • the trigger control 1003 is used to trigger the display of audio processing function controls.
  • Audio processing controls include multiple selectable controls shown in the audio processing window 1004 of Figure 10b.
  • the user can add effects, perform further audio processing, adjust the volume, etc. to the vocals of track 1 and the accompaniment of track 2 respectively on the fifth interface 1003.
  • the user can also adjust the overall volume of the vocals and accompaniment on the fifth interface 1003. .
  • the effects include reverb, equalization, electronic music, phase shifter, flanger, filter, etc.
  • the user imports the first audio by touching the first control on the first interface, and separates the human voice from the first audio; and then imports the second audio by touching the second control on the first interface. , and separate the accompaniment from the second audio; finally, by touching the third control on the first interface, mix the vocal and accompaniment to obtain the target audio.
  • the above process realizes the automatic mixing and matching of vocals and accompaniment in two pieces of audio, improves the audio processing effect, and meets the user's personalized needs for audio production.
  • FIG. 11 is a structural block diagram of an audio processing device provided by an embodiment of the present disclosure.
  • the audio processing device 1100 provided in this embodiment includes: an acquisition module 1101 and a processing module 1102.
  • Acquisition module 1101 configured to acquire human voices in response to the first instruction
  • the acquisition module 1101 is also configured to acquire the accompaniment in response to the second instruction
  • the processing module 1102 is configured to mix the vocal and the accompaniment in response to the third instruction to obtain target audio.
  • the acquisition module 1101 is configured to import the first audio in response to a touch operation on the first control on the first interface, and separate the first audio from the first audio. said human voice;
  • the acquisition module 1101 is also configured to import the second audio in response to the touch operation on the second control on the first interface, and separate the accompaniment from the second audio.
  • the processing module 1102 is configured to mix the vocal and the accompaniment in response to a touch operation on the third control on the first interface to obtain the desired Describe the target audio.
  • processing module 1102 is used to:
  • the vocal segment and the accompaniment segment are mixed to obtain the target audio.
  • processing module 1102 is used to:
  • paragraph recognition model is used to identify target segments of audio.
  • the audio processing device 1100 further includes: a display module 1103;
  • the display module 1103 is configured to display the vocal track and the accompaniment track on the second interface in response to a touch operation on the fourth control on the first interface;
  • the acquisition module 1101 is configured to acquire the vocal segment in response to the editing operation of the vocal track; and in response to the editing operation of the accompaniment track, acquire the accompaniment segment.
  • processing module 1102 is used to:
  • the third audio is one audio of the human voice and the accompaniment
  • the fourth audio is the other audio of the human voice and the accompaniment
  • the third audio is the audio of the human voice and the accompaniment
  • One audio of the vocal segment and the accompaniment segment, and the fourth audio is the other audio of the vocal segment and the accompaniment segment.
  • the processing module 1102 is configured to adjust the second rhythm of the fourth audio based on the first rhythm of the third audio, so that the The third audio has the same rhythm as the fourth audio.
  • the first interface includes:
  • a first playback control, a first delete control and a first replacement control associated with the human voice the first playback control is used to listen to the human voice, the first delete control is used to delete the human voice, The first replacement control is used to replace the human voice;
  • a second playback control, a second deletion control and a second replacement control associated with the accompaniment is used to listen to the accompaniment, the second deletion control is used to delete the accompaniment, and the second deletion control is used to delete the accompaniment.
  • Two replacement controls are used to replace the accompaniment.
  • the processing module 1102 is configured to jump to a third interface in response to a touch operation on the third control on the first interface, where the third interface includes A third playback control, the third playback control is used to trigger playback of the target audio.
  • the display module 1103 is configured to display a first window in response to a touch operation on the cover editing control on the third interface, where the first window includes a cover import control, one or more preset static cover controls and one or more preset animation effect controls;
  • the acquisition module 1101 is configured to acquire a target cover in response to a control selection operation on the first window; the target cover is a static cover or a dynamic cover.
  • the acquisition module 1101 is configured to acquire static cover and animation effects in response to a control selection operation on the first window;
  • the processing module 1102 is configured to generate a dynamic cover that changes with the audio characteristics of the target audio according to the audio characteristics of the target audio, the static cover, and the animation effect;
  • the audio characteristics include audio beat and/or volume.
  • the processing module 1102 is configured to export data associated with the target audio to a target location in response to an export instruction on the third interface; the target Locations include photo albums or file systems.
  • the processing module 1102 is configured to share data associated with the target audio to a target application in response to a sharing instruction on the third interface.
  • the data associated with the target audio includes at least one of the following:
  • the target audio the human voice, the accompaniment, the vocal segment of the human voice, the accompaniment segment of the accompaniment, the static cover of the target audio, and the dynamic cover of the target audio.
  • the processing module 1102 is configured to jump from the third interface to the fourth interface in response to a touch operation on the audio editing control on the third interface,
  • the fourth interface includes an audio processing function control or a trigger control associated with the audio processing function control, the trigger control being used to trigger display of the audio processing function control;
  • the audio processing function controls include one or more of the following:
  • Audio optimization controls for triggering editing of audio to optimize said audio
  • Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
  • Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
  • the audio processing device provided in this embodiment can be used to execute the technical solutions of the above method embodiments. Its implementation principles and technical effects are similar, and will not be described again in this embodiment.
  • FIG 12 is a structural block diagram of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device 1200 may be a terminal device or a server.
  • the terminal devices may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA for short), tablet computers (Portable Android Device, PAD for short), portable multimedia players (Portable Mobile terminals such as Media Player (PMP for short), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TVs, desktop computers, etc.
  • PDA Personal Digital Assistant
  • PDA Personal Digital Assistant
  • PAD Personal Android Device
  • portable multimedia players Portable Mobile terminals such as Media Player (PMP for short
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 1200 may include a processing device (such as a central processing unit, a graphics processor, etc.) 1201, which may process data according to a program stored in a read-only memory (Read Only Memory, ROM for short) 1202 or from a storage device. 1208 loads the program in the random access memory (Random Access Memory, RAM for short) 1203 to perform various appropriate actions and processing. In the RAM 1203, various programs and data required for the operation of the electronic device 1200 are also stored.
  • the processing device 1201, ROM 1202 and RAM 1203 are connected to each other via a bus 1204.
  • An input/output (I/O for short) interface 1205 is also connected to bus 1204.
  • the following devices can be connected to the I/O interface 1205: input devices 1206 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD). ), an output device 1207 such as a speaker, a vibrator, etc.; a storage device 1208 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1209.
  • the communication device 1209 may allow the electronic device 1200 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 12 illustrates electronic device 1200 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 1209, or from storage device 1208, or from ROM 1202.
  • the processing device 1201 When the computer program is executed by the processing device 1201, the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof.
  • Computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, referred to as EPROM or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, referred to as CD-ROM), optical storage device, magnetic storage device, or the above any suitable combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device When the one or more programs are executed by the electronic device, the electronic device performs the method shown in the above embodiment.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or it can be connected to an external computer Computer (e.g. connected via the Internet using an Internet service provider).
  • LAN Local Area Network
  • WAN Wide Area Network
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure can be implemented in software or hardware.
  • the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
  • the first acquisition unit can also be described as "the unit that acquires at least two Internet Protocol addresses.”
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Products ( Application Specific Standard Parts (ASSP for short), System on Chip (SOC for short), Complex Programmable Logic Device (CPLD for short), etc.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • an audio processing method including:
  • the vocal and the accompaniment are mixed to obtain target audio.
  • obtaining the human voice in response to the first instruction includes:
  • the step of obtaining the accompaniment in response to the second instruction includes:
  • the mixing the vocal and the accompaniment in response to the third instruction to obtain the target audio includes:
  • the vocal and the accompaniment are mixed to obtain the target audio.
  • mixing the vocal and the accompaniment to obtain target audio includes:
  • the vocal segment and the accompaniment segment are mixed to obtain the target audio.
  • obtaining the vocal segment of the human voice and the accompaniment segment of the accompaniment includes:
  • paragraph recognition model is used to identify target segments of audio.
  • obtaining the vocal segment of the human voice and the accompaniment segment of the accompaniment includes:
  • the accompaniment segment is obtained in response to a clipping operation on an audio track of the accompaniment.
  • mixing the vocal and the accompaniment to obtain target audio includes:
  • the third audio is one audio of the human voice and the accompaniment
  • the fourth audio is the other audio of the human voice and the accompaniment
  • the third audio is the audio of the human voice and the accompaniment
  • One audio of the vocal segment and the accompaniment segment, and the fourth audio is the other audio of the vocal segment and the accompaniment segment.
  • rhythmically aligning the first rhythm of the third audio and the second rhythm of the fourth audio includes:
  • the second rhythm of the fourth audio is adjusted so that the rhythm of the third audio is consistent with that of the fourth audio.
  • the first interface includes:
  • a first playback control, a first delete control and a first replacement control associated with the human voice the first playback control is used to listen to the human voice, the first delete control is used to delete the human voice, The first replacement control is used to replace the human voice;
  • the second playback control is used to listen to the accompaniment
  • the second delete control is used to delete the accompaniment
  • the second replacement control is used to replace the accompaniment.
  • jumping to the third interface includes a third playback control, and the third interface Three playback controls are used to trigger playback of the target audio.
  • a first window is displayed, the first window includes a cover import control, one or more presets static cover control and one or more preset animation effect controls;
  • the target cover is a static cover or a dynamic cover.
  • obtaining the target cover in response to a control selection operation on the first window includes:
  • the static cover and the animation effect generate a dynamic cover that changes with the audio characteristics of the target audio
  • the audio characteristics include audio beat and/or volume.
  • data associated with the target audio is exported to a target location; the target location includes a photo album or a file system.
  • data associated with the target audio is shared to a target application in response to a sharing instruction on the third interface.
  • data associated with the target audio includes at least one of the following:
  • the target audio the human voice, the accompaniment, the vocal segment of the human voice, the accompaniment segment of the accompaniment, the static cover of the target audio, and the dynamic cover of the target audio.
  • the fourth interface in response to a touch operation on the audio editing control on the third interface, jumping from the third interface to a fourth interface, the fourth interface includes audio processing Function controls or trigger controls associated with the audio processing function controls, the trigger controls being used to trigger display of the audio processing function controls;
  • the audio processing function controls include one or more of the following:
  • Audio optimization controls for triggering editing of audio to optimize said audio
  • Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
  • Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
  • an audio processing device including:
  • an acquisition module configured to acquire the human voice in response to the first instruction
  • the acquisition module is also used to acquire the accompaniment in response to the second instruction
  • a processing module configured to mix the vocal and the accompaniment in response to the third instruction to obtain target audio.
  • the acquisition module is configured to import the first audio in response to a touch operation on the first control on the first interface, and separate the first audio from the first audio. describe the human voice;
  • the acquisition module is further configured to import the second audio in response to a touch operation on the second control on the first interface, and separate the accompaniment from the second audio.
  • the processing module is configured to mix the vocal and the accompaniment in response to a touch operation on a third control on the first interface to obtain the target audio.
  • the processing module is used to:
  • the vocal segment and the accompaniment segment are mixed to obtain the target audio.
  • the processing module is used to:
  • paragraph recognition model is used to identify target segments of audio.
  • the audio processing device further includes: a display module
  • the display module is configured to display the vocal track and the accompaniment track on the second interface in response to a touch operation on the fourth control on the first interface;
  • the acquisition module is configured to acquire the vocal segment in response to an editing operation on the audio track of the human voice; and acquire the accompaniment segment in response to an editing operation on the audio track of the accompaniment.
  • the processing module is used to:
  • the third audio is one audio of the human voice and the accompaniment
  • the fourth audio is the other audio of the human voice and the accompaniment
  • the third audio is the audio of the human voice and the accompaniment
  • One audio of the vocal segment and the accompaniment segment, and the fourth audio is the other audio of the vocal segment and the accompaniment segment.
  • the processing module is configured to adjust the second rhythm of the fourth audio based on the first rhythm of the third audio, so that the The third audio has the same rhythm as the fourth audio.
  • the first interface includes:
  • a first playback control, a first delete control and a first replacement control associated with the human voice the first playback control is used to listen to the human voice, the first delete control is used to delete the human voice, The first replacement control is used to replace the human voice;
  • a second playback control, a second deletion control and a second replacement control associated with the accompaniment is used to listen to the accompaniment, the second deletion control is used to delete the accompaniment, and the second deletion control is used to delete the accompaniment.
  • Two replacement controls are used to replace the accompaniment.
  • the processing module is configured to jump to a third interface in response to a touch operation on the third control on the first interface, and the third interface includes a third interface.
  • the third playback control is used to trigger playback of the target audio.
  • the display module is configured to display a first window in response to a touch operation on the cover editing control on the third interface, the first window including a cover import control , one or more preset static cover controls and one or more preset animation effect controls;
  • the acquisition module is configured to acquire a target cover in response to a control selection operation on the first window; the
  • the target cover is a static cover or a dynamic cover.
  • the acquisition module is configured to acquire static cover and animation effects in response to a control selection operation on the first window
  • the processing module is configured to generate a dynamic cover that changes with the audio characteristics of the target audio according to the audio characteristics of the target audio, the static cover, and the animation effect;
  • the audio characteristics include audio beat and/or volume.
  • the processing module is configured to export data associated with the target audio to a target location in response to an export instruction on the third interface; the target location Including photo album or file system.
  • the processing module is configured to share data associated with the target audio to a target application in response to a sharing instruction on the third interface.
  • data associated with the target audio includes at least one of the following:
  • the target audio the human voice, the accompaniment, the vocal segment of the human voice, the accompaniment segment of the accompaniment, the static cover of the target audio, and the dynamic cover of the target audio.
  • the processing module is configured to jump from the third interface to the fourth interface in response to a touch operation on the audio editing control on the third interface, so
  • the fourth interface includes an audio processing function control or a trigger control associated with the audio processing function control, the trigger control being used to trigger display of the audio processing function control;
  • the audio processing function controls include one or more of the following:
  • Audio optimization controls for triggering editing of audio to optimize said audio
  • Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
  • Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
  • an electronic device including: at least one processor and a memory;
  • the memory stores computer execution instructions
  • the at least one processor executes the computer execution instructions stored in the memory, so that the at least one processor executes the audio processing method described in the above first aspect and various possible designs of the first aspect.
  • a computer-readable storage medium is provided.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • a processor executes the computer-executed instructions, Implement the audio processing method described in the above first aspect and various possible designs of the first aspect.
  • a computer program product including a computer program that, when executed by a processor, implements the above first aspect and various possible designs of the first aspect.
  • the audio processing method is provided, including a computer program that, when executed by a processor, implements the above first aspect and various possible designs of the first aspect.
  • embodiments of the present disclosure provide a computer program that, when executed by a processor, implements the audio processing method described in the first aspect and various possible designs of the first aspect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

一种音频处理方法、装置、设备及存储介质,该方法包括:响应于第一指示,获取用户上传的一段音频中的人声(201);响应于第二指示,获取用户上传的另一段音频中的伴奏(202);响应于第三指示,对人声和伴奏进行混合,以获取目标音频(203)。

Description

音频处理方法、装置、设备及存储介质
相关申请交叉引用
本申请要求于2022年05月07日提交中国专利局、申请号为202210495456.4、发明名称为“音频处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用并入本文。
技术领域
本公开实施例涉及人机交互技术领域,尤其涉及一种音频处理方法、装置、设备及存储介质。
背景技术
随着媒体内容的不断增长和计算机技术的迅速发展,用户在使用媒体数据的过程中,产生了与媒体进行交互、创作出个性化风格的媒体内容的需求。音频编辑是创作媒体内容的一种常见方式。
现有的音频编辑功能有限,无法满足用户基于不同音频进行处理和创作的需求。
发明内容
本公开实施例提供一种音频处理方法、装置、设备及存储介质,提升音频处理效率,满足用户对音频制作的个性化需求。
第一方面,本公开实施例提供一种音频处理方法,包括:
响应于第一指示,获取人声;
响应于第二指示,获取伴奏;
响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
第二方面,本公开实施例提供一种音频处理装置,包括:
获取模块,用于响应于第一指示,获取人声;
所述获取模块,还用于响应于第二指示,获取伴奏;
处理模块,用于响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
第三方面,本公开实施例提供一种电子设备,包括:处理器和存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面以及第一方面各种可能的设计所述的音频处理方法。
第四方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的音频处理方法。
第五方面,本公开实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的音频处理方法。
第六方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述图像处理方法。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的音频处理方法的应用场景示意图;
图2为本公开实施例提供的音频处理方法的流程示意图一;
图3为本公开实施例提供的音频处理方法的流程示意图二;
图4为本公开实施例提供的用户界面示意图一;
图5为本公开实施例提供的用户界面变化示意图一;
图6a为本公开实施例提供的用户界面示意图二;
图6b为本公开实施例提供的用户界面示意图三;
图7a为本公开实施例提供的用户界面变化示意图二;
图7b为本公开实施例提供的用户界面变化示意图三;
图7c为本公开实施例提供的用户界面变化示意图四;
图8为本公开实施例提供的用户界面变化示意图五;
图9为本公开实施例提供的用户界面示意图四;
图10a为本公开实施例提供的用户界面变化示意图六;
图10b为本公开实施例提供的用户界面变化示意图七;
图11为本公开实施例提供的音频处理装置的结构框图;
图12为本公开实施例提供的电子设备的结构框图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
本公开的实施例针对但不限于上述问题中的一个或多个,提出一种音频处理方法,该方法提供可视化的智能音频处理全流程,能够自动实现目标音频中的人声与伴奏融合,并且在经过智能处理之后直接进行音频编辑,导出音频处理文件时可将素材包打包输出,满足不同用户的个性化需求,提升用户音频制作的使用体验。
为了便于理解本公开提供的技术方案,下面首先对音频处理方法的应用场景进行简要介绍。
图1为本公开实施例提供的音频处理方法的应用场景示意图。如图1所示,本实施例提供的应用场景包括终端设备101和服务器102,终端设备101与服务器102通信连接。终端设备101预置音频处理应用程序APP,该音频处理APP向用户提供以下一个或多个功能:录音棚编辑功能、伴奏分离功能、音频混搭功能、风格合成功能、音频优化功能。
作为一种示例,用户通过终端设备101访问服务器102,例如通过终端设备101上传两段音频数据。服务器102首先分别对这两段音频数据进行声源分离(包括人声、乐器等分离), 例如获取其中一段音频数据中的人声,以及另一段音频数据中的伴奏;其次对人声和伴奏进行段落识别,获取人声和伴奏的目标片段(如高潮片段);最后对人声和伴奏的目标片段进行节奏检测和节奏对齐,生成混合后的目标音频。服务器102向终端设备101发送目标音频,以便用户试听、保存、分享目标音频,或者对目标音频进行后处理等。
本实施例的终端设备可以是具有信息显示功能的任意电子设备,包括但不限于智能手机,笔记本电脑,平板电脑,智能车载设备,智能穿戴设备,智慧屏等。
本实施例的服务器可以为普通服务器或者云服务器,云服务器又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。
需要说明的是,本公开的产品实现形态是包含在平台软件中,并部署在电子设备(也可以是计算云或移动终端等具有计算能力的硬件)上的程序代码。示例性的,本公开的程序代码可以存储在电子设备内部。运行时,程序代码运行于电子设备的主机内存和/或GPU内存。
还需要说明的是,本公开提供的技术方案可应用于服务器或终端设备,也可一部分处理由终端设备执行,另一部分处理由服务器执行,对此实施例不作任何限制。
下面结合几个具体的实施例对本公开提供的技术方案进行详细说明。下面几个实施例可以相互结合,对于相同或者相似的概念或过程,可能在某些实施例中不再赘述。
本公开实施例提供一种音频处理方法、装置、设备及存储介质,该方法包括:响应于第一指示,获取用户上传的一段音频中的人声;响应于第二指示,获取用户上传的另一段音频中的伴奏;响应于第三指示,将两段音频中的人声和伴奏进行自动混搭,提升音频处理效率,满足用户对音频制作的个性化需求。
图2为本公开实施例提供的音频处理方法的流程示意图一。如图2所示,本实施例的方法可应用于终端设备或服务器中,该音频处理方法包括:
步骤201、响应于第一指示,获取人声。
本实施例中,响应于第一指示,获取人声,包括:响应于第一指示,获取仅包含人声的音频数据。
一种可能的实施方式中,响应于用户通过触摸屏幕控件,生成第一指示,根据第一指示获取人声。
一种可能的实施方式中,响应于用户通过鼠标点击屏幕控件,生成第一指示,根据第一指示获取人声。
一种可能的实施方式中,响应于用户的语音控制,生成第一指示,根据第一指示获取人声。
需要指出的是,第一指示不仅用于指示获取包含人声的音频数据,还用于触发对音频数据进行人声部分的提取。
步骤202、响应于第二指示,获取伴奏。
本实施例中,响应于第二指示,获取伴奏,包括:响应于第二指示,获取仅包含伴奏的音频数据。
一种可能的实施方式中,响应于用户通过触摸屏幕控件,生成第二指示,根据第二指示获取伴奏。
一种可能的实施方式中,响应于用户通过鼠标点击屏幕控件,生成第二指示,根据第二 指示获取伴奏。
一种可能的实施方式中,响应于用户的语音控制,生成第二指示,根据第二指示获取伴奏。
需要指出的是,第二指示不仅指示获取包含伴奏的音频数据,还用于触发对音频数据进行伴奏部分的提取。
步骤203、响应于第三指示,对人声和伴奏进行混合,以获取目标音频。
一种可能的实施方式中,响应于用户通过触摸屏幕控件,生成第三指示,根据第三指示对人声和伴奏进行混合,以获取目标音频。
一种可能的实施方式中,响应于用户通过鼠标点击屏幕控件,生成第三指示,根据第三指示对人声和伴奏进行混合,以获取目标音频。
一种可能的实施方式中,响应于用户的语音控制,生成第三指示,根据第三指示对人声和伴奏进行混合,以获取目标音频。
本公开实施例中,对人声和伴奏进行混合,也可描述为对人声和伴奏进行混搭,用户通过界面触控或语音控制的方式,上传需要混搭的两段音频数据,分别提取两段音频数据中的其中一段音频数据的人声,以及另一段音频数据的伴奏,实现两段音频中人声和伴奏的自动混搭,提升音频处理效率,满足用户对音频制作的个性化需求。
在上述实施例的基础上,下面通过一个具体实施例对上述的音频处理过程进行详细介绍。
图3为本公开实施例提供的音频处理方法的流程示意图二。如图3所示,本实施例的方法可应用于终端设备或服务器中,该音频处理方法包括:
步骤301、响应于针对第一界面上的第一控件的触控操作,导入第一音频,并从第一音频中分离出人声。
步骤302、响应于针对第一界面上的第二控件的触控操作,导入第二音频,并从第二音频中分离出伴奏。
步骤303、响应于针对第一界面上的第三控件的触控操作,对人声和伴奏进行混合,以获取目标音频。
本实施例中,第一界面也可描述为音频导入界面。可选的,针对第一界面上的第一控件的触控操作、第二控件的触控操作以及第三控件的触控操作包括但不限于点击操作。
示例性的,图4为本公开实施例提供的用户界面示意图一。如图4所示,第一界面400包括:第一控件401,第二控件402,第三控件403。第一控件401用于抽取音频数据中的人声,第二控件402用于抽取音频数据中的伴奏,第三控件403用于对抽取的人声和伴奏进行自动混搭(混合)。
需要指出的是,本实施例的界面控件包括但不限于图标、按钮、下拉框、滑块等,触控操作包括但不限于点击操作、长按操作、双击操作、滑动操作等。
可选的,在一些实施例中,响应于针对第一界面上的第一控件的长按操作,获取用户输入的第一语音,通过语音识别生成第一指示,根据第一指示导入第一音频,并从第一音频中分离出人声。
可选的,在一些实施例中,响应于针对第一界面上的第二控件的长按操作,获取用户输入的第二语音,通过语音识别生成第二指示,根据第二指示导入第二音频,并从第二音频中分离出伴奏。
可选的,在一些实施例中,响应于针对第一界面上的第三控件的长按操作,获取用户输入的第三语音,通过语音识别生成第三指示,根据第三指示对人声和伴奏进行混合,以获取目标音频。
可选的,在一些实施例中,用户还可以通过设备实体按钮,例如智能手机侧键,输入控制语音,以导入上述的第一音频或第二音频或进行音频混合。
基于图4所示的第一界面,下面对用户针对第一界面的触控操作及界面变化进行图示化说明。
示例性的,图5为本公开实施例提供的用户界面变化示意图一,图6a为本公开实施例提供的用户界面示意图二。如图5所示,首先,用户通过点击第一界面400的第一控件401导入音频数据,用户可选择从文件或视频相册中导入音频数据,例如图5中用户在视频相册界面404选择音频1。在导入音频数据的同时,对该音频数据进行人声部分的提取,获取人声,并在第一界面400中可视化展示人声,例如图5或图6a所示的人声音轨,用户可试听提取的人声。其次,用户通过点击第一界面400的第二控件402导入另一音频数据,在导入音频数据的同时,对该音频数据进行伴奏部分的提取,获取伴奏,并在第一界面400中可视化伴奏,用户可试听提取的伴奏,例如图6a所示的伴奏音轨,用户可试听提取的伴奏。最后,用户通过点击第一界面400的第三控件403对提取的人声和伴奏进行自动混搭,以获取目标音频。
示例性的,用户上传录制的弹唱音频,同时上传现有的成品音乐作品,通过提取用户弹唱视频中的人声和成品音乐作品中的伴奏,对提取的人声和伴奏进行混合,以获取目标音频,该目标音频融合了用户的人声和现有的伴奏。上述音频处理过程极大地方便用户进行个性化的音乐创造,满足不同用户的音乐创作需求。
示例性的,图6b为本公开实施例提供的用户界面示意图三。图6b所示的界面可以看作是图6a所示的界面的优化版界面,包括更多的功能控件。如图6b所示,用户上传两段音频数据后,第一界面400包括:与人声关联的第一播放控件、第一删除控件以及第一替换控件,第一播放控件用于试听人声,第一删除控件用于删除人声,第一替换控件用于替换人声;以及,与伴奏关联的第二播放控件、第二删除控件以及第二替换控件,第二播放控件用于试听伴奏,第二删除控件用于删除伴奏,第二替换控件用于替换伴奏。
可选的,第一界面400还包括:第四控件405和第五控件406。第四控件405用于触发对人声和/或伴奏的自定义处理,自定义处理包括对人声和/或伴奏的音频剪辑。第五控件用于触发对人声和/或伴奏的音频编辑(去录音棚进行音频编辑或处理),具体可参见后文。
本实施例的一个可选实施例中,对人声和伴奏进行混合,以获取目标音频,具体包括:获取人声的人声片段和伴奏的伴奏片段;将人声片段和伴奏片段进行混合,以获取目标音频。即在对人声和伴奏进行混合时,可以先从人声和伴奏中分别提取可混合的人声片段和伴奏片段,再根据人声片段和伴奏片段进行音频混合,以获取目标音频。具体的,可通过如下实施方式获取人声片段和伴奏片段:
一种可选的实施方式中,分别将人声和伴奏输入段落识别模型,以获取人声的人声片段和伴奏的伴奏片段。其中,段落识别模型用于识别音频的目标片段。具体的,将人声输入段落识别模型,以获取人声的目标片段。将伴奏输入段落识别模型,以获取伴奏的目标片段。目标片段可以是音频的副歌片段、高潮片段或者其他片段,例如目标片段为一首歌曲中的重复片段。
可选的,段落识别模型可采用深度学习模型训练得到,本实施例对深度学习模型的结构不作限定。该实施方式通过训练模型实现对人声片段和伴奏片段的智能化提取,可提升音频处理效率和准确率。
可选的,段落识别模型的训练过程包括:获取训练数据集,训练数据集包括多个音频样本以及每个音频样本的标注信息,标注信息用于指示音频样本对应的目标片段。将训练数据集中的多个音频样本分别作为段落识别模型的输入,将训练数据集中的每个音频样本的标注信息作为段落识别模型的输出,对段落识别模型进行训练,直至段落识别模型的损失函数收敛时,停止对段落识别模型的训练,获取训练好的段落识别模型的模型参数。
本实施例中,段落识别模型可用于分析输入音频的节奏和响度变化等信息,可识别出音频的前奏、主歌、副歌、间奏、桥段、尾奏、静音等片段,提取出最有可能的副歌部分,即高潮部分。具体的,提取出不同片段的开始和结束时间戳,进行后续的剪裁,最终输出音频的目标片段。
一种可选的实施方式中,响应于针对第一界面上的第四控件的触控操作,在第二界面显示人声和伴奏的音轨;响应于针对人声的音轨的剪辑操作,获取人声片段;响应于针对伴奏的音轨的剪辑操作,获取伴奏片段。
本实施方式是通过用户在界面剪辑片段的方式获取人声和伴奏的目标片段,用于后续音频混合,该方式增加用户对导入人声和伴奏的自定义处理,可提升用户音频制作的参与度,满足不同用户的音频制作需求。
本实施例的一个可选实施例中,对人声和伴奏进行混合,以获取目标音频,包括:获取人声的第一节奏以及伴奏的第二节奏,对人声的第一节奏和伴奏的第二节奏进行节奏对齐,基于对对齐后的人声和伴奏进行混合,以获取目标音频。
一种可选的实施方式中,以人声的第一节奏为基准,调节伴奏的第二节奏,以使得人声的第一节奏和伴奏的第二节奏一致。
一种可选的实施方式中,以伴奏的第二节奏为基准,调节人声的第一节奏,以使得人声的第一节奏和伴奏的第二节奏一致。
本实施例的一个可选实施例中,对人声和伴奏进行混合,以获取目标音频,包括:对人声的人声片段和伴奏的伴奏片段进行混合,以获取目标音频。具体的,获取人声片段的第一节奏以及伴奏片段的第二节奏,对人声片段的第一节奏和伴奏片段的第二节奏进行节奏对齐,基于对对齐后的人声片段和伴奏片段进行混合,以获取目标音频。
一种可选的实施方式中,以人声片段的第一节奏为基准,调节伴奏片段的第二节奏,以使得人声片段的第一节奏和伴奏片段的第二节奏一致。
一种可选的实施方式中,以伴奏片段的第二节奏为基准,调节人声片段的第一节奏,以使得人声片段的第一节奏和伴奏片段的第二节奏一致。
基于上述节奏对齐的几种实施方式,可知:
通过获取第三音频的第一节奏以及第四音频的第二节奏;对第三音频的第一节奏和第四音频的第二节奏进行节奏对齐;基于对齐后的第三音频和第四音频,获取目标音频。具体的,以第三音频的第一节奏为基准,调节第四音频的第二节奏,以使得第三音频与第四音频节奏一致。
其中,第三音频可以是人声和伴奏中的一个音频,相应的,第四音频可以是人声和伴奏 中的另一音频。或者,第三音频可以是人声片段和伴奏片段中的一个音频,第四音频可以是人声片段和伴奏片段中的另一音频。
需要指出的是,上述实施例涉及对音频或音频片段的节奏检测,节奏检测用于检测节拍中的重拍时间,推测出整个音频或音频片段的速度。对音频节奏的调节包括对音频节奏的拉伸或压缩,通常情况下,将人声轨道的节奏对齐到伴奏轨道,通过音频拉伸或压缩来处理人声轨道文件。
通过对两段音频或两段音频的片段进行节奏检测和对齐,使得混合后的目标音频中的人声和伴奏较好地融合,提升音频处理效果。
本实施例的一个可选实施例中,响应于针对第一界面上的第三控件的触控操作,界面跳转至第三界面,第三界面包括第三播放控件,第三播放控件用于触发播放目标音频。第三界面即音频混合预览界面。下面对用户导入两段音频后获取目标音频的用户界面变化进行图示化说明。
示例性的,图7a为本公开实施例提供的用户界面变化示意图二。如图7a所示,用户在上传两段音频数据后,第一界面400中可视化展示人声和伴奏,用户可以直接点击第三控件403,进行人声和伴奏的自动混搭,并在第三界面701可视化音频混合后的目标音频。用户可在第三界面701试听、导出、分享音频混合后的目标音频,或者选择再玩一次,或者导入到录音棚作进一步音频处理。
示例性的,图7b为本公开实施例提供的用户界面变化示意图三。如7b所示,用户还可以在上传两段音频数据后,点击第一界面400的第四控件405,触发对人声和/或伴奏的自定义处理,界面跳转至第二界面700。用户可在第二界面700对人声和/或伴奏进行音频剪辑,如截取人声的高潮片段和伴奏的高潮片段,用户还可以在第二界面700试听剪辑后的人声或伴奏的高潮片段。在完成音频剪辑后,通过点击第二界面700的“自动混搭”控件,跳转至第三界面701。
示例性的,图7c为本公开实施例提供的用户界面变化示意图四。如图7c所示,用户在上传两段音频数据后,直接点击第一界面400的第三控件403,进行人声和伴奏的自动混搭,可在图7c所示的第三界面701试听、导出、分享音频混合后的目标音频,或者选择取消,或者导入到录音棚作进一步音频处理,还可以在图7c所示的第三界面701设置目标音频的封面。图7c所示的第三界面701可以看作是图7a所示的第三界面701的优化版界面。
基于上述图示化的第三界面,下面通过几个具体实施例对第三界面上的各功能控件进行详细说明。
本实施例的一个可选实施例中,响应于针对第三界面上的封面编辑控件的触控操作,显示第一窗口;响应于针对第一窗口上的控件选择操作,获取目标封面。其中,第一窗口包括封面导入控件、一个或多个预设的静态封面控件以及一个或多个预设的动画效果控件。
可选的,目标封面为静态封面或者动态封面。
一个可选的实施方式中,若目标封面为动态封面,响应于针对第一窗口上的控件选择操作,获取目标封面,包括:响应于针对第一窗口上的控件选择操作,获取静态封面和动画效果;根据目标音频的音频特征、静态封面和动画效果,生成随目标音频的音频特征变化的动态封面。其中,音频特征包括音频节拍和/或音量。
示例性的,图8为本公开实施例提供的用户界面变化示意图五。如图8所示,用户通过 点击第三界面701上的封面编辑控件705,在第三界面701的底部弹出第一窗口800。第一窗口800包括封面导入控件801,多个预设的静态封面,例如图8中的封面1至封面3,以及多个动画效果,如图8中的动画1至动画3。用户可通过点击封面导入控件801从本地相册中导入自定义图片,将自定义图片作为静态封面,还可以直接选择预设的一个静态封面。用户可以直接选择预设的一个动画或者不设置动画。在完成对封面的编辑操作后,将目标音频和生成的目标封面导出到相册或文件系统,或者分享到指定应用程序,或者导入到录音棚作进一步音频处理。
示例性的,图9为本公开实施例提供的用户界面示意图四。如图9所示,用户通过图8设置目标音频的静态封面和动画效果,可在音频混合预览界面预览合成的目标封面,目标封面包括静态封面和随目标音频的音频特征变化的动画效果,该动画效果可看作是在静态封面底部添加了动画特效图层,动画效果可在静态封面四周的任意位置动态变化。
本实施例通过向用户提供设置音频封面的功能,实现不同用户对封面的个性化编辑,提升用户音频制作的使用体验。
本实施例的一个可选实施例中,响应于针对第三界面上的导出指示,将与目标音频相关联的数据导出到目标位置。可选的,目标位置包括相册或文件系统。
示例性的,如图7a至7c所示,用户通过点击第三界面701上的导出控件702,触发第一选择窗,用户可在第一选择窗上选择将与目标音频相关联的数据导出到相册或文件系统。
可选的,在一些实施例中,响应于针对第三界面701上的导出控件702的长按操作,获取用户输入的第四语音,通过语音识别生成导出指示,根据导出指示将与目标音频相关联的数据导出到目标位置。
本实施例的一个可选实施例中,响应于针对第三界面上的分享指示,将与目标音频相关联的数据分享到目标应用。
示例性的,如图7a至7c所示,用户通过点击第三界面701上的分享控件704,触发第二选择窗,用户可在第二选择窗上选择将与目标音频相关联的数据分享到目标应用,或者,目标应用中的指定用户。
可选的,在一些实施例中,响应于针对第三界面701上的分享控件704的长按操作,获取用户输入的第五语音,通过语音识别生成分享指示,根据分享指示将与目标音频相关联的数据分享到目标应用,或者,目标应用中的指定用户。
可选的,与目标音频相关联的数据包括以下至少一项:目标音频,人声,伴奏,人声的人声片段,伴奏的伴奏片段,目标音频的静态封面,和目标音频的动态封面。
综上可知,用户导出或分享的数据可以仅包含目标音频,也可以包含在获取目标音频过程中的所有中间数据。用户导出或分享的数据过大时,可以先对数据进行压缩处理,再将压缩后的数据导出到本地或分享给其他用户。若其他用户接收到的分享数据包含在获取目标音频过程中的所有中间数据,该用户除了播放目标音频外,还可以查询或重新编辑其中的中间数据,生成新的目标音频,从而实现多人协同进行音频制作,增加用户间的互动性,提高用户的使用体验。
本实施例的一个可选实施例中,响应于针对第三界面上的音频编辑控件的触控操作,从第三界面跳转至第四界面,第四界面包括音频处理功能控件。第四界面即用于音频后处理的界面,也可描述为录音棚界面,用户可在第四界面对目标音频中的人声和伴奏进行音频后处 理。
本实施例的一个可选实施例中,响应于针对第三界面上的音频编辑控件的触控操作,从第三界面跳转至第四界面,第四界面包括与音频处理功能控件相关联的触发控件,触发控件用于触发显示音频处理功能控件。
可选的,音频处理功能控件包括以下中的一个或多个:
音频优化控件,用于触发对音频进行编辑以优化音频;
伴奏分离控件,用于触发从音频分离人声和/或伴奏;
风格合成控件,用于触发从音频分离人声,并将分离出的人声与预设伴奏进行混合和编辑。
音频混搭控件,用于触发从第一音频分离人声,从第二音频分离伴奏,并将分离出的人声与分离出的伴奏进行混合和编辑。
可选的,音频优化包括对用户的弹唱音频进行人声和/或伴奏的优化处理,即音频优化包括弹唱优化,如男生吉他、女生吉他、男生钢琴、女生钢琴等优化处理。
可选的,伴奏分离包括去除人声、去除乐器等分离处理。
可选的,风格合成包括例如车载嗨歌、经典流行、心动瞬间、放松时刻、童年记忆等风格优化。
可选的,音频混搭包括对节奏对齐、移调等优化处理。
示例性的,图10a为本公开实施例提供的用户界面变化示意图六。如图10a所示,用户通过点击第三界面701上的音频编辑控件703,界面跳转至第四界面1000。第四界面1000上直接在音轨1显示人声和伴奏混合后的目标音频,且在第四界面1000的音频处理窗1004显示多个可选的音频处理控件。
示例性的,图10b为本公开实施例提供的用户界面变化示意图七。如图10b所示,用户通过点击第三界面701上的音频编辑控件703,界面跳转至第四界面1000。用户可在第四界面1000对目标音频中的人声和伴奏进行音频后处理,例如第四界面1000的音轨1对应人声,第四界面1000的音轨2对应伴奏,对人声和伴奏进行音频剪辑。用户还可通过点击第四界面1000上的界面切换控件1001,或者通过向左或向右的滑动操作,进入第五界面1002,第五界面1002包括与音频处理功能控件相关联的触发控件1003,触发控件1003用于触发显示音频处理功能控件。音频处理控件如图10b的音频处理窗1004中显示的多个可选控件。用户可在第五界面1003分别对音轨1的人声和音轨2的伴奏添加效果器、作进一步音频处理、调节音量等操作,还可以在第五界面1003调节人声和伴奏的整体音量。其中,效果器包括混响、均衡、电音、移相、镶边、滤波等。
本公开实施例中,用户通过触控第一界面上的第一控件导入第一音频,并从第一音频中分离出人声;再通过触控第一界面上的第二控件导入第二音频,并从第二音频中分离出伴奏;最后通过触控第一界面上的第三控件,对人声和伴奏进行混合,以获取目标音频。上述过程实现两段音频中人声和伴奏的自动混搭,提升音频处理效果,满足用户对音频制作的个性化需求。
对应于上文实施例的音频处理方法,图11为本公开实施例提供的音频处理装置的结构框图。为了便于说明,仅示出了与本公开实施例相关的部分。如图11所示,本实施例提供的音频处理装置1100,包括:获取模块1101以及处理模块1102。
获取模块1101,用于响应于第一指示,获取人声;
所述获取模块1101,还用于响应于第二指示,获取伴奏;
处理模块1102,用于响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
在本公开的一个可选实施例中,所述获取模块1101,用于响应于针对第一界面上的第一控件的触控操作,导入第一音频,并从所述第一音频中分离出所述人声;
所述获取模块1101,还用于响应于针对所述第一界面上的第二控件的触控操作,导入第二音频,并从所述第二音频中分离出所述伴奏。
在本公开的一个可选实施例中,所述处理模块1102,用于响应于针对第一界面上的第三控件的触控操作,对所述人声和所述伴奏进行混合,以获取所述目标音频。
在本公开的一个可选实施例中,所述处理模块1102,用于:
获取所述人声的人声片段和所述伴奏的伴奏片段;
将所述人声片段和所述伴奏片段进行混合,以获取所述目标音频。
在本公开的一个可选实施例中,所述处理模块1102,用于:
分别将所述人声和所述伴奏输入段落识别模型,以获取所述人声的人声片段和所述伴奏的伴奏片段;
其中,所述段落识别模型用于识别音频的目标片段。
在本公开的一个可选实施例中,音频处理装置1100还包括:显示模块1103;
所述显示模块1103,用于响应于针对所述第一界面上的第四控件的触控操作,在第二界面显示所述人声和所述伴奏的音轨;
所述获取模块1101,用于响应于针对所述人声的音轨的剪辑操作,获取所述人声片段;响应于针对所述伴奏的音轨的剪辑操作,获取所述伴奏片段。
在本公开的一个可选实施例中,所述处理模块1102,用于:
获取第三音频的第一节奏以及第四音频的第二节奏;
对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐;
基于对齐后的所述第三音频和所述第四音频,获取所述目标音频;
其中,所述第三音频为所述人声和所述伴奏中的一个音频,所述第四音频为所述人声和所述伴奏中的另一音频;或者,所述第三音频为所述人声片段和所述伴奏片段中的一个音频,所述第四音频为所述人声片段和所述伴奏片段中的另一音频。
在本公开的一个可选实施例中,所述处理模块1102,用于以所述第三音频的所述第一节奏为基准,调节所述第四音频的所述第二节奏,以使得所述第三音频与所述第四音频节奏一致。
在本公开的一个可选实施例中,所述第一界面包括:
与所述人声关联的第一播放控件、第一删除控件以及第一替换控件,所述第一播放控件用于试听所述人声,所述第一删除控件用于删除所述人声,所述第一替换控件用于替换所述人声;以及
与所述伴奏关联的第二播放控件、第二删除控件以及第二替换控件,所述第二播放控件用于试听所述伴奏,所述第二删除控件用于删除所述伴奏,所述第二替换控件用于替换所述伴奏。
在本公开的一个可选实施例中,所述处理模块1102,用于响应于针对第一界面上的所述第三控件的触控操作,跳转至第三界面,所述第三界面包括第三播放控件,所述第三播放控件用于触发播放所述目标音频。
在本公开的一个可选实施例中,所述显示模块1103,用于响应于针对所述第三界面上的封面编辑控件的触控操作,显示第一窗口,所述第一窗口包括封面导入控件、一个或多个预设的静态封面控件以及一个或多个预设的动画效果控件;
所述获取模块1101,用于响应于针对所述第一窗口上的控件选择操作,获取目标封面;所述目标封面为静态封面或者动态封面。
在本公开的一个可选实施例中,所述获取模块1101,用于响应于针对所述第一窗口上的控件选择操作,获取静态封面和动画效果;
所述处理模块1102,用于根据所述目标音频的音频特征、所述静态封面和所述动画效果,生成随所述目标音频的音频特征变化的动态封面;
其中,所述音频特征包括音频节拍和/或音量。
在本公开的一个可选实施例中,所述处理模块1102,用于响应于针对所述第三界面上的导出指示,将与所述目标音频相关联的数据导出到目标位置;所述目标位置包括相册或文件系统。
在本公开的一个可选实施例中,所述处理模块1102,用于响应于针对所述第三界面上的分享指示,将与所述目标音频相关联的数据分享到目标应用。
在本公开的一个可选实施例中,与所述目标音频相关联的数据包括以下至少一项:
所述目标音频,所述人声,所述伴奏,所述人声的人声片段,所述伴奏的伴奏片段,所述目标音频的静态封面,和所述目标音频的动态封面。
在本公开的一个可选实施例中,所述处理模块1102,用于响应于针对所述第三界面上的音频编辑控件的触控操作,从所述第三界面跳转至第四界面,所述第四界面包括音频处理功能控件或者与所述音频处理功能控件相关联的触发控件,所述触发控件用于触发显示所述音频处理功能控件;
所述音频处理功能控件包括以下中的一个或多个:
音频优化控件,用于触发对音频进行编辑以优化所述音频;
伴奏分离控件,用于触发从音频分离人声和/或伴奏;
风格合成控件,用于触发从音频分离人声,并将分离出的人声与预设伴奏进行混合和编辑;
音频混搭控件,用于触发从第一音频分离人声,从第二音频分离伴奏,并将分离出的人声与分离出的伴奏进行混合和编辑。
本实施例提供的音频处理装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
图12为本公开实施例提供的电子设备的结构框图。如图12所示,该电子设备1200可以为终端设备或服务器。其中,终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,简称PDA)、平板电脑(Portable Android Device,简称PAD)、便携式多媒体播放器(Portable Media Player,简称PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图12 示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图12所示,电子设备1200可以包括处理装置(例如中央处理器、图形处理器等)1201,其可以根据存储在只读存储器(Read Only Memory,简称ROM)1202中的程序或者从存储装置1208加载到随机访问存储器(Random Access Memory,简称RAM)1203中的程序而执行各种适当的动作和处理。在RAM 1203中,还存储有电子设备1200操作所需的各种程序和数据。处理装置1201、ROM 1202以及RAM 1203通过总线1204彼此相连。输入/输出(Input/Output,简称I/O)接口1205也连接至总线1204。
通常,以下装置可以连接至I/O接口1205:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1206;包括例如液晶显示器(Liquid Crystal Display,简称LCD)、扬声器、振动器等的输出装置1207;包括例如磁带、硬盘等的存储装置1208;以及通信装置1209。通信装置1209可以允许电子设备1200与其他设备进行无线或有线通信以交换数据。虽然图12示出了具有各种装置的电子设备1200,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置1209从网络上被下载和安装,或者从存储装置1208被安装,或者从ROM 1202被安装。在该计算机程序被处理装置1201执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,简称EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,简称CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述实施例所示的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,简称LAN)或广域网(Wide Area Network,简称WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、专用标准产品(Application Specific Standard Parts,简称ASSP)、片上系统(System on Chip,简称SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,简称CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
第一方面,根据本公开的一个或多个实施例,提供一种音频处理方法,包括:
响应于第一指示,获取人声;
响应于第二指示,获取伴奏;
响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
根据本公开的一个或多个实施例,所述响应于第一指示,获取人声,包括:
响应于针对第一界面上的第一控件的触控操作,导入第一音频,并从所述第一音频中分离出所述人声;
所述响应于第二指示,获取伴奏,包括:
响应于针对所述第一界面上的第二控件的触控操作,导入第二音频,并从所述第二音频中分离出所述伴奏。
根据本公开的一个或多个实施例,所述响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频,包括:
响应于针对第一界面上的第三控件的触控操作,对所述人声和所述伴奏进行混合,以获取所述目标音频。
根据本公开的一个或多个实施例,所述对所述人声和所述伴奏进行混合,以获取目标音频,包括:
获取所述人声的人声片段和所述伴奏的伴奏片段;
将所述人声片段和所述伴奏片段进行混合,以获取所述目标音频。
根据本公开的一个或多个实施例,所述获取所述人声的人声片段和所述伴奏的伴奏片段,包括:
分别将所述人声和所述伴奏输入段落识别模型,以获取所述人声的人声片段和所述伴奏的伴奏片段;
其中,所述段落识别模型用于识别音频的目标片段。
根据本公开的一个或多个实施例,所述获取所述人声的人声片段和所述伴奏的伴奏片段,包括:
响应于针对所述第一界面上的第四控件的触控操作,在第二界面显示所述人声和所述伴奏的音轨;
响应于针对所述人声的音轨的剪辑操作,获取所述人声片段;
响应于针对所述伴奏的音轨的剪辑操作,获取所述伴奏片段。
根据本公开的一个或多个实施例,所述对所述人声和所述伴奏进行混合,以获取目标音频,包括:
获取第三音频的第一节奏以及第四音频的第二节奏;
对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐;
基于对齐后的所述第三音频和所述第四音频,获取所述目标音频;
其中,所述第三音频为所述人声和所述伴奏中的一个音频,所述第四音频为所述人声和所述伴奏中的另一音频;或者,所述第三音频为所述人声片段和所述伴奏片段中的一个音频,所述第四音频为所述人声片段和所述伴奏片段中的另一音频。
根据本公开的一个或多个实施例,所述对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐,包括:
以所述第三音频的所述第一节奏为基准,调节所述第四音频的所述第二节奏,以使得所述第三音频与所述第四音频节奏一致。
根据本公开的一个或多个实施例,所述第一界面包括:
与所述人声关联的第一播放控件、第一删除控件以及第一替换控件,所述第一播放控件用于试听所述人声,所述第一删除控件用于删除所述人声,所述第一替换控件用于替换所述人声;以及
与所述伴奏关联的第二播放控件、第二删除控件以及第二替换控件,所述第二播放控件 用于试听所述伴奏,所述第二删除控件用于删除所述伴奏,所述第二替换控件用于替换所述伴奏。
根据本公开的一个或多个实施例,响应于针对第一界面上的所述第三控件的触控操作,跳转至第三界面,所述第三界面包括第三播放控件,所述第三播放控件用于触发播放所述目标音频。
根据本公开的一个或多个实施例,响应于针对所述第三界面上的封面编辑控件的触控操作,显示第一窗口,所述第一窗口包括封面导入控件、一个或多个预设的静态封面控件以及一个或多个预设的动画效果控件;
响应于针对所述第一窗口上的控件选择操作,获取目标封面;
所述目标封面为静态封面或者动态封面。
根据本公开的一个或多个实施例,若所述目标封面为动态封面,所述响应于针对所述第一窗口上的控件选择操作,获取目标封面,包括:
响应于针对所述第一窗口上的控件选择操作,获取静态封面和动画效果;
根据所述目标音频的音频特征、所述静态封面和所述动画效果,生成随所述目标音频的音频特征变化的动态封面;
其中,所述音频特征包括音频节拍和/或音量。
根据本公开的一个或多个实施例,响应于针对所述第三界面上的导出指示,将与所述目标音频相关联的数据导出到目标位置;所述目标位置包括相册或文件系统。
根据本公开的一个或多个实施例,响应于针对所述第三界面上的分享指示,将与所述目标音频相关联的数据分享到目标应用。
根据本公开的一个或多个实施例,与所述目标音频相关联的数据包括以下至少一项:
所述目标音频,所述人声,所述伴奏,所述人声的人声片段,所述伴奏的伴奏片段,所述目标音频的静态封面,和所述目标音频的动态封面。
根据本公开的一个或多个实施例,响应于针对所述第三界面上的音频编辑控件的触控操作,从所述第三界面跳转至第四界面,所述第四界面包括音频处理功能控件或者与所述音频处理功能控件相关联的触发控件,所述触发控件用于触发显示所述音频处理功能控件;
所述音频处理功能控件包括以下中的一个或多个:
音频优化控件,用于触发对音频进行编辑以优化所述音频;
伴奏分离控件,用于触发从音频分离人声和/或伴奏;
风格合成控件,用于触发从音频分离人声,并将分离出的人声与预设伴奏进行混合和编辑;
音频混搭控件,用于触发从第一音频分离人声,从第二音频分离伴奏,并将分离出的人声与分离出的伴奏进行混合和编辑。
第二方面,根据本公开的一个或多个实施例,提供一种音频处理装置,包括:
获取模块,用于响应于第一指示,获取人声;
所述获取模块,还用于响应于第二指示,获取伴奏;
处理模块,用于响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
根据本公开的一个或多个实施例,所述获取模块,用于响应于针对第一界面上的第一控件的触控操作,导入第一音频,并从所述第一音频中分离出所述人声;
所述获取模块,还用于响应于针对所述第一界面上的第二控件的触控操作,导入第二音频,并从所述第二音频中分离出所述伴奏。
根据本公开的一个或多个实施例,所述处理模块,用于响应于针对第一界面上的第三控件的触控操作,对所述人声和所述伴奏进行混合,以获取所述目标音频。
根据本公开的一个或多个实施例,所述处理模块,用于:
获取所述人声的人声片段和所述伴奏的伴奏片段;
将所述人声片段和所述伴奏片段进行混合,以获取所述目标音频。
根据本公开的一个或多个实施例,所述处理模块,用于:
分别将所述人声和所述伴奏输入段落识别模型,以获取所述人声的人声片段和所述伴奏的伴奏片段;
其中,所述段落识别模型用于识别音频的目标片段。
根据本公开的一个或多个实施例,音频处理装置还包括:显示模块;
所述显示模块,用于响应于针对所述第一界面上的第四控件的触控操作,在第二界面显示所述人声和所述伴奏的音轨;
所述获取模块,用于响应于针对所述人声的音轨的剪辑操作,获取所述人声片段;响应于针对所述伴奏的音轨的剪辑操作,获取所述伴奏片段。
根据本公开的一个或多个实施例,所述处理模块,用于:
获取第三音频的第一节奏以及第四音频的第二节奏;
对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐;
基于对齐后的所述第三音频和所述第四音频,获取所述目标音频;
其中,所述第三音频为所述人声和所述伴奏中的一个音频,所述第四音频为所述人声和所述伴奏中的另一音频;或者,所述第三音频为所述人声片段和所述伴奏片段中的一个音频,所述第四音频为所述人声片段和所述伴奏片段中的另一音频。
根据本公开的一个或多个实施例,所述处理模块,用于以所述第三音频的所述第一节奏为基准,调节所述第四音频的所述第二节奏,以使得所述第三音频与所述第四音频节奏一致。
根据本公开的一个或多个实施例,所述第一界面包括:
与所述人声关联的第一播放控件、第一删除控件以及第一替换控件,所述第一播放控件用于试听所述人声,所述第一删除控件用于删除所述人声,所述第一替换控件用于替换所述人声;以及
与所述伴奏关联的第二播放控件、第二删除控件以及第二替换控件,所述第二播放控件用于试听所述伴奏,所述第二删除控件用于删除所述伴奏,所述第二替换控件用于替换所述伴奏。
根据本公开的一个或多个实施例,所述处理模块,用于响应于针对第一界面上的所述第三控件的触控操作,跳转至第三界面,所述第三界面包括第三播放控件,所述第三播放控件用于触发播放所述目标音频。
根据本公开的一个或多个实施例,所述显示模块,用于响应于针对所述第三界面上的封面编辑控件的触控操作,显示第一窗口,所述第一窗口包括封面导入控件、一个或多个预设的静态封面控件以及一个或多个预设的动画效果控件;
所述获取模块,用于响应于针对所述第一窗口上的控件选择操作,获取目标封面;所述 目标封面为静态封面或者动态封面。
根据本公开的一个或多个实施例,所述获取模块,用于响应于针对所述第一窗口上的控件选择操作,获取静态封面和动画效果;
所述处理模块,用于根据所述目标音频的音频特征、所述静态封面和所述动画效果,生成随所述目标音频的音频特征变化的动态封面;
其中,所述音频特征包括音频节拍和/或音量。
根据本公开的一个或多个实施例,所述处理模块,用于响应于针对所述第三界面上的导出指示,将与所述目标音频相关联的数据导出到目标位置;所述目标位置包括相册或文件系统。
根据本公开的一个或多个实施例,所述处理模块,用于响应于针对所述第三界面上的分享指示,将与所述目标音频相关联的数据分享到目标应用。
根据本公开的一个或多个实施例,与所述目标音频相关联的数据包括以下至少一项:
所述目标音频,所述人声,所述伴奏,所述人声的人声片段,所述伴奏的伴奏片段,所述目标音频的静态封面,和所述目标音频的动态封面。
根据本公开的一个或多个实施例,所述处理模块,用于响应于针对所述第三界面上的音频编辑控件的触控操作,从所述第三界面跳转至第四界面,所述第四界面包括音频处理功能控件或者与所述音频处理功能控件相关联的触发控件,所述触发控件用于触发显示所述音频处理功能控件;
所述音频处理功能控件包括以下中的一个或多个:
音频优化控件,用于触发对音频进行编辑以优化所述音频;
伴奏分离控件,用于触发从音频分离人声和/或伴奏;
风格合成控件,用于触发从音频分离人声,并将分离出的人声与预设伴奏进行混合和编辑;
音频混搭控件,用于触发从第一音频分离人声,从第二音频分离伴奏,并将分离出的人声与分离出的伴奏进行混合和编辑。
第三方面,根据本公开的一个或多个实施例,提供了一种电子设备,包括:至少一个处理器和存储器;
所述存储器存储计算机执行指令;
所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面以及第一方面各种可能的设计所述的音频处理方法。
第四方面,根据本公开的一个或多个实施例,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的音频处理方法。
第五方面,根据本公开的一个或多个实施例,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的音频处理方法。
第六方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的音频处理方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当 理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (21)

  1. 一种音频处理方法,包括:
    响应于第一指示,获取人声;
    响应于第二指示,获取伴奏;
    响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
  2. 根据权利要求1所述的方法,其中,
    所述响应于第一指示,获取人声,包括:
    响应于针对第一界面上的第一控件的触控操作,导入第一音频,并从所述第一音频中分离出所述人声;
    所述响应于第二指示,获取伴奏,包括:
    响应于针对所述第一界面上的第二控件的触控操作,导入第二音频,并从所述第二音频中分离出所述伴奏。
  3. 根据权利要求2所述的方法,其中,所述响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频,包括:
    响应于针对所述第一界面上的第三控件的触控操作,对所述人声和所述伴奏进行混合,以获取所述目标音频。
  4. 根据权利要求1至3中任一项所述的方法,其中,所述对所述人声和所述伴奏进行混合,以获取目标音频,包括:
    获取所述人声的人声片段和所述伴奏的伴奏片段;
    将所述人声片段和所述伴奏片段进行混合,以获取所述目标音频。
  5. 根据权利要求4所述的方法,其中,所述获取所述人声的人声片段和所述伴奏的伴奏片段,包括:
    分别将所述人声和所述伴奏输入段落识别模型,以获取所述人声的人声片段和所述伴奏的伴奏片段;
    其中,所述段落识别模型用于识别音频的目标片段。
  6. 根据权利要求4所述的方法,其中,所述获取所述人声的人声片段和所述伴奏的伴奏片段,包括:
    响应于针对第一界面上的第四控件的触控操作,在第二界面显示所述人声和所述伴奏的音轨;
    响应于针对所述人声的音轨的剪辑操作,获取所述人声片段;
    响应于针对所述伴奏的音轨的剪辑操作,获取所述伴奏片段。
  7. 根据权利要求1至3中任一项所述的方法,其中,所述对所述人声和所述伴奏进行混合,以获取目标音频,包括:
    获取第三音频的第一节奏以及第四音频的第二节奏;
    对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐;
    基于对齐后的所述第三音频和所述第四音频,获取所述目标音频;
    其中,所述第三音频为所述人声和所述伴奏中的一个音频,所述第四音频为所述人声和所述伴奏中的另一音频;或者,所述第三音频为所述人声的人声片段和所述伴奏的伴奏片段中的一个音频,所述第四音频为所述人声片段和所述伴奏片段中的另一音频。
  8. 根据权利要求7所述的方法,其中,所述对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐,包括:
    以所述第三音频的所述第一节奏为基准,调节所述第四音频的所述第二节奏,以使得所述第三音频与所述第四音频节奏一致。
  9. 根据权利要求2至3和6中任一项所述的方法,其中,所述第一界面包括:
    与所述人声关联的第一播放控件、第一删除控件以及第一替换控件,所述第一播放控件用于试听所述人声,所述第一删除控件用于删除所述人声,所述第一替换控件用于替换所述人声;以及
    与所述伴奏关联的第二播放控件、第二删除控件以及第二替换控件,所述第二播放控件用于试听所述伴奏,所述第二删除控件用于删除所述伴奏,所述第二替换控件用于替换所述伴奏。
  10. 根据权利要求3所述的方法,还包括:响应于针对第一界面上的所述第三控件的触控操作,跳转至第三界面,所述第三界面包括第三播放控件,所述第三播放控件用于触发播放所述目标音频。
  11. 根据权利要求3所述的方法,还包括:
    响应于针对第三界面上的封面编辑控件的触控操作,显示第一窗口,所述第一窗口包括封面导入控件、一个或多个预设的静态封面控件以及一个或多个预设的动画效果控件;
    响应于针对所述第一窗口上的控件选择操作,获取目标封面;
    所述目标封面为静态封面或者动态封面。
  12. 根据权利要求11所述的方法,其中,若所述目标封面为动态封面,所述响应于针对所述第一窗口上的控件选择操作,获取目标封面,包括:
    响应于针对所述第一窗口上的控件选择操作,获取静态封面和动画效果;
    根据所述目标音频的音频特征、所述静态封面和所述动画效果,生成随所述目标音频的音频特征变化的动态封面;
    其中,所述音频特征包括音频节拍和/或音量。
  13. 根据权利要求3所述的方法,还包括:
    响应于针对第三界面上的导出指示,将与所述目标音频相关联的数据导出到目标位置;所述目标位置包括相册或文件系统。
  14. 根据权利要求3所述的方法,还包括:
    响应于针对第三界面上的分享指示,将与所述目标音频相关联的数据分享到目标应用。
  15. 根据权利要求13或14所述的方法,其中,与所述目标音频相关联的数据包括以下至少一项:
    所述目标音频,所述人声,所述伴奏,所述人声的人声片段,所述伴奏的伴奏片段,所述目标音频的静态封面,和所述目标音频的动态封面。
  16. 根据权利要求3所述的方法,还包括:
    响应于针对第三界面上的音频编辑控件的触控操作,从所述第三界面跳转至第四界面,所述第四界面包括音频处理功能控件或者与所述音频处理功能控件相关联的触发控件,所述触发控件用于触发显示所述音频处理功能控件;
    所述音频处理功能控件包括以下中的一个或多个:
    音频优化控件,用于触发对音频进行编辑以优化所述音频;
    伴奏分离控件,用于触发从音频分离人声和/或伴奏;
    风格合成控件,用于触发从音频分离人声,并将分离出的人声与预设伴奏进行混合和编辑;
    音频混搭控件,用于触发从第一音频分离人声,从第二音频分离伴奏,并将分离出的人声与分离出的伴奏进行混合和编辑。
  17. 一种音频处理装置,包括:
    获取模块,用于响应于第一指示,获取人声;
    所述获取模块,还用于响应于第二指示,获取伴奏;
    处理模块,用于响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
  18. 一种电子设备,包括:至少一个处理器和存储器;
    所述存储器存储计算机执行指令;
    所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求1至16中任一项所述的音频处理方法。
  19. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至16中任一项所述的音频处理方法。
  20. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如权利要求1至16中任一项所述的方法。
  21. 一种计算机程序,所述计算机程序被处理器执行时实现如权利要求1至16中任一项所述的方法。
PCT/CN2023/092377 2022-05-07 2023-05-05 音频处理方法、装置、设备及存储介质 WO2023217003A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210495456.4 2022-05-07
CN202210495456.4A CN117059055A (zh) 2022-05-07 2022-05-07 音频处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023217003A1 true WO2023217003A1 (zh) 2023-11-16

Family

ID=88655974

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092377 WO2023217003A1 (zh) 2022-05-07 2023-05-05 音频处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN117059055A (zh)
WO (1) WO2023217003A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109714671A (zh) * 2017-10-26 2019-05-03 张德明 一种无线k歌音响系统
WO2020034227A1 (zh) * 2018-08-17 2020-02-20 华为技术有限公司 一种多媒体内容同步方法及电子设备
CN111554329A (zh) * 2020-04-08 2020-08-18 咪咕音乐有限公司 音频剪辑方法、服务器及存储介质
CN112967705A (zh) * 2021-02-24 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 一种混音歌曲生成方法、装置、设备及存储介质
WO2022062979A1 (zh) * 2020-09-23 2022-03-31 华为技术有限公司 音频处理方法、计算机可读存储介质、及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109714671A (zh) * 2017-10-26 2019-05-03 张德明 一种无线k歌音响系统
WO2020034227A1 (zh) * 2018-08-17 2020-02-20 华为技术有限公司 一种多媒体内容同步方法及电子设备
CN111554329A (zh) * 2020-04-08 2020-08-18 咪咕音乐有限公司 音频剪辑方法、服务器及存储介质
WO2022062979A1 (zh) * 2020-09-23 2022-03-31 华为技术有限公司 音频处理方法、计算机可读存储介质、及电子设备
CN112967705A (zh) * 2021-02-24 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 一种混音歌曲生成方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN117059055A (zh) 2023-11-14

Similar Documents

Publication Publication Date Title
US10062367B1 (en) Vocal effects control system
US9972297B2 (en) Sound processing device, sound data selecting method and sound data selecting program
WO2021103314A1 (zh) 一种构造听音场景的方法和相关装置
US20110041059A1 (en) Interactive Multimedia Content Playback System
US8887051B2 (en) Positioning a virtual sound capturing device in a three dimensional interface
WO2016112841A1 (zh) 一种信息处理方法及客户端、计算机存储介质
WO2016007677A1 (en) Clip creation and collaboration
JP2014520352A (ja) エンハンスされたメディア記録およびプレイバック
WO2023051246A1 (zh) 视频录制方法、装置、设备及存储介质
JP2023538943A (ja) オーディオデータの処理方法、装置、機器及び記憶媒体
WO2012021799A2 (en) Browser-based song creation
WO2022160603A1 (zh) 歌曲的推荐方法、装置、电子设备及存储介质
US11272136B2 (en) Method and device for processing multimedia information, electronic equipment and computer-readable storage medium
WO2023217003A1 (zh) 音频处理方法、装置、设备及存储介质
US9705953B2 (en) Local control of digital signal processing
WO2023216999A1 (zh) 音频处理方法、装置、设备及存储介质
WO2023217002A1 (zh) 音频处理方法、装置、设备及存储介质
WO2020124679A1 (zh) 视频处理参数信息的预配置方法、装置及电子设备
Jago Adobe Audition CC Classroom in a Book
WO2023160713A1 (zh) 音乐生成方法、装置、设备、存储介质及程序
WO2024012257A1 (zh) 音频处理方法、装置及电子设备
Adobe Creative Team Adobe Audition CS6 Classroom in a Book
WO2023010949A1 (zh) 一种音频数据的处理方法及装置
CN116403549A (zh) 音乐生成方法、装置、电子设备及存储介质
US20230050370A1 (en) A portable interactive music player

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802775

Country of ref document: EP

Kind code of ref document: A1