WO2022143530A1 - 音频处理方法、装置、计算机设备及存储介质 - Google Patents

音频处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022143530A1
WO2022143530A1 PCT/CN2021/141662 CN2021141662W WO2022143530A1 WO 2022143530 A1 WO2022143530 A1 WO 2022143530A1 CN 2021141662 W CN2021141662 W CN 2021141662W WO 2022143530 A1 WO2022143530 A1 WO 2022143530A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
component
audio signal
target
separation model
Prior art date
Application number
PCT/CN2021/141662
Other languages
English (en)
French (fr)
Inventor
张超钢
Original Assignee
广州酷狗计算机科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州酷狗计算机科技有限公司 filed Critical 广州酷狗计算机科技有限公司
Publication of WO2022143530A1 publication Critical patent/WO2022143530A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiments of the present application relate to the field of computer technologies, and in particular, to an audio processing method, an apparatus, a computer device, and a storage medium.
  • the audio processing software has the function of playing audio, and people can listen to the audio played by the audio processing software in leisure time; for another example, the audio processing software also has the function of adding sound effects to the audio, and people can add sound effects to the audio. Reverb, EQ, etc.
  • Embodiments of the present application provide an audio processing method, apparatus, computer equipment, and storage medium, which improve the flexibility of audio processing.
  • the technical solution is as follows:
  • an audio processing method comprising:
  • For each target component process the first audio signal of the target component according to the playback parameters set for the target component to obtain the second audio signal of the target component;
  • the second audio signal of each target component is fused with the third audio signal of other components in the target audio except the at least one target component to obtain the processed target audio.
  • an audio processing method comprising:
  • the target audio is composed of multiple components, and the components are human voice components or any musical instrument sound components;
  • time-domain separation model and a frequency-domain separation model are used to obtain the same type of components from the audio;
  • the time domain separation model and the frequency domain separation model are invoked to separate the first audio signal of each of at least one component from the fourth audio signal of the target audio.
  • an audio processing device comprising:
  • the display module is used to display the playback parameter setting options of the multiple components separated in the target audio through the playback parameter setting interface, and the components are human voice components or any musical instrument sound components;
  • a determination module configured to determine a playback parameter set for the at least one target component in response to a triggering operation of a playback parameter setting option for at least one target component, where the target component is any one of the multiple components;
  • a processing module configured to process the first audio signal of the target component according to the playback parameters set for the target component for each target component to obtain the second audio signal of the target component;
  • a fusion module configured to fuse the second audio signal of each target component with the third audio signal of other components in the target audio except the at least one target component to obtain the processed target audio.
  • the playback parameter includes a volume parameter
  • the processing module is configured to adjust the amplitude of the first audio signal of the target component according to the volume parameter set for the target component, to obtain the the second audio signal of the target component;
  • the playback parameters include sound effect parameters
  • the processing module is configured to perform sound effect processing on the first audio signal of the target component according to the sound effect parameters set for the target component to obtain the second audio signal of the target component ;or,
  • the playback parameters include timbre parameters, the timbre parameters indicate the timbre of the audio, and the processing module is used to obtain musical score information corresponding to the target component, and the musical score information is used to represent the pitch of the target component; according to The score information and the timbre parameters generate a second audio signal of the target component.
  • the apparatus further includes:
  • an obtaining module configured to obtain the first audio signals of the plurality of components from the server.
  • the apparatus further includes:
  • a separation module configured to invoke a time-domain separation model and a frequency-domain separation model, and separate the first audio signal of each of the multiple components from the fourth audio signal of the target audio;
  • the separation module is used to determine the first real part signal and the first imaginary part signal of the first frequency spectrum, the first frequency spectrum is the frequency spectrum of the target audio; call the frequency domain separation model, from the The second real part signal and the second imaginary part signal of each component are separated from the first real part signal and the first imaginary part signal; based on the second real part signal and the second imaginary part signal of each component part signal, determining the first audio signal for each of the components.
  • the separation module includes:
  • a time-domain separation unit configured to invoke the time-domain separation model, and separate the fifth audio signal of each component from the fourth audio signal of the target audio based on the time-domain information of the target audio;
  • a frequency domain separation unit configured to invoke the frequency domain separation model, for each component, based on the frequency domain information of the component, separate the first audio of the component from the fifth audio signal of the component signal to obtain the first audio signal of each component;
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the separation module includes:
  • a frequency domain separation unit configured to call the frequency domain separation model, and separate the sixth audio signal of each component from the fourth audio signal of the target audio based on the frequency domain information of the target audio;
  • a time-domain separation unit for invoking the time-domain separation model, for each component, separating the first audio of the component from the sixth audio signal of the component based on the time-domain information of the component signal to obtain the first audio signal of each component;
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the separation module includes:
  • a frequency domain separation unit configured to call the frequency domain separation model, and separate the sixth audio signal of each component from the fourth audio signal of the target audio based on the frequency domain information of the target audio;
  • a time-domain separation unit configured to invoke the time-domain separation model, and separate the fifth audio signal of each component from the fourth audio signal of the target audio based on the time-domain information of the target audio;
  • the fusion unit is configured to, for each component, perform fusion processing on the fifth audio signal of the component and the sixth audio signal of the component to obtain the first audio signal of the component.
  • the apparatus further includes:
  • an acquisition module for acquiring sample data, the sample data comprising sample audio and a sample audio signal of each of the multiple components of the sample audio;
  • the separation module is configured to call the frequency domain separation model, and based on the frequency domain information of the sample audio, separate the first prediction of each component of the plurality of components from the sample audio signal of the sample audio audio signal;
  • the separation module is further configured to call the time-domain separation model, and based on the time-domain information of the sample audio, separate the second predicted audio signal of each component from the sample audio signal of the sample audio;
  • the separation module is further configured to, for each component, perform fusion processing on the first predicted audio signal of the component and the second predicted audio signal of the component to obtain the third predicted audio signal of the component;
  • a training module configured to train the frequency domain separation model and the time domain separation model according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • an audio processing device comprising:
  • an audio acquisition module used for acquiring target audio
  • the target audio is composed of a plurality of components, and the components are human voice components or any musical instrument sound components;
  • a model acquisition module for acquiring a time-domain separation model and a frequency-domain separation model, the time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from the audio;
  • a separation module configured to invoke the time-domain separation model and the frequency-domain separation model to separate the first audio signal of each component of at least one component from the fourth audio signal of the target audio.
  • the separation module includes:
  • a time-domain separation unit configured to invoke the time-domain separation model, and separate the fifth audio signal of each component from the fourth audio signal of the target audio based on the time-domain information of the target audio;
  • a frequency domain separation unit configured to invoke the frequency domain separation model, for each component, based on the frequency domain information of the component, separate the first audio of the component from the fifth audio signal of the component Signal.
  • the separation module includes:
  • a frequency domain separation unit configured to call the frequency domain separation model, and separate the sixth audio signal of each component from the fourth audio signal of the target audio based on the frequency domain information of the target audio;
  • a time-domain separation unit for invoking the time-domain separation model, for each component, separating the first audio of the component from the sixth audio signal of the component based on the time-domain information of the component signal to obtain the first audio signal of each component.
  • the frequency domain separation unit is configured to determine a first real part signal and a first imaginary part signal of a first frequency spectrum, where the first frequency spectrum is a frequency spectrum corresponding to the target audio; the frequency domain separation model, separating the second real part signal and the second imaginary part signal of each component from the first real part signal and the first imaginary part signal of the first spectrum; based on the each component The second real signal of the components and the second imaginary signal determine the sixth audio signal for each of the components.
  • the separation module includes:
  • a frequency domain separation unit configured to call the frequency domain separation model, and separate the sixth audio signal of each component from the fourth audio signal of the target audio based on the frequency domain information of the target audio;
  • a time-domain separation unit configured to invoke the time-domain separation model, and separate the fifth audio signal of each component from the fourth audio signal of the target audio based on the time-domain information of the target audio;
  • the fusion unit is configured to, for each component, perform fusion processing on the fifth audio signal of the component and the sixth audio signal of the component to obtain the first audio signal of the component.
  • the apparatus further includes:
  • sample acquisition module for acquiring sample data, the sample data including sample audio and a sample audio signal of each component in at least one component of the sample audio;
  • the separation module is configured to call the frequency domain separation model, and based on the frequency domain information of the sample audio, separate the first prediction of each component of the at least one component from the sample audio signal of the sample audio audio signal;
  • the separation module is configured to invoke the time-domain separation model, and based on the time-domain information of the sample audio, separate the second predicted audio signal of each component from the sample audio signal of the sample audio;
  • a fusion module for performing fusion processing on the first predicted audio signal of the component and the second predicted audio signal of the component for each of the components to obtain the third predicted audio signal of the component;
  • a training module configured to train the frequency domain separation model and the time domain separation model according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • a computer device in another aspect, includes a processor and a memory, the memory stores at least one piece of program code, and the at least one piece of program code is loaded and executed by the processor to implement the The operations performed in the audio processing method described in the above aspects.
  • a computer-readable storage medium is provided, and at least one piece of program code is stored in the computer-readable storage medium, and the at least one piece of program code is loaded and executed by a processor, so as to realize the above-mentioned aspects The operation performed in the audio processing method.
  • a computer program product wherein at least one piece of program code is stored in the computer program product, and the at least one piece of program code is loaded and executed by a processor to implement the audio processing method described in the above aspects. the action performed.
  • the audio processing method, device, device and medium provided by the embodiments of the present application can set playback parameters for one or more components in the audio through the playback parameter setting interface when processing audio.
  • the audio signal of the component is processed by the playback parameters set by the component, thereby realizing the processing of the audio signal of the component in the audio separately. Therefore, different personalized playback effects can be set for different components in the same audio through the above method, Improved flexibility in audio processing.
  • FIG. 1 is a schematic structural diagram of an implementation environment provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of an audio processing method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of an audio processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a playback parameter setting interface provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a playback parameter setting interface provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a playback parameter setting interface provided by an embodiment of the present application.
  • FIG. 7 is a flowchart of an audio processing method provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of an audio processing method provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an audio processing apparatus provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another audio processing apparatus provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an audio processing apparatus provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of another audio processing apparatus provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • first, second, third, “fourth”, “fifth”, “sixth”, etc. used in this application may be used herein to describe various concepts, However, these concepts are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first component may be referred to as a second component and a second component may be referred to as a first component without departing from the scope of this application.
  • each includes one, two or more, multiple includes two or more, and Each refers to each of the corresponding plurality, and any refers to any one of the plurality.
  • a plurality of components includes 3 components, and each refers to each of the 3 components, and any refers to any of the 3 components, which can be the first or the second one, or the third.
  • the audio processing method provided by the embodiment of the present application is executed by a computer device.
  • the computer device is a terminal, such as a mobile phone, a tablet computer, a computer, and the like.
  • the computer device is a server, and the server is a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the computer equipment includes a terminal and a server.
  • FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • the implementation environment includes at least one terminal 101 and a server 102 .
  • the terminal 101 and the server 102 are connected through a wireless or wired network.
  • a target application provided by the server 102 is installed on the terminal 101, and the terminal 101 can realize functions such as data transmission and message interaction through the target application.
  • the target application is an application in the operating system of the terminal 101, or an application provided by a third party.
  • the target application is an audio processing application, and the audio processing application has a function of playing audio.
  • the audio processing application can also have other functions, such as a recording function, a live broadcast function, a sharing function, a sound effect adjustment function, and the like.
  • the server 102 is a background server of the target application or a cloud server that provides services such as cloud computing and cloud storage.
  • the terminal 101 sends the playback parameters set for at least one target component in the audio to the server, and the server 102 performs personalized processing on the audio signal of the at least one target component in the audio based on the received playback parameters, and obtains the processed audio signal.
  • the processed audio is sent to the terminal 101, and the terminal 101 plays the processed audio.
  • the terminal 101 acquires audio signals of multiple components of audio from the server, and performs personalized processing on the audio signal of at least one target component of the multiple components to obtain processed audio.
  • the terminal 101 obtains audio from the server, separates the audio signal of multiple components from the audio, and performs personalized processing on the audio signal of at least one target component among the multiple components to obtain the processed audio. It should be noted that, in a possible implementation manner, the terminal 101 plays the processed audio after obtaining the processed audio.
  • the audio is usually composed of human voice and musical instrument sound
  • the components in the audio refer to the vocal components, musical instrument sound components, etc. that make up the audio.
  • the user can personalize the components in the audio in the terminal interface, for example, increase the volume of the human voice, and add sound effects to the bass sound in the accompaniment , replace the drums in the accompaniment with African drums, etc.
  • the terminal plays the original song, so that the user sings according to the played original song, the terminal records the user's singing voice, synthesizes the user's singing voice and the original song into a new audio, and adopts the audio processing provided by the embodiment of the present application.
  • the method can change the volume of the vocals in the original song, and mix the vocals in the original song with the user's singing by reducing the volume of the vocals in the original song, so that the vocals in the original song can be used as harmony. Effect.
  • the embodiment of the present application only takes the audio playback scene and the song recording scene as examples to illustrate the audio processing scene, but does not limit the audio processing scene.
  • the audio processing method provided by the embodiment of the present application can also Apply to any other audio processing scene.
  • FIG. 2 is a flowchart of an audio processing method provided by an embodiment of the present application.
  • the execution body of the embodiments of the present application is a computer device. Referring to Figure 2, the method includes:
  • the playback parameter setting interface is an interface for setting the audio playback effect, and the playback parameter setting interface includes at least one playback parameter setting option for the user to adjust the audio playback parameters to change the audio playback effect.
  • the target audio is any audio in the computer device, for example, any song and so on.
  • Audio is usually composed of vocals and instrumental sounds, and the components in the audio refer to the vocal components, instrumental sound components, etc. that make up the audio.
  • the components included in the audio are vocal components and accompaniment components, wherein the accompaniment components refer to the other multiple musical instrument sound components in the audio except the vocal components.
  • the components included in the audio are vocal components, drum components, bass components, and remaining accompaniment components, wherein the remaining accompaniment components refer to other components in the audio except the vocal components, drum components, and bass components. .
  • the target component is any one of a plurality of components of the target audio, for example, a vocal component, a drum sound component, an accompaniment component, and the like.
  • Playback parameters are parameters used to control audio playback effects, such as volume parameters, sound effect parameters, timbre parameters, and the like.
  • the playback parameter of the target component is the parameter used to control the playback effect of the target component. It should be noted that, in this embodiment of the present application, the playback parameter of the target component is only used to control the playback effect of the target component, not the playback effect of the target component. Controls how other components play.
  • the first audio signal is an audio signal of a target component in the target audio
  • the second audio signal is an audio signal obtained by processing the first audio signal according to the playback operation set for the target component.
  • the playback effect of the target component is changed, so that the playback effect of the target component can be changed by setting the playback parameters of the target component.
  • the first audio signal is an audio signal of a target component in the target audio
  • the third audio signal is an audio signal of other components in the target audio except the target component.
  • the first audio signal and the third audio signal are used. For example, distinguish the audio signal of the target component from the audio signals of other components.
  • the components in the processed target audio are the same as the components in the target audio, but the audio signal of the target component in the processed target audio is not the same as the audio signal of the target component in the target audio.
  • the target component has a playback effect corresponding to the playback parameters set in step 202 .
  • playback parameters when processing audio, playback parameters can be set for one or more components in the audio through a playback parameter setting interface, and for each component, the playback parameters set for the component are used to The audio signal of the component is processed, so that the audio signal of the component in the audio can be processed separately. Therefore, different personalized playback effects can be set for different components in the same audio through the above method, which improves the flexibility of audio processing. sex.
  • FIG. 3 is a flowchart of an audio processing method provided by an embodiment of the present application. In the embodiment of the present application, only the execution subject is taken as the terminal for illustrative description. Referring to FIG. 3 , the method includes:
  • the terminal acquires, from the server, a first audio signal of multiple components that have been separated from the target audio.
  • the terminal is installed with a target application
  • the server is a server that provides services for the target application.
  • the target application is an audio processing application, and the terminal can acquire audio from the server, process the audio, or play the audio.
  • the server stores multiple audios and the first audio signals of multiple components that have been separated from each audio, or the server only stores the separated audio signals from each audio. a first audio signal of multiple components. Therefore, the terminal can directly acquire the first audio signal of the multiple components of the target audio from the server without performing separation processing on the target audio.
  • the terminal acquires the first audio signal of multiple components separated from the target audio from the server, including: the terminal sends an audio acquisition request to the server, where the audio acquisition request carries the audio identifier of the target audio
  • the server receives this audio acquisition request, and based on the audio identification of this target audio, the first audio signal of the multiple components separated from the target audio is sent to the terminal, or, the target audio and the multiple components separated from the target audio are sent to the terminal.
  • the first audio signal of each component is sent to the terminal.
  • the audio identifier is any identifier such as the name of the audio, the author of the audio, the serial number of the audio, and the embodiment of the present application does not limit the audio identifier.
  • the terminal when the user plays the song A through the song playing application of the terminal, the terminal sends a song acquisition request to the server, the song acquisition request carries the song name of the song A, and the server acquires the vocal component of the song A according to the song name of the song A and For each instrumental component, the vocal component and each instrumental component are sent to the terminal.
  • song A is composed of vocals, piano sounds, drum sounds and bass sounds
  • each instrument sound component refers to piano sounds, drum sounds and bass sounds.
  • the embodiment of the present application only takes the first audio signal in which the multiple components separated from the audio are stored in the server as an example, and the terminal obtains the first audio signal of the separated multiple components in the target audio.
  • the process of the audio signal is exemplified, and in another embodiment, only a plurality of audios are stored in the server.
  • the terminal After the terminal acquires the audio from the server, the terminal performs separation processing on the acquired audio, and separates the acquired audio from the acquired audio.
  • the first audio signal with multiple components is obtained, wherein the process of separating the first audio signal with multiple components from the audio refers to the embodiments shown in FIG. 7 and FIG. 8 , and details are not repeated here.
  • the terminal stores the first audio signal of multiple components separated from the target audio, and the terminal directly acquires the first audio signal of the multiple components of the target audio locally.
  • the first audio signal locally stored by the terminal is acquired from a server; optionally, the first audio signal locally stored by the terminal is obtained by separating and processing the acquired audio.
  • the terminal displays, through a playback parameter setting interface, playback parameter setting options of multiple components that have been separated in the target audio.
  • the playback parameter setting interface is an interface for setting the audio playback effect
  • the playback parameter setting interface includes at least one playback parameter setting option.
  • the playback parameter setting options include at least one of a volume setting option, a sound effect setting option, or a timbre setting option.
  • a playback parameter setting option corresponding to each component is displayed in the playback parameter setting interface.
  • the playback parameter setting interface includes the volume setting option of the vocal component and the volume setting of the drum sound component. options, volume setting options for bass components, and volume setting options for other styles. The volume of various components in the target audio can be adjusted through the playback parameter setting interface.
  • the playback parameter setting interface displays a plurality of playback parameter setting options corresponding to each component.
  • the playback parameter setting interface displays volume setting options and sound effect setting options for human voice components, volume setting options and sound effect setting options for drum components, volume setting options and sound effect setting options for bass components, and Volume setting options and sound effect setting options for other accompaniments.
  • the sound effect setting options are one or more options.
  • the sound effect setting options include reverberation options, soothing options, rock options, etc.; or, the sound effect setting option is used to trigger the display of a sound effect setting interface, and the sound effect setting interface includes multiple sound effect options such as reverberation options, soothing options, and rock options.
  • the playback parameter setting interface displays one or more playback parameter setting options corresponding to one component, that is, the playback parameter setting interface can only display playback parameter setting options for one component at a time.
  • the playback parameter setting interface includes a component selection option, where the component selection option is used to indicate which playback parameter setting option of the component is displayed, or is used to indicate that the playback parameter indicated by the current playback parameter setting option corresponds to which ingredient.
  • the playback parameter setting interface includes vocal options, drum sound options, bass sound options, other accompaniment options, and at least one playback parameter setting option.
  • a playback parameter setting option is triggered to set playback parameters for the vocal component; when the bass sound option is selected, the at least one playback parameter setting option is triggered to set playback parameters for the bass sound component.
  • the above-mentioned terminal displays, through the playback parameter setting interface, playback parameter setting options of multiple components that have been separated in the target audio, including: acquiring the component identifiers of each component in the target audio, and according to the acquired component identifiers, through the playback parameter setting interface , to display the playback parameter setting options of the separated multiple components in the target audio, so that the components displayed on the playback parameter setting interface can be guaranteed to correspond to the components of the target audio.
  • the terminal determines, in response to a triggering operation of a playback parameter setting option of at least one target component, a playback parameter set for the at least one target component.
  • the target component is any one of multiple components separated from the target audio.
  • the target component refers to a component whose playback parameters are modified.
  • the playback parameter setting option is a volume adjustment option
  • the playback parameter is a volume parameter
  • the playback parameter setting option is a sound effect adjustment option
  • the playback parameter is a sound effect parameter
  • the sound effect parameter is the sound effect name, and the corresponding audio signal.
  • Adjustment parameters, etc. optionally, the playback parameter setting option is the timbre adjustment option, then the playback parameter is the timbre parameter, and the timbre parameter is used to indicate that the timbre of the audio is adjusted to the target timbre, for example, the timbre parameter is the timbre of the target timbre logo.
  • each sound effect corresponds to at least one adjustment parameter for the audio signal, such as a frequency adjustment parameter, a phase adjustment parameter and the like of the audio signal.
  • the playback parameters set for each target component may be one or multiple, for example, the playback parameters set for the target components include volume parameters and sound effect parameters; and, the playback parameters set for each target component The parameters can be the same or different.
  • the terminal For each target component, the terminal processes the first audio signal of the target component according to the playback parameters set for the target component to obtain the second audio signal of the target component.
  • the types of playback parameters are different, and the processing methods for target components are also different.
  • the playback parameters are volume parameters, sound effect parameters, and timbre parameters as examples to illustrate the processing process of target components.
  • the playback parameters are other types of parameters, and the embodiment of the present application does not limit the types of playback parameters, nor does it limit the process of processing audio according to the playback parameters.
  • the playback parameters include volume parameters
  • the first audio signal of the target component is processed to obtain the second audio signal of the target component, including : For each target component, according to the volume parameter set for the target component, adjust the amplitude of the first audio signal of the target component to obtain the second audio signal of the target component.
  • the playback volume of the audio is determined by the amplitude of the audio signal. Only the amplitude is different between the first audio signal and the second audio signal, while the frequency, phase and other information are the same. Therefore, when adjusting the amplitude of the first audio signal After the amplitude, only the volume of the target component is changed, but the timbre, playback speed, etc. of the target component are not changed. Subsequently, the second audio signal based on the target component is played.
  • audio A includes vocal components, bass components and drum components.
  • the volume of audio A is 10. Now the volume of the vocal components is adjusted to 20, but the volume of the bass components and drum components is not adjusted.
  • the terminal will adjust the amplitude of the audio signal of the vocal component to obtain the audio signal adjusted by the vocal component, and fuse the audio signal adjusted by the vocal component with the audio signal of the bass component and the audio signal of the drum component to obtain
  • the volume of the human voice component in the processed target audio is 20, and the volume of the bass component and the drum component is 10, so when playing, the human voice in the processed target audio is louder.
  • the playback parameters include sound effect parameters
  • the first audio signal of the target component is processed to obtain the second audio signal of the target component, including : For each target component, perform sound effect processing on the first audio signal of the target component according to the sound effect parameters set for the target component to obtain the second audio signal of the target component.
  • the sound effect parameter is the sound effect indicating the audio.
  • the sound effect parameters include volume parameters, playback speed parameters, frequency adjustment parameters, phase adjustment parameters, and the like.
  • the soothing sound effect includes a volume reduction parameter, a playback speed reduction parameter, a frequency reduction parameter, etc., wherein the reduction parameter is used to indicate a reduced amplitude.
  • the sound effect parameters include volume parameters, playback speed parameters and frequency adjustment parameters, and perform sound effect processing on the first audio signal of the target component to obtain the second audio signal of the target component, including: the amplitude, time of the first audio signal. and frequency are adjusted to obtain the second audio signal.
  • the playback parameters include timbre parameters, and the timbre parameters are used to indicate that the timbre of the audio is adjusted to the target timbre.
  • the first audio The signal is processed to obtain the second audio signal of the target component, including: for each target component in the at least one target component, obtaining musical score information corresponding to the target component, and the musical score information is used to represent the pitch of the component; parameter to generate a second audio signal of the target component, the second audio signal having the target timbre.
  • the score information includes at least one note and the duration of each note.
  • the target component is the sound of a drum kit
  • the timbre parameter is the identifier of the African drum.
  • the drum sound component in the target audio is analyzed to obtain the score information of the drum sound component, and the African drum sound is generated according to the score information and the timbre parameters.
  • the African drum sound component has the same score information as the kit drum sound component, so the African drum sound component can be added to the target audio instead of the original drum kit sound component.
  • acquiring the musical score information corresponding to any component includes: according to the first audio signal of the component frequency, determine the note corresponding to the component; determine the duration of the corresponding note according to the duration of the component on the frequency; generate music score information corresponding to the component according to the multiple notes corresponding to the component and the duration of each note.
  • the frequency of the audio signal of component A is B from 0 to 1.5 seconds, the corresponding note of the audio signal from 0 to 1.5 seconds is C, and the duration of the note C is 1.5 seconds;
  • the audio signal of component A is in The frequency of the first 1.5 to 2.5 seconds is D, then the note corresponding to the audio signal of the first 1.5 to 2.5 seconds is E, and the duration of the note E is 1 second;
  • the frequency of the audio signal of the component A in the first 2.5 to 4 seconds is F, then the note corresponding to the audio signal from the 2.5th to 4th second is G, and the duration of the note G is 1.5 seconds;
  • the notes in the score information of the obtained component A are C, E and G in sequence, and each audio frequency The durations are 1.5 seconds, 1 second, and 1.5 seconds in sequence.
  • Timbre is determined by the waveform of the audio signal, eg, the harmonic amplitude, phase offset, etc. of the audio signal. Therefore, the audio with the target timbre can be obtained by generating the corresponding audio signal.
  • generating the second audio signal of the target component according to the score information and the timbre parameters including: inputting the score information and the timbre parameters into an audio signal synthesizer, and obtaining the second audio signal output by the audio signal synthesizer, The audio signal synthesizer is used to synthesize audio signals according to the input score information and the specified timbre.
  • the waveform features corresponding to multiple timbres are stored in the audio signal synthesizer, the score information and the timbre parameters are input into the audio signal synthesizer, and the second audio signal output by the audio signal synthesizer is obtained, including: combining the score information with the timbre.
  • the parameters are input to the audio signal synthesizer, the audio signal synthesizer determines the waveform feature corresponding to the timbre parameter according to the timbre parameter, and according to the waveform feature and the frequency of the audio signal indicated by the score information and the duration of each frequency, synthesizes the corresponding first Two audio signals.
  • the terminal or server stores musical instrument materials of multiple musical instruments, each musical instrument material is audio corresponding to a musical instrument, and by changing the pitch and rhythm of the musical instrument material, the musical instrument material can be Replace the target ingredient.
  • obtain target audio material whose timbre is the target timbre according to the timbre parameter; adjust the target audio material according to the score information to obtain the adjusted target audio material, and use the target audio material as the target component.
  • adjusting the target audio material according to the musical score information, and obtaining the adjusted target audio material refers to: adjusting the pitch and rhythm of the target audio material according to the musical score information, so that the musical score information corresponding to the adjusted target audio material is the same as the target component.
  • the corresponding score information is the same.
  • adjusting the pitch of the target audio material refers to: adjusting the frequency of the audio signal of the target audio material.
  • Adjusting the tempo of the target audio material refers to: adjusting the duration of each frequency in the audio signal of the target audio material.
  • the terminal fuses the second audio signal of each target component with the third audio signal of other components in the target audio except at least one target component, to obtain the processed target audio.
  • the terminal fuses the second audio signal of each target component with the third audio signal of other components in the target audio except at least one target component to obtain the processed target audio, including: The second audio signal of the target component and the third audio signal of other components in the target audio except at least one target component are superimposed to obtain an eighth audio signal, and the processed target audio is composed of the eighth audio signal.
  • obtaining the eighth audio signal is obtaining the processed target audio.
  • the eighth audio signal is the audio signal of the processed target audio, and the embodiment of the present application only uses the eighth audio signal as an example to distinguish the processed audio signal of the target audio from other audio signals.
  • the processed target audio can be played, or the processed target audio can be stored. Do limit.
  • playback parameters when processing audio, playback parameters can be set for one or more components in the audio through a playback parameter setting interface, and for each component, the playback parameters set for the component are used to The audio signal of the component is processed, so that the audio signal of the component in the audio can be processed separately. Therefore, different personalized playback effects can be set for different components in the same audio through the above method, which improves the flexibility of audio processing. sex.
  • the terminal can directly obtain the first audio signal of multiple components separated from the audio from the server, without the need for the terminal to perform audio separation processing, which reduces the requirements for the terminal and improves the efficiency of the terminal in processing audio.
  • the audio processing method provided by the embodiment of the present application provides a variety of processing methods for components in the audio, and can process the volume, sound effect and timbre of any component, thereby improving the diversity and flexibility of processing.
  • FIG. 2 to FIG. 3 are only exemplary descriptions of the processing process of any one or more components in the audio.
  • FIG. 7 to FIG. The process of ingredients is exemplified.
  • FIG. 7 is a flowchart of an audio processing method provided by an embodiment of the present application. Referring to FIG. 7, the method is applied in computer equipment, and the method includes:
  • Acquire target audio where the target audio is composed of multiple components, and the components are human voice components or any musical instrument sound components.
  • the target audio is any audio, for example, the target audio is the audio of any song, the audio of any symphony, etc.
  • the embodiment of the present application does not limit the target audio.
  • the time-domain separation model is a model for separating audio based on the time-domain information of the audio.
  • the time-domain separation model is Wave-U-Net (Wave-U-Network, signal wave U-shaped neural network) or Models such as TasNet (Time-domain audio separation Network, single-channel speech separation neural network).
  • the frequency domain separation model is a model used to separate audio based on the frequency domain information of the audio.
  • the frequency domain separation model is U-Net (U-Network, U-shaped neural network) or open-unmix (a frequency domain separation model) and other models.
  • time-domain separation model and the frequency-domain separation model are separated based on different information in the audio, the time-domain separation model and the frequency-domain separation model are complementary.
  • the audio is separated, and the various components can be separated more accurately.
  • the time-domain separation model and the frequency-domain separation model can separate the audio based on different information in the audio
  • the time-domain separation model and the frequency-domain separation model are complementary, so that the time-domain separation model and the frequency-domain separation model are complementary.
  • the domain separation model and the frequency domain separation model separate the audio together, which can more accurately separate various components and improve the audio separation effect.
  • FIG. 8 is a flowchart of an audio processing method provided by an embodiment of the present application.
  • the computer equipment is used to separate audio as an example for description. Referring to FIG. 8 , the method includes:
  • the computer device acquires target audio, where the target audio is composed of multiple components, and the components are human voice components or any musical instrument sound components.
  • the step 801 is the same as the step 701, and details are not repeated here.
  • the computer device obtains a time-domain separation model and a frequency-domain separation model, where the time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the time-domain separation model and the frequency-domain separation model obtained in step 802 are trained models, and the time-domain separation model and the frequency-domain separation model have certain separation accuracy.
  • the time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from the audio, which means to separate the audio signal of the same component from the audio.
  • the time domain separation model is used to separate the audio signal of the vocal component, the audio signal of the drum component and the audio signal of other accompaniment from the audio
  • the frequency domain separation model is also used to separate the audio of the vocal component from the audio. signal, the audio signal of the drum component, and the audio signal of other accompaniments.
  • the time domain separation model is used to separate the first audio signal of the target component from the audio
  • the frequency domain separation model is also used to separate the first audio signal of the target component from the audio
  • the target component is human voice. component, or accompaniment component, or any instrumental component.
  • the time domain separation model is used to separate the multi-component audio signal from the audio
  • the frequency domain separation model is also used to separate the multi-component audio signal from the audio.
  • time-domain separation models are used to separate vocal and bass components from audio
  • frequency-domain separation models are also used to separate vocal and bass components from audio.
  • the computer device invokes the frequency domain separation model, and based on the frequency domain information of the target audio, separates the sixth audio signal of each component from the fourth audio signal of the target audio.
  • the audio signal of the audio represents the regularity of the waveform of the audio changing with time. Therefore, the audio signal is the time domain information of the audio.
  • the spectrum is the frequency distribution curve of the audio, which represents the frequency domain information of the audio.
  • Both the audio time-domain information and the frequency-domain information contain information of various components in the audio. Therefore, based on the audio time-domain information or frequency-domain information, the audio signal of each component can be separated from the audio.
  • the audio signal of which the component is separated from the audio based on the frequency domain information of the audio is used as an example for illustration.
  • the audio signal of which the component is separated from the audio based on the time domain information of the audio is: example to illustrate.
  • the frequency domain separation model is invoked, and based on the frequency domain information of the target audio, the sixth audio signal of each component is separated from the fourth audio signal of the target audio, including: based on the fourth audio signal corresponding to the target audio
  • the frequency domain separation model is called to separate the amplitude information corresponding to each component from the second frequency spectrum, and based on the amplitude information of each component, the sixth audio signal of each component is generated.
  • the fourth audio signal is the audio signal of the target audio
  • the sixth audio signal is the audio signal of each component separated from the target audio.
  • the embodiment of the present application only takes the fourth audio signal and the sixth audio signal as examples.
  • the audio signal of the target audio and the audio signal of the separated components are exemplified.
  • the second frequency spectrum is a curve in which the amplitude of the fourth audio signal of the target audio is arranged according to frequency. Therefore, before calling the frequency domain separation model, the second frequency spectrum needs to be generated first.
  • generating the second frequency spectrum includes: performing Fourier transform on the fourth audio signal of the target audio to obtain a complex signal; obtaining the sum of squares of the real part information and the imaginary part information of the complex signal, and opening the sum of squares. A square operation is performed to obtain the amplitude information of the fourth audio signal, and a curve of the amplitude information of the audio signal changing with the frequency is obtained to obtain the second frequency spectrum.
  • the frequency domain separation model can only separate the amplitude information, it is necessary to generate the component sixth audio signal based on the phase information and the separated amplitude information of the fourth audio signal in the target audio.
  • the embodiment of the present application also provides another more accurate separation method , in another possible implementation manner, the frequency domain separation model is called, and based on the frequency domain information of the target audio, the sixth audio signal of each component is separated from the fourth audio signal of the target audio, including: determining the first frequency spectrum The first real part signal and first imaginary part signal of a second real signal and a second imaginary signal for each component; and a sixth audio signal for each component is determined based on the second real signal and the second imaginary signal for each component.
  • determining the first real part signal and the first imaginary part signal of the first frequency spectrum includes: performing Fourier transform on the fourth audio signal of the target audio to obtain the first real part signal corresponding to the fourth audio signal and the first imaginary part signal, obtain the curve of the first real part signal and the first imaginary part signal as a function of frequency, and obtain the first frequency spectrum. Since the first spectrum is the curve of the first real part signal and the first imaginary part signal changing with frequency, obtaining the first spectrum means determining the first real part signal and the first imaginary part signal in the first spectrum .
  • the first real part signal and the first imaginary part signal contain the amplitude information and phase information of the audio signal.
  • Part signal and second imaginary part signal determine the sixth audio signal of each component, avoid introducing phase noise, and the obtained sixth audio signal is more accurate.
  • determining the sixth audio signal of each component based on the second real part signal and the second imaginary part signal of each component includes: performing time inverse time on the second real part signal and the second imaginary part signal of each component frequency conversion to obtain the sixth audio signal of each component.
  • the computer device invokes the time-domain separation model, and for each component, separates the first audio signal of the component from the sixth audio signal of the component based on the time-domain information of the component to obtain the first audio signal of each component.
  • the computer equipment separates the separation result of the frequency domain separation model again through the time domain separation model.
  • the frequency domain separation model is used to separate the vocal components from the audio, but the separated vocal components may also contain some drum components. Therefore, the vocal components separated by the frequency domain separation model are input into the time domain separation model. , the vocal components are separated by the time-domain separation model.
  • the sixth audio signal of each component is directly input into the time-domain separation model, and the time-domain separation model for each component separates the sixth audio signal of the component from the sixth audio signal of the component. an audio signal.
  • the frequency-domain separation model is called first, and then the time-domain separation model is called, and the time-domain separation model and the frequency-domain separation model are called to separate at least the fourth audio signal of the target audio.
  • the process of the first audio signal for each of the components is exemplified.
  • the time domain separation model is called first, and then the frequency domain separation model is called.
  • calling the time-domain separation model and the frequency-domain separation model to separate the first audio signal of each component in at least one component from the fourth audio signal of the target audio including: calling the time-domain separation model, based on the time-domain separation model of the target audio Domain information, separate the fifth audio signal of each component from the fourth audio signal of the target audio; call the frequency domain separation model, for each component, based on the frequency domain information of the component, separate from the fifth audio signal of the component
  • the first audio signal of each component is obtained, and the first audio signal of each component is obtained.
  • the time domain separation model is called first, and then the frequency domain separation model is called.
  • the frequency domain separation model is called, and for each component, based on the frequency domain information of the component, separating the first audio signal of the component from the fifth audio signal of the component includes: determining the third real part signal of the third frequency spectrum and the first audio signal of the component Three imaginary part signals, the third spectrum is the spectrum corresponding to the fifth audio signal of each component; call the frequency domain separation model, for each component, separate from the third real part signal and the third imaginary part signal of the component The fourth real part signal and the fourth imaginary part signal of the component are obtained; the first audio signal of each component is determined based on the third real part signal and the fourth imaginary part signal of each component.
  • the manner of acquiring the third spectrum is the same as the manner of acquiring the first spectrum, and details are not repeated here.
  • the time-domain separation model and the frequency-domain separation model are called in parallel, the time-domain separation model and the frequency-domain separation model are called, and the fourth audio signal of each component of the at least one component is separated from the fourth audio signal of the target audio.
  • An audio signal comprising: calling a frequency-domain separation model, and separating a sixth audio signal of each component from a fourth audio signal of the target audio based on the frequency-domain information of the target audio; calling a time-domain separation model, based on the target audio Time domain information, the fifth audio signal of each component is separated from the fourth audio signal of the target audio; for each component, the fifth audio signal of the component and the sixth audio signal of the component are fused to obtain the component's first audio signal.
  • the time-domain separation model and the frequency-domain separation model are used in parallel to separate the audio.
  • performing fusion processing on the fifth audio signal of the component and the sixth audio signal of the component to obtain the first audio signal of the component means: according to the weight of the fifth audio signal and the sixth audio signal The weight of the signal is weighted, and the fifth audio signal and the sixth audio signal are weighted to obtain the first audio signal.
  • the embodiment of the present application also provides a method for training the time-domain separation model and the frequency-domain separation model.
  • the method further includes: acquiring sample data before separating the sixth audio signal of each component from the fourth audio signal of the target audio based on the frequency domain information of the target audio by invoking the frequency domain separation model , the sample data includes sample audio and the sample audio signal of each component in at least one component of the sample audio; call the frequency domain separation model, based on the frequency domain information of the sample audio, from the sample audio signal of the sample audio.
  • the first predicted audio signal of each component; the time domain separation model is called, and based on the time domain information of the sample audio, the second predicted audio signal of each component is separated from the sample audio signal of the sample audio; for each component, the The first predicted audio signal of the component and the second predicted audio signal of the component are fused to obtain the third predicted audio signal of the component; according to the third predicted audio signal of each component and the corresponding sample audio signal in the sample data
  • the frequency-domain separation model and the time-domain separation model are trained to converge the difference between the third predicted audio signal and the corresponding sample audio signal in the sample data.
  • time-domain separation model and the frequency-domain separation model provided by the embodiments of the present application can separate at least one component from the audio, and the embodiments of the present application also provide a time-domain separation model and a frequency-domain separation model that only How to separate audio when a component can be separated.
  • the frequency domain separation model is a model for separating an audio signal of one component from the audio, and the frequency domain separation model is called to separate the first audio signal from the audio.
  • the first audio signal of the component includes: determining the first real part signal and the first imaginary part signal of the first spectrum, and the first spectrum is the spectrum corresponding to the target audio; calling the frequency domain separation model, from the first real part of the first spectrum.
  • the second real part signal and the second imaginary part signal of the first component are separated from the first component signal and the first imaginary part signal; based on the second real part signal and the second imaginary part signal of the first component, the first component an audio signal.
  • the first audio signal of the remaining components in the target audio is determined, and the first component and the remaining components form a plurality of components, thereby realizing the separation of the audio into multiple components. an ingredient.
  • the frequency-domain separation model and the time-domain separation model are models for separating an audio signal of a component from the audio; computer equipment Separating the first audio signal of the first component from the audio includes: calling a time domain separation model and a frequency domain separation model to separate the first audio signal of the first component from the fourth audio signal of the target audio. Subsequently, based on the fourth audio signal of the target audio and the first audio signal of the first component, the first audio signal of the remaining components in the target audio is determined, and the first component and the remaining components form a plurality of components, thereby realizing the separation of the audio into multiple components. an ingredient.
  • the time-domain separation model and the frequency-domain separation model can separate the audio based on different information in the audio
  • the time-domain separation model and the frequency-domain separation model are complementary, so that the time-domain separation model and the frequency-domain separation model are complementary.
  • the domain separation model and the frequency domain separation model separate the audio together, which can more accurately separate various components and improve the audio separation effect.
  • FIG. 9 is a schematic structural diagram of an audio processing apparatus provided by the present application.
  • the device includes:
  • the display module 901 is used to display the playback parameter setting options of multiple components separated in the target audio through the playback parameter setting interface, and the components are human voice components or any musical instrument sound components;
  • a determination module 902 configured to determine a playback parameter set for the at least one target component in response to a triggering operation of a playback parameter setting option for at least one target component, and the target component is any one of the multiple components;
  • the processing module 903 is used for, for each target component, to process the first audio signal of the target component according to the playback parameters set for the target component to obtain the second audio signal of the target component;
  • the fusion module 904 is configured to fuse the second audio signal of each target component with the third audio signal of other components in the target audio except the at least one target component to obtain the processed target audio.
  • the playback parameter includes a volume parameter
  • the processing module 903 is configured to adjust the amplitude of the first audio signal of the target component according to the volume parameter set for the target component, obtain the second audio signal of the target component; or,
  • the playback parameters include sound effect parameters
  • the processing module 903 is configured to perform sound effect processing on the first audio signal of the target component according to the sound effect parameters set for the target component to obtain the second audio signal of the target component; or,
  • the playback parameters include timbre parameters, which indicate the timbre of the audio.
  • the processing module 903 is configured to obtain musical score information corresponding to the target component, and the musical score information is used to represent the pitch of the target component; according to the musical score information and the The timbre parameter generates the second audio signal of the target component.
  • the device further includes:
  • the obtaining module 905 is configured to obtain the first audio signal of multiple components from the server.
  • the device further includes:
  • a separation module 906 configured to invoke a time-domain separation model and a frequency-domain separation model, and separate the first audio signal of each of the multiple components from the fourth audio signal of the target audio; or,
  • the separation module 906 is configured to determine the first real part signal and the first imaginary part signal of the first frequency spectrum, the first frequency spectrum is the frequency spectrum of the target audio; call the frequency domain separation model, from the first frequency spectrum of the first frequency spectrum.
  • the second real part signal and the second imaginary part signal of each component in the plurality of components are separated from the real part signal and the first imaginary part signal; based on the second real part signal and the second imaginary part of each component signal to determine the first audio signal for each component.
  • the separation module 906 includes:
  • the time-domain separation unit 9061 is used to call the time-domain separation model, based on the time-domain information of the target audio, separates the fifth audio signal of each component from the fourth audio signal of the target audio;
  • the frequency domain separation unit 9062 is used to call the frequency domain separation model, and for each component, based on the frequency domain information of the component, separate the first audio signal of the component from the fifth audio signal of the component to obtain the the first audio signal for each component;
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from the audio.
  • the separation module 906 includes:
  • Frequency domain separation unit 9062 for calling this frequency domain separation model, based on the frequency domain information of this target audio frequency, separates the sixth audio signal of this each component from the fourth audio signal of this target audio frequency;
  • the time-domain separation unit 9061 is configured to call the time-domain separation model, and for each component, based on the time-domain information of the component, separate the first audio signal of the component from the sixth audio signal of the component to obtain the the first audio signal for each component;
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from the audio.
  • the separation module 906 includes:
  • Frequency domain separation unit 9062 for calling this frequency domain separation model, based on the frequency domain information of this target audio frequency, separates the sixth audio signal of this each component from the fourth audio signal of this target audio frequency;
  • the time-domain separation unit 9061 is used to call the time-domain separation model, based on the time-domain information of the target audio, separates the fifth audio signal of each component from the fourth audio signal of the target audio;
  • the fusion unit 9063 is configured to, for each component, perform fusion processing on the fifth audio signal of the component and the sixth audio signal of the component to obtain the first audio signal of the component.
  • the device further includes:
  • an acquisition module 905 configured to acquire sample data, the sample data comprising sample audio and a sample audio signal of each of the multiple components of the sample audio;
  • the separation module 906 is used to call the frequency domain separation model, and based on the frequency domain information of the sample audio, separate the first predicted audio signal of each component in the plurality of components from the sample audio signal of the sample audio;
  • This separation module 906 is also used for calling this time domain separation model, based on the time domain information of this sample audio, separates the second predicted audio signal of this each component from the sample audio signal of this sample audio;
  • the separation module 906 is further configured to perform fusion processing on the first predicted audio signal of the component and the second predicted audio signal of the component for each component to obtain the third predicted audio signal of the component;
  • the training module 907 is configured to train the frequency domain separation model and the time domain separation model according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • FIG. 11 is a schematic structural diagram of an audio processing apparatus provided by an embodiment of the present application. Referring to FIG. 11 , the apparatus includes:
  • the audio acquisition module 1101 is used to acquire target audio, the target audio is composed of multiple components, and the component is a human voice component or any musical instrument sound component;
  • a model obtaining module 1102 configured to obtain a time-domain separation model and a frequency-domain separation model, where the time-domain separation model and the frequency-domain separation model are used to obtain components of the same type from the audio;
  • the separation module 1103 is configured to call the time domain separation model and the frequency domain separation model, and separate the first audio signal of each component of at least one component from the fourth audio signal of the target audio.
  • the separation module 1103 includes:
  • the time-domain separation unit 1113 is used to invoke the time-domain separation model, and based on the time-domain information of the target audio, separates the fifth audio signal of each component from the fourth audio signal of the target audio;
  • the frequency domain separation unit 1123 is configured to call the frequency domain separation model, and for each component, based on the frequency domain information of the component, separate the first audio signal of the component from the fifth audio signal of the component.
  • the separation module 1103 includes:
  • the frequency domain separation unit 1123 is used to call the frequency domain separation model, based on the frequency domain information of the target audio, separates the sixth audio signal of each component from the fourth audio signal of the target audio;
  • the time-domain separation unit 1113 is configured to call the time-domain separation model, and for each component, based on the time-domain information of the component, separate the first audio signal of the component from the sixth audio signal of the component to obtain the the first audio signal for each component.
  • the frequency domain separation unit 1123 is configured to determine the first real part signal and the first imaginary part signal of the first frequency spectrum, where the first frequency spectrum is the frequency spectrum corresponding to the target audio; call the frequency domain separation unit model, separates the second real part signal and the second imaginary part signal of each component from the first real part signal and the first imaginary part signal of the first spectrum; based on the second real part signal of each component and the second imaginary signal to determine the sixth audio signal for each component.
  • the separation module 1103 includes:
  • the frequency domain separation unit 1123 is used to call the frequency domain separation model, based on the frequency domain information of the target audio, separates the sixth audio signal of each component from the fourth audio signal of the target audio;
  • the time-domain separation unit 1113 is used to invoke the time-domain separation model, and based on the time-domain information of the target audio, separates the fifth audio signal of each component from the fourth audio signal of the target audio;
  • the fusion unit 1133 is configured to, for each component, perform fusion processing on the fifth audio signal of the component and the sixth audio signal of the component to obtain the first audio signal of the component.
  • the device further includes:
  • a sample acquisition module 1104 configured to acquire sample data, the sample data including sample audio and a sample audio signal of each component in at least one component of the sample audio;
  • the separation module 1103 is used to call the frequency domain separation model, and based on the frequency domain information of the sample audio, separate the first predicted audio signal of each component in the at least one component from the sample audio signal of the sample audio;
  • the separation module 1103 is used to call the time-domain separation model, and based on the time-domain information of the sample audio, separate the second predicted audio signal of each component from the sample audio signal of the sample audio;
  • a fusion module 1105 for performing fusion processing on the first predicted audio signal of the component and the second predicted audio signal of the component for each component to obtain the third predicted audio signal of the component;
  • the training module 1106 is configured to train the frequency domain separation model and the time domain separation model according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • An embodiment of the present application further provides a computer device, the computer device includes a processor and a memory, and the memory stores at least one piece of program code, the at least one piece of program code is loaded and executed by the processor, so as to realize the audio frequency according to the above embodiment The action performed in the handler method.
  • FIG. 13 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal 1300 may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3), an MP4 (Moving Picture Experts Group Audio Layer IV, a dynamic picture expert Video Expert Compresses Standard Audio Layer 4) Player, Laptop or Desktop.
  • Terminal 1300 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 1300 includes: a processor 1301 and a memory 1302 .
  • the processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1301 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 1301 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 1301 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1301 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. Memory 1302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1302 is used to store at least one program code, and the at least one program code is used to be executed by the processor 1301 to implement the methods provided by the method embodiments in this application. audio processing method.
  • the terminal 1300 may optionally further include: a peripheral device interface 1303 and at least one peripheral device.
  • the processor 1301, the memory 1302 and the peripheral device interface 1303 can be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1303 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1304 , a display screen 1305 , a camera assembly 1306 , an audio circuit 1307 , a positioning assembly 1308 and a power supply 1309 .
  • FIG. 13 does not constitute a limitation on the terminal 1300, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • FIG. 14 is a schematic structural diagram of a server according to an exemplary embodiment.
  • the server 1400 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1401 and one or more memories 1402, where at least one piece of program code is stored in the memory 1402, and at least one piece of program code is loaded and executed by the processor 1401 to implement the methods provided by the above method embodiments.
  • the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be described here.
  • the embodiment of the present application also provides a computer device, the computer device includes a processor and a memory, and at least one piece of program code is stored in the memory, and the at least one piece of program code is loaded by the processor and performs the following steps:
  • For each target component process the first audio signal of the target component according to the playback parameters set for the target component to obtain the second audio signal of the target component;
  • the second audio signal of each target component is fused with the third audio signal of other components in the target audio except the at least one target component to obtain the processed target audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the playback parameter includes a volume parameter, and according to the volume parameter set for the target component, the amplitude of the first audio signal of the target component is adjusted to obtain the second audio signal of the target component; or,
  • the playback parameters include sound effect parameters, and according to the sound effect parameters set for the target component, perform sound effect processing on the first audio signal of the target component to obtain the second audio signal of the target component; or,
  • the playback parameters include timbre parameters, the timbre parameters indicate the timbre of the audio, and musical score information corresponding to the target component is obtained, and the musical score information is used to represent the pitch of the target component; according to the musical score information and the timbre parameters to generate the second audio signal of the target component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the first audio signal of the plurality of components is obtained from the server.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the first spectrum is the spectrum of the target audio; call the frequency domain separation model, from the first real part signal and the first.
  • the second real part signal and the second imaginary part signal of each component are separated from an imaginary part signal; based on the second real part signal and the second imaginary part signal of each component, the each component is determined the first audio signal.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the fifth audio signal of the component and the sixth audio signal of the component are fused to obtain the first audio signal of the component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • sample data comprising sample audio and a sample audio signal of each of the multiple components of the sample audio
  • the first predicted audio signal of the component and the second predicted audio signal of the component are fused to obtain the third predicted audio signal of the component;
  • the frequency domain separation model and the time domain separation model are trained according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • the embodiment of the present application also provides a computer device, the computer device includes a processor and a memory, and at least one piece of program code is stored in the memory, and the at least one piece of program code is loaded by the processor and performs the following steps:
  • the target audio is composed of multiple components, and the components are human voice components or any musical instrument sound components;
  • time-domain separation model and a frequency-domain separation model are used to obtain the same type of components from the audio;
  • the time domain separation model and the frequency domain separation model are invoked to separate the first audio signal of each of at least one component from the fourth audio signal of the target audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the time domain separation model is invoked, and for each component, the first audio signal of the component is separated from the sixth audio signal of the component based on the time domain information of the component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • a sixth audio signal for each component is determined based on the second real signal and the second imaginary signal for each component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • sample data comprising sample audio and a sample audio signal for each of at least one component of the sample audio
  • the first predicted audio signal of the component and the second predicted audio signal of the component are fused to obtain the third predicted audio signal of the component;
  • the frequency domain separation model and the time domain separation model are trained according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • Embodiments of the present application further provide a computer-readable storage medium, where at least one piece of program code is stored in the computer-readable storage medium, and the at least one piece of program code is loaded and executed by a processor to implement the audio processing method of the foregoing embodiment operations performed in .
  • At least one piece of program code is stored in the computer-readable storage medium, and the at least one piece of program code is loaded by the processor and executes the following steps:
  • For each target component process the first audio signal of the target component according to the playback parameters set for the target component to obtain the second audio signal of the target component;
  • the second audio signal of each target component is fused with the third audio signal of other components in the target audio except the at least one target component to obtain the processed target audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the playback parameter includes a volume parameter, and according to the volume parameter set for the target component, the amplitude of the first audio signal of the target component is adjusted to obtain the second audio signal of the target component; or,
  • the playback parameters include sound effect parameters, and according to the sound effect parameters set for the target component, perform sound effect processing on the first audio signal of the target component to obtain the second audio signal of the target component; or,
  • the playback parameters include timbre parameters, the timbre parameters indicate the timbre of the audio, and musical score information corresponding to the target component is obtained, and the musical score information is used to represent the pitch of the target component; according to the musical score information and the timbre parameters to generate the second audio signal of the target component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the first audio signal of the plurality of components is obtained from the server.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the first spectrum is the spectrum of the target audio; call the frequency domain separation model, from the first real part signal and the first.
  • the second real part signal and the second imaginary part signal of each component are separated from an imaginary part signal; based on the second real part signal and the second imaginary part signal of each component, the each component is determined the first audio signal.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the fifth audio signal of the component and the sixth audio signal of the component are fused to obtain the first audio signal of the component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • sample data comprising sample audio and a sample audio signal of each of the multiple components of the sample audio
  • the first predicted audio signal of the component and the second predicted audio signal of the component are fused to obtain the third predicted audio signal of the component;
  • the frequency domain separation model and the time domain separation model are trained according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • Embodiments of the present application further provide a computer-readable storage medium, where at least one piece of program code is stored in the computer-readable storage medium, and the at least one piece of program code is loaded by a processor and performs the following operations:
  • the target audio is composed of multiple components, and the components are human voice components or any musical instrument sound components;
  • time-domain separation model and a frequency-domain separation model are used to obtain the same type of components from the audio;
  • the time domain separation model and the frequency domain separation model are invoked to separate the first audio signal of each of at least one component from the fourth audio signal of the target audio.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • the time domain separation model is invoked, and for each component, the first audio signal of the component is separated from the sixth audio signal of the component based on the time domain information of the component.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • a sixth audio signal for each component is determined based on the second real signal and the second imaginary signal for each component.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • the fifth audio signal of the component and the sixth audio signal of the component are fused to obtain the first audio signal of the component.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • sample data comprising sample audio and a sample audio signal for each of at least one component of the sample audio
  • the first predicted audio signal of the component and the second predicted audio signal of the component are fused to obtain the third predicted audio signal of the component;
  • the frequency domain separation model and the time domain separation model are trained according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • Embodiments of the present application further provide a computer program product, where at least one piece of program code is stored in the computer program product, and the at least one piece of program code is loaded and executed by a processor to implement the audio processing method of the above-mentioned embodiment. operate.
  • the embodiment of the present application also provides a computer program product, where at least one piece of program code is stored in the computer program product, and the at least one piece of program code is loaded by the processor and performs the following steps:
  • For each target component process the first audio signal of the target component according to the playback parameters set for the target component to obtain the second audio signal of the target component;
  • the second audio signal of each target component is fused with the third audio signal of other components in the target audio except the at least one target component to obtain the processed target audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the playback parameter includes a volume parameter, and according to the volume parameter set for the target component, the amplitude of the first audio signal of the target component is adjusted to obtain the second audio signal of the target component; or,
  • the playback parameters include sound effect parameters, and according to the sound effect parameters set for the target component, perform sound effect processing on the first audio signal of the target component to obtain the second audio signal of the target component; or,
  • the playback parameters include timbre parameters, the timbre parameters indicate the timbre of the audio, and musical score information corresponding to the target component is obtained, and the musical score information is used to represent the pitch of the target component; according to the musical score information and the timbre parameters to generate the second audio signal of the target component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the first audio signal of the plurality of components is obtained from the server.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the first spectrum is the spectrum of the target audio; call the frequency domain separation model, from the first real part signal and the first.
  • the second real part signal and the second imaginary part signal of each component are separated from an imaginary part signal; based on the second real part signal and the second imaginary part signal of each component, the each component is determined the first audio signal.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • time-domain separation model and the frequency-domain separation model are used to obtain the same type of components from audio.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • the fifth audio signal of the component and the sixth audio signal of the component are fused to obtain the first audio signal of the component.
  • the at least one piece of program code is loaded by the processor and performs the following steps:
  • sample data comprising sample audio and a sample audio signal of each of the multiple components of the sample audio
  • the first predicted audio signal of the component and the second predicted audio signal of the component are fused to obtain the third predicted audio signal of the component;
  • the frequency domain separation model and the time domain separation model are trained according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.
  • the embodiment of the present application also provides a computer program product, where at least one piece of program code is stored in the computer program product, and the at least one piece of program code is loaded by the processor and performs the following steps:
  • the target audio is composed of multiple components, and the components are human voice components or any musical instrument sound components;
  • time-domain separation model and a frequency-domain separation model are used to obtain the same type of components from the audio;
  • the time domain separation model and the frequency domain separation model are invoked to separate the first audio signal of each of at least one component from the fourth audio signal of the target audio.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • the time domain separation model is invoked, and for each component, the first audio signal of the component is separated from the sixth audio signal of the component based on the time domain information of the component.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • a sixth audio signal for each component is determined based on the second real signal and the second imaginary signal for each component.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • the fifth audio signal of the component and the sixth audio signal of the component are fused to obtain the first audio signal of the component.
  • the at least one piece of program code is loaded by the processor and performs the following operations:
  • sample data comprising sample audio and a sample audio signal for each of at least one component of the sample audio
  • the first predicted audio signal of the component and the second predicted audio signal of the component are fused to obtain the third predicted audio signal of the component;
  • the frequency domain separation model and the time domain separation model are trained according to the difference between the third predicted audio signal of each component and the corresponding sample audio signal in the sample data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Stereophonic System (AREA)

Abstract

本申请实施例公开了一种音频处理方法、装置、计算机设备及存储介质,属于计算机技术领域。该方法包括:显示目标音频中已分离出的多个成分的播放参数设置选项;响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为至少一个目标成分设置的播放参数;对于每个目标成分,根据为目标成分设置的播放参数,对目标成分的第一音频信号进行处理,得到目标成分的第二音频信号;将每个目标成分的第二音频信号与目标音频中除至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频,实现了单独对音频中的成分的音频信号进行处理,能够为同一音频中的不同成分设置不同的个性化播放效果,提高了音频处理的灵活性。

Description

音频处理方法、装置、计算机设备及存储介质
本申请要求于2020年12月30日提交、申请号为202011603259.7、发明名称为“音频处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,特别涉及一种音频处理方法、装置、计算机设备及存储介质。
背景技术
随着计算机技术的不断发展,音频处理类软件得到了广泛的应用,与人们的生活越来越密切。例如,该音频处理类软件具有播放音频的功能,人们能够在休闲时间收听该音频处理类软件播放的音频;又如,该音频处理类软件还具有为音频添加音效的功能,人们能够为音频添加混响、均衡等音效。
但是上述对音频进行处理的方式,只能对音频整体进行统一处理,因此音频处理的方式比较单一,音频处理的灵活性较差。
发明内容
本申请实施例提供了一种音频处理方法、装置、计算机设备及存储介质,提高了音频处理的灵活性。所述技术方案如下:
一方面,提供了一种音频处理方法,所述方法包括:
通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
另一方面,提供了一种音频处理方法,所述方法包括:
获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一乐器声成分;
获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
另一方面,提供了一种音频处理装置,所述装置包括:
显示模块,用于通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
确定模块,用于响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
处理模块,用于对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
融合模块,用于将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
在一种可能实现方式中,所述播放参数包括音量参数,所述处理模块,用于根据为所述目标成分设置的音量参数,调整所述目标成分的第一音频信号的振幅,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音效参数,所述处理模块,用于根据为所述目标成分设置的音效参数,对所述目标成分的第一音频信号进行音效处理,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音色参数,所述音色参数指示音频的音色,所述处理模块,用于获取所述目标成分对应的曲谱信息,所述曲谱信息用于表示所述目标成分的音高;根据所述曲谱信息和所述音色参数,生成所述目标成分的第二音频信号。
在一种可能实现方式中,所述装置还包括:
获取模块,用于从服务器中获取所述多个成分的第一音频信号。
在一种可能实现方式中,所述装置还包括:
分离模块,用于调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号;或者,
所述分离模块,用于确定所述第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频的频谱;调用所述频域分离模型,从所述第一实部信号和所述第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;基于所述每个成分的第二实部信号和第二虚部信号,确定所述每个成分的第一音频信号。
在一种可能实现方式中,所述分离模块,包括:
时域分离单元,用于调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
频域分离单元,用于调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,所述分离模块,包括:
频域分离单元,用于调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
时域分离单元,用于调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,所述分离模块,包括:
频域分离单元,用于调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
时域分离单元,用于调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
融合单元,用于对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
在一种可能实现方式中,所述装置还包括:
获取模块,用于获取样本数据,所述样本数据包括样本音频以及所述样本音频的多个成分中每个成分的样本音频信号;
所述分离模块,用于调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述多个成分中每个成分的第一预测音频信号;
所述分离模块,还用于调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
所述分离模块,还用于对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
训练模块,用于根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
另一方面,提供了一种音频处理装置,所述装置包括:
音频获取模块,用于获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一乐器声成分;
模型获取模块,用于获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
分离模块,用于调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
在一种可能实现方式中,所述分离模块,包括:
时域分离单元,用于调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
频域分离单元,用于调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号。
在一种可能实现方式中,所述分离模块,包括:
频域分离单元,用于调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
时域分离单元,用于调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号。
在一种可能实现方式中,所述频域分离单元,用于确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频对应的频谱;调用所述频域分离模型,从所述第一 频谱的第一实部信号和第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;基于所述每个成分的第二实部信号和所述第二虚部信号,确定所述每个成分的第六音频信号。
在一种可能实现方式中,所述分离模块,包括:
频域分离单元,用于调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
时域分离单元,用于调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
融合单元,用于对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
在一种可能实现方式中,所述装置还包括:
样本获取模块,用于获取样本数据,所述样本数据包括样本音频以及所述样本音频的至少一个成分中每个成分的样本音频信号;
所述分离模块,用于调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述至少一个成分中每个成分的第一预测音频信号;
所述分离模块,用于调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
融合模块,用于对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
训练模块,用于根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行,以实现如上述方面所述的音频处理方法中所执行的操作。
另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现如上述方面所述的音频处理方法中所执行的操作。
再一方面,提供了一种计算机程序产品,所述计算机程序产品中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现如上述方面所述的音频处理方法中所执行的操作。
本申请实施例提供的音频处理方法、装置、设备及介质,在对音频进行处理时,能够通过播放参数设置界面为音频中的一个或者多个成分设置播放参数,对于每个成分,采用为该成分设置的播放参数对该成分的音频信号进行处理,从而实现了单独对音频中的成分的音频信号进行处理,因此,通过上述方法能够为同一音频中的不同成分设置不同的个性化播放效果,提高了音频处理的灵活性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种实施环境的结构示意图。
图2是本申请实施例提供的一种音频处理方法的流程图。
图3是本申请实施例提供的一种音频处理方法的流程图。
图4是本申请实施例提供的一种播放参数设置界面的示意图。
图5是本申请实施例提供的一种播放参数设置界面的示意图。
图6是本申请实施例提供的一种播放参数设置界面的示意图。
图7是本申请实施例提供的一种音频处理方法的流程图。
图8是本申请实施例提供的一种音频处理方法的流程图。
图9是本申请实施例提供的一种音频处理装置的结构示意图。
图10是本申请实施例提供的另一种音频处理装置的结构示意图。
图11是本申请实施例提供的一种音频处理装置的结构示意图。
图12是本申请实施例提供的另一种音频处理装置的结构示意图。
图13是本申请实施例提供的一种终端的结构示意图。
图14是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
可以理解,本申请所使用的术语“第一”、“第二”、“第三”、“第四”、“第五”、“第六”等可在本文中用于描述各种概念,但除非特别说明,这些概念不受这些术语限制。这些术语仅用于将一个概念与另一个概念区分。举例来说,在不脱离本申请的范围的情况下,可以将第一成分称为第二成分,将第二成分称为第一成分。
本申请所使用的术语“每个”、“多个”、“至少一个”、“任一”等,至少一个包括一个、两个或两个以上,多个包括两个或两个以上,而每个是指对应的多个中的每一个,任一是指多个中的任意一个。举例来说,多个成分包括3个成分,而每个是指这3个成分中的每一个成分,任一是指这3个成分中的任意一个,可以是第一个,可以是第二个,也可以是第三个。
本申请实施例提供的音频处理方法,由计算机设备执行。在一种可能实现方式中,计算机设备为终端,例如,手机、平板电脑、计算机等。在另一种可能实现方式中,计算机设备为服务器,该服务器为一台服务器,或者由若干服务器组成的服务器集群,或者是一个云计算服务中心。在另一种可能实现方式中,计算机设备包括终端和服务器。
图1是本申请实施例提供的一种实施环境的示意图。参见图1,该实施环境包括至少一个终端101和服务器102。终端101和服务器102之间通过无线或者有线网络连接。
终端101上安装由服务器102提供服务的目标应用,终端101能够通过该目标应用实现例如数据传输、消息交互等功能。可选地,目标应用为终端101操作系统中的应用,或者为 第三方提供的应用。例如,目标应用为音频处理应用,该音频处理应用具有播放音频的功能,当然,该音频处理应用还能够具有其他功能,例如,录制功能、直播功能、分享功能、音效调整功能等。可选地,服务器102为该目标应用的后台服务器或者为提供云计算以及云存储等服务的云服务器。
可选地,终端101向服务器发送为音频中至少一个目标成分设置的播放参数,由服务器102基于接收的播放参数,对该音频中的至少一个目标成分的音频信号进行个性化处理,得到处理后的音频,将处理后的音频发送至终端101,终端101播放该处理后的音频。
可选地,终端101从服务器中获取音频的多个成分的音频信号,对该多个成分中的至少一个目标成分的音频信号进行个性化处理,得到处理后的音频。可选地,终端101从服务器中获取音频,从该音频中分离出多个成分的音频信号,对该多个成分中的至少一个目标成分的音频信号进行个性化处理,得到处理后的音频。需要说明的是,在一种可能实现方式中,终端101得到处理后的音频之后,播放该处理后的音频。
其中,音频通常是由人声和乐器声组成,音频中的成分是指组成该音频的人声成分、乐器声成分等。
本申请实施例提供的音频处理方法,能够应用于音频处理场景中:
例如,应用于音频播放场景中。
在播放音频的过程中,采用本申请实施例提供的音频处理方法,用户能够在终端界面中对音频中的成分进行个性化设置,例如,增大人声的音量、为伴奏中的贝斯声添加音效、将伴奏中的鼓声替换为非洲鼓鼓声等。
例如,应用于歌曲录制场景中。
在用户录制歌曲时,终端播放原歌曲,使得用户根据播放的原歌曲进行演唱,终端录制用户的歌声,将用户的歌声与该原歌曲合成一个新的音频,采用本申请实施例提供的音频处理方法,能够改变原歌曲中的人声音量,通过降低原歌曲中的人声音量,将原歌曲中的人声与用户的歌声混合在一起,达到了将原歌曲中的人声作为和声的效果。
需要说明的是,本申请实施例仅是以音频播放场景和歌曲录制场景为例,对音频处理场景进行示例性说明,并不对音频处理场景进行限定,本申请实施例提供的音频处理方法还能够应用于其他任一音频处理场景中。
图2是本申请实施例提供的一种音频处理方法的流程图。本申请实施例的执行主体为计算机设备。参见图2,该方法包括:
201、通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,该成分为人声成分或者任一乐器声成分。
其中,播放参数设置界面是用于设置音频的播放效果的界面,该播放参数设置界面中包括至少一个播放参数设置选项,以供用户调整音频的播放参数,从而改变音频的播放效果。目标音频为计算机设备中的任一音频,例如,任一歌曲等。
音频通常是由人声和乐器声组成,音频中的成分是指组成该音频的人声成分、乐器声成分等。例如,音频包括的成分为人声成分和伴奏成分,其中,伴奏成分是指除音频中除人声成分之外的其余多个乐器声成分。又如,音频包括的成分为人声成分、鼓声成分、贝斯声成分、剩余伴奏成分,其中,剩余伴奏成分是指音频中除人声成分、鼓声成分和贝斯声成分之 外的其他组成成分。
202、响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为至少一个目标成分设置的播放参数。
其中,目标成分是目标音频的多个成分中的任一成分,例如,人声成分、鼓声成分、伴奏成分等。播放参数是用于控制音频播放效果的参数,例如,音量参数、音效参数、音色参数等。目标成分的播放参数即是用于控制该目标成分的播放效果的参数,需要说明的是,在本申请实施例中,目标成分的播放参数仅是用于控制该目标成分的播放效果,而不会控制其他成分的播放效果。
203、对于每个目标成分,根据为目标成分设置的播放参数,对目标成分的第一音频信号进行处理,得到目标成分的第二音频信号。
其中,第一音频信号是目标音频中目标成分的音频信号,第二音频信号是根据为目标成分设置的播放操作对第一音频信号进行处理后得到的音频信号。
其中,根据为目标成分设置的播放参数,对目标成分的第一音频信号进行处理之后,目标成分的播放效果发生改变,从而实现了通过设置目标成分的播放参数,改变目标成分的播放效果。
204、将每个目标成分的第二音频信号与目标音频中除至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
其中,第一音频信号是目标音频中目标成分的音频信号,第三音频信号是目标音频中除目标成分之外的其他成分的音频信号,本申请实施例以第一音频信号和第三音频信号为例,区分目标成分的音频信号和其他成分的音频信号。
其中,处理后的目标音频中的成分与目标音频中的成分相同,但是处理后的目标音频中目标成分的音频信号与目标音频中目标成分的音频信号并不相同,处理后的目标音频中的目标成分具有与步骤202中设置的播放参数对应的播放效果。
本申请实施例提供的音频处理方法,在对音频进行处理时,能够通过播放参数设置界面为音频中的一个或者多个成分设置播放参数,对于每个成分,采用为该成分设置的播放参数对该成分的音频信号进行处理,从而实现了单独对音频中的成分的音频信号进行处理,因此,通过上述方法能够为同一音频中的不同成分设置不同的个性化播放效果,提高了音频处理的灵活性。
图3是本申请实施例提供的一种音频处理方法的流程图。在本申请实施例中仅是以执行主体为终端为例进行示例性说明,参见图3,该方法包括:
301、终端从服务器中获取从目标音频中已分离出的多个成分的第一音频信号。
终端安装有目标应用,服务器是为该目标应用提供服务的服务器。该目标应用为音频处理应用,终端能够从服务器中获取音频,对该音频进行处理,或者进行播放等。
在本申请实施例中,服务器中存储有多个音频,以及从每个音频中已分离出的多个成分的第一音频信号,或者,服务器中仅存储有从每个音频中已分离出的多个成分的第一音频信号。因此,终端能够直接从服务器中获取目标音频的多个成分的第一音频信号,而无需对目标音频进行分离处理。
在一种可能实现方式中,终端从服务器中获取从目标音频中已分离出的多个成分的第一音频信号,包括:终端向服务器发送音频获取请求,该音频获取请求携带目标音频的音频标 识;服务器接收该音频获取请求,基于该目标音频的音频标识,将从目标音频中分离出的多个成分的第一音频信号发送给终端,或者,将目标音频以及从目标音频中分离出的多个成分的第一音频信号发送给终端。
其中,音频标识为音频的名称、音频的作者、音频的序号等任一种标识,本申请实施例对音频标识不做限定。
例如,用户通过终端的歌曲播放应用播放歌曲A时,终端向服务器发送歌曲获取请求,该歌曲获取请求携带歌曲A的歌曲名,服务器根据该歌曲A的歌曲名,获取歌曲A的人声成分和每个乐器声成分,将人声成分和每个乐器声成分发送给终端。其中,若歌曲A是由人声、钢琴声、架子鼓声和贝斯声组成,则每个乐器声成分是指钢琴声、架子鼓声和贝斯声。
需要说明的是,本申请实施例仅是以服务器中存储有从音频中已分离出的多个成分的第一音频信号为例,对终端获取目标音频中已分离出的多个成分的第一音频信号的过程进行示例性说明,而在另一实施例中,服务器中仅存储有多个音频,终端从服务器中获取到音频之后,对获取的音频进行分离处理,从该获取的音频中分离出多个成分的第一音频信号,其中,从音频中分离出多个成分的第一音频信号的过程参见图7和图8所示的实施例,在此不再一一赘述。在另一实施例中,终端存储有从目标音频中已分离出的多个成分的第一音频信号,终端直接从本地获取该目标音频的多个成分的第一音频信号。可选地,终端本地存储的第一音频信号是从服务器中获取的;可选地,终端本地存储的第一音频信号是对获取的音频进行分离处理得到的。
302、终端通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项。
其中,播放参数设置界面是用于设置音频的播放效果的界面,该播放参数设置界面中包括至少一个播放参数设置选项。可选地,播放参数设置选项包括音量设置选项、音效设置选项、或者音色设置选项中的至少一种。
可选地,播放参数设置界面中显示有每个成分对应的一个播放参数设置选项,例如,如图4所示,播放参数设置界面中包括人声成分的音量设置选项、鼓声成分的音量设置选项、贝斯声成分的音量设置选项和其他伴奏的音量设置选项。通过该播放参数设置界面能够调整目标音频中多种成分的音量。
可选地,播放参数设置界面中显示有每个成分对应的多个播放参数设置选项。例如,如图5所示,播放参数设置界面中显示有人声成分的音量设置选项和音效设置选项、鼓声成分的音量设置选项和音效设置选项、贝斯声成分的音量设置选项和音效设置选项和其他伴奏的音量设置选项和音效设置选项。其中,音效设置选项为一个或者多个选项。例如音效设置选项包括混响选项、舒缓选项、摇滚选项等;或者,该音效设置选项用于触发显示音效设置界面,该音效设置界面包括混响选项、舒缓选项、摇滚选项等多个音效选项。
可选地,播放参数设置界面中显示有一个成分对应的一个或者多个播放参数设置选项,也就是说,播放参数设置界面每次仅能显示一种成分的播放参数设置选项。在一种可能实现方式中,播放参数设置界面中包括成分选择选项,该成分选择选项用于指示显示哪种成分的播放参数设置选项,或者,用于指示当前播放参数设置选项指示的播放参数对应哪种成分。例如,如图6所示,播放参数设置界面中包括人声选项、鼓声选项、贝斯声选项、其他伴奏选项以及至少一个播放参数设置选项,当人声选项处于选中状态时,通过对该至少一个播放参数设置选项进行触发操作,为人声成分设置播放参数;当贝斯声选项处于选中状态时,通 过对该至少一个播放参数设置选项进行触发操作,为贝斯声成分设置播放参数。
需要说明的是,不同音频的成分组成不同,因此,从音频中分离出的成分也不同。例如,从音频A中分离出的成分为人声成分、鼓声成分和其他伴奏成分;而音频B中不存在鼓声成分,因此从音频B中分离出的成分为人声成分和伴奏成分,因此,上述终端通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,包括:获取目标音频中的每个成分的成分标识,根据获取的成分标识,通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,这样,能够保证播放参数设置界面显示的成分与目标音频的成分相对应。
303、终端响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为该至少一个目标成分设置的播放参数。
其中,目标成分为从目标音频中分离出的多个成分中的任一成分,本申请实施例中,目标成分是指被修改播放参数的成分。
可选地,播放参数设置选项为音量调节选项,那么播放参数为音量参数;可选地,播放参数设置选项为音效调节选项,那么播放参数为音效参数,该音效参数为音效名称、对音频信号的调整参数等;可选地,播放参数设置选项为音色调节选项,那么播放参数为音色参数,该音色参数用于表示将音频的音色调整为目标音色,例如,该音色参数为目标音色的音色标识。
可选地,每一种音效都对应至少一个对音频信号的调整参数,例如音频信号的频率调整参数、相位调整参数等。
需要说明的是,为每个目标成分设置的播放参数可以是一个,也可以是多个,例如,为目标成分设置的播放参数包括音量参数和音效参数;并且,为每个目标成分设置的播放参数可以相同,也可以不同。
304、对于每个目标成分,终端根据为该目标成分设置的播放参数,对该目标成分的第一音频信号进行处理,得到目标成分的第二音频信号。
其中,播放参数的类型不同,对目标成分的处理方式也不同,本申请实施例分别以播放参数为音量参数、音效参数和音色参数为例,对目标成分的处理过程进行示例性说明,在一些实施例中,播放参数是其他类型的参数,本申请实施例对播放参数的类型不做限定,对根据播放参数对音频进行处理的过程也不做限定。
在一种可能实现方式中,播放参数包括音量参数,对于每个目标成分,根据为目标成分设置的播放参数,对目标成分的第一音频信号进行处理,得到目标成分的第二音频信号,包括:对于每个目标成分,根据为目标成分设置的音量参数,调整目标成分的第一音频信号的振幅,得到目标成分的第二音频信号。
其中,音频的播放音量是由音频信号的振幅确定的,第一音频信号与第二音频信号之间仅是振幅不同,而频率、相位等信息是相同的,因此,在调整第一音频信号的振幅之后,仅改变了目标成分的音量,而不会改变目标成分的音色、播放速度等。后续,基于目标成分的第二音频信号进行播放。例如,音频A包括人声成分、贝斯声成分和鼓声成分,音频A的音量为10,现将人声成分的音量调整为20,而并未对贝斯成分和鼓声成分的音量进行调整,那么终端会调整人声成分的音频信号的振幅,得到人声成分调整后的音频信号,将人声成分调整后的音频信号与贝斯声成分的音频信号以及鼓声成分的音频信号进行融合,得到处理后的目标音频,处理后的目标音频中人声成分的音量为20、贝斯声成分和鼓声成分的音量为10, 那么播放时,处理后的目标音频中人声的声音较大。
在一种可能实现方式中,播放参数包括音效参数,对于每个目标成分,根据为目标成分设置的播放参数,对目标成分的第一音频信号进行处理,得到目标成分的第二音频信号,包括:对于每个目标成分,根据为目标成分设置的音效参数,对目标成分的第一音频信号进行音效处理,得到目标成分的第二音频信号。
其中,音效参数是指示音频的音效。可选地,音效参数包括音量参数、播放速度参数、频率调整参数、相位调整参数等。例如,舒缓音效包括音量减小参数、播放速度减小参数、频率减小参数等,其中,减小参数用于指示减小的振幅。
可选地,音效参数包括音量参数、播放速度参数和频率调整参数,对目标成分的第一音频信号进行音效处理,得到目标成分的第二音频信号,包括:对第一音频信号的振幅、时间和频率进行调整,得到第二音频信号。
在一种可能实现方式中,播放参数包括音色参数,音色参数用于表示将音频的音色调整为目标音色,对于每个目标成分,根据为目标成分设置的播放参数,对目标成分的第一音频信号进行处理,得到目标成分的第二音频信号,包括:对于至少一个目标成分中的每个目标成分,获取目标成分对应的曲谱信息,曲谱信息用于表示成分的音高;根据曲谱信息和音色参数,生成目标成分的第二音频信号,该第二音频信号具有目标音色。
其中,曲谱信息包括至少一个音符,以及每个音符持续的时长。例如,目标成分为架子鼓声成分,音色参数为非洲鼓的标识,对目标音频中架子鼓声成分进行分析,得到架子鼓声成分的曲谱信息,根据该曲谱信息和音色参数,生成非洲鼓声成分,该非洲鼓声成分与该架子鼓声成分的曲谱信息相同,因此,能够将非洲鼓声成分加入目标音频中代替原来的架子鼓声成分。
由于音符指示音频的音高,而音高是由音频信号的频率确定的,因此,在一种可能实现方式中,获取任一成分对应的曲谱信息,包括:根据该成分的第一音频信号的频率,确定该成分对应的音符;根据该成分在频率上的持续时长,确定对应音符的持续时长;根据该成分对应的多个音符以及每个音符的持续时长,生成该成分对应的曲谱信息。
例如,成分A的音频信号在第0至1.5秒的频率为B,则第0至1.5秒的音频信号对应的音符为C,且该音符C的持续时长为1.5秒;成分A的音频信号在第1.5至2.5秒的频率为D,则第1.5至2.5秒的音频信号对应的音符为E,且该音符E的持续时长为1秒;成分A的音频信号在第2.5至4秒的频率为F,则第2.5至4秒的音频信号对应的音符为G,且该音符G的持续时长为1.5秒;得到的成分A的曲谱信息中音符依次为C、E和G,且每个音频的持续时长依次为1.5秒、1秒和1.5秒。
音色是由音频信号的波形确定,例如,音频信号的谐波幅值、相位偏移等。因此,能够通过生成相应的音频信号,来得到具有目标音色的音频。在一种可能实现方式中,根据曲谱信息和音色参数,生成目标成分的第二音频信号,包括:将曲谱信息与音色参数输入音频信号合成器,获取音频信号合成器输出的第二音频信号,音频信号合成器用于根据输入的曲谱信息和指定的音色合成音频信号。
可选地,音频信号合成器中存储有多种音色对应的波形特征,将曲谱信息与音色参数输入音频信号合成器,获取音频信号合成器输出的第二音频信号,包括:将曲谱信息与音色参数输入音频信号合成器,音频信号合成器根据该音色参数确定与该音色参数对应的波形特征,根据该波形特征以及曲谱信息指示的音频信号的频率以及每个频率的持续时长,合成对应的 第二音频信号。
在另一种可能实现方式中,终端或者服务器中存储有多种乐器的乐器素材,每个乐器素材为一种乐器对应的音频,通过改变乐器素材的音高和节奏,能够使得该乐器素材能够替换该目标成分。可选地,根据音色参数,获取音色为目标音色的目标音频素材;按照曲谱信息对目标音频素材进行调整,得到调整后的目标音频素材,将该目标音频素材作为目标成分。
其中,按照曲谱信息对目标音频素材进行调整,得到调整后的目标音频素材是指:按照曲谱信息,调整目标音频素材的音高和节奏,使得调整后的目标音频素材对应的曲谱信息与目标成分对应的曲谱信息相同。
其中,调整目标音频素材的音高是指:调整目标音频素材的音频信号的频率。调整目标音频素材的节奏是指:调整目标音频素材的音频信号中每个频率的持续时长。
305、终端将每个目标成分的第二音频信号与目标音频中除至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
其中,“将多个成分进行融合”与“从音频中分离出多个成分”为逆过程。通过将每个目标成分的第二音频信号与目标音频中除至少一个目标成分之外的其他成分的第三音频信号进行融合,使得得到的处理后的目标音频中成分完整。
可选地,终端将每个目标成分的第二音频信号与目标音频中除至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频,包括:终端将每个目标成分的第二音频信号与目标音频中除至少一个目标成分之外的其他成分的第三音频信号进行叠加,得到第八音频信号,该处理后的目标音频由第八音频信号组成。其中,得到第八音频信号即是得到处理后的目标音频。本申请实施例中,第八音频信号是处理后的目标音频的音频信号,本申请实施例仅是以第八音频信号为例,对处理后的目标音频的音频信号与其他音频信号进行区分。
需要说明的是,在得到处理后的目标音频之后,能够播放该处理后的目标音频,或者将该处理后的目标音频进行存储等,本申请实施例对处理后的目标音频的后续处理方式不做限定。
本申请实施例提供的音频处理方法,在对音频进行处理时,能够通过播放参数设置界面为音频中的一个或者多个成分设置播放参数,对于每个成分,采用为该成分设置的播放参数对该成分的音频信号进行处理,从而实现了单独对音频中的成分的音频信号进行处理,因此,通过上述方法能够为同一音频中的不同成分设置不同的个性化播放效果,提高了音频处理的灵活性。
并且,终端能够直接从服务器中获取从音频中分离出的多个成分的第一音频信号,无需终端对音频进行分离处理,降低了对终端的要求,也提高了终端处理音频的效率。
并且,本申请实施例提供的音频处理方法,提供了多种对音频中成分的处理方式,能够对任一成分的音量、音效和音色进行处理,提高了处理的多样性和灵活性。
上述图2至图3所示的实施例仅是对音频中任一个或者多个成分的处理过程进行示例性说明,下面通过图7至图8所示的实施例对从音频中分离出多个成分的过程进行示例性说明。
图7是本申请实施例提供的一种音频处理方法的流程图。参见图7,该方法应用于计算机设备中,该方法包括:
701、获取目标音频,目标音频由多个成分组成,该成分为人声成分或者任一乐器声成分。
其中,目标音频为任一音频,例如该目标音频为任一歌曲的音频、任一交响乐的音频等,本申请实施例对目标音频不做限定。
702、获取时域分离模型和频域分离模型,时域分离模型和频域分离模型用于从音频中获取相同类型的成分。
其中,时域分离模型是用于基于音频的时域信息对音频进行分离的模型,例如,该时域分离模型为Wave-U-Net(Wave-U-Network,信号波U型神经网络)或TasNet(Time-domain audio separation Network,单通道语音分离神经网络)等模型。频域分离模型是用于基于音频的频域信息对音频进行分离的模型,例如,该频域分离模型为U-Net(U-Network,U型神经网络)或open-unmix(一种频域分离模型)等模型。
703、调用时域分离模型和频域分离模型,从目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
由于时域分离模型和频域分离模型是基于音频中不同的信息进行分离的,因此,该时域分离模型和频域分离模型具有互补性,若调用时域分离模型和频域分离模型,对音频进行分离,则能够更加准确地分离出各种成分。
本申请实施例提供的音频处理方法,由于时域分离模型和频域分离模型能够基于音频中不同的信息将音频进行分离,因此,时域分离模型和频域分离模型具有互补性,从而通过时域分离模型和频域分离模型一起对音频进行分离,能够更加准确地分离出各种成分,提高了音频的分离效果。
图8是本申请实施例提供的一种音频处理方法的流程图。在本申请实施例中,以计算机设备分离音频为例进行说明,参见图8,该方法包括:
801、计算机设备获取目标音频,目标音频由多个成分组成,该成分为人声成分或者任一乐器声成分。
该步骤801与步骤701同理,在此不再一一赘述。
802、计算机设备获取时域分离模型和频域分离模型,该时域分离模型和该频域分离模型用于从音频中获取相同类型的成分。
其中,在步骤802中获取的时域分离模型和频域分离模型是经过训练的模型,该时域分离模型和频域分离模型具有一定的分离准确性。时域分离模型和频域分离模型用于从音频中获取相同类型的成分是指:从音频中分离出相同成分的音频信号。例如,时域分离模型用于从音频中分离出人声成分的音频信号、鼓声成分的音频信号和其他伴奏的音频信号,频域分离模型也用于从音频中分离出人声成分的音频信号、鼓声成分的音频信号和其他伴奏的音频信号。
可选地,时域分离模型用于从音频中分离出目标成分的第一音频信号,而频域分离模型也用于从音频中分离出目标成分的第一音频信号,其中,目标成分为人声成分、或者伴奏成分、或者任一种乐器声成分。
可选地,时域分离模型用于从音频中分离出多个成分的音频信号,而频域分离模型也用于从音频中分离出多个成分的音频信号。例如,时域分离模型用于从音频中分离出人声成分和贝斯声成分,频域分离模型也用于从音频中分离出人声成分和贝斯声成分。
803、计算机设备调用频域分离模型,基于目标音频的频域信息,从目标音频的第四音频信号中分离出每个成分的第六音频信号。
音频的音频信号表示音频的波形随着时间变化的规律,因此,音频信号为音频的时域信息。频谱是音频的频率分布曲线,表示音频的频域信息。通过对音频信号进行时频转换,即可将音频的时域信息转换为频域信息。
无论是音频的时域信息还是频域信息,都包含该音频中各种成分的信息,因此,基于音频的时域信息或者频域信息,能够从音频中分离出各个成分的音频信号。在步骤803中,以基于音频的频域信息,从音频中分离出成分的音频信号为例进行说明,在步骤804中,以基于音频的时域信息,从音频中分离出成分的音频信号为例进行说明。
在一种可能实现方式中,调用频域分离模型,基于目标音频的频域信息,从目标音频的第四音频信号中分离出每个成分的第六音频信号,包括:基于目标音频对应的第二频谱中的振幅信息,调用频域分离模型,从第二频谱中分离出每个成分对应的振幅信息,基于每个成分的振幅信息,生成每个成分的第六音频信号。
第四音频信号是目标音频的音频信号,第六音频信号是从目标音频中分离出的每个成分的音频信号,本申请实施例仅是以第四音频信号和第六音频信号为例,对目标音频的音频信号和分离出的成分的音频信号进行示例性说明。
该第二频谱是目标音频的第四音频信号的振幅按照频率排布的曲线,因此,在调用频域分离模型之前,需要先生成第二频谱。可选地,生成第二频谱,包括:将目标音频的第四音频信号进行傅里叶转换,得到复数信号;获取复数信号的实部信息和虚部信息的平方和,将该平方和进行开平方运算,得到该第四音频信号的振幅信息,获取音频信号的振幅信息随着频率变化的曲线,得到第二频谱。
由于频域分离模型仅能分离出振幅信息,因此,需要基于目标音频中第四音频信号的相位信息和分离出的振幅信息,来生成成分的第六音频信号。
若通过目标音频中第四音频信号的相位信息,来生成成分的第六音频信号,会导致第六音频信号中引入相位噪声,因此,本申请实施例还提供了另一种更加准确地分离方法,在另一种可能实现方式中,调用频域分离模型,基于目标音频的频域信息,从目标音频的第四音频信号中分离出每个成分的第六音频信号,包括:确定第一频谱的第一实部信号和第一虚部信号,该第一频谱为目标音频对应的频谱;调用频域分离模型,从第一频谱的第一实部信号和第一虚部信号中分离出每个成分的第二实部信号和第二虚部信号;基于每个成分的第二实部信号和第二虚部信号,确定每个成分的第六音频信号。
可选地,确定第一频谱的第一实部信号和第一虚部信号,包括:将目标音频的第四音频信号进行傅里叶转换,得到该第四音频信号对应的第一实部信号和第一虚部信号,获取第一实部信号和第一虚部信号随着频率变化的曲线,得到第一频谱。由于第一频谱即是第一实部信号和第一虚部信号随着频率变化的曲线,因此,得到第一频谱即是确定了第一频谱中的第一实部信号和第一虚部信号。
另外,根据音频信号的振幅信息和相位信息的获取方式可知,第一实部信号和第一虚部信号中包含音频信号的振幅信息和相位信息,因此,能够直接根据每个成分的第二实部信号和第二虚部信号,确定每个成分的第六音频信号,避免了引入相位噪声,得到的第六音频信号更加准确。
其中,基于每个成分的第二实部信号和第二虚部信号,确定每个成分的第六音频信号,包括:将每个成分的第二实部信号和第二虚部信号进行逆时频转换,得到每个成分的第六音频信号。
804、计算机设备调用时域分离模型,对于每个成分,基于成分的时域信息,从成分的第六音频信号中分离出该成分的第一音频信号,得到每个成分的第一音频信号。
在频域分离模型从音频中分离出每个成分的第六音频信号之后,为了保证分离效果,计算机设备通过时域分离模型,对频域分离模型的分离结果再次进行分离。例如,频域分离模型用于从音频中分离出人声成分,但是分离出的人声成分可能还夹杂一些鼓声成分,因此,将频域分离模型分离出的人声成分输入时域分离模型中,由时域分离模型对该人声成分继续进行分离。
对于时域分离模型来说,直接将每个成分的第六音频信号输入至该时域分离模型,时域分离模型对于每个成分,从该成分的第六音频信号中分离出该成分的第一音频信号。
需要说明的是,本申请实施例仅是以先调用频域分离模型,再调用时域分离模型,对调用时域分离模型和频域分离模型,从目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号的过程进行示例性说明。在另一实施例中,先调用时域分离模型,再调用频域分离模型。其中,调用时域分离模型和频域分离模型,从目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号,包括:调用时域分离模型,基于目标音频的时域信息,从目标音频的第四音频信号中分离出每个成分的第五音频信号;调用频域分离模型,对于每个成分,基于成分的频域信息,从成分的第五音频信号中分离出成分的第一音频信号,得到每个成分的第一音频信号。
也即是先调用时域分离模型,再调用频域分离模型。其中,调用频域分离模型,对于每个成分,基于成分的频域信息,从成分的第五音频信号中分离出成分的第一音频信号包括:确定第三频谱的第三实部信号和第三虚部信号,该第三频谱是每个成分的第五音频信号对应的频谱;调用频域分离模型,对于每个成分,从该成分的第三实部信号和第三虚部信号中分离出该成分的第四实部信号和第四虚部信号;基于每个成分的第三实部信号和第四虚部信号,确定每个成分的第一音频信号。
其中,第三频谱的获取方式与第一频谱的获取方式同理,在此不再一一赘述。
在另一实施例中,并行调用时域分离模型和频域分离模型,调用时域分离模型和频域分离模型,从目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号,包括:调用频域分离模型,基于目标音频的频域信息,从目标音频的第四音频信号中分离出每个成分的第六音频信号;调用时域分离模型,基于目标音频的时域信息,从目标音频的第四音频信号中分离出每个成分的第五音频信号;对于每个成分,将该成分的第五音频信号与该成分的第六音频信号进行融合处理,得到该成分的第一音频信号。
也就是说,时域分离模型和频域分离模型采用并联的方式对音频进行分离处理。其中,对于每个成分,将该成分的第五音频信号与该成分的第六音频信号进行融合处理,得到该成分的第一音频信号是指:根据第五音频信号的权值和第六音频信号的权值,对该第五音频信号和第六音频信号进行加权处理,得到第一音频信号。
需要说明的是,本申请实施例还提供了一种训练时域分离模型和频域分离模型的方式。在一种可能实现方式中,调用频域分离模型,基于目标音频的频域信息,从目标音频的第四音频信号中分离出每个成分的第六音频信号之前,方法还包括:获取样本数据,样本数据包括样本音频以及样本音频的至少一个成分中每个成分的样本音频信号;调用频域分离模型,基于样本音频的频域信息,从样本音频的样本音频信号中分离出至少一个成分中每个成分的第一预测音频信号;调用时域分离模型,基于样本音频的时域信息,从样本音频的样本音频 信号中分离出每个成分的第二预测音频信号;对于每个成分,将该成分的第一预测音频信号与该成分的第二预测音频信号进行融合处理,得到该成分的第三预测音频信号;根据每个成分的第三预测音频信号与样本数据中对应的样本音频信号之间的差异,对频域分离模型和时域分离模型进行训练,以使该第三预测音频信号与样本数据中对应的样本音频信号之间的差异收敛。
需要说明的是,本申请实施例提供的时域分离模型和频域分离模型,能够从音频中分离出至少一个成分,本申请实施例还提供了一种当时域分离模型和频域分离模型仅能分离出一个成分时,如何对音频进行分离的方法。
先以通过频域分离模型进行音频分离为例进行说明,可选地,频域分离模型为从音频中分离出一个成分的音频信号的模型,调用该频域分离模型从音频中分离出第一成分的第一音频信号,包括:确定第一频谱的第一实部信号和第一虚部信号,第一频谱为目标音频对应的频谱;调用频域分离模型,从第一频谱的第一实部信号和第一虚部信号中分离出第一成分的第二实部信号和第二虚部信号;基于第一成分的第二实部信号和第二虚部信号,确定第一成分的第一音频信号。后续基于目标音频的第四音频信号和第一成分的第一音频信号,确定目标音频中剩余成分的第一音频信号,第一成分和剩余成分组成多个成分,从而实现了将音频分离为多个成分。
再以通过时域分离模型和频域分离模型共同进行音频分离为例进行说明,可选地,频域分离模型和时域分离模型为从音频中分离出一个成分的音频信号的模型;计算机设备从音频中分离出第一成分的第一音频信号,包括:调用时域分离模型和频域分离模型,从目标音频的第四音频信号中分离出第一成分的第一音频信号。后续基于目标音频的第四音频信号和第一成分的第一音频信号,确定目标音频中剩余成分的第一音频信号,第一成分和剩余成分组成多个成分,从而实现了将音频分离为多个成分。
本申请实施例提供的音频处理方法,由于时域分离模型和频域分离模型能够基于音频中不同的信息将音频进行分离,因此,时域分离模型和频域分离模型具有互补性,从而通过时域分离模型和频域分离模型一起对音频进行分离,能够更加准确地分离出各种成分,提高了音频的分离效果。
图9是本申请提供的一种音频处理装置的结构示意图。参见图9,该装置包括:
显示模块901,用于通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,该成分为人声成分或者任一乐器声成分;
确定模块902,用于响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为该至少一个目标成分设置的播放参数,该目标成分为多个成分中的任一成分;
处理模块903,用于对于每个目标成分,根据为该目标成分设置的播放参数,对该目标成分的第一音频信号进行处理,得到该目标成分的第二音频信号;
融合模块904,用于将该每个目标成分的第二音频信号与该目标音频中除该至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
如图10所示,在一种可能实现方式中,该播放参数包括音量参数,该处理模块903,用于根据为该目标成分设置的音量参数,调整该目标成分的第一音频信号的振幅,得到该目标成分的第二音频信号;或者,
该播放参数包括音效参数,该处理模块903,用于根据为该目标成分设置的音效参数, 对该目标成分的第一音频信号进行音效处理,得到该目标成分的第二音频信号;或者,
该播放参数包括音色参数,该音色参数指示音频的音色,该处理模块903,用于获取该目标成分对应的曲谱信息,该曲谱信息用于表示该目标成分的音高;根据该曲谱信息和该音色参数,生成该目标成分的第二音频信号。
在一种可能实现方式中,该装置还包括:
获取模块905,用于从服务器中获取多个成分的第一音频信号。
在一种可能实现方式中,该装置还包括:
分离模块906,用于调用时域分离模型和频域分离模型,从该目标音频的第四音频信号中分离出该多个成分中每个成分的第一音频信号;或者,
该分离模块906,用于确定第一频谱的第一实部信号和第一虚部信号,该第一频谱为目标音频的频谱;调用该频域分离模型,从该第一频谱的该第一实部信号和该第一虚部信号中分离出该多个成分中每个成分的第二实部信号和第二虚部信号;基于该每个成分的第二实部信号和第二虚部信号,确定该每个成分的第一音频信号。
在一种可能实现方式中,该分离模块906,包括:
时域分离单元9061,用于调用该时域分离模型,基于该目标音频的时域信息,从该目标音频的第四音频信号中分离出该每个成分的第五音频信号;
频域分离单元9062,用于调用该频域分离模型,对于该每个成分,基于该成分的频域信息,从该成分的第五音频信号中分离出该成分的第一音频信号,得到该每个成分的第一音频信号;
其中,该时域分离模型和该频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该分离模块906,包括:
频域分离单元9062,用于调用该频域分离模型,基于该目标音频的频域信息,从该目标音频的第四音频信号中分离出该每个成分的第六音频信号;
时域分离单元9061,用于调用该时域分离模型,对于该每个成分,基于该成分的时域信息,从该成分的第六音频信号中分离出该成分的第一音频信号,得到该每个成分的第一音频信号;
其中,该时域分离模型和该频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该分离模块906,包括:
频域分离单元9062,用于调用该频域分离模型,基于该目标音频的频域信息,从该目标音频的第四音频信号中分离出该每个成分的第六音频信号;
时域分离单元9061,用于调用该时域分离模型,基于该目标音频的时域信息,从该目标音频的第四音频信号中分离出该每个成分的第五音频信号;
融合单元9063,用于对于该每个成分,将该成分的第五音频信号与该成分的第六音频信号进行融合处理,得到该成分的第一音频信号。
在一种可能实现方式中,该装置还包括:
获取模块905,用于获取样本数据,该样本数据包括样本音频以及该样本音频的多个成分中每个成分的样本音频信号;
该分离模块906,用于调用该频域分离模型,基于该样本音频的频域信息,从该样本音频的样本音频信号中分离出该多个成分中每个成分的第一预测音频信号;
该分离模块906,还用于调用该时域分离模型,基于该样本音频的时域信息,从该样本 音频的样本音频信号中分离出该每个成分的第二预测音频信号;
该分离模块906,还用于对于该每个成分,将该成分的第一预测音频信号与该成分的第二预测音频信号进行融合处理,得到该成分的第三预测音频信号;
训练模块907,用于根据该每个成分的第三预测音频信号与该样本数据中对应的样本音频信号之间的差异,对该频域分离模型和该时域分离模型进行训练。
图11是本申请实施例提供的一种音频处理装置的结构示意图,参见图11,该装置包括:
音频获取模块1101,用于获取目标音频,该目标音频由多个成分组成,该成分为人声成分或者任一乐器声成分;
模型获取模块1102,用于获取时域分离模型和频域分离模型,该时域分离模型和该频域分离模型用于从音频中获取相同类型的成分;
分离模块1103,用于调用该时域分离模型和该频域分离模型,从该目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
如图12所示,在一种可能实现方式中,该分离模块1103,包括:
时域分离单元1113,用于调用该时域分离模型,基于该目标音频的时域信息,从该目标音频的第四音频信号中分离出该每个成分的第五音频信号;
频域分离单元1123,用于调用该频域分离模型,对于该每个成分,基于该成分的频域信息,从该成分的第五音频信号中分离出该成分的第一音频信号。
在一种可能实现方式中,该分离模块1103,包括:
频域分离单元1123,用于调用该频域分离模型,基于该目标音频的频域信息,从该目标音频的第四音频信号中分离出该每个成分的第六音频信号;
时域分离单元1113,用于调用该时域分离模型,对于该每个成分,基于该成分的时域信息,从该成分的第六音频信号中分离出该成分的第一音频信号,得到该每个成分的第一音频信号。
在一种可能实现方式中,该频域分离单元1123,用于确定第一频谱的第一实部信号和第一虚部信号,该第一频谱为目标音频对应的频谱;调用该频域分离模型,从该第一频谱的第一实部信号和第一虚部信号中分离出该每个成分的第二实部信号和第二虚部信号;基于该每个成分的第二实部信号和该第二虚部信号,确定该每个成分的第六音频信号。
在一种可能实现方式中,该分离模块1103,包括:
频域分离单元1123,用于调用该频域分离模型,基于该目标音频的频域信息,从该目标音频的第四音频信号中分离出该每个成分的第六音频信号;
时域分离单元1113,用于调用该时域分离模型,基于该目标音频的时域信息,从该目标音频的第四音频信号中分离出该每个成分的第五音频信号;
融合单元1133,用于对于该每个成分,将该成分的第五音频信号与该成分的第六音频信号进行融合处理,得到该成分的第一音频信号。
在一种可能实现方式中,该装置还包括:
样本获取模块1104,用于获取样本数据,该样本数据包括样本音频以及该样本音频的至少一个成分中每个成分的样本音频信号;
该分离模块1103,用于调用该频域分离模型,基于该样本音频的频域信息,从该样本音频的样本音频信号中分离出该至少一个成分中每个成分的第一预测音频信号;
该分离模块1103,用于调用该时域分离模型,基于该样本音频的时域信息,从该样本音频的样本音频信号中分离出该每个成分的第二预测音频信号;
融合模块1105,用于对于该每个成分,将该成分的第一预测音频信号与该成分的第二预测音频信号进行融合处理,得到该成分的第三预测音频信号;
训练模块1106,用于根据该每个成分的第三预测音频信号与该样本数据中对应的样本音频信号之间的差异,对该频域分离模型和该时域分离模型进行训练。
本申请实施例还提供了一种计算机设备,该计算机设备包括处理器和存储器,存储器中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行,以实现如上述实施例的音频处理方法中所执行的操作。
可选地,计算机设备被提供为终端。图13是本申请实施例提供的一种终端的结构示意图。该终端1300可以是便携式移动终端,比如:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1300还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
终端1300包括有:处理器1301和存储器1302。
处理器1301可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1301可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1301也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1301可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1301还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1302可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1302还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1302中的非暂态的计算机可读存储介质用于存储至少一个程序代码,该至少一个程序代码用于被处理器1301所执行以实现本申请中方法实施例提供的音频处理方法。
在一些实施例中,终端1300还可选包括有:外围设备接口1303和至少一个外围设备。处理器1301、存储器1302和外围设备接口1303之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1303相连。具体地,外围设备包括:射频电路1304、显示屏1305、摄像头组件1306、音频电路1307、定位组件1308和电源1309中的至少一种。
本领域技术人员可以理解,图13中示出的结构并不构成对终端1300的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
可选地,计算机设备被提供为服务器。图14是根据一示例性实施例示出的一种服务器的结构示意图,该服务器1400可因配置或性能不同而产生比较大的差异,可以包括一个或一个 以上处理器(Central Processing Units,CPU)1401和一个或一个以上的存储器1402,其中,存储器1402中存储有至少一条程序代码,至少一条程序代码由处理器1401加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。
本申请实施例还提供了一种计算机设备,该计算机设备包括处理器和存储器,存储器中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以下步骤:
通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
所述播放参数包括音量参数,根据为所述目标成分设置的音量参数,调整所述目标成分的第一音频信号的振幅,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音效参数,根据为所述目标成分设置的音效参数,对所述目标成分的第一音频信号进行音效处理,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音色参数,所述音色参数指示音频的音色,获取所述目标成分对应的曲谱信息,所述曲谱信息用于表示所述目标成分的音高;根据所述曲谱信息和所述音色参数,生成所述目标成分的第二音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
从服务器中获取所述多个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号;或者,
确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频的频谱;调用所述频域分离模型,从所述第一实部信号和所述第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;基于所述每个成分的第二实部信号和第二虚部信号,确定所述每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
获取样本数据,所述样本数据包括样本音频以及所述样本音频的多个成分中每个成分的样本音频信号;
调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述多个成分中每个成分的第一预测音频信号;
调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
本申请实施例还提供了一种计算机设备,该计算机设备包括处理器和存储器,存储器中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以下步骤:
获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一乐器声成分;
获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频对应的频谱;
调用所述频域分离模型,从所述第一频谱的第一实部信号和第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;
基于所述每个成分的第二实部信号和所述第二虚部信号,确定所述每个成分的第六音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
获取样本数据,所述样本数据包括样本音频以及所述样本音频的至少一个成分中每个成分的样本音频信号;
调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述至少一个成分中每个成分的第一预测音频信号;
调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行,以实现上述实施例的音频处理方法中所执行的操作。
在一种可能实现方式中,该计算机可读存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以下步骤:
通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
所述播放参数包括音量参数,根据为所述目标成分设置的音量参数,调整所述目标成分的第一音频信号的振幅,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音效参数,根据为所述目标成分设置的音效参数,对所述目标成分的第一音频信号进行音效处理,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音色参数,所述音色参数指示音频的音色,获取所述目标成分对应的曲谱信息,所述曲谱信息用于表示所述目标成分的音高;根据所述曲谱信息和所述音色参数,生成所述目标成分的第二音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
从服务器中获取所述多个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号;或者,
确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频的频谱;调用所述频域分离模型,从所述第一实部信号和所述第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;基于所述每个成分的第二实部信号和第二虚部信号,确定所述每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
获取样本数据,所述样本数据包括样本音频以及所述样本音频的多个成分中每个成分的样本音频信号;
调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述多个成分中每个成分的第一预测音频信号;
调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以下操作:
获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一乐器声成分;
获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频对应的频谱;
调用所述频域分离模型,从所述第一频谱的第一实部信号和第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;
基于所述每个成分的第二实部信号和所述第二虚部信号,确定所述每个成分的第六音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
获取样本数据,所述样本数据包括样本音频以及所述样本音频的至少一个成分中每个成分的样本音频信号;
调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述至少一个成分中每个成分的第一预测音频信号;
调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行,以实现上述实施例的音频处理方法中所执行的操作。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行如下步骤:
通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
所述播放参数包括音量参数,根据为所述目标成分设置的音量参数,调整所述目标成分的第一音频信号的振幅,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音效参数,根据为所述目标成分设置的音效参数,对所述目标成分的第一音频信号进行音效处理,得到所述目标成分的第二音频信号;或者,
所述播放参数包括音色参数,所述音色参数指示音频的音色,获取所述目标成分对应的曲谱信息,所述曲谱信息用于表示所述目标成分的音高;根据所述曲谱信息和所述音色参数,生成所述目标成分的第二音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
从服务器中获取所述多个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号;或者,
确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频的频谱;调用所述频域分离模型,从所述第一实部信号和所述第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;基于所述每个成分的第二实部信号和第二虚部信号,确定所述每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下步骤:
获取样本数据,所述样本数据包括样本音频以及所述样本音频的多个成分中每个成分的样本音频信号;
调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述多个成分中每个成分的第一预测音频信号;
调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行如下步骤:
获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一乐器声成分;
获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的 第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频对应的频谱;
调用所述频域分离模型,从所述第一频谱的第一实部信号和第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;
基于所述每个成分的第二实部信号和所述第二虚部信号,确定所述每个成分的第六音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以下操作:
获取样本数据,所述样本数据包括样本音频以及所述样本音频的至少一个成分中每个成分的样本音频信号;
调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述至少一个成分中每个成分的第一预测音频信号;
调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上仅为本申请实施例的可选实施例,并不用以限制本申请实施例,凡在本申请实施例的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种音频处理方法,所述方法包括:
    通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
    响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
    对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
    将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
  2. 根据权利要求1所述的方法,其中,所述根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号,包括:
    所述播放参数包括音量参数,根据为所述目标成分设置的音量参数,调整所述目标成分的第一音频信号的振幅,得到所述目标成分的第二音频信号;或者,
    所述播放参数包括音效参数,根据为所述目标成分设置的音效参数,对所述目标成分的第一音频信号进行音效处理,得到所述目标成分的第二音频信号;或者,
    所述播放参数包括音色参数,所述音色参数指示音频的音色,获取所述目标成分对应的曲谱信息,所述曲谱信息用于表示所述目标成分的音高;根据所述曲谱信息和所述音色参数,生成所述目标成分的第二音频信号。
  3. 根据权利要求1所述的方法,其中,所述方法还包括:
    从服务器中获取所述多个成分的第一音频信号。
  4. 根据权利要求1所述的方法,其中,所述方法还包括:
    调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号;或者,
    确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频的频谱;调用所述频域分离模型,从所述第一实部信号和所述第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;基于所述每个成分的第二实部信号和第二虚部信号,确定所述每个成分的第一音频信号。
  5. 根据权利要求4所述的方法,其中,所述调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号,包括:
    调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
    调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
    其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
  6. 根据权利要求4所述的方法,其中,所述调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号,包括:
    调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
    调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号;
    其中,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分。
  7. 根据权利要求4所述的方法,其中,所述调用时域分离模型和频域分离模型,从所述目标音频的第四音频信号中分离出所述多个成分中每个成分的第一音频信号,包括:
    调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
    调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
    对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
  8. 根据权利要求7所述的方法,其中,所述方法还包括:
    获取样本数据,所述样本数据包括样本音频以及所述样本音频的多个成分中每个成分的样本音频信号;
    调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述多个成分中每个成分的第一预测音频信号;
    调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信号中分离出所述每个成分的第二预测音频信号;
    对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
    根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
  9. 一种音频处理方法,所述方法包括:
    获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一乐器声成分;
    获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
    调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
  10. 根据权利要求9所述的方法,其中,所述调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号,包括:
    调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
    调用所述频域分离模型,对于所述每个成分,基于所述成分的频域信息,从所述成分的第五音频信号中分离出所述成分的第一音频信号,得到所述每个成分的第一音频信号。
  11. 根据权利要求9所述的方法,其中,所述调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号,包括:
    调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
    调用所述时域分离模型,对于所述每个成分,基于所述成分的时域信息,从所述成分的第六音频信号中分离出所述成分的第一音频信号。
  12. 根据权利要求11所述的方法,其中,所述调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号,包括:
    确定第一频谱的第一实部信号和第一虚部信号,所述第一频谱为所述目标音频对应的频谱;
    调用所述频域分离模型,从所述第一频谱的第一实部信号和第一虚部信号中分离出所述每个成分的第二实部信号和第二虚部信号;
    基于所述每个成分的第二实部信号和所述第二虚部信号,确定所述每个成分的第六音频信号。
  13. 根据权利要求9所述的方法,其中,所述调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号,包括:
    调用所述频域分离模型,基于所述目标音频的频域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第六音频信号;
    调用所述时域分离模型,基于所述目标音频的时域信息,从所述目标音频的第四音频信号中分离出所述每个成分的第五音频信号;
    对于所述每个成分,将所述成分的第五音频信号与所述成分的第六音频信号进行融合处理,得到所述成分的第一音频信号。
  14. 根据权利要求13所述的方法,其中,所述方法还包括:
    获取样本数据,所述样本数据包括样本音频以及所述样本音频的至少一个成分中每个成分的样本音频信号;
    调用所述频域分离模型,基于所述样本音频的频域信息,从所述样本音频的样本音频信号中分离出所述至少一个成分中每个成分的第一预测音频信号;
    调用所述时域分离模型,基于所述样本音频的时域信息,从所述样本音频的样本音频信 号中分离出所述每个成分的第二预测音频信号;
    对于所述每个成分,将所述成分的第一预测音频信号与所述成分的第二预测音频信号进行融合处理,得到所述成分的第三预测音频信号;
    根据所述每个成分的第三预测音频信号与所述样本数据中对应的样本音频信号之间的差异,对所述频域分离模型和所述时域分离模型进行训练。
  15. 一种音频处理装置,所述装置包括:
    显示模块,用于通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
    确定模块,用于响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
    处理模块,用于对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
    融合模块,用于将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
  16. 一种音频处理装置,所述装置包括:
    音频获取模块,用于获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一种乐器声成分;
    模型获取模块,用于获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
    分离模块,用于调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
  17. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行如下操作:
    通过播放参数设置界面,显示目标音频中已分离出的多个成分的播放参数设置选项,所述成分为人声成分或者任一乐器声成分;
    响应于对至少一个目标成分的播放参数设置选项的触发操作,确定为所述至少一个目标成分设置的播放参数,所述目标成分为所述多个成分中的任一成分;
    对于每个目标成分,根据为所述目标成分设置的播放参数,对所述目标成分的第一音频信号进行处理,得到所述目标成分的第二音频信号;
    将所述每个目标成分的第二音频信号与所述目标音频中除所述至少一个目标成分之外的其他成分的第三音频信号进行融合,得到处理后的目标音频。
  18. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行如下操作:
    获取目标音频,所述目标音频由多个成分组成,所述成分为人声成分或者任一乐器声成分;
    获取时域分离模型和频域分离模型,所述时域分离模型和所述频域分离模型用于从音频中获取相同类型的成分;
    调用所述时域分离模型和所述频域分离模型,从所述目标音频的第四音频信号中分离出至少一个成分中每个成分的第一音频信号。
  19. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现如权利要求1至8任一权利要求所述的音频处理方法中所执行的操作;或者,以实现如权利要求9至14任一权利要求所述的音频处理方法中所执行的操作。
  20. 一种计算机程序产品,所述计算机程序产品中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现如权利要求1至8任一权利要求所述的音频处理方法中所执行的操作;或者,以实现如权利要求9至14任一权利要求所述的音频处理方法中所执行的操作。
PCT/CN2021/141662 2020-12-30 2021-12-27 音频处理方法、装置、计算机设备及存储介质 WO2022143530A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011603259.7A CN112685000B (zh) 2020-12-30 2020-12-30 音频处理方法、装置、计算机设备及存储介质
CN202011603259.7 2020-12-30

Publications (1)

Publication Number Publication Date
WO2022143530A1 true WO2022143530A1 (zh) 2022-07-07

Family

ID=75454467

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/141662 WO2022143530A1 (zh) 2020-12-30 2021-12-27 音频处理方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112685000B (zh)
WO (1) WO2022143530A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685000B (zh) * 2020-12-30 2024-09-10 广州酷狗计算机科技有限公司 音频处理方法、装置、计算机设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190206417A1 (en) * 2017-12-28 2019-07-04 Knowles Electronics, Llc Content-based audio stream separation
CN111370018A (zh) * 2020-02-28 2020-07-03 维沃移动通信有限公司 音频数据的处理方法、电子设备及介质
CN111724807A (zh) * 2020-08-05 2020-09-29 字节跳动有限公司 音频分离方法、装置、电子设备及计算机可读存储介质
CN112685000A (zh) * 2020-12-30 2021-04-20 广州酷狗计算机科技有限公司 音频处理方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1280138A1 (de) * 2001-07-24 2003-01-29 Empire Interactive Europe Ltd. Verfahren zur Analyse von Audiosignalen
CN110503976B (zh) * 2019-08-15 2021-11-23 广州方硅信息技术有限公司 音频分离方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190206417A1 (en) * 2017-12-28 2019-07-04 Knowles Electronics, Llc Content-based audio stream separation
CN111370018A (zh) * 2020-02-28 2020-07-03 维沃移动通信有限公司 音频数据的处理方法、电子设备及介质
CN111724807A (zh) * 2020-08-05 2020-09-29 字节跳动有限公司 音频分离方法、装置、电子设备及计算机可读存储介质
CN112685000A (zh) * 2020-12-30 2021-04-20 广州酷狗计算机科技有限公司 音频处理方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI WEI-DONG, LI FANG: "The Function and Application of Muti-sound Track Frequency Editing Software in Rhythmic Sportive Gymnastics", JOURNAL OF BEIJING TEACHERS COLLEGE OF PHYSICAL EDUCATION, vol. 14, no. 2, 30 June 2002 (2002-06-30), pages 91 - 92, 95, XP009538061, ISSN: 1009-783X *

Also Published As

Publication number Publication date
CN112685000B (zh) 2024-09-10
CN112685000A (zh) 2021-04-20

Similar Documents

Publication Publication Date Title
US9779708B2 (en) Networks of portable electronic devices that collectively generate sound
JP5624255B2 (ja) 携帯用音声合成のためのシステム及び方法
CN111402842B (zh) 用于生成音频的方法、装置、设备和介质
WO2020113733A1 (zh) 动画生成方法、装置、电子设备及计算机可读存储介质
TW202006534A (zh) 音頻合成方法、裝置、儲存媒體和計算機設備
US11120782B1 (en) System, method, and non-transitory computer-readable storage medium for collaborating on a musical composition over a communication network
WO2020224322A1 (zh) 音乐文件的处理方法、装置、终端及存储介质
CN110211556B (zh) 音乐文件的处理方法、装置、终端及存储介质
JP2011516907A (ja) 音楽の学習及びミキシングシステム
CN113823250B (zh) 音频播放方法、装置、终端及存储介质
CN114073854A (zh) 基于多媒体文件的游戏方法和系统
US20120072841A1 (en) Browser-Based Song Creation
JP2001331175A (ja) 副旋律生成装置及び方法並びに記憶媒体
US11875777B2 (en) Information processing method, estimation model construction method, information processing device, and estimation model constructing device
JP7497523B2 (ja) カスタム音色歌声の合成方法、装置、電子機器及び記憶媒体
WO2022143530A1 (zh) 音频处理方法、装置、计算机设备及存储介质
CN112669811B (zh) 一种歌曲处理方法、装置、电子设备及可读存储介质
WO2024139162A1 (zh) 音频处理方法和装置
DK202170064A1 (en) An interactive real-time music system and a computer-implemented interactive real-time music rendering method
US20240339094A1 (en) Audio synthesis method, and computer device and computer-readable storage medium
JP2013213907A (ja) 評価装置
Turchet et al. A web-based distributed system for integrating mobile music in choral performance
WO2023092368A1 (zh) 音频分离方法、装置、设备、存储介质及程序产品
Vawter Ambient addition: How to turn urban noise into music
CN112825245A (zh) 实时修音方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21914266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21914266

Country of ref document: EP

Kind code of ref document: A1