WO2023010949A1 - 一种音频数据的处理方法及装置 - Google Patents

一种音频数据的处理方法及装置 Download PDF

Info

Publication number
WO2023010949A1
WO2023010949A1 PCT/CN2022/093923 CN2022093923W WO2023010949A1 WO 2023010949 A1 WO2023010949 A1 WO 2023010949A1 CN 2022093923 W CN2022093923 W CN 2022093923W WO 2023010949 A1 WO2023010949 A1 WO 2023010949A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
information
target
transition
clips
Prior art date
Application number
PCT/CN2022/093923
Other languages
English (en)
French (fr)
Inventor
王卓
王萌
杜春晖
范泛
刘经纬
李贤胜
徐德著
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22851667.0A priority Critical patent/EP4365888A1/en
Publication of WO2023010949A1 publication Critical patent/WO2023010949A1/zh
Priority to US18/426,495 priority patent/US20240169962A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/02Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
    • G10H1/04Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos by additional modulation
    • G10H1/053Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos by additional modulation during execution only
    • G10H1/057Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos by additional modulation during execution only by envelope-forming circuits
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0033Recording/reproducing or transmission of music for electrophonic musical instruments
    • G10H1/0041Recording/reproducing or transmission of music for electrophonic musical instruments in coded form
    • G10H1/0058Transmission between separate instruments or between individual components of a musical system
    • G10H1/0066Transmission between separate instruments or between individual components of a musical system using a MIDI interface
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/46Volume control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/025Envelope processing of music signals in, e.g. time domain, transform domain or cepstrum domain
    • G10H2250/035Crossfade, i.e. time domain amplitude envelope control of the transition between musical sounds or melodies, obtained for musical purposes, e.g. for ADSR tone generation, articulations, medley, remix
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation

Definitions

  • the present application relates to the field of multimedia technologies, and in particular to a method and device for processing audio data.
  • the present application provides a method and device for processing audio data. Based on the method, richer and more diverse serial audio can be obtained.
  • the present application provides a method for processing audio data.
  • the method includes: acquiring m audio segments, where m is an integer greater than or equal to 2.
  • the target mix audio is generated.
  • the m-1 pieces of transition audio information are used to connect the m pieces of audio segments.
  • the first transition audio information is used to connect the sequentially sorted first audio segment and the second audio segment among the m audio segments.
  • the sorting of the m audio clips is the concatenation sequence of the m audio clips.
  • the determination of the m-1 pieces of transition audio information based on the m audio segments includes: determining the first Transition audio information.
  • the first information includes the musical instrument digital interface (musical instrument digital interface, MIDI) information and audio feature information of the first audio segment
  • the second information includes the MIDI information and audio feature information of the second audio segment
  • the first transition The audio information includes MIDI information of the first transition audio corresponding to the first transition audio information.
  • the audio feature information includes at least one of the main melody track position information, style label, emotion label, rhythm information, beat information, or key signature information of the audio clip.
  • the transition audio information generated by the method provided in the present application for connecting multiple audio segments is performed in the MIDI domain. Since the MIDI information of the audio is the most original form of expression of the audio, it records information such as the pitch of the audio note, the strength of the note, and the duration of the note. Therefore, compared to directly stringing multiple audio clips in the time domain, the method provided by this application processes the MIDI information of the audio clips in the MIDI domain to generate transition audio for connecting two audio clips Information is generated based on audio music theory. In this way, the concatenation audio obtained based on the transition audio information is smoother and more natural in hearing. Moreover, processing data in the MIDI domain is also more conducive to the flexibility and consistency of serial audio in post-rendering.
  • determining the first transition audio information according to the first information of the first audio segment and the second information of the second audio segment includes: according to the first information of the first audio segment, The second information of the second audio segment and the preset neural network model determine the first transition audio information.
  • the above-mentioned first transitional audio information is based on the feature vector used to characterize the first transitional audio information If determined, the feature vector of the first transition audio information is determined based on the first vector and the second vector.
  • the first vector is a feature vector generated at the end of the first audio clip according to the first information
  • the second vector is a feature vector generated at the beginning of the second audio clip according to the second information.
  • the method provided in this application processes the MIDI information of multiple audio clips through the neural network model in the MIDI domain, thereby obtaining the MIDI information of the transition audio used to join the multiple audio clips.
  • the transition audio information used to connect multiple audio clips obtained by the application based on the learning of audio music theory in the MIDI domain can connect multiple audio clips more naturally and smoothly.
  • the acquisition of the m audio clips includes: determining k target audio in response to the user's first operation. Extract m audio segments from the k target audio. Wherein, 2 ⁇ k ⁇ m, and k is an integer.
  • the present application can mix audio clips among multiple target audios selected by the user based on the user's wishes, thereby improving the user's experience.
  • the above method before determining the m-1 pieces of transitional audio information according to the m audio pieces, the above method further includes: determining a concatenation order of the m audio pieces.
  • the above method further includes: in response to the second operation of the user, re-determining the mixing order of the m audio clips. According to the re-determined mixing sequence and the m audio clips, re-determine m-1 pieces of transitional audio information. According to the re-determined m-1 pieces of transitional audio information and m pieces of audio clips, regenerate the target audio mix.
  • the target audio mix is generated by the method provided in this application
  • the user when the user is not satisfied with the target audio mix, he can input a second operation to the terminal device, so that the terminal device responds to the second operation, Adjust the serialization order of the m audio clips that generate the target serialized audio, and regenerate a new target serialized audio.
  • the user can obtain a satisfactory target serialized audio, thereby improving the user experience.
  • the above method further includes: rendering the above target audio concatenation in response to a third user operation.
  • the above method further includes: outputting the above target serial audio.
  • the present application provides an audio data processing device.
  • the processing device is configured to execute any one of the methods provided in the first aspect above.
  • the present application may divide the processing device into functional modules according to any one of the methods provided in the first aspect above.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • this application may divide the processing device into an acquisition unit, a determination unit, a generation unit, and the like according to functions.
  • the processing device includes: one or more processors and a transmission interface, the one or more processors receive or send data through the transmission interface, and the one or more processors are configured to Invoking the program instructions stored in the memory, so that the processing device executes any one of the methods provided in the first aspect and any possible design thereof.
  • the present application provides a computer-readable storage medium, the computer-readable storage medium includes program instructions, and when the program instructions are run on a computer or a processor, the computer or the processor executes any of the steps in the first aspect. Either method provided by a possible implementation.
  • the present application provides a computer program product, which, when running on an audio data processing device, enables any method provided in any possible implementation manner in the first aspect to be executed.
  • the present application provides an audio data processing system, where the system includes a terminal device and a server.
  • the terminal device is used to execute the method part of interacting with the user in any method provided in any possible implementation manner in the first aspect
  • the server is used to execute any possible implementation method in the first aspect. The method part of generating the target audio mix in any method provided by the implementation.
  • any device, computer storage medium, computer program product or system provided above can be applied to the corresponding method provided above. Therefore, the beneficial effects that it can achieve can refer to the corresponding method The beneficial effects in the above will not be repeated here.
  • FIG. 1 is a schematic diagram of a mobile phone hardware structure provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an audio data processing system provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for processing audio data provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a first operation input by a user on an audio editing interface of an audio editing application provided by an embodiment of the present application;
  • FIG. 5 is a schematic diagram of another first operation input by the user on the audio editing interface of the audio editing application provided by the embodiment of the present application;
  • FIG. 6 is a schematic diagram of another first operation input by the user on the audio editing interface of the audio editing application provided by the embodiment of the present application;
  • FIG. 7 is a schematic diagram of another first operation input by the user on the audio editing interface of the audio editing application provided by the embodiment of the present application;
  • FIG. 8 is a schematic structural diagram of a preset neural network model provided in an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of another preset neural network model provided by the embodiment of the present application.
  • Fig. 10 is a schematic diagram of the second operation provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of rendering and outputting MIDI information of target serial audio provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of an audio data processing device provided by an embodiment of the present application.
  • Fig. 13 is a schematic structural diagram of a signal carrying medium for carrying a computer program product provided by an embodiment of the present application.
  • MIDI is the most widely used music standard format in the arranger industry, and it can be called "music score that computers can understand”.
  • MIDI records music with digital control signals of musical notes. That is, what MIDI transmits is not the sound signal itself, but instructions such as musical notes and control parameters. These instructions may instruct the MIDI device to play music, for example, instruct the MIDI device to play a certain note at the volume level indicated in the instruction.
  • the instructions transmitted by MIDI can be uniformly expressed as MIDI messages or MIDI messages.
  • MIDI information can be presented in the form of a map, or in the form of a data stream.
  • MIDI information When MIDI information is presented in the form of a map, it may be referred to as a MIDI spectrum for short.
  • the MIDI information can be understood as the music waveform signal representation in the MIDI domain.
  • the time domain refers to the time domain.
  • the MIDI information may generally include multiple sound tracks, and each sound track is marked with the start position and end position of the note, the pitch of the note, the velocity information of the note, and the like. Among them, one track is used to represent one instrument sound/vocal voice. It can be understood that the size of a complete piece of music expressed through MIDI information is usually only tens of kilobytes (kilobyte, KB), but it may contain dozens of sound tracks.
  • the timbre library (or called the sampling library) includes various sounds that human beings can hear and create, such as the performance of various musical instruments, the singing of various human voices, chanting, and various natural and artificial sounds. recording etc.
  • the space of the original data represented by the feature after transformation by several neural network layers can be called the latent space.
  • the dimensions of the latent space are generally smaller than the spatial dimensions of the original data.
  • Latent space can also be understood as some abstract extraction and representation of original data features.
  • a model that contains sequence data in its input or output can be called a sequence model.
  • Sequence models are often used to deal with data that has some order relationship.
  • a neural network used to build a sequence model can be called a sequence model network.
  • the common sequence model network includes recurrent neural network (recurrent neural network, RNN), long short-term memory network (long short-term memory, LSTM), gated recurrent unit (gated recurrent unit, GRU), converter (transformer) wait.
  • RNN recurrent neural network
  • long short-term memory network long short-term memory
  • LSTM long short-term memory
  • GRU gated recurrent unit
  • converter transformer
  • the prediction result obtained by the sequence model network at time t is usually obtained after learning the input data before time t.
  • the sequence model network when the sequence model network predicts the prediction results at time t, it is based on learning the data of the input data before time t and after learning the data of the input data after time t.
  • the sequence network model is called a bidirectional sequence model network. It can be seen that when the bidirectional sequence network model predicts the input data, it combines the context information of time t in the input data to predict the result.
  • the two-way sequence model can predict the prediction result at any moment of the input data.
  • bidirectional sequence model networks include (bidirectional) recurrent neural networks ((Bi-)RNN), (bidirectional) long short-term memory ((bidirectional) long short-term memory, (Bi-) )LSTM), (bidirectional) gate recurrent unit ((bidirectional) gate recurrent unit, (Bi-)GRU), converter (transformer), etc.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design scheme described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
  • first and second are used for description purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • the term “at least one” in this application means one or more. In the description of the present application, unless otherwise specified, the term “plurality” means two or more.
  • determining B according to A does not mean determining B only according to A, and B may also be determined according to A and/or other information.
  • the stringing described below in the embodiments of the present application refers to the process of extracting multiple audio segments from different audios, and combining the multiple audio segments in series according to a preset order.
  • the preset order is It is the serial burning sequence of the multi-segment audio segments.
  • An embodiment of the present application provides a method for processing audio data.
  • the method first determines m-1 transitional audio information based on the pre-acquired m audio segments, and then determines the m audio segments based on the m-1 transitional audio information.
  • the connection is performed, thereby generating the target serialized audio after m audio clips are serialized.
  • one piece of transitional audio information is used to connect two adjacent audio clips in sequence.
  • the embodiment of the present application also provides an audio data processing apparatus, where the processing apparatus may be a terminal device.
  • the terminal device may be a portable device such as a mobile phone, a tablet computer, a notebook computer, a personal digital assistant (PDA), a netbook, a wearable electronic device (such as a smart watch, smart glasses), or a desktop computer,
  • PDA personal digital assistant
  • Devices such as smart TVs and vehicles may also be any other terminal devices capable of implementing the embodiments of the present application, which is not limited in the present application.
  • FIG. 1 shows a schematic diagram of a hardware structure of a mobile phone 10 provided by an embodiment of the present application.
  • the mobile phone 10 may include a processor 110 , an internal memory 120 , an external memory interface 130 , a camera 140 , a touch screen 150 , an audio module 160 and a communication module 170 and so on.
  • the processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU) wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit, NPU
  • the controller may be the nerve center and command center of the mobile phone 10 .
  • the controller can generate an operation control signal according to the instruction opcode and timing signal, and complete the control of fetching and executing the instruction.
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the mobile phone 10 can be implemented through the NPU, such as character recognition, image recognition, face recognition, and the like.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is a cache memory.
  • the memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 110 is reduced, thereby improving the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transmitter (universal asynchronous receiver/transmitter, UART) interface, mobile industry processor interface (mobile industry processor interface, MIPI), general-purpose input and output (general-purpose input/output, GPIO) interface, subscriber identity module (subscriber identity module, SIM) interface, and /or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input and output
  • subscriber identity module subscriber identity module
  • SIM subscriber identity module
  • USB universal serial bus
  • the I2C interface is a bidirectional synchronous serial bus, including a serial data line (serial data line, SDA) and a serial clock line (derail clock line, SCL).
  • the I2S interface can be used for audio communication.
  • the PCM interface can also be used for audio communication, sampling, quantizing and encoding the analog signal.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus can be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the camera 140 and the touch screen 150 .
  • MIPI interface includes camera serial interface (camera serial interface, CSI), touch screen serial interface (display serial interface, DSI), etc.
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the internal memory 120 can be used to store computer executable program codes, and the executable program codes include instructions.
  • the processor 110 executes various functional applications and data processing of the mobile phone 10 by executing instructions stored in the internal memory 120 , for example, executing the audio data processing method provided in the embodiment of the present application.
  • the external memory interface 130 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 10.
  • the external memory card communicates with the processor 110 through the external memory interface 130 to implement a data storage function. For example, save music, video, picture and other files in the external memory card.
  • the camera 140 is used to acquire still images or videos.
  • the object generates an optical image through the lens and projects it to the photosensitive element.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. It should be understood that the mobile phone 10 may include n cameras 140, where n is a positive integer.
  • the touch screen 150 is used for interaction between the mobile phone 10 and the user.
  • the touch screen 150 includes a display panel 151 and a touch pad 152 .
  • the display panel 151 is used for displaying text, images, videos and the like.
  • the touch panel 152 is used to input user's instruction.
  • the audio module 160 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal.
  • the audio module 160 may include at least one of a speaker 161 , a receiver 162 , a microphone 163 , and an earphone jack 164 .
  • the speaker 161 also called “speaker” is used to convert audio electrical signals into sound signals.
  • the earphone interface 164 is used for connecting wired earphones.
  • the earphone interface 164 can be a USB interface, or a 3.2mm open mobile terminal platform (open mobile terminal platform, OMTP) standard interface, or a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the mobile phone 10 can realize the audio function through the speaker 161 , the receiver 162 , the microphone 163 , the earphone interface 164 , and the application processor in the audio module 160 .
  • the communication module 170 is used to realize the communication function of the mobile phone 10 .
  • the communication module 170 may be implemented by an antenna, a mobile communication module, a wireless communication module, a modem processor, a baseband processor, and the like.
  • Antennas are used to transmit and receive electromagnetic wave signals.
  • Each antenna in handset 10 can be used to cover single or multiple communication frequency bands. Different antennas can also be multiplexed to improve the utilization of the antennas.
  • the antenna 1 used for the mobile communication module can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module can provide wireless communication solutions including 2G/3G/4G/5G applied on the mobile phone 10.
  • the mobile communication module may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA) and the like.
  • the mobile communication module can receive electromagnetic waves through the antenna, filter and amplify the received electromagnetic waves, and send them to the modem processor for demodulation.
  • the mobile communication module can also amplify the signal modulated by the modem processor, and convert it into electromagnetic wave and radiate it through the antenna.
  • at least part of the functional modules of the mobile communication module may be set in the processor 110 .
  • at least part of the functional modules of the mobile communication module and at least part of the modules of the processor 110 may be set in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the wireless communication module can provide applications on the mobile phone 10 including wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (wireless fidelity, Wi-Fi) network), bluetooth (bluetooth, BT), GNSS, frequency modulation (frequency modulation, FM), near field communication (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • the wireless communication module may be one or more devices integrating at least one communication processing module.
  • the wireless communication module receives electromagnetic waves through the antenna, frequency-modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module can also receive the signal to be sent from the processor 110, frequency-modulate it, amplify it, and convert it into electromagnetic wave through the antenna to radiate out.
  • the GNSS in the embodiment of the present application may include: global positioning system (global positioning system, GPS), global navigation satellite system (global navigation satellite system, GLONASS), Beidou satellite navigation system (beidou navigation satellite system, BDS) ), quasi-zenith satellite system (QZSS) and/or satellite based augmentation systems (satellite based augmentation systems, SBAS), etc.
  • global positioning system global positioning system, GPS
  • global navigation satellite system global navigation satellite system
  • GLONASS global navigation satellite system
  • Beidou satellite navigation system beidou navigation satellite system, BDS
  • QZSS quasi-zenith satellite system
  • SBAS satellite based augmentation systems
  • the structure shown in the embodiment of the present application does not constitute a specific limitation on the mobile phone 10 .
  • the mobile phone 10 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the above-mentioned audio data processing method may be realized by an application program (application, App) installed on the terminal device.
  • the App has the function of editing audio.
  • the App may be a music clip App or the like.
  • the App may be an App with a manual intervention function.
  • manual intervention means that the App can receive instructions input by the user and be able to respond to the instructions input by the user.
  • the App can interact with the user.
  • the App may include an interactive interface for interacting with the user, and the interactive interface is displayed through a display screen of the terminal device (such as the display panel 151 shown in FIG. 1 ).
  • the terminal device includes a touch screen (such as the touch screen 150 shown in FIG. 1 )
  • the user can interact with the App by operating the touch screen of the terminal device (such as operating the touch panel 152 shown in FIG. 1 ).
  • the terminal device does not include a touch screen (for example, the terminal device is an ordinary desktop computer)
  • the user can interact with the App through input and output devices such as a mouse and a keyboard of the terminal device.
  • the aforementioned App may be an embedded application installed in the terminal device (that is, a system application of the terminal device), or may be a downloadable application.
  • the embedded application program is an application program provided by an operating system of a device (such as a mobile phone).
  • the embedded application program may be a music application program provided by the mobile phone when leaving the factory.
  • a downloadable application is an application that can provide its own communication connection.
  • the downloadable application is an application that can be pre-installed in the device, or it can be a third-party application that is downloaded by the user and installed in the device.
  • the The downloadable application program may be a music clipping App, which is not specifically limited in this embodiment of the present application.
  • the above-mentioned processing device may also be a server.
  • the embodiment of the present application further provides an audio data processing system.
  • the processing system includes a server and a terminal device, and the server and the terminal device may be connected and communicated in a wired or wireless manner.
  • FIG. 2 shows a schematic diagram of a processing system 20 provided by an embodiment of the present application.
  • the processing system 20 includes a terminal device 21 and a server 22 .
  • the terminal device 21 can interact with the user through a client App (for example, a client App for audio editing), for example, receiving an instruction input by the user, and transmitting the received instruction to the server 22 .
  • the server 22 is used to execute the audio data processing method provided by the embodiment of the present application based on the instruction received from the terminal device 21, and send the generated MIDI information of the target serial audio and/or the target serial audio to the terminal device twenty one.
  • the terminal device 21 can receive the MIDI information of the target serial audio and/or the target serial audio sent by the server 22, and play the target serial audio to the user through the audio module, and/or display the target serial audio to the user through the display screen. Displays the MIDI spectrum of the target's mixed audio. There is no limit to this.
  • FIG. 3 shows a schematic flowchart of a method for processing audio data provided by an embodiment of the present application.
  • the method is executed by the above-mentioned audio data processing device.
  • the method may include the following steps:
  • the m audio segments include segments in different audios.
  • the terminal device may first determine k target audios, and then extract m audio segments from the k target audios.
  • k target audios 2 ⁇ k ⁇ m, and k is an integer.
  • the terminal device can extract at least one audio segment from a target audio.
  • the m-1 pieces of transition audio information are used to connect the m pieces of audio segments.
  • the first transitional audio information is used to join the first audio segment and the second audio segment that are sorted consecutively among the m audio segments acquired by the terminal device Transition audio information.
  • the sorting described here is the sorting order of the concatenation sequence of the m audio clips. It should be understood that the concatenation sequence is predetermined by the terminal device.
  • the terminal device may determine m-1 pieces of transition audio information according to the preset neural network model and the information of the m audio segments acquired in S101.
  • the transition audio information is the MIDI information of the transition audio segment.
  • the terminal device connects the MIDI information of the m audio segments through the above-mentioned m-1 transition audio information (ie, the MIDI information of the transition audio segment), that is, generates the MIDI information of the target serial audio.
  • the terminal device may also output the target serialized audio after determining the target serialized audio based on the MIDI information of the target serialized audio.
  • the terminal device can generate m-1 pieces of transition audio information for connecting the m audio segments based on the m audio segments.
  • the MIDI information of the m audio segments can be concatenated through the m-1 transitional audio information, so as to obtain the MIDI information of the target serialized audio after the m audio segments are serialized.
  • the terminal device converts the MIDI information of the target serial audio into an audio format, the target serial audio is obtained.
  • the terminal can generate a new transitional audio segment for connecting the multiple audio clips, so the method provided in the embodiment of the present application does not need to consider the method used to obtain the target The similarity of audio clips for cross-mixed audio. That is to say, richer and more diverse mixed audio can be obtained through the method provided by the embodiment of the present application.
  • the process of generating transition audio information by the method provided in the embodiment of the present application is carried out in the MIDI domain. Since the MIDI information of the audio is the most original form of expression of the audio, it records the pitch of the audio note, the strength of the note, and the duration of the note. and other information. Therefore, compared to directly concatenating multiple audio clips in the time domain, the method provided by the embodiment of the present application can generate transition audio information for connecting two audio clips based on audio music theory, and the transition audio information obtained based on the transition audio information Mixed audio is more smooth and natural in hearing. Moreover, processing data in the MIDI domain is more conducive to the flexibility and consistency of serial audio in post-rendering.
  • the audio mentioned in S101 may be a complete/incomplete song/music, and an audio clip is a section of audio intercepted from an audio. It should also be understood that audio or audio clips are time-sequential.
  • the terminal device may randomly determine the k audios in the media database or the locally stored music database as the k target audios.
  • the terminal device may first receive a first operation input by the user, and respond to the first operation, so as to determine k target audio.
  • an application program having an audio clipping function is installed in the terminal device, and the first operation is an operation performed by the user on an audio clipping interface in the application program.
  • the above-mentioned first operation may include a user's selection operation of the target audio on the audio editing interface.
  • the selection operation may include a selection operation of a music database, and an operation of selecting a target audio in the selected music database.
  • the music database may be a music database stored locally, or a music database based on the system classifying audio according to audio scene tags, emotion tags, style tags, etc., or a music database automatically recommended by the system. It may be a user-defined music database configured after the user deletes or adds audio in the music database recommended or classified based on the system, which is not limited in this embodiment of the present application.
  • the system may be any media system that communicates with an application program installed on the terminal device that has an audio editing function, which is not limited in this embodiment of the present application.
  • the music database recommended by the system may be the music database recommended by the system according to the current scene/state of the user detected by the sensor of the terminal device. For example, when the sensor of the terminal device detects that the current state of the user is running, the system may recommend a music database including dynamic music to the user.
  • the music database recommended by the system may also be a randomly displayed music database of streaming media, such as major music charts, or a music database including popular classic music. etc.
  • each audio can be marked with tags including but not limited to scene, style, emotion, etc. when it is produced.
  • the scene refers to a scene suitable for listening to the audio, such as a work scene, a study scene, a running scene, etc.
  • the style refers to the musical style of the audio, such as rock, electronic, light music, etc.
  • Emotion refers to the emotion expressed by the audio, such as sadness, longing, loneliness, etc. This will not be detailed.
  • FIG. Schematic diagram of the first operation input on the editing interface.
  • an audio editing interface 401 of an audio editing application is displayed on the touch screen of the mobile phone 10 .
  • the audio editing interface 401 is an interface for selecting a music database under the label of "Classified Music Library". In this way, the user can perform a music database selection operation on the audio editing interface 401 .
  • type labels when the system classifies audio based on different audio classification standards are displayed.
  • the audio editing interface 401 displays type labels for classifying audio based on scenes suitable for listening to the audio, such as "work” label, "running” label and so on.
  • the audio editing interface 401 also displays type tags for classifying audio based on the emotion expressed in the audio, such as "happy” tag, "excited” tag, and the like.
  • the audio editing interface 401 also displays genre tags for classifying audio based on the music style of the audio, such as "popular" tags, "rhythm and blues” tags, and the like. It is easy to understand that the audio type label and its display format shown in (a) in FIG. 4 are only exemplary descriptions, and are not intended to limit the protection scope of the embodiment of the present application.
  • the mobile phone 10 can display the system based on the genre tag recommendation selected by the user All audio interfaces with tags of "running", “happiness”, “excitement” and “rhythm and blues", such as the target audio selection interface 402 shown in (b) in FIG. 4 . It can be understood that all the audios displayed on the target audio selection interface 402 constitute the music database selected by the user.
  • the audio displayed by the mobile phone 10 on the target audio selection interface 402 is the sensor (such as a gyroscope) configured by the mobile phone 10 according to the mobile phone 10.
  • the environment/state of the user currently operating the mobile phone 10 detected by the instrument sensor, noise sensor, etc. automatically recommends audio suitable for the user to listen to in the current environment for the user. I won't go into details on this.
  • the user may perform a target audio selection operation on the target audio selection interface 402 , for example, the user may select k target audios on the target audio selection interface 402 based on his own needs/preferences.
  • the mobile phone 10 determines k target audios.
  • FIG. 5 shows a schematic diagram of another first operation input by a user on an audio editing interface of an audio editing application program according to an embodiment of the present application.
  • an audio editing interface 501 of an audio editing application is displayed on the touch screen of the mobile phone 10 .
  • the audio editing interface 501 is a music database selection interface displayed after the user selects the "recommended music library" label on the audio editing interface 401 . In this way, the user can perform a music database selection operation on the audio editing interface 501 .
  • the audio clipping interface 501 there are multiple music database logos displayed by the system displayed.
  • the "Popular Classic” music library logo For example, the "Internet Sweet Songs” music library logo, the “Light Music Collection” music library logo, the “Golden Songs” music library logo, etc.
  • the display format of the music database and its logo shown in (a) in Figure 5 is only an exemplary description, and is not intended as a limitation to the protection scope of the embodiment of the present application.
  • the mobile phone 10 may also be divided into multiple interfaces to display the identifications of different types of music databases, which is not limited.
  • the user can operate (for example, click with a finger/touch pen) on a music database (such as a "popular classic” music library) displayed on the audio editing interface 501 based on their own needs or preferences, and operate (such as with a finger/touch pen)
  • a music database such as a "popular classic” music library
  • the mobile phone 10 can display the audio interface in the "popular classic” music library, such as the target audio selection interface shown in (b) in Figure 5 502.
  • the user may perform a target audio selection operation on the target audio selection interface 502 , for example, the user may select k target audios on the target audio selection interface 502 based on his own needs/preferences.
  • the mobile phone 10 determines k target audios.
  • FIG. 6 shows a schematic diagram of yet another first operation input by a user on an audio editing interface of an audio editing application provided in an embodiment of the present application.
  • an audio editing interface 601 of an audio editing application is displayed on the touch screen of the mobile phone 10 . It can be seen that the audio editing interface 601 is the target audio selection interface displayed after the user selects the "local music library" label on the audio editing interface 401 or the audio editing interface 501 . In this way, the user can select the target audio on the audio editing interface 501 .
  • the user selects the "local music database" label on the audio editing interface 401 or the audio editing interface 501 and then displays the operation of the audio editing interface 601, which is equivalent to the operation of the user selecting a local music database on the audio editing interface 401 or the audio editing interface 501 .
  • multiple audios stored locally are displayed on the audio editing interface 601 , for example, the multiple audios are displayed in a list form.
  • the user can select target audio on the audio editing interface 601 , for example, the user can select k target audio on the audio editing interface 601 based on his/her own needs/preferences.
  • the mobile phone 10 determines k target audios.
  • the display format of multiple locally stored audios shown in FIG. 6 is only an exemplary description, and is not intended as a limitation to the embodiment of the present application.
  • the mobile phone 10 may also divide the multiple audios stored locally into multiple groups, and display audio lists in different groups with multiple hierarchical interfaces, which is not limited.
  • the above-mentioned first operation may include an input operation of the user inputting the quantity of the target audio on the audio editing interface, and a selection operation of the music database.
  • FIG. 7 shows a schematic diagram of another first operation input by a user on an audio editing interface of an audio editing application provided in an embodiment of the present application.
  • the audio editing interface 701 shown in Figure 7 includes an interface (i.e. input box 702) for inputting the number of serially burned audio, so that the user can input the number of serially burned audio through the input box 702, to
  • the value of k is 3 as an example, that is, the user can input the value “3” in the input box 702 .
  • users can select a music database according to their own needs/preferences by operating the "Music Library” button on the audio editing interface 701 .
  • the process of selecting the music database displayed after the user operates the "Music Library” button on the audio editing interface 701 can refer to (a) in Figure 4, (b) in Figure 5 and Figure 6 above to select the music database description, which will not be repeated here.
  • the mobile phone 10 can select k audios in the music database selected by the user as the target according to the k value input by the user in the input box 702. audio.
  • the mobile phone 10 may select k audios from the music database selected by the user as target audios based on preset rules and according to the k value input by the user in the input box 702 .
  • the mobile phone 10 may randomly select k audios in the music database as the target audio, or the mobile phone 10 may use the first k audios in the music database as the target audio, etc., which is not limited in this embodiment of the present application.
  • the terminal device can extract m audio segments from the k target audios through a preset algorithm.
  • the preset algorithm may be an algorithm for extracting the chorus/climax part of a song, which is not limited in this embodiment of the present application.
  • the terminal device is preset with the sequence of the m audio clips, or the terminal device can further interact with the user to determine the sequence of the m audio clips during the sequence.
  • the terminal device i.e., the mobile phone 10) can display a skewer sequence selection interface 403 as shown in (c) in Figure 4 .
  • the skewering sequence selection interface 403 may include three options of "order", "random" and "custom".
  • the mobile phone 10 may serialize the m audio clips extracted from the k target audios according to the order of the k target audios in the music database to which they belong.
  • the sequence of the k target audios in the music database to which they belong may be reflected by the numbers of the k target audios in the music database.
  • the mobile phone 10 may randomly mix m audio clips extracted from k target audio.
  • the user may further input k target audio identifications (such as numbers) in a preset order in the "custom" option box 4031 .
  • the mobile phone 10 may concatenate the m audio segments extracted from the k target audio in the preset order.
  • the preset order is a user-defined order.
  • audio clipping interface 701 shown in Figure 7 can also include three options for inputting the order of the serialized songs, its specific description can refer to the description of (c) in Figure 4 above, here No longer.
  • the terminal device may determine audio feature information of the m audio segments.
  • the audio feature information of the audio clip may include at least one of information such as main melody track position information, style tag, emotion tag, rhythm information, beat information, or key signature information of the audio clip.
  • the beat mentioned here is the beat of the music
  • the key signature is the tone signature.
  • the embodiment of the present application does not specifically limit the specific implementation manner for the terminal device to obtain the audio feature information of the audio segment. In this embodiment of the present application, the process of determining the audio characteristic information of the m audio clips by the terminal device is not described in detail.
  • the terminal device may also perform music and human voice separation processing on each audio segment by using music vocal detection technology.
  • the terminal device can separate the human voice in the first audio segment from the sound of various musical instruments (such as piano, bass, drum, violin, etc.) through music vocal detection technology. , and convert the separated multi-track instruments and vocals into data in MIDI format, and the data in MIDI format is the MIDI information of the first audio segment.
  • musical instruments such as piano, bass, drum, violin, etc.
  • convert the separated multi-track instruments and vocals into data in MIDI format and the data in MIDI format is the MIDI information of the first audio segment.
  • the embodiment of the present application does not describe the music vocal detection technology in detail.
  • the audio clip may not include human voice.
  • the terminal device may separate the multi-track instrument sounds in the first audio segment through the music human voice detection technology, and convert the multi-track instrument sounds into data in MIDI format.
  • the terminal device may determine m-1 pieces of transition audio information according to the audio feature information and MIDI information of the m audio clips, and a preset neural network model. Among them, one piece of transition audio information in the m-1 pieces of transition audio information is used to connect the transition audio information of two consecutive audio clips in the m audio clips acquired by the terminal device, wherein the m audio clips
  • the sorting refers to the concatenation sequence of the m audio segments, and the transition audio information is MIDI information of the transition audio segments.
  • S101 for a detailed description of the terminal device determining the concatenation sequence of the m audio clips, reference may be made to related descriptions in S101 , which will not be repeated here.
  • the preset neural network model may be preset in the terminal device, or may be preset on a server having a communication connection with the terminal device, which is not limited in this embodiment of the present application.
  • the preset neural network model includes an encoder, an information extraction module, an information generation module and a decoder.
  • the encoder, the information extraction module, the information generation module, and the decoder are all sequence model network structures, and the information extraction module is a bidirectional sequence model network structure.
  • the encoder, information generation module, and decoder may be networks such as RNN, LSTM, GRU, and Transformer.
  • the information extraction module can be Bi-RNN, Bi-LSTM, Bi-GRU, Transformer and other networks.
  • the preset neural network model includes at least two encoders, at least two information extraction modules, at least one information generation module, and at least one decoder.
  • the network structure of the at least two encoders is the same
  • the network structure of the at least two information extraction modules is the same
  • the network structure of the at least one information generation module is the same
  • the network structure of the at least one decoder is the same.
  • the network structure of the encoder and decoder can also be the same. It should be noted that the data flows of the encoder and the decoder are reversed.
  • the input to the encoder can be the output of the decoder
  • the output of the encoder can be the input to the decoder.
  • the above-mentioned network parameters of at least two encoders, at least two information extraction modules, at least one information generation module and at least one decoder are all determined when training the preset neural network model.
  • the detailed description of training the preset neural network model can refer to the description of training the preset neural network model 80 shown in FIG. 8 below, and details are not repeated here.
  • the preset neural network model when the number of audio segments that need to be processed by the preset neural network model is m, the preset neural network model includes m inputs and m ⁇ 1 outputs.
  • the number of encoders in the preset neural network model is m
  • the number of information extraction modules is 2 ⁇ (m-1)
  • the number of information generation modules and decoders are both m-1.
  • the preset neural network model can simultaneously process information of m input audio segments, and output MIDI information of m-1 transition audio.
  • Fig. 8 shows a schematic structural diagram of a preset neural network model provided by the embodiment of the present application.
  • the preset neural network model 80 includes two encoders for receiving two inputs, namely an encoder 811 and an encoder 812 .
  • the preset neural network model 80 includes two (ie, 2 ⁇ (2 ⁇ 1)) information extraction modules, which are respectively an information extraction module 821 and an information extraction module 822 .
  • the preset neural network model 80 also includes 1 (ie (2-1)) information generating module (ie, information generating module 83 ) and 1 (ie (2-1)) decoder (ie, decoder 84 ).
  • FIG. 9 shows a schematic structural diagram of another preset neural network model provided by an embodiment of the present application.
  • the preset neural network model 90 includes four encoders for receiving four inputs, namely an encoder 911 , an encoder 912 , an encoder 913 , and an encoder 914 .
  • the preset neural network model 90 includes 6 (ie 2 ⁇ (4-1)) information extraction modules, which are respectively information extraction module 921, information extraction module 922, information extraction module 923, information extraction module 924, information extraction module 925 , and an information extraction module 926.
  • the preset neural network model 90 includes 3 (ie (4-1)) information generation modules, which are an information generation module 931 , an information generation module 932 , and an information generation module 933 .
  • the preset neural network model 90 also includes 3 (ie (4-1)) decoders, namely a decoder 941 , a decoder 942 , and a decoder 943 .
  • the number of encoders and information extraction modules in the preset neural network model is 2, and the number of information generation modules and decoders is 1 (for example, the preset neural network shown in Figure 8 Model). That is to say, the preset neural network model includes 2 inputs and 1 output.
  • the preset neural network model can simultaneously process two inputs (that is, information of two audio clips) at a time, and output one transitional audio information.
  • the preset neural network model can perform m-1 serial processing on the information of m audio segments, that is, Obtain m-1 pieces of transition audio information. It should be understood that the two audio clips processed each time by the preset neural network model are two adjacent audio clips that are concatenated sequentially.
  • the terminal device can serially process the information of the 4 audio clips 3 times (ie (4-1)) to obtain 3 transitional audio messages.
  • the terminal device may use the information of the audio clip 1 and the audio clip 4 as two input information of the preset neural network, so as to obtain the transition audio information 1 for joining the audio clip 1 and the audio clip 4 .
  • the terminal device can use the information of the audio segment 4 and the audio segment 3 as two input information of the preset neural network, so that the transition audio information 2 for connecting the audio segment 4 and the audio segment 3 can be obtained.
  • the terminal device can also use the information of the audio segment 3 and the audio segment 2 as two input information of the preset neural network, so that the transition audio information 3 for connecting the audio segment 3 and the audio segment 2 can be obtained.
  • the value of m is 2, that is, the terminal device obtains 2 audio clips (for example, the 2 audio clips include the first audio clip and the second audio clip), and the order of serialization is the first audio clip ⁇ the second Taking an audio clip as an example, each module in the preset neural network model provided by the embodiment of the present application and the process of processing the information of the audio clip will be described in detail with reference to FIG. 8 . It can be seen from the serialization order that the first audio segment is the preceding segment in the target serialized audio (referred to as the front segment), and the second audio segment is the subsequent segment in the target serialized audio (abbreviated as the post segment). ).
  • the first audio segment serves as the pre-segment, and its audio characteristic information and MIDI information may be referred to as first information.
  • the second audio segment is used as a post-segment, and its audio characteristic information and MIDI information may be referred to as second information.
  • the first information can be used as an input of the preset neural network model 80
  • the second information can be used as another input of the preset neural network model 80 .
  • the number of tracks in the MIDI information in the first message that is, the MIDI information of the first audio clip
  • the number of tracks in the MIDI information in the second message that is, the MIDI information in the second audio clip
  • the soundtracks are the same.
  • the number and types of tracks included in the MIDI information in the first information are the same as the number and types of tracks included in the MIDI information in the second information.
  • the MIDI information in the first information includes 3 sound tracks, which are human voice track, piano sound track and violin sound track. Then, the MIDI information in the second information also includes these three audio tracks.
  • the terminal device can be the MIDI information in the second information Empty tracks are added, so that the number of tracks included in the MIDI information in the first message is the same as the number of tracks included in the MIDI information in the second message.
  • the terminal device may input the first information into the encoder 811 as an input (for example, input 1) of the preset neural network model 80 .
  • the encoder 811 can process the received first information and output the first sequence corresponding to the first audio segment.
  • the first sequence is a sequence obtained after the encoder 811 performs feature extraction on the MIDI information and audio feature information in the first information.
  • the first sequence can be understood as a sequence obtained after the encoder 811 performs dimensionality reduction on the first information, or the first sequence can be understood as a sequence obtained after the encoder 811 compresses the first information into a latent space.
  • the first sequence is a one-dimensional sequence in time sequence, and the length of the first sequence is determined by the length of the first audio segment. It can be understood that the longer the first audio segment, the longer the first sequence; the shorter the first audio segment, the shorter the first sequence.
  • the first sequence may be expressed as " ⁇ P 1 , P 2 , ..., P s ⁇ ", where P represents a feature vector, and s represents the number of feature vectors. It should be understood that since the audio clip itself has timing, the first sequence also has timing.
  • P 1 may be the feature vector corresponding to the start time of the audio clip
  • P s may be the feature vector corresponding to the end time of the audio clip.
  • the terminal device can input the second information as another input (for example, input 2) of the preset neural network model 80 into the encoder 812, so that the encoder 812 can process the received second information
  • a second sequence corresponding to the second audio segment is output.
  • the description of the second sequence may refer to the first sequence, and details are not repeated here.
  • the second sequence may be expressed as " ⁇ F 1 , F 2 , . . . , F t ⁇ ", where F represents a feature vector, and t represents the number of feature vectors.
  • F 1 may be the feature vector corresponding to the start time of the audio clip
  • F t may be the feature vector corresponding to the end time of the audio clip.
  • the information extraction module 821 receives the first sequence output by the encoder 811 . Since the audio clips themselves are time-sequential and the first audio clip is a pre-sequence, the information extraction module 821 may output a first vector corresponding to the end moment of the first audio clip after learning the first sequence. This process can also be understood as that the information extraction module 821 further performs dimensionality reduction on the first sequence. It should be noted that the first vector carries the characteristics of the first sequence and corresponds to the end moment of the first sequence.
  • the information extraction module 822 receives the second sequence output by the encoder 812 . After the information extraction module 821 learns the second sequence, it may output a second vector corresponding to the start time of the second audio segment. This process can also be understood as that the information extraction module 822 further performs dimensionality reduction on the second sequence. It should be noted that the second vector carries the characteristics of the second sequence and corresponds to the start time of the second sequence.
  • the sum (i.e. the third vector) of the latent space vector (i.e. the first vector) corresponding to the end moment of the first audio clip and the hidden space vector (i.e. the second vector) corresponding to the beginning moment of the second audio clip can be used as the connection
  • the preset neural network model 80 can determine a transition audio segment for connecting the first audio segment and the second audio segment. Specifically, the preset neural network model 80 inputs the third vector to the information generation network module 83, and the information generation network module 83 can learn the received third vector and output the third sequence. It should be understood that the third sequence is a sequence formed by feature vectors of transitional audio segments used to connect the first audio segment and the second audio segment.
  • the third sequence may be expressed as " ⁇ M 1 , M 2 , ... M j ⁇ ", where M represents a feature vector, and j represents the number of feature vectors.
  • M 1 may be a feature vector corresponding to the beginning moment of the transition audio segment used to connect the first audio segment and the second audio segment
  • M j may be a feature vector corresponding to the end moment of the transition audio segment.
  • the decoder 84 receives the third sequence output by the information generation module 83, and learns the third sequence, and outputs the transition audio information for connecting the first audio segment and the second audio segment, that is, the MIDI of the transition audio segment. information.
  • the preset neural network model 80 shown in FIG. 8 may be a neural network model trained based on a plurality of training samples in advance.
  • a training sample includes MIDI information and audio feature information of two audio clips, and the label value of the training sample is the MIDI information of the transition audio segment constructed by domain experts based on the two audio clips.
  • the neural network model shown in FIG. 8 in the embodiment of the present application can be obtained by repeatedly iteratively training the neural network based on multiple training samples.
  • the terminal device inserts the m-1 transition audio information (ie, the MIDI information of the transition audio segment) generated in the step S102 into the MIDI information of the m audio segments, so as to realize the transition through the m-1 transition audio information.
  • the purpose of concatenating the MIDI information of the m audio segments is to generate the target mixed audio obtained after the m target audio segments are mixed.
  • the terminal device can insert transition audio information 1 between the MIDI information of audio clip 1 and audio clip 3, and can insert transition audio information 2 into audio clip 3 and audio clip 3. Between the MIDI messages of clip 2. In this way, the terminal device generates the MIDI information of the target serialized audio after the audio clip 1, the audio clip 2, and the audio clip 3 are serialized.
  • the terminal device After the terminal device generates the MIDI information of the target serial audio, it can play the target serial audio to the user.
  • the user may input a second operation to the terminal device.
  • the terminal device adjusts the serialization sequence of the m audio clips used to serialize the target audio clips, and regenerates m audio clips based on the adjusted serialization sequence and the m audio clips. -1 Transition audio message and regenerate the MIDI message of the target mix audio.
  • the end device can then play the regenerated target mix-in audio to the user.
  • S104 can be executed; when the user is not satisfied with the target serialized audio, the second operation can be input to the terminal device again, so that the terminal device can sequence the m audio clips Make another adjustment and regenerate the MIDI messages of the target mix audio again. It can be seen that, through repeated interactions between the terminal device and the user, the target streaming audio that satisfies the user can be obtained, which improves the user experience.
  • the terminal device displays the MIDI spectrum of the target serial audio on the display panel in the form of a graph.
  • the second operation of the user may be a drag operation on the MIDI spectrum.
  • the terminal device may respond to the second operation to re-determine the mixing order of the m audio segments used for mixing to obtain the aforementioned target mixing audio.
  • the terminal device can re-determine m-1 pieces of transitional audio information based on the re-determined concatenation sequence of the m audio clips and the m audio clips, and by executing S102. Further, the terminal device can regenerate the target serial audio according to the re-determined m-1 transitional audio information and the m audio segments.
  • the process of regenerating the target audio mix by the terminal device you can refer to the detailed description of the process of generating the target mix audio by the terminal device above, which will not be repeated here.
  • the terminal device is a mobile phone 10
  • the MIDI information of the target serialized audio generated by the mobile phone 10 includes 3 sound tracks
  • the target serialized audio is for audio clip 1, audio clip 2 and audio clip 3, with audio Fragment 1 ⁇ Audio Fragment 3 ⁇ Audio Fragment 2
  • the concatenated audio obtained after concatenated in sequence with reference to (a) in Figure 10, (a) in Figure 10 shows a A schematic diagram of the second operation.
  • the MIDI spectrum of the target serial audio can be displayed on the display panel, such as (a) in Figure 10
  • the MIDI spectrum displayed on the serial audio editing interface 1001 includes three tracks, which are track 1 shown in black bars, track 2 shown in white bars, and track shown in striped bars 3.
  • the start line on the MIDI spectrum displayed on the serial audio editing interface 1001 is used to mark the start of the target serial audio.
  • the MIDI spectrum displayed on the serial audio editing interface 1001 also includes a plurality of dividing lines, and the multiple dividing lines are used to distinguish different audio segments and transitional audio segments in the target serial audio.
  • the audio clip between the start line and split line 1 is audio clip 1
  • the audio clip between split line 1 and split line 2 is the connecting audio clip Transitional audio segment 1 between 1 and audio segment 3
  • the audio segment located between dividing line 2 and dividing line 3 is audio segment 3
  • the audio segment located between dividing line 3 and dividing line 4 is the connecting audio segment 3 and audio segment Transitional audio segment 2 of 2
  • the audio segment on the right side of division 4 is audio segment 2
  • the MIDI spectrum in (a) in Figure 10 does not show the termination line used to mark the end of the target cross-burning audio.
  • the name of each audio segment may also be displayed on the MIDI spectrum displayed on the serial audio editing interface 1001 , which is not limited in this embodiment of the present application.
  • the mobile phone 10 plays the target mixing audio to the user.
  • the second operation may be an operation in which the user drags and drops the MIDI spectrum displayed on the serial audio editing interface 1001 (for example, an operation in which a finger or a touch pen slides on the display panel), for example, the user presses and holds the audio clip 1 with a finger.
  • the mobile phone 10 exchanges the sequence of audio clip 1 and audio clip 2, that is, the mobile phone 10 re-determines the sequence of audio clip 1, audio clip 2, and audio clip 3 as audio clip 2 ⁇ audio clip 3 ⁇ audio clip 1.
  • the mobile phone 10 may regenerate the target mixed audio according to the audio clip 1, the audio clip 2, the audio clip 3 and the re-determined mixing sequence.
  • the second operation may be an operation in which the user inputs the target serialization sequence on the audio editing interface displayed after the terminal device generates the MIDI information of the target serialization audio.
  • the terminal device receives the target sequence input by the user. That is to say, the terminal device re-determines the target serialization sequence of the m audio clips.
  • the terminal device can re-determine m-1 pieces of transitional audio information based on the received target sequence of m audio clips and the m audio clips, and by executing S102. Further, the terminal device can regenerate the target serial audio according to the re-determined m-1 transitional audio information and the m audio segments.
  • the target serial audio is the mobile phone 10.
  • Audio clip 1, audio clip 2, and audio clip 3 are mixed in the order of audio clip 1 ⁇ audio clip 3 ⁇ audio clip 2.
  • FIG. 10 shows a schematic diagram of another second operation provided by the embodiment of the present application.
  • the serial audio editing interface 1001 as shown in (b) in Figure 10 can be displayed on the display panel .
  • the user can operate (for example, click with a finger or a touch pen) on the playback icon 1002 on the audio mixing interface 1001, in response, the mobile phone 10 can play the target mixing audio.
  • the user is dissatisfied with the target serial audio, he can input a second operation to the mobile phone 10 .
  • the user can input the desired target mix sequence in the input box 1003 of the target mix sequence on the mix audio editing interface 1001, for example, the user inputs "2, 3, 1" in the input box 1003, wherein, "2 " can be used to represent the identifier of audio clip 2, "3” can be used to represent the identifier of audio clip 3, "1” can be used to represent the identifier of audio clip 1, and "2, 3, 1” can be used to represent the audio clip 1.
  • the serial burning sequence of audio clip 2 and audio clip 3 is audio clip 2 ⁇ audio clip 3 ⁇ audio clip 1.
  • the mobile phone 10 receives the user-input target sequence. In this way, the mobile phone 10 determines the target mix sequence of the audio clip 1 , the audio clip 2 and the audio clip 3 .
  • the mobile phone 10 can regenerate the target mix audio according to the audio clip 1, the audio clip 2, the audio clip 3, and the received target mix sequence.
  • the current serialization sequence may also be displayed on the audio serialization interface 1001 shown in (b) of FIG. 10 , for example, "current serialization sequence: 1, 3, 2". It should be understood that the current cooking sequence can be used as a reference when the user provides an input target cooking sequence.
  • the terminal device can directly save and output the MIDI information of the latest target serial audio generated in S103.
  • the terminal device may also synthesize the time-domain waveform of the target serialized audio according to the latest MIDI information of the target serialized audio generated in S103, so as to obtain the target serialized audio.
  • the terminal device can also save/output the target serial audio.
  • the embodiment of the present application does not limit the specific manner in which the terminal device synthesizes the time-domain waveform of the target serial audio according to the MIDI information of the target serial audio.
  • the terminal device can synthesize the time-domain waveform of the target serialized audio by loading the tone library for the MIDI information of the target serialized audio, or can synthesize the time-domain waveform of the target serialized audio according to the MIDI information and wavetable of the target serialized audio (the wavetable is pre-compiled by various All the sounds that real instruments can emit (including various ranges, tones, etc.) are recorded and stored) to synthesize the time-domain waveform of the target serialized audio, or it can also be based on the MIDI information of the target serialized audio, and the physical model/ The neural network model is used to synthesize the time-domain waveform of the target cross-fired audio, but of course it is not limited to this.
  • the physical model/neural network model is a pre-built model for synthe
  • the terminal device synthesizes the time-domain waveform of the target serialized audio according to the MIDI information of the target serialized audio, it only synthesizes the target A time-domain waveform of a serialized audio.
  • the terminal device may receive the third operation of the user, and in response to the third operation, render the latest MIDI information of the target serial audio generated by S103, and synthesize the MIDI information of the target serial audio according to the rendered MIDI information. time domain waveform. Then, the terminal device further renders the time-domain waveform of the synthesized target serial audio to obtain the rendered target serial audio.
  • the terminal device can also save/output the rendered target serial audio.
  • the detailed description of the terminal device synthesizing the time-domain waveform of the target serialized audio according to the rendered MIDI information can refer to the above description of the terminal device synthesizing the time-domain waveform of the target serialized audio according to the MIDI information of the target serialized audio. Here No longer.
  • the third operation of the user may include the selection operation of the audio rendering processing mode input by the user on the audio rendering interface, and the selection operation of the processing mode input by the user on the audio rendering interface for synthesizing the time-domain waveform of the target serial audio.
  • rendering the MIDI information of the target serial audio may include, for example, sound source separation.
  • the processing method for synthesizing the time-domain waveform of the target cross-burning audio may include, for example, loading a sound library, wavetable synthesis, physical model synthesis, and the like.
  • Rendering the time-domain waveform of the target cross-combined audio may include sound mixing, vocal style transfer, etc., which is not limited in this embodiment of the present application.
  • the terminal device may save the time-domain waveform of the rendered target serial audio as an audio file in any audio format, which is not specifically limited in this embodiment of the present application.
  • the terminal device can save the rendered target serial audio waveform as an audio file in WAV format, an audio file in lossless audio compression coding (free lossless audio codec, FLAC) format, or an audio file in the moving image expert compression standard audio layer 3 (moving Picture experts group audio layer III, MP3) format audio files, audio compression format (OGGVobis, ogg) audio files, etc., of course, are not limited to this.
  • the terminal device can also save the project for generating the target serial audio.
  • the terminal can reset the sequence of m audio clips used to string together the target audio clips, and perform the string processing again, which can improve the performance of the m audio clips in the future. Efficiency when skewered again.
  • FIG. 11 shows a schematic diagram of rendering and outputting MIDI information of target serial audio provided by the embodiment of the present application.
  • the mobile phone 10 may display an audio rendering interface 1101 .
  • the user may input a third operation on the audio rendering interface 1101, and the third operation may include: the user selects to enable "Remove Human Voice" under the "Sound Source Separation” option ((a) in FIG. 11 The black square means open, and the white square means close), the user selects the selection operation of "load sound library” under the “audio waveform synthesis” option, and the user selects the selection operation of "record vocal" under the “mixing” option operation, and the user selects "Singer A” as the migration target under the "Vocal Style Migration” option.
  • the mobile phone 10 can record the person in the MIDI information of the target serial audio Delete or invalidate the sound track, and load the tone library to the MIDI information of the target serialized audio to synthesize the time domain waveform of the target serialized audio, then open the recording interface to record vocals for the target serialized audio, and record the target serialized audio
  • the human voice in the audio is migrated to singer A's voice.
  • the mobile phone 10 may display the Audio publishing interface 1102 .
  • the terminal device 10 can interact with the user through the audio publishing interface 1102, and export the target mix audio according to the instruction input by the user.
  • the mobile phone 10 may receive a selection operation of selecting an export audio format input by the user under the "export format” option of the audio publishing interface 1102, such as selecting "audio format 1".
  • the mobile phone 10 can receive the name (such as the name A) and the operation of the path of the target serialized audio input by the user under the "export path” option of the audio publishing interface 1102 .
  • the mobile phone 10 may also receive an operation input by the user to enable the "Save Project" function under the "Save Project” option of the audio publishing interface 1102 .
  • the mobile phone 10 will save the target serialized audio according to the user's instruction.
  • the method part of determining transition audio information and generating target audio (ie steps S102-S104) in the method provided in the embodiment of the present application may also be performed during the process of the terminal device playing audio to the user in real time.
  • the method part of determining the transition audio information and generating the target audio in the method provided by the embodiment of the present application may be realized by a functional module of an App capable of providing audio listening.
  • an App capable of providing audio listening may be, for example, a cloud music App.
  • a cloud music App For a brief description, the following description takes the app that can provide audio listening as a cloud music app as an example.
  • the Cloud Music App when it provides the user with a music listening mode, it may provide a serial burning mode.
  • the serial burning mode can be realized by executing the steps S102-S104 in the method provided in the embodiment of the present application by the terminal device running the cloud music App, or the server connected to the cloud music App for communication.
  • the terminal device running the cloud music App For the sake of simple description, the following will be described by taking the terminal device running the cloud music App to execute steps S102-S104 in the method provided by the embodiment of the present application to realize the serial burning mode as an example.
  • the music played by the terminal device can be the music automatically recommended by the cloud music media library, or the music in the local media library. music, which is not limited in this embodiment of the application.
  • the terminal device when the terminal device determines to play music to the user through the above-mentioned mixing mode through interaction with the user, the terminal device can use the current music being played to the user and the next piece of music to be played to the user as two target audio. And based on the 2 target audios and the preset mixing sequence, execute the above-mentioned S102-S104 to generate the first target mixing audio obtained by mixing the 2 target audios.
  • the preset serial burning sequence is: the current music being played by the terminal device to the user ⁇ the next music to be played by the terminal device to the user.
  • the terminal device can determine the next piece of music automatically recommended for the user during the process of playing the current music to the user.
  • the first music is the next music that the terminal will play to the user.
  • the terminal device may complete the concatenation of the determined two target audios and obtain the first target concatenation audio.
  • the terminal device may play the first target audio for the user after playing the current music to the user. Further, the terminal device plays the original next music for the user after playing the first target mixed music to the user.
  • the terminal device can play the first target mix to the user after playing music 1 to the user. audio. Then, the terminal device plays music 2 for the user after playing the first target mixed music to the user.
  • the terminal device plays the original next piece of music for the user, the original next piece of music becomes the new current music being played by the terminal device to the user.
  • the terminal device can repeat the above process to generate the second target mixed audio obtained by mixing the new current music and the music next to the new current music.
  • the terminal device may play the second serial audio for the user after playing the new current music to the user.
  • the terminal device when the terminal device plays music to the user through the serialization mode, the terminal device can dynamically generate the serialized audio of the current music and the next music, and play the serialized audio for the user, thereby improving the user experience.
  • the terminal device plays music to the user through the serial burning mode
  • the terminal device plays the current music and the next music for the user
  • it can only play the current music and the next music to generate the serial burning audio , for example, only play the chorus/climax part of the current music and the next music, which is not limited in this embodiment of the present application.
  • the terminal device executes the above-mentioned steps S102-S104 based on the determined two target audios and the preset serialization sequence to generate the target serialized audio (such as the first target serialized audio, During the process of the second target serialized audio, ..., the qth target serialized audio, etc., q is a positive integer), in S103, the terminal device only needs to execute the process of generating transitional audio information once, without receiving the first Two operations.
  • the streaming mode of Cloud Music App is preset with the preset rendering mode and preset exporting mode of the target streaming audio. at least one of .
  • the preset export mode includes the export format of the mix audio, and the project indicating whether to save the target mix audio, etc. Therefore, at S104, the terminal device does not need to interact with the user to obtain the rendering mode and the exporting mode of the target serial audio.
  • the preset rendering mode and preset export mode of the target serial audio preset in the streaming mode of the cloud music app can be pre-configured through the interaction between the terminal device and the user, and can also be played on the terminal device to the user.
  • the process of music is obtained through interaction configuration with the user, which is not limited here.
  • the terminal device can also update the pre-configured preset through interaction with the user during the process of playing music to the user.
  • the rendering mode and the preset export mode are not limited in this embodiment of the present application.
  • the above-mentioned preset export mode may also not include a project indicating whether to save the target serial audio.
  • the terminal device may interact with the user to receive an instruction input by the user whether to save the target serial audio project, and execute the Save the project for the target serial audio. It can be understood that, prior to this, the terminal device may cache all dynamically generated target serial audio projects.
  • combustion mode is only an example, and is not intended to limit the embodiment of the present application.
  • the embodiment of the present application provides a method for processing audio data.
  • the embodiment of the present application can generate m-1 transitions for connecting the m audio fragments based on m audio fragments in the MIDI domain. audio information.
  • the MIDI information of the m audio segments can be concatenated through the m-1 transitional audio information, so as to obtain the target serialized audio after the m audio segments are serialized.
  • the terminal device can generate a new transitional audio segment for connecting the multiple audio clips, so the method provided in the embodiment of the present application does not need to consider the method used to obtain The similarity of the audio clips for the target intermixed audio. That is to say, richer and more diverse mixed audio can be obtained through the method provided by the embodiment of the present application.
  • the MIDI information of the audio is the most original form of expression of the audio, it records information such as the pitch of the audio note, the strength of the note, and the duration of the note. Therefore, compared to directly concatenating multiple audio clips in the time domain, the transition audio information for connecting two audio clips generated in the method provided by the embodiment of the present application is generated based on audio music theory. In this case, The mixed audio obtained based on the transition audio information is smoother and more natural in hearing. Moreover, processing data in the MIDI domain is also more conducive to the flexibility and consistency of serial audio in post-rendering.
  • the user when the method provided by the embodiment of the present application is used to mix m audio clips, the user can have a high degree of participation, so that the mixed audio that satisfies the user can be obtained, that is, the user experience is high.
  • the embodiment of the present application may divide the audio data processing apparatus into functional modules according to the above method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. It should be noted that the division of modules in the embodiment of the present application is schematic, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 12 shows a schematic structural diagram of an audio data processing device 120 provided by an embodiment of the present application.
  • the processing device 120 may be used to execute the above audio data processing method, for example, to execute the method shown in FIG. 3 .
  • the processing device 120 may include an acquiring unit 121 , a determining unit 122 and a generating unit 123 .
  • the obtaining unit 121 is configured to obtain m audio clips.
  • the determining unit 122 is configured to determine m-1 pieces of transition audio information according to the m audio segments.
  • the generating unit 123 is configured to generate the target mix audio according to the m audio segments and the m-1 transition audio information.
  • the m-1 pieces of transition audio information are used to connect the m pieces of audio segments.
  • the first transition audio information is used to join the sequentially sorted first audio segment and the second audio segment among the m audio segments.
  • the sorting of the m audio clips refers to the concatenation order of the m audio clips.
  • the acquisition unit 121 may be used to execute S101
  • the determination unit 122 may be used to execute S102
  • the generation unit 123 may be used to execute S103-S104.
  • the determining unit 122 is specifically configured to determine the first transition audio information according to the first information of the first audio segment and the second information of the second audio segment.
  • the first information includes MIDI information and audio feature information of the first audio segment
  • the second information includes MIDI information and audio feature information of the second audio segment
  • the first transition audio information includes the corresponding MIDI message for the first transition audio.
  • the determining unit 122 may be used to execute S102.
  • the above-mentioned audio characteristic information includes at least one of main melody track position information, style label, emotion label, rhythm information, beat information, or key signature information of the audio clip.
  • the determining unit 122 is specifically configured to determine the first transition audio information according to the first information of the first audio segment, the second information of the second audio segment and a preset neural network model.
  • the determining unit 122 may be used to execute S102.
  • the above-mentioned first transitional audio information is determined based on a feature vector used to characterize the first transitional audio information
  • the first A feature vector of the transition audio information is determined based on the first vector and the second vector.
  • the first vector is a feature vector generated at the end of the sequence of the first audio clip according to the first information
  • the second vector is a feature vector generated at the start of the sequence of the second audio clip according to the second information.
  • the determining unit 122 is further configured to determine k target audios in response to the user's first operation.
  • the acquiring unit 121 is specifically configured to extract m audio segments from the k target audio. Wherein, 2 ⁇ k ⁇ m, and k is an integer.
  • the determination unit 122 and the acquisition unit 121 may be used to perform S101,
  • the determining unit 122 is further configured to determine the concatenation order of the m audio segments before determining the m-1 pieces of transitional audio information according to the m audio segments.
  • the determining unit 122 is further configured to re-determine the sequence of the m audio clips in response to the second operation of the user.
  • the determining unit 122 is also used for re-determining m-1 pieces of transitional audio information according to the re-determined mixing sequence and m audio clips.
  • the generating unit 123 is also used for re-determining m-1 pieces of transitional audio information and m pieces of Audio clip, regenerates the target mix audio.
  • the processing device 120 further includes: a rendering unit 124, configured to render the above-mentioned target audio mix in response to a third user operation.
  • a rendering unit 124 configured to render the above-mentioned target audio mix in response to a third user operation.
  • the rendering unit 124 may be used to execute S104.
  • the processing device 120 further includes: an output unit 125, configured to output the above-mentioned target serial audio.
  • the acquisition unit 121 and the output unit 125 in the processing device 120 may implement their functions through the touch screen 150 and the processor 110 in FIG. 1 .
  • the determining unit 122, the generating unit 123, and the rendering unit 124 may be implemented by executing the program code in the internal memory 120 in FIG. 1 through the processing 110 in FIG. 1 .
  • Fig. 13 shows a schematic structural diagram of a signal-carrying medium for carrying a computer program product provided by an embodiment of the present application.
  • the signal-carrying medium is used for storing a computer program product or a computer program for executing a computer process on a computing device.
  • signal-bearing medium 130 may include one or more program instructions that, when executed by one or more processors, may provide the functions or portions of the functions described above with respect to FIG. 3 .
  • one or more features referred to as S101 - S104 in FIG. 3 may be undertaken by one or more instructions associated with the signal-bearing medium 130 .
  • the program instructions in FIG. 13 also describe example instructions.
  • signal bearing medium 130 may comprise computer readable medium 131 such as, but not limited to, a hard drive, compact disc (CD), digital video disc (DVD), digital tape, memory, read-only memory (read only memory) -only memory, ROM) or random access memory (random access memory, RAM) and so on.
  • computer readable medium 131 such as, but not limited to, a hard drive, compact disc (CD), digital video disc (DVD), digital tape, memory, read-only memory (read only memory) -only memory, ROM) or random access memory (random access memory, RAM) and so on.
  • signal bearing media 130 may comprise computer recordable media 132 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • computer recordable media 132 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • signal bearing medium 130 may include communication media 133 such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • communication media 133 such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • the signal bearing medium 130 may be conveyed by a wireless form of communication medium 133 (eg, a wireless communication medium complying with the IEEE 1902.11 standard or other transmission protocol).
  • a wireless form of communication medium 133 eg, a wireless communication medium complying with the IEEE 1902.11 standard or other transmission protocol.
  • One or more program instructions may be, for example, computer-executable instructions or logic-implementing instructions.
  • audio data processing means such as that described with respect to FIG. , providing various operations, functions, or actions.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • a software program When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the processes or functions according to the embodiments of the present application are generated in whole or in part when the computer executes the instructions on the computer.
  • a computer can be a general purpose computer, special purpose computer, a computer network, or other programmable apparatus.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or may contain one or more data storage devices such as servers and data centers that can be integrated with the medium.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (solid state disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Telephone Function (AREA)

Abstract

一种音频数据的处理方法及装置(120),涉及多媒体技术领域。方法包括:获取m(m是大于等于2的整数)个音频片段(S101);根据m个音频片段确定m-1个过渡音频信息(S102);根据m个音频片段和m-1个过渡音频信息,生成目标串烧音频的MIDI信息(S103);根据确定的目标串烧音频的MIDI信息确定目标串烧音频(S104)。其中,m-1个过渡音频信息用于衔接m个音频片段。对于m-1个过渡音频信息中的第一过渡音频信息而言,第一过渡音频信息用于衔接m个音频片段中排序连续的第一音频片段和第二音频片段。排序是指m个音频片段的串烧顺序。

Description

一种音频数据的处理方法及装置
本申请要求于2021年07月31日提交国家知识产权局、申请号为202110876809.0、申请名称为“一种音频数据的处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及多媒体技术领域,尤其涉及一种音频数据的处理方法及装置。
背景技术
随着数字流媒体音乐的广泛传播和应用,以及随着手机、平板、耳机等无线终端设备的普及和发展,聆听音乐成为大多数人在不同环境下的生活必需品,并且人们对音乐的多样化需求也日趋增长。例如,除了从音频开头至结尾全部聆听的听歌体验之外,人们对于由多个音频片段组合而成的音频,即串烧音频的需求也日益增多。
目前在实现音频串烧时,通常仅能实现对相似度较高的至少两个音频片段进行拼接串烧。然而这种方式实现的串烧音频的风格往往都比较单一。
因此,如何获得更丰富、更具多样性的串烧音频,是现有技术中亟待解决的技术问题。
发明内容
本申请提供了一种音频数据的处理方法及装置,基于该方法可以获得更丰富、更具多样性的串烧音频。
为达上述目的,本申请提供如下技术方案:
第一方面,本申请提供一种音频数据的处理方法,该方法包括:获取m个音频片段,m是大于等于2的整数。根据该m个音频片段确定m-1个过渡音频信息。根据该m个音频片段和该m-1个过渡音频信息,生成目标串烧音频。其中,该m-1个过渡音频信息用于衔接该m个音频片段。其中,对于该m-1个过渡音频信息中的第一过渡音频信息而言,该第一过渡音频信息用于衔接m个音频片段中排序连续的第一音频片段和第二音频片段。这里,m个音频片段的排序为该m个音频片段的串烧顺序。
可以看出,基于本申请提供的方法对多个音频片段进行串烧时,可以生成全新的用于衔接该多个音频片段的过渡音频信息。因此本申请提供的方法无需考虑用于串烧得到目标串烧音频的多个音频片段的相似度。也就是说,通过本申请实施例提供的方法能够获得更丰富、更具多样性的串烧音频。
在一种可能的设计方式中,上述根据该m个音频片段确定m-1个过渡音频信息包括:根据上述第一音频片段的第一信息和上述第二音频片段的第二信息,确定第一过渡音频信息。其中,该第一信息包括第一音频片段的乐器数字接口(musical instrument digital interface,MIDI)信息和音频特征信息,该第二信息包括第二音频片段的MIDI信息和音频特征信息,该第一过渡音频信息包括第一过渡音频信息对应的第一过渡音频的乐器数字接口MIDI信息。
在另一种可能的设计方式中,上述的音频特征信息包括音频片段的主旋律轨位置 信息、风格标签、情感标签、节奏信息、节拍信息、或调号信息中的至少一种。
基于该两种可能的设计,本申请所提供方法生成的用于衔接多个音频片段的过渡音频信息是在MIDI域进行的。由于音频的MIDI信息是音频最原始的表现形式,其记录有音频的音符音高、音符力度、音符时长等信息。因此,相比在时域上直接对多个音频片段进行串烧,本申请所提供的方法在MIDI域中对音频片段的MIDI信息进行处理后所生成的用于衔接两个音频片段的过渡音频信息是基于音频乐理生成的。这样的话,基于该过渡音频信息获得的串烧音频在听觉上更加流畅自然。并且,在MIDI域处理数据也更有利于串烧音频在后期渲染时的灵活性和一致性。
在另一种可能的设计方式中,上述根据上述第一音频片段的第一信息和上述第二音频片段的第二信息,确定第一过渡音频信息包括:根据第一音频片段的第一信息、第二音频片段的第二信息以及预设神经网络模型,确定第一过渡音频信息。
在另一种可能的设计方式中,当在目标串烧音频中,第一音频片段位于第二音频片段之前,则:上述第一过渡音频信息是基于用于表征第一过渡音频信息的特征向量确定的,该第一过渡音频信息的特征向量是基于第一向量和第二向量确定的。其中,第一向量是根据第一信息在第一音频片段的时序末端生成的特征向量,第二向量是根据第二信息在第二音频片段的时序始端生成的特征向量。
基于该两种可能的设计,本申请所提供方法在MIDI域通过神经网络模型对多个音频片段的MIDI信息进行处理,从而得到了用于衔接多个音频片段的过渡音频的MIDI信息。这样的话,基于神经网络极强的学习能力,本申请在MIDI域基于对音频乐理的学习而得到的用于衔接多个音频片段的过渡音频信息,能够更自然流畅的衔接多个音频片段。
在另一种可能的设计方式中,上述获取m个音频片段包括:响应于用户的第一操作,确定k个目标音频。从该k个目标音频中提取m个音频片段。其中,2≤k≤m,且k是整数。
基于该可能的设计,本申请可以基于用户的意愿对用户选择的多个目标音频中的音频片段进行串烧,从而提高了用户的体验。
在另一种可能的设计方式中,在上述根据该m个音频片段确定m-1个过渡音频信息之前,上述方法还包括:确定该m个音频片段的串烧顺序。
在另一种可能的设计方式中,上述方法还包括:响应于用户的第二操作,重新确定上述m个音频片段的串烧顺序。根据重新确定的串烧顺序和m个音频片段,重新确定m-1个过渡音频信息。根据重新确定的m-1个过渡音频信息和m个音频片段,重新生成目标串烧音频。
根据该可能的设计,通过本申请所提供方法生成目标串烧音频后,当用户对该目标串烧音频不满意时,可以向终端设备输入第二操作,以使终端设备响应该第二操作,对生成该目标串烧音频的m个音频片段的串烧顺序进行调整,并重新生成新的目标串烧音频。这样的话,通过设备和用户之间的反复交互,可以令用户获得满意的目标串烧音频,从而提高了用户的体验。
在另一种可能的设计方式中,上述方法还包括:响应于用户的第三操作,对上述目标串烧音频进行渲染。
在另一种可能的设计方式中,上述方法还包括:输出上述目标串烧音频。
第二方面,本申请提供了一种音频数据的处理装置。
在一种可能的设计方式中,该处理装置用于执行上述第一方面提供的任一种方法。本申请可以根据上述第一方面提供的任一种方法,对该处理装置进行功能模块的划分。例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。示例性的,本申请可以按照功能将该处理装置划分为获取单元、确定单元和生成单元等。上述划分的各个功能模块执行的可能的技术方案和有益效果的描述均可以参考上述第一方面或其相应的可能的设计提供的技术方案,此处不再赘述。
在另一种可能的设计中,该处理装置包括:一个或多个处理器和传输接口,该一个或多个处理器通过该传输接口接收或发送数据,该一个或多个处理器被配置为调用存储在存储器中的程序指令,以使得该处理装置执行如第一方面及其任一种可能的设计方式提供的任一种方法。
第三方面,本申请提供了一种计算机可读存储介质,该计算机可读存储介质包括程序指令,当程序指令在计算机或处理器上运行时,使得计算机或处理器执行第一方面中的任一种可能的实现方式提供的任一种方法。
第四方面,本申请提供了一种计算机程序产品,当其在音频数据的处理装置上运行时,使得第一方面中的任一种可能的实现方式提供的任一种方法被执行。
第五方面,本申请提供给了一种音频数据的处理系统,该系统包括终端设备和服务器。其中,该终端设备用于执行第一方面中的任一种可能的实现方式提供的任一种方法中与用户进行交互的方法部分,该服务器用于执行第一方面中的任一种可能的实现方式提供的任一种方法中生成目标串烧音频的方法部分。
可以理解的是,上述提供的任一种装置、计算机存储介质、计算机程序产品或系统等均可以应用于上文所提供的对应的方法,因此,其所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。
在本申请中,上述音频数据的处理装置的名字对设备或功能模块本身不构成限定,在实际实现中,这些设备或功能模块可以以其他名称出现。只要各个设备或功能模块的功能和本申请类似,属于本申请权利要求及其等同技术的范围之内。
附图说明
图1为本申请实施例提供的一种手机硬件结构示意图;
图2为本申请实施例提供的一种音频数据的处理系统的示意图;
图3为本申请实施例提供的一种音频数据的处理方法的流程示意图;
图4为本申请实施例提供的一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图;
图5为本申请实施例提供的另一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图;
图6为本申请实施例提供的又一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图;
图7为本申请实施例提供的又一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图;
图8为本申请实施例提供的一种预设神经网络模型的结构示意图;
图9为本申请实施例提供的另一种预设神经网络模型的结构示意图;
图10为本申请实施例提供的第二操作的示意图;
图11为本申请实施例提供的一种对目标串烧音频的MIDI信息进行渲染以及输出的示意图;
图12为本申请实施例提供的一种音频数据的处理装置的结构示意图;
图13为本申请实施例提供的用于承载计算机程序产品的信号承载介质的结构示意图。
具体实施方式
为了更清楚的理解本申请实施例,下面对本申请实施例中涉及的部分术语或技术进行说明:
1)、乐器数字接口(musical instrument digital interface,MIDI)
MIDI是编曲界应用最广泛的音乐标准格式,可以称为“计算机能理解的乐谱”。
MIDI用音符的数字控制信号来记录音乐。即,MIDI传输的不是声音信号本身,而是音符及控制参数等指令。这些指令可以指示MIDI设备演奏音乐,例如指示MIDI设备以指令中所指示的音量大小来演奏某个音符等。MIDI所传输的指令可以被统一表示成MIDI消息或MIDI信息。
通常,MIDI信息可以以图谱形式呈现,也可以以数据流的形式呈现。当MIDI信息以图谱形式呈现时,可以简称为MIDI谱。
对于一首以波形音频文件格式(waveform audio file format,WAV)存储的时域上的音乐波形信号,当该音乐波形信号被转录为MIDI信息时,即可以将该MIDI信息理解为该音乐波形信号在MIDI域的表达形式。其中,时域是指时间域。
可以理解,MIDI信息一般可以包含多个声轨,且每一个声轨均标注有音符的起始位置、结束位置、音符的音高以及音符的力度信息等。其中,一个声轨用于表征一种乐器声音/人声。可以理解,一首完整的通过MIDI信息表达的音乐的大小往往只有几十千字节(kilobyte,KB),但是却可以包含数十条声轨。
当前,几乎所有的现代音乐都是通过MIDI信息加上音色库来制作合成。其中,音色库(或称为采样库)包括人类所能听到和创造出来的各种声音,例如包括各种乐器的演奏、各种人声的演唱、念白,以及各种自然、人造声音的录音等。
2)、隐空间
对于神经网络的某个中间层输出的特征,该特征所表示的原始数据经过若干神经网络层变换后的空间即可称为隐空间。通常,隐空间的维度一般小于原始数据的空间维度。
隐空间也可以被理解为原始数据特征的某种抽象的提取和表示。
3)、序列模型网络、双向序列模型网络
通常,输入或者输出中包含有序列数据的模型可以叫做序列模型。序列模型通常用于处理具有某种顺序关系数据。用于构建序列模型的神经网络可以称为序列模型网络。
其中,常见的序列模型网络包括循环神经网络(recurrent neural network,RNN)、 长短期记忆网络(long short-term memory,LSTM)、门控循环单元(gated recurrent unit,GRU)、转换器(transformer)等。
应理解,序列模型网络在时间t预测得到的预测结果,通常是基于对输入数据在时间t之前的数据进行学习后得到的。
而在一些情况中,当序列模型网络在时间t预测得到的预测结果,是基于对输入数据在时间t之前的数据进行学习,以及对输入数据在时间t之后的数据进行学习后得到的,这种情况下,该序列网络模型称为双向序列模型网络。可以看出,双向序列网络模型在对输入数据进行预测时,结合了输入数据中时刻t的上下文信息来进行结果预测。
应理解,双向序列模型可以在输入数据的任意时刻预测得到预测结果。
其中,常见的双向序列模型网络包括(双向)循环神经网络((bidirectional)recurrent neural networks,(Bi-)RNN)、(双向)长短时记忆网络((bidirectional)long short-term memory,(Bi-)LSTM)、(双向)门控循环单元((bidirectional)gate recurrent unit,(Bi-)GRU)、转换器(transformer)等。
4)、其他术语
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的实施例中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。本申请中的术语“至少一个”的含义是指一个或多个。在本申请的描述中,除非另有说明,术语“多个”的含义是两个或两个以上。
还应理解,本文中所使用的术语“和/或”是指并且涵盖相关联的所列出的项目中的一个或多个项目的任何和全部可能的组合。术语“和/或”,是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中的字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
还应理解,术语“包括”(也称“includes”、“including”、“comprises”和/或“comprising”)当在本说明书中使用时指定存在所陈述的特征、整数、步骤、操作、元素、和/或部件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元素、部件、和/或其分组。
应理解,本申请实施例下文中所述的串烧,是指从不同音频中提取出多段音频段, 并将该多个音频段按照预设顺序串联组合在一起的过程,该预设顺序即为该多段音频段的串烧顺序。
本申请实施例提供一种音频数据的处理方法,该方法先根据预先获取的m个音频片段确定出m-1个过渡音频信息,然后基于该m-1个过渡音频信息对该m个音频片段进行衔接,从而生成了m个音频片段被串烧后的目标串烧音频。其中,一个过渡音频信息用于衔接串烧顺序相邻的两个音频片段。
由于通过该方法对多个音频片段进行串烧时,无需考虑该多个音频片段之间的特征相似度。因此,通过本申请实施例方法可以获得丰富的、风格多样的串烧音频。
本申请实施例还提供一种音频数据的处理装置,该处理装置可以是终端设备。其中,该终端设备可以是手机、平板电脑、笔记本电脑、个人数字助理(personal digital assistant,PDA)、上网本、可穿戴电子设备(例如智能手表、智能眼镜)等便携式设备,也可以是台式计算机、智能电视、车载等设备,还可以是其他任一能够实现本申请实施例的终端设备,本申请对此不作限定。
以上述处理装置是手机为例,参考图1,图1示出了本申请实施例提供的一种手机10硬件结构示意图。如图1所示,手机10可以包括处理器110,内部存储器120,外部存储器接口130,摄像头140,触摸屏150,音频模块160以及通信模块170等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是手机10的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现手机10的智能认知等应用,例如:文字识别、图像识别、人脸识别等。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。I2S接口可以用于音频通信。PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。MIPI接口可以被用于连接处理器110与摄像头140、触摸屏150等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),触摸屏串行接口(display serial interface,DSI)等。GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。
内部存储器120,可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器120的指令,从而执行手机10的各种功能应用以及数据处理,例如执行本申请实施例提供的音频数据的处理方法。
外部存储器接口130可以用于连接外部存储卡,例如Micro SD卡,实现扩展手机10的存储能力。外部存储卡通过外部存储器接口130与处理器110通信,实现数据存储功能。例如将音乐,视频、图片等文件保存在外部存储卡中。
摄像头140,用于获取静态图像或视频。物体通过镜头生成光学图像投射到感光元件。数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。应理解,手机10可以包括n个摄像头140,n是正整数。
触摸屏150,用于手机10和用户之间的交互。触摸屏150包括显示面板151和触摸板152。其中,显示面板151用于显示文字、图像、视频等。触摸板152用于输入用户的指令。
音频模块160用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块160可以包括扬声器161、受话器162、麦克风163以及耳机接口164中的至少一个。
其中,扬声器161,也称“喇叭”,用于将音频电信号转换为声音信号。受话器162,也称“听筒”,用于将音频电信号转换成声音信号。麦克风163,也称“话筒”,“传声器”,用于将声音信号转换为电信号。耳机接口164用于连接有线耳机。耳机接口164可以是USB接口,也可以是3.2mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
这样,手机10即可以通过音频模块160中的扬声器161,受话器162,麦克风163,耳机接口164,以及应用处理器等实现音频功能。例如用户的语音输入、语音/音乐的播放等。
通信模块170,用于实现手机10的通信功能。具体的,通信模块170可以通过天线、移动通信模块,无线通信模块,调制解调处理器以及基带处理器等实现。
天线用于发射和接收电磁波信号。手机10中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将用于移动通信模块的天线1可以复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块可以提供应用在手机10上的包括2G/3G/4G/5G等无线通信的解决 方案。移动通信模块可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块可以由天线接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块还可以对经调制解调处理器调制后的信号放大,经天线转为电磁波辐射出去。在一些实施例中,移动通信模块的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。调制解调处理器可以包括调制器和解调器。
无线通信模块可以提供应用在手机10上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),GNSS,调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块经由天线接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线转为电磁波辐射出去。
示例性的,本申请实施例中的GNSS可以包括:全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)等。
可以理解的是,本申请实施例示意的结构并不构成对手机10的具体限定。在本申请另一些实施例中,手机10可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
需要说明的是,当上述的处理装置为终端设备,上述音频数据的处理方法可以通过安装在终端设备上的应用程序(application,App)实现。其中,该App具有剪辑音频的功能。作为示例,该App可以是音乐剪辑App等。
其中,该App可以是具有人工介入功能的App。这里,人工介入是指App可以接收用户输入的指令,并能够响应用户输入的指令。也就是说,该App可以和用户进行交互。该App可以包括用于与用户进行交互的交互界面,该交互界面通过终端设备的显示屏(例如图1所示的显示面板151)显示。
应理解,如果终端设备包括有触摸屏(例如图1所示的触摸屏150),则用户可以通过操作终端设备的触摸屏(例如操作图1所示的触摸板152)实现和App的交互。如果终端设备不包括触摸屏(例如终端设备是普通的台式计算机),则用户可以通过终端设备的鼠标、键盘等输入输出器件和App进行交互。
还应理解,上述的App可以是安装在终端设备中的嵌入式应用程序(即终端设备的系统应用),也可以是可下载应用程序。
其中,嵌入式应用程序是设备(如手机)的操作系统提供的应用程序。例如,该嵌入式应用程序可以时手机出厂是提供的音乐应用程序等。可下载应用程序是一个可 以提供自己的通信连接的应用程序,该可下载应用程序是可以预先安装在设备中的App,或者可以是由用户下载并安装在设备中的第三方App,例如,该该可下载应用程序可以是音乐剪辑App,本申请实施例对此不作具体限定。
还需说明的是,上述的处理装置也可以是服务器。这种情况下,本申请实施例还提供一种音频数据的处理系统。其中,该处理系统包括服务器和终端设备,该服务器和终端设备之间可以通过有线或无线的方式连接通信。
如图2所示,图2示出了本申请实施例提供的一种处理系统20的示意图。处理系统20包括终端设备21和服务器22。其中,终端设备21可以通过客户端App(例如音频剪辑的客户端App)与用户进行交互,例如接收用户输入的指令,并将接收到的指令传输至服务器22。然后,服务器22用于基于从终端设备21接收到的指令执行本申请实施例提供的音频数据的处理方法,并将生成的目标串烧音频的MIDI信息和/或目标串烧音频发送至终端设备21。这样,终端设备21即可接收到服务器22发送的目标串烧音频的MIDI信息和/或目标串烧音频,并通过音频模块向用户播放该目标串烧音频,和/或,通过显示屏向用户显示该目标串烧音频的MIDI谱。对此不作限定。
下面结合附图,对本申请实施例提供的音频数据的处理方法进行详细描述。
参考图3,图3示出了本申请实施例提供的一种音频数据的处理方法的流程示意图。该方法由上文所述的音频数据的处理装置执行。下面,以上文所述的音频数据的处理装置是终端设备为例,该方法可以包括以下步骤:
S101、获取m个音频片段,m是大于等于2的整数。
其中,该m个音频片段包括不同音频中的片段。
具体的,终端设备可以先确定出k个目标音频,然后再从k个目标音频中提取m个音频片段。其中,2≤k≤m,且k是整数。
也就是说,终端设备可以从一个目标音频中提取至少一个音频片段。
S102、根据上述获取的m个音频片段确定m-1个过渡音频信息。
其中,该m-1个过渡音频信息用于衔接该m个音频片段。对于该m-1个过渡音频信息中的第一过渡音频信息,该第一过渡音频信息是用于衔接终端设备所获取的m个音频片段中排序连续的第一音频片段和第二音频片段的过渡音频信息。应理解,这里所述排序是m个音频片段的串烧顺序的排序。应理解,该串烧顺序是终端设备预先确定的。
具体的,终端设备可以根据预设神经网络模型和在S101获取的m个音频片段的信息,确定出m-1个过渡音频信息。这里,该过渡音频信息即为过渡音频段的MIDI信息。
S103、根据上述的m个音频片段和m-1个过渡音频信息,生成目标串烧音频的MIDI信息。
终端设备通过上述的m-1个过渡音频信息(即过渡音频段的MIDI信息)将m个音频片段的MIDI信息连接起来,即生成了目标串烧音频的MIDI信息。
S104、根据上述确定的目标串烧音频的MIDI信息确定目标串烧音频。
可选的,终端设备还可以基于目标串烧音频的MIDI信息确定出目标串烧音频后,输出该目标串烧音频。
通过上述本申请实施例提供的方法,终端设备可以基于m个音频片段,生成用于衔接该m个音频片段的m-1个过渡音频信息。这样,通过该m-1个过渡音频信息即可将m个音频片段的MIDI信息衔接起来,从而得到m个音频片段被串烧后的目标串烧音频的MIDI信息。这样的话,当终端设备将目标串烧音频的MIDI信息转换为音频格式,即得到了目标串烧音频。可以看出,通过这种方式对多个音频片段进行串烧时,终端可以生成全新的用于衔接该多个音频片段的过渡音频段,因此本申请实施例提供的方法无需考虑用于得到目标串烧音频的音频片段的相似度。也就是说,通过本申请实施例提供的方法能够获得更丰富、更具多样性的串烧音频。
并且,本申请实施例提供的方法生成过渡音频信息的过程是在MIDI域中进行的,由于音频的MIDI信息是音频最原始的表现形式,其记录有音频的音符音高、音符力度、音符时长等信息。因此,相比在时域上直接对多个音频片段进行串烧,本申请实施例提供的方法能够基于音频乐理生成用于衔接两个音频片段的过渡音频信息,且基于该过渡音频信息获得的串烧音频在听觉上更加流畅自然。并且,在MIDI域处理数据更有利于串烧音频在后期渲染时的灵活性和一致性。
下面对S101-S104进行详细描述:
在S101,可以理解,S101所述的一个音频可以是一首完整/不完整的歌曲/音乐,一个音频片段则是从一个音频中截取的一段音频。还应理解,音频或音频片段均具有时序性。
可选的,终端设备可以随机的将媒体数据库或本地存储的音乐数据库中的k个音频确定为k个目标音频。
可选的,终端设备可以先接收用户输入的第一操作,并响应该第一操作,从而确定出k个目标音频。可以理解,该终端设备中安装有具有音频剪辑功能的应用程序,该第一操作即为用户在该应用程序中的音频剪辑界面上的操作。
一种可能的实现方式,上述的第一操作可以包括用户在音频剪辑界面上的目标音频的选择操作。可选的,选择操作可以包括音乐数据库的选择操作,以及在选中的音乐数据库中选择目标音频的操作。
其中,音乐数据库可以是存储在本地的音乐数据库,也可以是基于系统根据音频的场景标签、情感标签、风格标签等对音频进行分类后的音乐数据库,也可以是系统自动推荐的音乐数据库,还可以是用户对基于系统推荐或分类的音乐数据库中的音频进行删除或增加后配置得到的自定义的音乐数据库等,本申请实施例对此不作限定。这里,系统可以是与终端设备上安装的具有音频剪辑功能的应用程序连接通信的任意媒体系统,本申请实施例对此不作限定。
作为示例,系统推荐的音乐数据库可以是系统根据终端设备的传感器检测的当前用户所处场景/状态推荐的音乐数据库。例如,当终端设备传感器检测到当前用户的状态为跑步状态,则系统可以向用户推荐包括动感音乐的音乐数据库。或者,系统推荐的音乐数据库也可以是随机展示的流媒体的音乐数据库,例如各大音乐榜单,或者包括流行经典音乐的音乐数据库。等等。
这里需要说明的是,每个音频在制作时,都可以标注有包括但不限于场景、风格、情感等标签。其中,场景是指适合聆听该音频的场景,例如可以是工作场景、学习场 景、跑步场景等。风格是指音频的音乐风格,例如可以是摇滚、电子、轻音乐等。情感是指该音频表达的情感,例如可以是伤感、思恋、孤独等。对此不做详述。
以终端设备设备是手机10,手机10中安装有音频剪辑应用程序为例,在一个示例中,参考图4,图4示出了本申请实施例提供的一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图。
如图4中的(a)所示,手机10的触摸屏上显示有音频剪辑应用程序的音频剪辑界面401。可以看出,音频剪辑界面401为“分类曲库”标签下的音乐数据库选择界面。这样,用户即可在音频剪辑界面401上进行音乐数据库的选择操作。
如图4中的(a)所示,在音频剪辑界面401上,显示有系统基于音频的不同分类标准对音频进行分类时的类型标签。如图所示,音频剪辑界面401上显示有基于适于聆听音频的场景对音频进行分类时的类型标签,如“工作”标签、“跑步”标签等。音频剪辑界面401上还显示有基于音频所表达情感对音频进行分类时的类型标签,例如“快乐”标签、“兴奋”标签等。音频剪辑界面401上还显示有基于音频的音乐风格对音频进行分类时的类型标签,例如“流行”标签、“节奏蓝调”标签等。容易理解,图4中的(a)所示出的音频的类型标签及其显示格式仅为示例性说明,并不作为对本申请实施例保护范围的限定。
这样,用户可以基于自身的需求/喜好,操作(例如用手指/触摸笔点击)音频剪辑界面401上显示的类型标签,例如用户可以用手指分别点击“跑步”标签、“快乐”标签、“兴奋”标签、以及“节奏蓝调”标签,并在操作(例如用手指/触摸笔点击)音频剪辑界面401上的“确定”按钮后,作为响应,手机10即可显示系统基于用户选择的类型标签推荐的所有具有“跑步”标签、“快乐”标签、“兴奋”标签、以及“节奏蓝调”标签的音频的界面,例如图4中的(b)所示的目标音频选择界面402。可以理解,目标音频选择界面402上显示的所有音频,即构成了用户所选择的音乐数据库。
可以理解,当用户在音频剪辑界面401上选择的类型标签为“自动”标签,则手机10在目标音频选择界面402上所显示的音频,即为手机10根据该手机10配置的传感器(例如陀螺仪传感器、噪声传感器等)检测到的当前操作手机10的用户所处的环境/状态,为用户自动推荐的适合用户在当前环境聆听的音频。对此不作赘述。
进一步的,用户可以在目标音频选择界面402上进行目标音频的选择操作,例如用户可以基于自身的需求/喜好,在目标音频选择界面402选中k个目标音频。作为响应,手机10即确定出k个目标音频。
在另一个示例中,参考图5,图5示出了本申请实施例提供的另一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图。
如图5中的(a)所示,手机10的触摸屏上显示有音频剪辑应用程序的音频剪辑界面501。可以看出,音频剪辑界面501为用户在音频剪辑界面401上选中“推荐曲库”标签后显示的音乐数据库选择界面。这样,用户即可在音频剪辑界面501上进行音乐数据库的选择操作。
如图5中的(a)所示,在音频剪辑界面501上,显示有系统展示的多个音乐数据库的标识。例如“流行经典”音乐库标识、“网络甜歌”音乐库标识、“轻音乐集”音乐库标识、“金曲榜”音乐库标识等。容易理解,图5中的(a)所示出的音乐数据库及其标 识的显示格式仅为示例性说明,并不作为对本申请实施例保护范围的限定。例如,手机10还可以分多个界面来显示不同类型的音乐数据库的标识,对此不作限定。
这样,用户即可基于自身需求或喜好,操作(例如用手指/触摸笔点击)音频剪辑界面501上所显示的一个音乐数据库(例如“流行经典”音乐库),并在操作(例如用手指/触摸笔点击)音频剪辑界面501上的“确定”按钮后,作为响应,手机10即可显示“流行经典”音乐库中音频的界面,例如图5中的(b)所示的目标音频选择界面502。
进一步的,用户可以在目标音频选择界面502上进行目标音频的选择操作,例如用户可以基于自身的需求/喜好,在目标音频选择界面502选中k个目标音频。作为响应,手机10即确定出k个目标音频。
在又一个示例中,参考图6,图6示出了本申请实施例提供的又一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图。
如图6所示,手机10的触摸屏上显示有音频剪辑应用程序的音频剪辑界面601。可以看出,音频剪辑界面601为用户在音频剪辑界面401或音频剪辑界面501上选中“本地曲库”标签后显示的目标音频选择界面。这样,用户即可在音频剪辑界面501上进行目标音频的选择操作。
可以理解,用户在音频剪辑界面401或音频剪辑界面501上选中“本地曲库”标签后显示音频剪辑界面601的操作,相当于是用户在音频剪辑界面401或音频剪辑界面501选择本地音乐数据库的操作。
如图6所示,在音频剪辑界面601上显示有本地存储的多个音频,例如以列表的形式显示多个音频。这样,用户可以在音频剪辑界面601上进行目标音频的选择操作,例如用户可以基于自身的需求/喜好,在音频剪辑界面601选中k个目标音频。作为响应,手机10即确定出k个目标音频。
容易理解,图6中所示出的本地存储的多个音频的显示的格式仅为示例性说明,并不作为对本申请实施例的限定。例如,手机10还可以将本地存储的多个音频划分为多个组,并以多个层级界面来显示不同组别中的音频列表,对此不作限定。
另一种可能的实现方式,上述的第一操作可以包括用户在音频剪辑界面上输入目标音频的数量的输入操作,以及音乐数据库的选择操作。
作为示例,参考图7,图7示出了本申请实施例提供的又一种用户在音频剪辑应用程序的音频剪辑界面上输入的第一操作的示意图。
如图7所示,图7所示的音频剪辑界面701上包括用于输入串烧音频数目的接口(即输入框702),这样,用户即可通过输入框702输入串烧音频的数量,以k的取值是3为例,即用户可以在输入框702输入数值“3”。
此外,用户可以通过操作音频剪辑界面701上的“曲库”按钮,并根据自身的需求/喜好选择音乐数据库。这里,用户操作音频剪辑界面701上的“曲库”按钮后所显示的选择音乐数据库的过程,可以参考上文中图4中的(a)、图5中的(b)以及图6选择音乐数据库的描述,这里不作赘述。
当用户通过操作音频剪辑界面701上的“曲库”按钮选定音乐数据库后,手机10即可以根据用户在输入框702输入的k值,在用户选定的音乐数据库中选择k个音频作为目标音频。
可选的,手机10可以基于预设规则,并根据用户在输入框702输入的k值,在用户选定的音乐数据库中选择k个音频作为目标音频。例如,手机10可以在音乐数据库中随机选择k个音频作为目标音频,或者,手机10可以将音乐数据库中的前k个音频作为目标音频,等等,本申请实施例对此不作限定。
这样,当终端设备确定出k个目标音频后,终端设备可以通过预设算法从k个目标音频中提取m个音频片段。作为示例,该预设算法可以是用于提取歌曲中副歌/高潮部分的算法,本申请实施例对此不作限定。
可选的,终端设备预置有m个音频的串烧顺序,或者,终端设备可以进一步通过与用户进行交互,以确定出该m个音频片段在串烧时的串烧顺序。
一个示例,结合图4、图5以及图6,当用户在图4中的(b)所示的目标音频选择界面402上选中k个目标音频后,并对目标音频选择界面402上的“确定”按钮进行操作(例如用手指/触摸笔点击)后,或者,当用户在图5中的(b)所示的目标音频选择界面502上选中k个目标音频后,并对目标音频选择界面502上的“确定”按钮进行操作(例如用手指/触摸笔点击)后,或者,当用户在图6所示的音频剪辑界面601选中k个目标音频后,并对音频剪辑界面601上的“确定”按钮进行操作(例如用手指/触摸笔点击)后,终端设备(即手机10)即可显示如图4中的(c)所示的串烧顺序选择界面403。如图4中的(c)所示,串烧顺序选择界面403可以包括“顺序”、“随机”、以及“自定义”三个选项。
其中,当用户选择“顺序”选项时,作为响应,手机10可以按照k个目标音频在所属音乐数据库中的顺序,对从k个目标音频中提取的m个音频片段进行串烧。可选的,k个目标音频在所属音乐数据库中的顺序,可以以该k个目标音频在该音乐数据库中的编号体现。
当用户选择“随机”选项时,作为响应,手机10可以随机的对从k个目标音频中提取的m个音频片段进行串烧。
当用户选择“自定义”选项时,用户可以进一步的在“自定义”的选项框4031中以预设顺序依次输入k个目标音频的标识(例如编号)。这样,作为响应,手机10即可以该预设顺序对从k个目标音频中提取的m个音频片段进行串烧。其中,该预设顺序即为用户自定义的顺序。
另一个示例,在图7所示的音频剪辑界面701上,还可以包括三个用于输入串烧歌曲顺序的选项,其具体说明可以参考上文对图4中的(c)的说明,这里不再赘述。
在S102,终端设备在确定出m个音频片段后,可以确定该m个音频片段的音频特征信息。
其中,音频片段的音频特征信息可以包括音频片段的主旋律轨位置信息、风格标签、情感标签、节奏信息、节拍信息、或调号信息等信息中的至少一种。可以理解,这里所述的节拍即为音乐的节拍,调号即为音调号。其中,本申请实施例对终端设备获得音频片段的音频特征信息的具体实现方式不作具体限定。本申请实施例对终端设备确定该m个音频片段的音频特征信息的过程不作详述。
对于终端设备确定出的m个音频片段中的每个音频片段而言,终端设备还可以通过音乐人声检测技术,对每个音频片段进行音乐和人声分离处理。
以m个音频片段中的第一音频片段为例,终端设备可以通过音乐人声检测技术,将第一音频片段中的人声和各种乐器(例如钢琴、贝斯、鼓、小提琴等)声音分离,并将分离得到的多轨乐器和人声转换为MIDI格式的数据,该MIDI格式的数据即为第一音频片段的MIDI信息。这里,本申请实施例对音乐人声检测技术不作详述。
需要说明的是,音频片段中也可以不包括人声。这种情况下,终端设备可以通过音乐人声检测技术,分离出第一音频片段中的多轨乐器声音,并将该多轨乐器声音转换为MIDI格式的数据。
进一步的,终端设备可以根据m个音频片段的音频特征信息和MIDI信息,以及预设神经网络模型,确定出m-1个过渡音频信息。其中,m-1个过渡音频信息中的一个过渡音频信息,是用于衔接终端设备所获取的m个音频片段中排序连续的两个音频片段的过渡音频信息,其中,该m个音频片段的排序是指该m个音频片段的串烧顺序,过渡音频信息为过渡音频段的MIDI信息。这里,终端设备确定m个音频片段的串烧顺序的详细描述可以参考S101中的相关描述,这里不再赘述。
其中,预设神经网络模型可以预置在终端设备中,也可以预置在与终端设备具有通信连接的服务器上,本申请实施例对此不作限定。
其中,该预设神经网络模型包括编码器、信息提取模块、信息生成模块以及解码器。这里,编码器、信息提取模块、信息生成模块、以及解码器均为序列模型网络结构,信息提取模块是双向序列模型网络结构。
作为示例,其中,编码器、信息生成模块、以及解码器可以是RNN、LSTM、GRU、Transformer等网络。信息提取模块可以是Bi-RNN、Bi-LSTM、Bi-GRU、Transformer等网络。
应理解,该预设神经网络模型包括至少两个编码器、至少两个信息提取模块、至少一个信息生成模块以及至少一个解码器。其中,该至少两个编码器的网络结构相同,该至少两个信息提取模块的网络结构相同,该至少一个信息生成模块的网络结构相同,该至少一个解码器的网络结构相同。此外,在一个预设神经网络模型中,编码器和解码器的网络结构也可以是相同的。需要说明的是,编码器和解码器的数据流向是相反的。作为示例,编码器的输入可以作为解码器的输出,编码器的输出可以作为解码器的输入。
应理解,上述至少两个编码器、至少两个信息提取模块、至少一个信息生成模块以及至少一个解码器的网络参数均是在训练该预设神经网络模型时确定的。这里,训练预设神经网络模型的详细说明可以参考下文中训练图8所示预设神经网络模型80的描述,这里不作赘述。
在一种可能的实现方式中,当需要该预设神经网络模型处理的音频片段的数量为m,则该预设神经网络模型包括m个输入和m-1个输出。这样的话,该预设神经网络模型中编码器的数量为m,信息提取模块的数量为2×(m-1),信息生成模块和解码器的数量均为m-1。
这种情况下,该预设神经网络模型可以对输入的m个音频片段的信息同时进行处理,并输出m-1个过渡音频的MIDI信息。
一个示例,以m的取值是2为例,参考图8,图8示出了本申请实施例提供的一 种预设神经网络模型的结构示意图。
如图8所示,预设神经网络模型80包括用于接收2个输入的2个编码器,分别为编码器811和编码器812。预设神经网络模型80包括2个(即2×(2-1)个)信息提取模块,分别为信息提取模块821和信息提取模块822。预设神经网络模型80还包括1个(即(2-1)个)信息生成模块(即信息生成模块83)和1个(即(2-1)个)解码器(即解码器84)。
另一个示例,以m的取值是4为例,参考图9,图9示出了本申请实施例提供的另一种预设神经网络模型的结构示意图。
如图9所示,预设神经网络模型90包括用于接收4个输入的4个编码器,分别为编码器911、编码器912、编码器913、以及编码器914。预设神经网络模型90包括6个(即2×(4-1)个)信息提取模块,分别为信息提取模块921、信息提取模块922、信息提取模块923、信息提取模块924、信息提取模块925、以及信息提取模块926。预设神经网络模型90包括3个(即(4-1)个)信息生成模块,分别为信息生成模块931、信息生成模块932、以及信息生成模块933。预设神经网络模型90还包括3个(即(4-1)个)解码器,分别为解码器941、解码器942、以及解码器943。
在另一种可能的实现方式中,预设神经网络模型中编码器和信息提取模块的数量均为2,信息生成模块和解码器的数量均为1(例如图8所示的预设神经网络模型)。也就是说,该预设神经网络模型包括2个输入、1个输出。
这种情况下,该预设神经网络模型一次可以对2个输入(即2个音频片段的信息)同时进行处理,并输出1个过渡音频信息。当需要该预设神经网络模型处理的音频片段的数量为m,且m大于2时,则该预设神经网络模型可以对m个音频片段的信息串行的进行m-1次处理,即可得到m-1个过渡音频信息。应理解,该预设神经网络模型每次处理的两个音频片段是串烧顺序相邻的两个音频片段。
作为示例,当需要该预设神经网络模型处理的音频片段包括音频片段1、音频片段2、音频片段3以及音频片段4,即m的取值为4时,假设该4个音频片段的串烧顺序为:音频片段1→音频片段4→音频片段3→音频片段2,则终端设备可以对该4个音频片段的信息串行的进行3(即(4-1))次处理,即可得到3个过渡音频信息。具体的,终端设备可以将音频片段1和音频片段4的信息作为该预设神经网络的两个输入信息,这样即可得到用于衔接音频片段1和音频片段4的过渡音频信息1。终端设备可以将音频片段4和音频片段3的信息作为该预设神经网络的两个输入信息,这样即可得到用于衔接音频片段4和音频片段3的过渡音频信息2。终端设备还可以将音频片段3和音频片段2的信息作为该预设神经网络的两个输入信息,这样即可得到用于衔接音频片段3和音频片段2的过渡音频信息3。
下面以m的取值是2,即以终端设备获取到2个音频片段(例如该2个音频片段包括第一音频片段和第二音频片段),且串烧顺序为第一音频片段→第二音频片段为例,结合图8对本申请实施例提供的预设神经网络模型中的各个模块及其对音频片段的信息进行处理的过程予以详细说明。从串烧顺序可以看出,第一音频片段是目标串烧音频中在前的乐段(简称前乐段),第二音频片段是目标串烧音频中在后的乐段(简称后乐段)。
这里,第一音频片段作为前乐段,其音频特征信息和MIDI信息可以被称为第一信息。第二音频片段作为后乐段,其音频特征信息和MIDI信息可以被称为第二信息。其中,第一信息即可作为预设神经网络模型80的一个输入,第二信息即可作为预设神经网络模型80的另一个输入。
需要说明的是,第一信息中的MIDI信息(即第一音频片段的MIDI信息)中的多个声轨,和第二信息中的MIDI信息(即第二音频片段的MIDI信息)中的多个声轨是相同的。具体的,第一信息中的MIDI信息所包括的声轨数量及其类型,与第二信息中的MIDI信息所包括的声轨数量及其类型均相同。
例如,第一信息中的MIDI信息包括3个声轨,分别为人声声轨、钢琴声轨以及小提琴声轨。那么,第二信息中的MIDI信息也包括这3个声轨。
应理解,假设第一信息中的MIDI信息所包括的声轨数量和第二信息中的MIDI信息所包括的声轨数量不同时,例如第一信息中的MIDI信息所包括的声轨数量,大于第二信息中的MIDI信息所包括的声轨数量,且第一信息中的MIDI信息包括第二信息中的MIDI信息所包括的所有声轨类型,则终端设备可以为第二信息中的MIDI信息增加空轨,以使第一信息中的MIDI信息所包括的声轨数量和第二信息中的MIDI信息所包括的声轨数量相同。
这样,终端设备可以将第一信息作为预设神经网络模型80的一个输入(例如输入1)输入到编码器811中。编码器811即可对接收到的第一信息进行处理后输出第一音频片段对应的第一序列。这里,第一序列是编码器811对第一信息中的MIDI信息和音频特征信息进行特征提取后得到的序列。这里,第一序列可以被理解为编码器811对第一信息进行降维后得到的序列,或者,第一序列可以被理解为编码器811将第一信息压缩至隐空间后得到的序列。
应理解,第一序列是时序上的一维序列,且第一序列的长度由第一音频片段的长度确定。可以理解,第一音频片段越长,第一序列越长;第一音频片段越短,第一序列越短。
作为示例,第一序列可以表示为“{P 1,P 2,…,P s}”,其中,P表示特征向量,s表示特征向量的个数。应理解,由于音频片段本身具有时序性,因此第一序列也具有时序性。这样的话,P 1可以是音频片段始端时刻对应的特征向量,P s可以是音频片段末端时刻对应的特征向量。
类似的,终端设备可以将第二信息作为预设神经网络模型80的另一个输入(例如输入2)输入到编码器812中,这样,编码器812即可对接收到的第二信息进行处理后输出第二音频片段对应的第二序列。这里,第二序列的描述可以参考第一序列,不再赘述。
作为示例,第二序列可以表示为“{F 1,F 2,…,F t}”,其中,F表示特征向量,t表示特征向量的个数。其中,F 1可以是音频片段始端时刻对应的特征向量,F t可以是音频片段末端时刻对应的特征向量。
接着,信息提取模块821接收编码器811输出的第一序列。由于音频片段本身具有时序性,且第一音频片段是前乐段,因此信息提取模块821对第一序列进行学习后,可以输出与第一音频片段的末端时刻对应的第一向量。该过程也可以理解为信息提取 模块821对第一序列进一步进行了降维。需要说明,第一向量携带有第一序列的特征,且与第一序列的末端时刻对应。
类似的,信息提取模块822接收编码器812输出的第二序列。信息提取模块821对第二序列进行学习后,可以输出与第二音频片段的始端时刻对应的第二向量。该过程也可以理解为信息提取模块822对第二序列进一步进行了降维。需要说明,第二向量携带有第二序列的特征,且与第二序列的始端时刻对应。
然后,预设神经网络模型80对第一向量和第二向量求和,得到第三向量。即第三向量=第一向量+第二向量。这里,第一音频片段末端时刻对应的隐空间向量(即第一向量)和第二音频片段始端时刻对应的隐空间向量(即第二向量)的和(即第三向量),可以作为衔接第一音频片段和第二音频片段的过渡音频段对应的隐空间向量。
这样,基于该过渡向量,预设神经网络模型80可以确定出用于衔接第一音频片段和第二音频片段的过渡音频段。具体的,预设神经网络模型80将第三向量输入到信息生成网络模块83,信息生成网络模块83即可对接收到的第三向量进行学习,并输出第三序列。应理解,第三序列即为用于衔接第一音频片段和第二音频片段的过渡音频段的特征向量所构成的序列。
作为示例,第三序列可以表示为“{M 1,M 2,…M j}”,其中,M表示特征向量,j表示特征向量的个数。其中,M 1可以是用于衔接第一音频片段和第二音频片段的过渡音频段的始端时刻对应的特征向量,M j可以是该过渡音频段的末端时刻对应的特征向量。
然后,解码器84接收到信息生成模块83输出的第三序列,并对第三序列进行学习,并输出用于衔接第一音频片段和第二音频片段的过渡音频信息,即过渡音频段的MIDI信息。
应理解,上述图8所示的预设神经网络模型80可以是预先基于多个训练样本训练得到的神经网络模型。其中,一个训练样本包括两个音频片段的MIDI信息和音频特征信息,该训练样本的标签值为领域专家根据该两个音频片段构建的过渡音频段的MIDI信息。这样,基于多个训练样本对神经网络进行反复的迭代训练,即可得到本申请实施例中图8所示的预设神经网络模型。
在S103,终端设备将在S102步骤生成的m-1个过渡音频信息(即过渡音频段的MIDI信息)分别插入m个音频片段的MIDI信息中,以实现通过该m-1个过渡音频信息来衔接该m个音频片段的MIDI信息的目的,也即生成了该m个目标音频片段串烧后得到的目标串烧音频。
作为示例,假设m取值为3,m个音频片段包括音频片段1、音频片段2以及音频片段3,m-1个过渡音频信息包括过渡音频信息1和过渡音频信息2。则当该3个音频片段的串烧顺序为音频片段1→音频片段3→音频片段2,且过渡音频信息1是用于衔接音频片段1和音频片段3过渡音频信息,过渡音频信息2是用于衔接音频片段3和音频片段2过渡音频信息时,则终端设备可以将过渡音频信息1插入音频片段1和音频片段3的MIDI信息之间,以及可以将过渡音频信息2插入音频片段3和音频片段2的MIDI信息之间。这样,终端设备即生成了音频片段1、音频片段2以及音频片段3串烧后目标串烧音频的MIDI信息。
在实际应用中,终端设备在生成目标串烧音频的MIDI信息后,可以向用户播放该目标串烧音频。当用户认为该目标串烧音频不是自己想要的串烧音频时,用户可以向终端设备输入第二操作。这样,终端设备响应该第二操作,对用于串烧得到该目标串烧音频的m个音频片段的串烧顺序进行调整,并基于调整后的串烧顺序和m个音频片段,重新生成m-1个过渡音频信息,并重新生成目标串烧音频的MIDI信息。
然后,终端设备可以向用户播放重新生成的目标串烧音频。当用户对该目标串烧音频满意,则可以执行S104;当用户对该目标串烧音频不满意,则可以再次向终端设备输入第二操作,以使终端设备对m个音频片段的串烧顺序进行再次调整,并再次重新生成目标串烧音频的MIDI信息。可以看出,通过终端设备和用户之间的反复交互,可以获得令用户满意的目标串烧音频,提高了用户的体验。
在一种可能的实现方式中,终端设备在生成目标串烧音频的MIDI信息后,通过图谱的形式在显示面板上显示目标串烧音频的MIDI谱。这样,用户的第二操作即可以是对该MIDI谱的拖拽操作。这样,终端设备在接收到用户的第二操作后,可以响应该第二操作,以重新确定出用于串烧得到前述目标串烧音频的m个音频片段的串烧顺序。
这样的话,终端设备基于重新确定出的m个音频片段的串烧顺序,以及该m个音频片段,并通过执行S102,即可重新确定出m-1个过渡音频信息。进一步的,终端设备根据重新确定出的m-1个过渡音频信息和该m个音频片段,即可重新生成目标串烧音频。这里,终端设备重新生成目标串烧音频的过程,可以参考是上文中终端设备生成目标串烧音频的过程的详细描述,这里不作赘述。
作为示例,以终端设备是手机10,且手机10生成的目标串烧音频的MIDI信息中包括3个声轨,且目标串烧音频是对音频片段1、音频片段2以及音频片段3,以音频片段1→音频片段3→音频片段2的串烧顺序进行串烧后得到的串烧音频,参考图10中的(a),图10中的(a)示出了本申请实施例提供的一种第二操作的示意图。
如图10中的(a)所示,当手机10第一次生成目标串烧音频的MIDI信息后,可以在显示面板上显示该目标串烧音频的MIDI谱,例如图10中的(a)中所示的串烧音频剪辑界面1001上所显示的MIDI谱。
其中,串烧音频剪辑界面1001上显示的MIDI谱包括三个声轨,分别为黑色条带所示的声轨1、白色条带所示的声轨2、以及条纹条带所示的声轨3。
其中,串烧音频剪辑界面1001上显示的MIDI谱上的起始线用于标记目标串烧音频的起始。串烧音频剪辑界面1001上显示的MIDI谱上还包括多个分割线,该多个分割线用于区分目标串烧音频中的不同音频片段以及过渡音频段。
例如,基于目标串烧音频中音频片段的串烧顺序,位于起始线和分割线1之间的音频段为音频片段1,位于分割线1和分割线2之间的音频段为衔接音频片段1和音频片段3的过渡音频段1,位于分割线2和分割线3之间的音频段为音频片段3,位于分割线3和分割线4之间的音频段为衔接音频片段3和音频片段2的过渡音频段2,位于分割4右侧的音频段为音频片段2(图10中的(a)中MIDI谱未示出用于标记目标串烧音频结束的终止线)。可以理解,在串烧音频剪辑界面1001上显示的MIDI谱上也可以显示各个音频段的名称,本申请实施例对此不作限定。
当用户可以对串烧音频剪辑界面1001上的播放图标1002进行操作(例如用手指或触摸笔点击)后,作为响应,手机10向用户播放目标串烧音频。当用户对该目标串烧音频不满意时,可以向手机10输入第二操作。这里,第二操作可以是用户对串烧音频剪辑界面1001上显示的MIDI谱进行拖拽的操作(例如通过手指或触摸笔在显示面板上滑动的操作),例如用户通过手指按住音频片段1的MIDI谱(即图10中的(a)所示MIDI谱中起始线和分割线1之间的区域),并沿着图10中的(a)中箭头所示方向滑动至音频片段2的MIDI谱(即图10中的(a)所示MIDI谱中分割线4右侧的区域)位置处。作为响应,手机10交换音频片段1和音频片段2的串烧顺序,即手机10重新确定出音频片段1、音频片段2以及音频片段3的串烧顺序为音频片段2→音频片段3→音频片段1。
进一步的,手机10可以根据音频片段1、音频片段2、音频片段3以及重新确定的串烧顺序,重新生成目标串烧音频。
在另一种实现方式中,第二操作可以是用户在终端设备生成目标串烧音频的MIDI信息后所显示的音频剪辑界面上,输入目标串烧顺序的操作。作为响应,终端设备即接收到用户输入的目标串烧顺序。也就是说,终端设备重新确定出m个音频片段的目标串烧顺序。
这样的话,终端设备基于接收到的m个音频片段的目标串烧顺序,以及该m个音频片段,并通过执行S102,即可重新确定出m-1个过渡音频信息。进一步的,终端设备根据重新确定出的m-1个过渡音频信息和该m个音频片段,即可重新生成目标串烧音频。
作为示例,以终端设备是手机10,且目标串烧音频是手机10对音频片段1、音频片段2以及音频片段3,以音频片段1→音频片段3→音频片段2的串烧顺序进行串烧后得到的串烧音频为例,参考图10中的(b),图10中的(b)示出了本申请实施例提供的另一种第二操作的示意图。
如图10中的(b)所示,当手机10第一次生成目标串烧音频的MIDI信息后,可以在显示面板上显示如图10中的(b)所示的串烧音频剪辑界面1001。当用户可以对串烧音频剪辑界面1001上的播放图标1002进行操作(例如用手指或触摸笔点击)后,作为响应,手机10可以播放该目标串烧音频。当用户对该目标串烧音频不满意时,可以向手机10输入第二操作。
具体的,用户可以在串烧音频剪辑界面1001上的目标串烧顺序的输入框1003内输入期望的目标串烧顺序,例如用户在输入框1003输入“2,3,1”,其中,“2”可以用于表示音频片段2的标识,“3”可以用于表示音频片段3的标识,“1”可以用于表示音频片段1的标识,“2,3,1”可以用于表示音频片段1、音频片段2以及音频片段3的串烧顺序为音频片段2→音频片段3→音频片段1。作为响应,手机10即接收到用户输入的目标串烧顺序。这样,手机10即确定出音频片段1、音频片段2以及音频片段3的目标串烧顺序。
然后,手机10可以根据音频片段1、音频片段2、音频片段3、以及接收到的目标串烧顺序,重新生成目标串烧音频。
可选的,图10中的(b)所示的串烧音频剪辑界面1001上也可以是显示有当前串 烧顺序,例如“当前串烧顺序:1,3,2”。应理解,该当前串烧顺序可以在用户提供输入目标串烧顺序时作为参考。
应理解,上述可能的实现第二操作的方式仅为示例性说明,并不构成对本申请实施例保护范围的限定。
在S104,可选的,终端设备可以直接将S103生成的最新的目标串烧音频的MIDI信息直接进行保存并输出。
可选的,终端设备还可以根据S103生成的最新的目标串烧音频的MIDI信息合成目标串烧音频的时域波形,从而得到目标串烧音频。可选的,终端设备还可以保存/输出该目标串烧音频。
其中,本申请实施例对终端设备根据目标串烧音频的MIDI信息合成目标串烧音频的时域波形的具体方式不作限定。作为示例,终端设备可以通过为目标串烧音频的MIDI信息加载音色库来合成目标串烧音频的时域波形,或者可以根据目标串烧音频的MIDI信息和波表(波表是预先将各种真实乐器所能发出的所有声音(包括各个音域、声调等)录制下来所存贮的文件)合成目标串烧音频的时域波形,或者还可以根据目标串烧音频的MIDI信息,和物理模型/神经网络模型来合成目标串烧音频的时域波形,当然不限于此。这里,物理模型/神经网络模型是预先构建的用于合成音频波形的模型,本申请实施例对此不作详述。
应理解,在终端设备根据目标串烧音频的MIDI信息合成目标串烧音频的时域波形时,仅根据目标串烧音频的MIDI信息中除人声声轨之外的所有乐器声轨来合成目标串烧音频的时域波形。
可选的,终端设备可以接收用户的第三操作,并响应该第三操作,对S103生成的最新的目标串烧音频的MIDI信息进行渲染,并根据渲染后的MIDI信息合成目标串烧音频的时域波形。然后,终端设备对合成的目标串烧音频的时域波形进行进一步渲染后,得到渲染后的目标串烧音频。可选的,终端设备还可以保存/输出该渲染后的目标串烧音频。
其中,终端设备根据渲染后的MIDI信息合成目标串烧音频的时域波形的详细描述,可以参考上文中终端设备根据目标串烧音频的MIDI信息合成目标串烧音频的时域波形的说明,这里不再赘述。
其中,用户的第三操作,可以包括用户在音频渲染界面输入的音频渲染处理方式的选择操作,以及用户在音频渲染界面输入的用于合成目标串烧音频时域波形的处理方式的选择操作。其中,对目标串烧音频的MIDI信息进行渲染,例如可以包括音源分离。用于合成目标串烧音频时域波形的处理方式,例如可以包括加载音色库、波表合成、物理模型合成等。对目标串烧音频的时域波形进行渲染可以包括混音,人声风格迁移等,本申请实施例对此不作限定。
可选的,终端设备可以将渲染后的目标串烧音频的时域波形保存为任一种音频格式音频文件,本申请实施例对此不作具体限定。作为示例,终端设备可以将渲染后的目标串烧音频波形保存为WAV格式的音频文件、无损音频压缩编码(free lossless audio codec,FLAC)格式的音频文件、动态影像专家压缩标准音频层面3(moving picture experts group audio layer III,MP3)格式的音频文件、音频压缩格式(OGGVobis,ogg) 的音频文件等,当然不限于此。
可选的,终端设备还可以保存生成目标串烧音频的工程。这样的话,终端可以根据所保存的工程文件重新设置用于串烧得到该目标串烧音频的m个音频片段的串烧顺序,并重新进行串烧处理,这样可以提高未来对该m个音频片段再次进行串烧时的效率。
作为示例,以终端设备是手机10为例,参考图11,图11示出了本申请实施例提供的一种对目标串烧音频的MIDI信息进行渲染以及输出的示意图。
如图11中的(a)所示,在手机10最终确定目标串烧音频的MIDI信息后,手机10可以显示音频渲染界面1101。用户可以在音频渲染界面1101上输入第三操作,该第三操作可以包括:用户在“音源分离”选项下选择开启“去除人声”(图11中的(a)“去除人声”标签的黑色方块表示开启,白色方框表示关闭)的选择操作,用户在“音频波形合成”选项下选择“加载音色库”的选择操作,用户在“混音”选项下选择“录制人声”的选择操作,以及用户在“人声风格迁移”选项下选择“歌星A”作为迁移目标的选择操作。
作为响应,手机10接收到用户的第三操作,以及接收到用户对音频渲染界面1101上的“确定”按钮的操作(例如点击)后,手机10可以将目标串烧音频的MIDI信息中的人声声轨删除或置为无效,并对目标串烧音频的MIDI信息加载音色库以合成目标串烧音频的时域波形,然后开启录音界面为目标串烧音频录制人声,以及将目标串烧音频中的人声迁移为歌星A的声音。
进一步的,手机10可以在接收到用户的第三操作,并接收到用户对音频渲染界面1101上的“确定”按钮的操作(例如点击)后,显示如图10中的(b)所显示的音频发布界面1102。这样,终端设备10可以通过音频发布界面1102与用户进行交互,并按照用户输入的指示导出目标串烧音频。
如图11中的(b)所示,手机10可以在音频发布界面1102的“导出格式”选项下接收用户输入的选择导出音频格式的选择操作,例如选择“音频格式1”的操作。手机10可以在音频发布界面1102的“导出路径”选项下接收用户输入的目标串烧音频的名称(如名称A)以及路径的操作。手机10还可以在音频发布界面1102的“保存工程”选项下接收用户输入的开启“保存工程”功能的操作。
这样,当用户对音频发布界面1102的“导出”按钮进行操作后(例如点击),作为响应,手机10即将目标串烧音频按照用户的指示进行保存。
在一些实施例中,本申请实施例所提供方法中确定过渡音频信息和生成目标音频的方法部分(即步骤S102-S104),也可以在终端设备向用户实时播放音频的过程中进行。这种情况下,本申请实施例所提供方法中确定过渡音频信息和生成目标音频的方法部分,可以通过能够提供音频聆听的App的一个功能模块实现。
作为示例,能够提供音频聆听的App例如可以是云音乐App。为简单叙述,下文中以能够提供音频聆听的App是云音乐App为例进行说明。
具体的,云音乐App在向用户提供音乐的聆听模式时,可以提供串烧模式。该串烧模式即可通过运行该云音乐App的终端设备、或者该云音乐App连接通信的服务器执行本申请实施例所提供方法中的步骤S102-S104来实现。为简单描述,下文中以通过运行该云音乐App的终端设备执行本申请实施例所提供方法中的步骤S102-S104来 实现该串烧模式为例进行说明。
这样的话,可选的,当终端设备向用户通过该终端设备中运行的云音乐App播放音乐时,终端设备所播放的音乐可以是云音乐媒体库自动推荐的音乐,也可以是本地媒体库中的音乐,本申请实施例对此不作限定。
这样,当终端设备通过与用户的交互,确定通过上述串烧模式向用户播放音乐时,终端设备可以将正在向用户播放的当前音乐和将要向用户播放的下一首音乐作为2个目标音频。并基于该2个目标音频和预设串烧顺序,并执行上文所述的S102-S104,以生成对该2个目标音频进行串烧后得到的第一目标串烧音频。其中,该预设串烧顺序为:终端设备正在向用户播放的当前音乐→终端设备将要向用户播放的下一首音乐。
这里应注意,当终端设备当前播放的音乐是云音乐媒体库自动推荐的音乐,则终端设备在向用户播放当前音乐的过程中,可以确定出为用户自动推荐的下一首音乐,该下一首音乐即为终端将要向用户播放的下一首音乐。
需要说明的是,在终端设备向用户播放完当前音乐之前,终端设备可以完成对确定出的2个目标音频的串烧并获得第一目标串烧音频。
可选的,终端设备可以在向用户播放完当前音乐后,为用户播放第一目标串烧音频。进一步的,终端设备在向用户播放完第一目标串烧音乐后,为用户播放原先的下一首音乐。
作为示例,如果终端设备正在向用户播放的当前音乐是音乐1,原先将要播放的下一首音乐为音乐2,则终端设备可以在向用户播放完音乐1后,向用户播放第一目标串烧音频。然后,终端设备在向用户播放完第一目标串烧音乐后,为用户播放音乐2。
类似的,当终端设备为用户播放原先的下一首音乐时,该原先的下一首音乐即变为终端设备正在向用户播放的新的当前音乐。这样,终端设备即可重复上述过程,生成对该新的当前音乐和该新的当前音乐的下一首音乐进行串烧后得到的第二目标串烧音频。并且,终端设备可以在向用户播放完该新的当前音乐后,为用户播放该第二串烧音频。
可以看出,在终端设备通过串烧模式向用户播放音乐时,终端设备可以动态的生成当前音乐和下一首音乐的串烧音频,并为用户播放该串烧音频,从而提高了用户体验。
还可以理解的是,在终端设备通过串烧模式向用户播放音乐时,终端设备为用户播放当前音乐和下一首音乐时,可以仅播放当前音乐和下一首音乐中用于生成串烧音频的音频片段,例如仅播放当前音乐和下一首音乐中的副歌/高潮部分,本申请实施例对此不作限定。
其中,还需说明的是,当终端设备基于确定出的2个目标音频和预设串烧顺序,执行上文所述步骤S102-S104以生成目标串烧音频(例如第一目标串烧音频、第二目标串烧音频、..、第q个目标串烧音频等,q是正整数)的过程中,在S103,终端设备只需执行一次生成过渡音频信息的过程,而无需接收用户输入的第二操作。
另外,云音乐App的串烧模式中预置有目标串烧音频的预设渲染模式和预设导出模式,该预设渲染模式中包括音源分离处理方式、混音处理方式,声音迁移方式等中 的至少一种。该预设导出模式包括串烧音频的导出格式,以及指示是否保存目标串烧音频的工程等。因此,在S104,终端设备无需通过与用户进行交互来获取对目标串烧音频的渲染模式和导出模式。
应理解,云音乐App的串烧模式中所预置的目标串烧音频的预设渲染模式和预设导出模式,可以通过终端设备和用户的交互预先配置得到,也可以在终端设备向用户播放音乐的过程中,通过与用户的交互配置得到,这里不作限定。还应理解,在对目标串烧音频的预设渲染模式和预设导出模式配置完成后,终端设备还可以在向用户播放音乐的过程中,通过与用户的交互来更新该预先配置的预设渲染模式和预设导出模式,本申请实施例对此不作限定。
当然,上述预设导出模式中也可以不包括指示是否保存目标串烧音频的工程。这种情况下,终端设备可以在通过与用户进行交互后停止为用户播放音乐时,通过与用户进行交互,接收用户输入的是否保存目标串烧音频的工程的指示,并基于用户输入的指示执行对目标串烧音频的工程的保存。可以理解,在此之前,终端设备可以将动态生成的所有目标串烧音频的工程进行缓存。
还应理解,上述“串烧模式”的名称仅为示例性说明,并不作为对本申请实施例的限定。
综上,本申请实施例提供了一种音频数据的处理方法,通过该方法,本申请实施例可以在MIDI域基于m个音频片段,生成用于衔接该m个音频片段的m-1个过渡音频信息。这样,通过该m-1个过渡音频信息即可将m个音频片段的MIDI信息衔接起来,从而得到该m个音频片段被串烧后的目标串烧音频。可以看出,通过本申请方法对多个音频片段进行串烧时,终端设备可以生成全新的用于衔接该多个音频片段的过渡音频段,因此本申请实施例提供的方法无需考虑用于得到目标串烧音频的音频片段的相似度。也就是说,通过本申请实施例提供的方法能够获得更丰富、更具多样性的串烧音频。
并且,由于音频的MIDI信息是音频最原始的表现形式,其记录有音频的音符音高、音符力度、音符时长等信息。因此,相比在时域上直接对多个音频片段进行串烧,本申请实施例提供的方法中所生成的用于衔接两个音频片段的过渡音频信息是基于音频乐理生成的,这样的话,基于该过渡音频信息获得的串烧音频在听觉上更加流畅自然。并且,在MIDI域处理数据也更有利于串烧音频在后期渲染时的灵活性和一致性。
此外,通过本申请实施例提供的方法对m个音频片段进行串烧时,用户可以参与度高,这样即可以获得令用户满意的串烧音频,即用户体验度高。
上述主要从方法的角度对本申请实施例提供的方案进行了介绍。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对音频数据的处理装置进行功能模块的划分, 例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
如图12所示,图12示出了本申请实施例提供的一种音频数据的处理装置120的结构示意图。处理装置120可以用于执行上述的音频数据的处理方法,例如用于执行图3所示的方法。其中,处理装置120可以包括获取单元121、确定单元122以及生成单元123。
获取单元121,用于获取m个音频片段。确定单元122,用于根据该m个音频片段确定m-1个过渡音频信息。生成单元123,用于根据该m个音频片段和该m-1个过渡音频信息,生成目标串烧音频。其中,该m-1个过渡音频信息用于衔接该m个音频片段。对于该m-1个过渡音频信息中的第一过渡音频信息而言,该第一过渡音频信息用于衔接m个音频片段中排序连续的第一音频片段和第二音频片段。这里,m个音频片段的排序是指m个音频片段的串烧顺序。
作为示例,结合图3,获取单元121可以用于执行S101,确定单元122以用于执行S102,生成单元123可以用于执行S103-S104。
可选的,确定单元122,具体用于根据第一音频片段的第一信息和第二音频片段的第二信息,确定第一过渡音频信息。其中,该第一信息包括第一音频片段的MIDI信息和音频特征信息,该第二信息包括第二音频片段的MIDI信息和音频特征信息,该第一过渡音频信息包括第一过渡音频信息对应的第一过渡音频的MIDI信息。
作为示例,结合图3,确定单元122可以用于执行S102。
可选的,上述的音频特征信息包括音频片段的主旋律轨位置信息、风格标签、情感标签、节奏信息、节拍信息、或调号信息中的至少一种。
可选的,确定单元122,具体用于根据第一音频片段的第一信息、第二音频片段的第二信息以及预设神经网络模型,确定第一过渡音频信息。
作为示例,结合图3,确定单元122可以用于执行S102。
可选的,当在目标串烧音频中,第一音频片段位于第二音频片段之前,则:上述第一过渡音频信息是基于用于表征第一过渡音频信息的特征向量确定的,该第一过渡音频信息的特征向量是基于第一向量和第二向量确定的。其中,该第一向量是根据第一信息在第一音频片段的时序末端生成的特征向量,该第二向量是根据第二信息在第二音频片段的时序始端生成的特征向量。
可选的,确定单元122,还用于响应于用户的第一操作,确定k个目标音频。获取单元121,具体用于从该k个目标音频中提取m个音频片段。其中,2≤k≤m,且k是整数。
作为示例,结合图3,确定单元122和获取单元121可以用于执行S101,
可选的,确定单元122,还用于在根据m个音频片段确定m-1个过渡音频信息之前,确定该m个音频片段的串烧顺序。
可选的,确定单元122还用于响应于用户的第二操作,重新确定m个音频片段的串烧顺序。确定单元122还用于根据重新确定的串烧顺序和m个音频片段,重新确定 m-1个过渡音频信息.生成单元123,还用于根据重新确定的m-1个过渡音频信息和m个音频片段,重新生成目标串烧音频。
可选的,处理装置120还包括:渲染单元124,用于响应于用户的第三操作,对上述的目标串烧音频进行渲染。
作为示例,结合图3,渲染单元124可以用于执行S104。
可选的,处理装置120还包括:输出单元125,用于输出上述的目标串烧音频。
关于上述可选方式的具体描述可以参见前述的方法实施例,此处不再赘述。此外,上述提供的任一种处理装置120的解释以及有益效果的描述均可参考上述对应的方法实施例,不再赘述。
作为示例,结合图1,处理装置120中的获取单元121和输出单元125,可以通过图1中的触摸屏150和处理器110实现其功能。确定单元122、生成单元123、渲染单元124可以通过图1中的处理110执行图1中的内部存储器120中的程序代码实现。
图13示出本申请实施例提供的用于承载计算机程序产品的信号承载介质的结构示意图,该信号承载介质用于存储计算机程序产品或用于存储计算设备上执行计算机进程的计算机程序。
如图13所示,信号承载介质130可以包括一个或多个程序指令,其当被一个或多个处理器运行时可以提供以上针对图3描述的功能或者部分功能。因此,例如,参考图3中S101~S104的一个或多个特征可以由与信号承载介质130相关联的一个或多个指令来承担。此外,图13中的程序指令也描述示例指令。
在一些示例中,信号承载介质130可以包含计算机可读介质131,诸如但不限于,硬盘驱动器、紧密盘(CD)、数字视频光盘(DVD)、数字磁带、存储器、只读存储记忆体(read-only memory,ROM)或随机存储记忆体(random access memory,RAM)等等。
在一些实施方式中,信号承载介质130可以包含计算机可记录介质132,诸如但不限于,存储器、读/写(R/W)CD、R/W DVD、等等。
在一些实施方式中,信号承载介质130可以包含通信介质133,诸如但不限于,数字和/或模拟通信介质(例如,光纤电缆、波导、有线通信链路、无线通信链路、等等)。
信号承载介质130可以由无线形式的通信介质133(例如,遵守IEEE 1902.11标准或者其它传输协议的无线通信介质)来传达。一个或多个程序指令可以是,例如,计算机可执行指令或者逻辑实施指令。
在一些示例中,诸如针对图3描述的音频数据的处理装置可以被配置为,响应于通过计算机可读介质131、计算机可记录介质132、和/或通信介质133中的一个或多个程序指令,提供各种操作、功能、或者动作。
应该理解,这里描述的布置仅仅是用于示例的目的。因而,本领域技术人员将理解,其它布置和其它元素(例如,机器、接口、功能、顺序、和功能组等等)能够被取而代之地使用,并且一些元素可以根据所期望的结果而一并省略。另外,所描述的元素中的许多是可以被实现为离散的或者分布式的组件的、或者以任何适当的组合和位置来结合其它组件实施的功能实体。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上和执行计算机执行指令时,全部或部分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (23)

  1. 一种音频数据的处理方法,其特征在于,包括:
    获取m个音频片段,m是大于等于2的整数;
    根据所述m个音频片段确定m-1个过渡音频信息,所述m-1个过渡音频信息用于衔接所述m个音频片段;其中,所述m-1个过渡音频信息中的第一过渡音频信息,用于衔接所述m个音频片段中排序连续的第一音频片段和第二音频片段,所述排序为所述m个音频片段的串烧顺序;
    根据所述m个音频片段和所述m-1个过渡音频信息,生成目标串烧音频。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述m个音频片段确定m-1个过渡音频信息,包括:
    根据所述第一音频片段的第一信息和所述第二音频片段的第二信息,确定所述第一过渡音频信息,所述第一过渡音频信息包括所述第一过渡音频信息对应的第一过渡音频的乐器数字接口MIDI信息;
    其中,所述第一信息包括所述第一音频片段的MIDI信息和音频特征信息,所述第二信息包括所述第二音频片段的MIDI信息和音频特征信息。
  3. 根据权利要求2所述的方法,其特征在于,所述音频特征信息包括音频片段的主旋律轨位置信息、风格标签、情感标签、节奏信息、节拍信息、或调号信息中的至少一种。
  4. 根据权利要求2或3所述的方法,其特征在于,所述根据所述第一音频片段的第一信息和所述第二音频片段的第二信息,确定第一过渡音频信息,包括:
    根据所述第一音频片段的第一信息、所述第二音频片段的第二信息以及预设神经网络模型,确定第一过渡音频信息。
  5. 根据权利要求4所述的方法,其特征在于,当在所述目标串烧音频中,所述第一音频片段位于所述第二音频片段之前,则:
    所述第一过渡音频信息是基于用于表征所述第一过渡音频信息的特征向量确定的,所述第一过渡音频信息的特征向量是基于第一向量和第二向量确定的;其中,所述第一向量是根据所述第一信息在所述第一音频片段的时序末端生成的特征向量,所述第二向量是根据所述第二信息在所述第二音频片段的时序始端生成的特征向量。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述获取m个音频片段,包括:
    响应于用户的第一操作,确定k个目标音频,2≤k≤m,且k是整数;
    从所述k个目标音频中提取所述m个音频片段。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,在所述根据所述m个音频片段确定m-1个过渡音频信息之前,所述方法还包括:
    确定所述m个音频片段的串烧顺序。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述方法还包括:
    响应于用户的第二操作,重新确定所述m个音频片段的串烧顺序;
    根据重新确定的串烧顺序和所述m个音频片段,重新确定m-1个过渡音频信息;
    根据重新确定的m-1个过渡音频信息和所述m个音频片段,重新生成目标串烧音 频。
  9. 根据权利要求1-8中任一项所述的方法,其特征在于,所述方法还包括:
    响应于用户的第三操作,对所述目标串烧音频进行渲染。
  10. 根据权利要求1-9中任一项所述的方法,其特征在于,所述方法还包括:
    输出所述目标串烧音频。
  11. 一种音频数据的处理装置,其特征在于,包括:
    获取单元,用于获取m个音频片段,m是大于等于2的整数;
    确定单元,用于根据所述m个音频片段确定m-1个过渡音频信息,所述m-1个过渡音频信息用于衔接所述m个音频片段;其中,所述m-1个过渡音频信息中的第一过渡音频信息,用于衔接所述m个音频片段中排序连续的第一音频片段和第二音频片段,所述排序为所述m个音频片段的串烧顺序;
    生成单元,用于根据所述m个音频片段和所述m-1个过渡音频信息,生成目标串烧音频。
  12. 根据权利要求11所述的装置,其特征在于,
    所述确定单元,具体用于根据所述第一音频片段的第一信息和所述第二音频片段的第二信息,确定所述第一过渡音频信息,所述第一过渡音频信息包括所述第一过渡音频信息对应的第一过渡音频的乐器数字接口MIDI信息;
    其中,所述第一信息包括所述第一音频片段的MIDI信息和音频特征信息,所述第二信息包括所述第二音频片段的MIDI信息和音频特征信息。
  13. 根据权利要求12所述的装置,其特征在于,所述音频特征信息包括音频片段的主旋律轨位置信息、风格标签、情感标签、节奏信息、节拍信息、或调号信息中的至少一种。
  14. 根据权利要求12或13所述的装置,其特征在于,
    所述确定单元,具体用于根据所述第一音频片段的第一信息、所述第二音频片段的第二信息以及预设神经网络模型,确定第一过渡音频信息。
  15. 根据权利要求14所述的装置,其特征在于,当在所述目标串烧音频中,所述第一音频片段位于所述第二音频片段之前,则:
    所述第一过渡音频信息是基于用于表征所述第一过渡音频信息的特征向量确定的,所述第一过渡音频信息的特征向量是基于第一向量和第二向量确定的;其中,所述第一向量是根据所述第一信息在所述第一音频片段的时序末端生成的特征向量,所述第二向量是根据所述第二信息在所述第二音频片段的时序始端生成的特征向量。
  16. 根据权利要求11-15中任一项所述的装置,其特征在于,
    所述确定单元,还用于响应于用户的第一操作,确定k个目标音频,2≤k≤m,且k是整数;
    所述获取单元,具体用于从所述k个目标音频中提取所述m个音频片段。
  17. 根据权利要求11-16中任一项所述的装置,其特征在于,
    所述确定单元,还用于在根据所述m个音频片段确定m-1个过渡音频信息之前,确定所述m个音频片段的串烧顺序。
  18. 根据权利要求11-17中任一项所述的装置,其特征在于,
    所述确定单元,还用于响应于用户的第二操作,重新确定所述m个音频片段的串烧顺序;以及,还用于根据重新确定的串烧顺序和所述m个音频片段,重新确定m-1个过渡音频信息;
    所述生成单元,还用于根据重新确定的m-1个过渡音频信息和所述m个音频片段,重新生成目标串烧音频。
  19. 根据权利要求11-18中任一项所述的装置,其特征在于,所述装置还包括:
    渲染单元,用于响应于用户的第三操作,对所述目标串烧音频进行渲染。
  20. 根据权利要求11-19中任一项所述的装置,其特征在于,所述装置还包括:
    输出单元,用于输出所述目标串烧音频。
  21. 一种音频数据的处理装置,其特征在于,包括:一个或多个处理器和传输接口,所述一个或多个处理器通过所述传输接口接收或发送数据,所述一个或多个处理器被配置为调用存储在存储器中的程序指令,以执行如权利要求1-10中任一项所述的方法。
  22. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括程序指令,当所述程序指令在计算机或处理器上运行时,使得所述计算机或所述处理器执行权利要求1-10中任一项所述的方法。
  23. 一种计算机程序产品,其特征在于,当所述计算机程序产品在音频数据的处理装置上运行时,使得所述装置执行如权利要求1-10中任一项所述的方法。
PCT/CN2022/093923 2021-07-31 2022-05-19 一种音频数据的处理方法及装置 WO2023010949A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22851667.0A EP4365888A1 (en) 2021-07-31 2022-05-19 Method and apparatus for processing audio data
US18/426,495 US20240169962A1 (en) 2021-07-31 2024-01-30 Audio data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110876809.0 2021-07-31
CN202110876809.0A CN115700870A (zh) 2021-07-31 2021-07-31 一种音频数据的处理方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/426,495 Continuation US20240169962A1 (en) 2021-07-31 2024-01-30 Audio data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2023010949A1 true WO2023010949A1 (zh) 2023-02-09

Family

ID=85120762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093923 WO2023010949A1 (zh) 2021-07-31 2022-05-19 一种音频数据的处理方法及装置

Country Status (4)

Country Link
US (1) US20240169962A1 (zh)
EP (1) EP4365888A1 (zh)
CN (1) CN115700870A (zh)
WO (1) WO2023010949A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1938753A (zh) * 2004-03-31 2007-03-28 松下电器产业株式会社 乐曲数据编辑装置以及乐曲数据编辑方法
WO2009107137A1 (en) * 2008-02-28 2009-09-03 Technion Research & Development Foundation Ltd. Interactive music composition method and apparatus
US20130282388A1 (en) * 2010-12-30 2013-10-24 Dolby International Ab Song transition effects for browsing
US20140254831A1 (en) * 2013-03-05 2014-09-11 Nike, Inc. Adaptive Music Playback System
CN108766407A (zh) * 2018-05-15 2018-11-06 腾讯音乐娱乐科技(深圳)有限公司 音频连接方法及装置
US20190051272A1 (en) * 2017-08-08 2019-02-14 CommonEdits, Inc. Audio editing and publication platform
CN112037739A (zh) * 2020-09-01 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 一种数据处理方法、装置、电子设备
US20200410968A1 (en) * 2018-02-26 2020-12-31 Ai Music Limited Method of combining audio signals

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1938753A (zh) * 2004-03-31 2007-03-28 松下电器产业株式会社 乐曲数据编辑装置以及乐曲数据编辑方法
WO2009107137A1 (en) * 2008-02-28 2009-09-03 Technion Research & Development Foundation Ltd. Interactive music composition method and apparatus
US20130282388A1 (en) * 2010-12-30 2013-10-24 Dolby International Ab Song transition effects for browsing
US20140254831A1 (en) * 2013-03-05 2014-09-11 Nike, Inc. Adaptive Music Playback System
US20190051272A1 (en) * 2017-08-08 2019-02-14 CommonEdits, Inc. Audio editing and publication platform
US20200410968A1 (en) * 2018-02-26 2020-12-31 Ai Music Limited Method of combining audio signals
CN108766407A (zh) * 2018-05-15 2018-11-06 腾讯音乐娱乐科技(深圳)有限公司 音频连接方法及装置
CN112037739A (zh) * 2020-09-01 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 一种数据处理方法、装置、电子设备

Also Published As

Publication number Publication date
EP4365888A1 (en) 2024-05-08
CN115700870A (zh) 2023-02-07
US20240169962A1 (en) 2024-05-23

Similar Documents

Publication Publication Date Title
WO2020113733A1 (zh) 动画生成方法、装置、电子设备及计算机可读存储介质
WO2018059342A1 (zh) 一种双音源音频数据的处理方法及装置
CA2650612C (en) An adaptive user interface
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
CN111402842A (zh) 用于生成音频的方法、装置、设备和介质
CN111292717B (zh) 语音合成方法、装置、存储介质和电子设备
WO2021259300A1 (zh) 音效添加方法和装置、存储介质和电子设备
CN113257218B (zh) 语音合成方法、装置、电子设备和存储介质
CN110324718A (zh) 音视频生成方法、装置、电子设备及可读介质
WO2022042634A1 (zh) 音频数据的处理方法、装置、设备及存储介质
WO2022160603A1 (zh) 歌曲的推荐方法、装置、电子设备及存储介质
US8682938B2 (en) System and method for generating personalized songs
WO2023217003A1 (zh) 音频处理方法、装置、设备及存储介质
WO2023010949A1 (zh) 一种音频数据的处理方法及装置
KR20180012397A (ko) 디지털 음원 관리 시스템 및 방법, 디지털 음원 재생 장치 및 방법
WO2022143530A1 (zh) 音频处理方法、装置、计算机设备及存储介质
KR101124798B1 (ko) 전자 그림책 편집 장치 및 방법
CN114974184A (zh) 音频制作方法、装置、终端设备及可读存储介质
Behrendt Telephones, music and history: From the invention era to the early smartphone days
Maz Music Technology Essentials: A Home Studio Guide
WO2022041177A1 (zh) 通信消息处理方法、设备及即时通信客户端
KR102598242B1 (ko) 전자 장치 및 이의 제어 방법
WO2023127486A1 (ja) プログラム及び情報処理装置
Costa et al. Internet of Musical Things Environments and Pure Data: A Perfect Match?
CN118447867A (zh) 音乐分离方法、音乐分离装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22851667

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022851667

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022851667

Country of ref document: EP

Effective date: 20240129

NENP Non-entry into the national phase

Ref country code: DE