WO2020125372A1 - 混合声音信号的分离方法、装置、电子设备和可读介质 - Google Patents

混合声音信号的分离方法、装置、电子设备和可读介质 Download PDF

Info

Publication number
WO2020125372A1
WO2020125372A1 PCT/CN2019/121730 CN2019121730W WO2020125372A1 WO 2020125372 A1 WO2020125372 A1 WO 2020125372A1 CN 2019121730 W CN2019121730 W CN 2019121730W WO 2020125372 A1 WO2020125372 A1 WO 2020125372A1
Authority
WO
WIPO (PCT)
Prior art keywords
accompaniment
vocal
sound
hidden variable
encoder
Prior art date
Application number
PCT/CN2019/121730
Other languages
English (en)
French (fr)
Inventor
张宁
李岩
姜涛
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2020125372A1 publication Critical patent/WO2020125372A1/zh
Priority to US17/352,856 priority Critical patent/US11430427B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present application belongs to the field of computer software applications, in particular to a method, device, electronic device and readable medium for separating mixed sound signals.
  • General pop music is composed of vocals and accompaniment. Separating the mixed music into vocals and accompaniment (separation of sound and accompaniment) is a challenging task. The separation of sound partners plays an important role in music editing and music retrieval. The improvement of the performance of the sound partner separation model can greatly improve the effect of the subsequent processing flow.
  • the current mainstream sound partner separation model is an end-to-end deterministic model. Calculate the mask of each sound source in the time-frequency diagram, and then multiply the mask by the time-frequency diagram of the mixed sound to obtain the time-frequency of the separated sound source. Feature, and then get the time domain representation of the separated sound source.
  • the inventor found that although the sound source signal separated by this end-to-end model has a high signal-to-noise ratio, the separated sound source signal is almost impossible to be clean, and it will be more or less doped with residual other sound source signals. . Although these residual disturbances are weak, they have a very serious impact on the subsequent lyrics segmentation and song evaluation steps.
  • industry experts are continuously improving existing technical solutions and looking for new technical solutions to gradually improve the separation effect of vocals and accompaniments in mixed sounds.
  • the present application discloses a separation method, device, electronic device and readable medium for mixing sound signals to solve the problems in the prior art.
  • a method for separating mixed sound signals including:
  • the human voice and accompaniment are obtained based on the human voice feature data and the accompaniment sound feature data.
  • a separation device for mixing sound signals including:
  • Feature extraction module used to extract the mixed sound feature data from the mixed sound signal
  • a hidden variable generation module which is used to input the mixed voice feature data into the mixed voice coding model to obtain a first hidden variable and a second hidden variable, the first hidden variable represents a vocal feature, and the second hidden variable represents an accompaniment.
  • the vocal feature generation module is used to input the first hidden variable to the vocal decoder to obtain vocal feature data
  • the accompaniment feature generation module is used to input the first hidden variable to the accompaniment sound decoder to obtain accompaniment sound feature data
  • a vocal generation module for obtaining a vocal based on the vocal characteristic data
  • the accompaniment generation module is used to obtain accompaniment based on the accompaniment sound feature data.
  • an electronic device including:
  • Memory for storing processor executable instructions
  • the processor is configured to perform any one of the above methods.
  • a non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are executed as described in any one of the above method.
  • a computer program product including a computer program product, the computer program includes program instructions, and when the program instructions are executed by an electronic device, the electronic device is caused to perform any of the above Item.
  • the technical solutions provided by the embodiments of the present application may include the following beneficial effects: the accompaniment and vocals are separated from the mixed sound by the model obtained after training, and the signal noise of the resulting vocals and accompaniment sounds is relatively low. Further, the vocals and accompaniment are separately trained, and the hidden variables obtained from the accompaniment coding model and the vocal coding model are used to construct the loss function of the mixed voice encoder, thereby improving the training efficiency.
  • FIG. 1 is a flowchart of a method for separating mixed sound signals according to an exemplary embodiment of the present application
  • 2A-2C are specific embodiments of training steps according to an exemplary embodiment of the present application.
  • FIG. 3 is a flowchart of a method for separating mixed sound signals according to an exemplary embodiment of the present application
  • 4A is a schematic structural diagram of a self-encoder including an encoder and a decoder
  • 4B is a schematic structural diagram of a prior art neural network
  • FIG. 5 is a schematic structural diagram of a vocal autoencoder, accompaniment autoencoder, and mixed voice encoder according to an embodiment of the present application
  • FIG. 6 is a schematic structural diagram of a device for separating mixed sound signals according to an embodiment of the present application.
  • Fig. 7 is a block diagram of a first electronic device that performs a method for separating mixed sound signals according to an exemplary embodiment
  • Fig. 8 is a block diagram of a second electronic device that performs a method for separating mixed sound signals according to an exemplary embodiment
  • Fig. 9 is a block diagram of a third electronic device performing a method of separating mixed sound signals according to an exemplary embodiment.
  • the vocal training samples, vocal verification samples and vocal sound signals are all pure vocal signals (or data), and accordingly, the accompaniment training samples, accompaniment sound verification samples and accompaniment sound signals are pure Signal (or data) of the accompaniment sound.
  • the audio data is named training samples and verification samples only to distinguish the samples used in different steps.
  • hidden variables into first, second, third, and fourth... is just to distinguish the hidden variables used in different scenarios, it does not mean that these hidden variables are different in attributes.
  • FIG. 1 is a flowchart of a method for separating mixed sound signals according to an exemplary embodiment of the present application. This embodiment is applied to an application scenario in which human voice and accompaniment are separated from mixed sounds. It includes the following steps.
  • step S101 mixed sound feature data is extracted from the mixed sound signal.
  • step S102 the mixed voice feature data is input into the mixed voice coding model to obtain the first hidden variable and the second hidden variable.
  • step S103 the first latent variable and the second latent variable are input to the vocal decoding model and the accompaniment decoding model, respectively, to obtain vocal feature data and accompaniment sound feature data.
  • step S104 vocals and accompaniments are obtained based on the vocal feature data and accompaniment sound feature data.
  • the mixed sound coding model, vocal decoding model, and accompaniment decoding model are all neural network models obtained by training.
  • the mixed sound coding model receives the mixed sound feature data, outputs the first hidden variable and the second hidden variable, obtains the human voice through the vocal decoding model based on the first hidden variable, and obtains the accompaniment through the accompaniment decoding model based on the second hidden variable, thereby achieving from Accompaniment and vocals are separated from the mixed sound.
  • the first hidden variable characterizes the vocal characteristics
  • the second hidden variable characterizes the accompaniment sound characteristics.
  • the vocal coding model and the vocal decoding model are obtained by training the vocal encoder and the vocal decoder, see FIG. 2A, the training includes the following steps:
  • step S201 construct a vocal training sample
  • step S202 a vocal training sample is input into the current vocal encoder to obtain the output third hidden variable, and the third hidden variable represents the characteristics of the human voice;
  • step S203 input the third latent variable to the current vocal decoder to obtain the corresponding vocal verification sample
  • step S204 a first loss function is constructed based on the current vocal training samples and corresponding vocal verification samples, and the weight parameters of the current vocal encoder and vocal decoder are updated based on the first loss function back propagation;
  • step S205 it is determined whether the first loss function is the smallest, if not, it is transferred to step S202, otherwise iterative processing is skipped;
  • step S206 when the iteration process ends, the current vocal encoder and vocal decoder are used as the vocal encoding model and vocal decoding model.
  • the accompaniment coding model and accompaniment decoding model are obtained through the training of the accompaniment encoder and accompaniment decoder. See FIG. 2B.
  • the training includes the following steps:
  • step S301 construct an accompaniment sound training sample
  • step S302 an accompaniment sound training sample is input into the current accompaniment encoder to obtain an output fourth hidden variable, and the fourth hidden variable represents the characteristics of the accompaniment sound;
  • step S303 the fourth latent variable is input to the current accompaniment decoder to obtain a corresponding accompaniment verification sample
  • step S304 a second loss function is constructed based on the current accompaniment training samples and corresponding accompaniment verification samples, and the weight parameters of the current accompaniment encoder and accompaniment decoder are updated based on the second loss function back propagation;
  • step S305 it is determined whether the second loss function is the smallest, if not, it is transferred to step S302, otherwise iterative processing is skipped;
  • step S306 when the iteration process ends, the current accompaniment encoder and accompaniment decoder are used as the accompaniment encoding model and accompaniment decoding model.
  • the hybrid coding model is obtained through the training of the hybrid encoder. Referring to FIG. 2C, the training includes the following steps:
  • step S401 a mixed sound training sample is constructed based on the vocal training sample and the accompaniment sound training sample;
  • step S402 a mixed sound training sample is input into the current mixed encoder to obtain the output of the fifth hidden variable and the sixth hidden variable.
  • the fifth hidden variable represents the vocal feature
  • the sixth hidden variable represents the accompaniment sound feature ;
  • step S403 the current fifth latent variable, the sixth latent variable, and the third latent variable and the fourth latent variable obtained by training the vocal encoder and accompaniment encoder before, and the vocal verification samples and vocal training samples are used to form The first loss function and the second loss function composed of the accompaniment training samples and the accompaniment verification samples, construct a third loss function, and update the weight parameters of the current hybrid encoder based on the back propagation of the third loss function;
  • step S404 it is determined whether the third loss function is the smallest, if not, it is transferred to step S402, otherwise iterative processing is skipped;
  • step S405 when the iterative process ends, the current hybrid encoder is used as the hybrid sound coding model of the application scene.
  • the vocal training samples used are pure vocals
  • the accompaniment training samples are pure accompaniment sounds
  • the mixed sound training samples are obtained by mixing each vocal training sample and each accompaniment training sample.
  • the loss function in the mixed sound is constructed based on the loss function and hidden variables in the vocal and accompaniment training process. Therefore, when the loss function obtained by the vocal and accompaniment converges, the loss function of the hidden variable also tends to converge, and finally Get the mixed sound coding model.
  • the sound features involved in the above embodiments including mixed sound features, vocal features, and accompaniment sound features, are all taken from the original sound signal and represent the essential sound features in the original sound.
  • the sound feature is, for example, a sound spectrogram.
  • the sound feature extraction methods are all existing technologies, and will not be repeated here.
  • FIG. 3 is a flowchart and the like of a method for separating mixed sound signals according to an exemplary embodiment of the present application.
  • step S501 the mixed sound feature data is extracted from the mixed sound signal by Fourier transform.
  • step S502 the mixed sound feature data is input into the mixed sound coding model to obtain the first hidden variable and the second hidden variable.
  • step S503 the first latent variable and the second latent variable are input to the vocal decoding model and the accompaniment decoding model, respectively, to obtain vocal characteristic data and accompaniment sound characteristic data.
  • step S504 the inverse Fourier transform is used to obtain the human voice and accompaniment based on the human voice feature data and the accompaniment voice feature data.
  • the spectrum characteristics of the mixed sound are obtained from the mixed sound signal based on the Fourier transform, and then the spectrum characteristics of the mixed sound are added to the model to separate the first latent variable representing the vocal spectrum and the accompaniment spectrum.
  • the second hidden variable and then reconstruct the human voice and accompaniment according to the first hidden variable and the second hidden variable, thereby realizing the separation of the human voice and the accompaniment from the mixed sound signal.
  • FIG. 4A is a schematic structural diagram of a self-encoder including an encoder and a decoder in the prior art.
  • Autoencoders are a type of neural network, and after training try to copy the input to the output. There is a hidden layer inside the self-encoder, which can generate the code as the input of the decoder. As shown in FIG. 4A, the input signal 301 generates an implicit variable 302 via an encoder as an input to a decoder, and the implicit variable 302 generates a reconstruction signal 303 via a decoder. To get a usable model of the encoder and decoder, you need to set the loss function, and then based on the goal of minimizing the loss function, continuously update the weight parameters of the encoder and decoder through iterative training to get the final encoder model and decoding ⁇ The model.
  • RNN Recurrent Neural Network
  • DNN Deep Neural Network
  • CNN Convolutional Neural Network
  • BP Back Propagation
  • the structure of a typical neural network is shown in Figure 4B.
  • the input layer goes through multiple feature map layers to get the output layer.
  • each vocal goes through encoder 1 (ie vocal coder) to get latent variable 1, and latent variable 1 is input to decoder 1 (ie vocal decoder) to get reconstructed vocal.
  • the weight parameters of encoder 1 and decoder 1 are updated according to the back propagation of the loss function between the reconstructed human voice and the input human voice. Repeat the above steps for each vocal sample in the vocal training samples, and use the resulting encoder 1 and decoder 1 as the vocal encoder model and vocal decoder model.
  • each accompaniment passes through encoder 2 (ie, accompaniment encoder) to obtain hidden variable 2, which is input to decoder 2 (ie, accompaniment decoder) to obtain reconstructed accompaniment sound.
  • decoder 2 ie, accompaniment decoder
  • the weight parameters of encoder 2 and decoder 2 are updated according to the back propagation of the loss function between the reconstructed accompaniment and the input accompaniment. Repeat the above steps for each accompaniment sample in the accompaniment training samples, and use the resulting encoder 2 and decoder 2 as the accompaniment encoder model and accompaniment decoder model.
  • mixed voice training samples are obtained based on the mixing of vocal training samples and accompaniment training samples. That is, each mixed sound sample is composed of a vocal training sample and a vocal accompaniment sample. Input each mixed sound sample into the mixed sound encoder to obtain the reconstructed mixed sound, and then the reconstructed mixed sound and mixed sound training samples, as well as the loss function of the corresponding vocal training samples and the corresponding accompaniment training samples.
  • the loss function constructs the loss function of the mixed sound encoder together, and continuously updates the weight parameters of the mixed sound encoder with the goal of minimizing the loss function.
  • the final mixed voice encoder is used as a mixed voice coding model.
  • Loss function of mixed sound encoder Use the following formula:
  • Loss function of mixed sound encoder Use the following formula:
  • v represents a vocal training sample
  • a represents accompaniment sound training samples
  • a represents accompaniment sound verification samples (reconstructed accompaniments)
  • h v and ha a represent the two hidden variables output by the mixed sound encoder
  • Hidden variables 3 and 4 in the figure above Represents the hidden variable output by the vocal encoder (hidden variable 1 in the above figure)
  • the above embodiment achieves separation of human voice and accompaniment from the mixed sound signal, and the resulting sound signal has relatively low signal noise.
  • the model training steps can be performed offline, thereby saving terminal computing resources, and the model application steps can be performed online to complete the separation of mixed sound signals in real time.
  • the device 800 includes a feature extraction module 801, a hidden variable generation module 802, a vocal feature generation module 803, an accompaniment feature generation module 805, a vocal generation module 804, and an accompaniment generation module 806.
  • the feature extraction module 801 is used to extract mixed sound feature data from the mixed sound signal.
  • the hidden variable generation module 802 is used to input the mixed voice feature data into the mixed voice coding model to obtain a first hidden variable and a second hidden variable.
  • the first hidden variable represents the human voice feature
  • the second hidden variable represents the accompaniment voice feature.
  • the vocal feature generation module 803 is used to input the first hidden variable output by the hidden variable generation module 802 to the vocal decoder to obtain vocal feature data.
  • the accompaniment feature generating module 805 is used to input the first implicit variable of the second implicit variable output by the implicit variable generating module 802 to the accompaniment sound decoder to obtain accompaniment sound feature data.
  • the vocal generation module 804 is used to obtain vocals based on vocal characteristic data.
  • the accompaniment generation module 806 is used to obtain accompaniment based on the accompaniment sound feature data.
  • the above device further includes: a vocal sample collection module and a vocal model training module.
  • the vocal sample collection module is used to construct vocal training samples.
  • Each sample in the vocal training sample is a vocal feature extracted from pure vocals.
  • the vocal model training module is used to perform iterative processing with the following steps until the loss function is minimized: input a vocal training sample to the current vocal encoder to obtain the output of the third hidden variable, the third hidden variable represents the Describe the vocal characteristics; input the third hidden variable to the current vocal decoder to obtain the corresponding vocal verification samples; construct the first loss function based on the current vocal training samples and the corresponding vocal verification samples, based on the first The loss function backpropagates and updates the weight parameters of the current vocal encoder and vocal decoder. When the iterative processing is completed, the current vocal encoder and vocal decoder are used as the vocal coding model and the person Sound decoding model.
  • the above-mentioned device further includes an accompaniment sample collection module and an accompaniment model training module.
  • the accompaniment sample collection module is used to construct accompaniment sound training samples.
  • Each sample in the accompaniment sound training sample is the accompaniment sound feature extracted from the pure accompaniment sound.
  • the accompaniment model training module is used to perform iterative processing with the following steps until the loss function is minimized: input an accompaniment sound training sample into the current vocal encoder to obtain the output fourth latent variable, which characterizes Accompaniment sound features; input the fourth latent variable to the current accompaniment decoder to obtain corresponding accompaniment verification samples; construct a second loss function based on the current accompaniment training samples and corresponding accompaniment verification samples, and backpropagate based on the second loss function Update the weight parameters of the current accompaniment encoder and accompaniment decoder. When the iteration process ends, use the current accompaniment encoder and accompaniment decoder as the accompaniment encoding model and accompaniment decoding model.
  • the above device further includes: a mixed sound sample collection module and a mixed sound model training module.
  • the mixed sound sample collection module is used to construct a mixed sound training sample based on the vocal training sample and the accompaniment sound training sample.
  • Each sample of the mixed sound training sample is based on the mixed sound features extracted from the pure vocal and accompaniment sound after mixing.
  • Mixed sound model training module used to construct mixed sound training samples based on vocal training samples and accompaniment sound training samples; iterative processing using the following steps until the loss function is minimized: input a mixed sound training sample to the current mixed encoder In the output, the fifth and sixth hidden variables are obtained; the current fifth hidden variable, sixth hidden variable, third hidden variable, fourth hidden variable, and the first loss function and the second loss function are constructed.
  • the loss function updates the weight parameters of the current hybrid encoder based on the back propagation of the third loss function.
  • the current hybrid encoder is used as the hybrid sound encoding model.
  • whether it is mixed sound feature data, vocal feature data, or accompaniment sound feature data is Fourier transform extracted from the original sound signal to characterize the depth characteristics of the sound signal.
  • Fig. 7 is a block diagram of an electronic device performing the above method according to an exemplary embodiment.
  • the electronic device 1200 may be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, or the like.
  • the electronic device 1200 may include one or more of the following components: a processing component 1202, a memory 1204, a power supply component 1206, a multimedia component 1208, an audio component 1210, an input/output (I/O) interface 1212, and a sensor component 1214 , And communication components 1216.
  • a processing component 1202 a memory 1204, a power supply component 1206, a multimedia component 1208, an audio component 1210, an input/output (I/O) interface 1212, and a sensor component 1214 , And communication components 1216.
  • the processing component 1202 generally controls the overall operations of the electronic device 1200, such as operations associated with display, telephone calls, data communication, camera operations, and recording operations.
  • the processing component 1202 may include one or more processors 1220 to execute instructions to complete all or part of the steps in the above method.
  • the processing component 1202 may include one or more modules to facilitate interaction between the processing component 1202 and other components.
  • the processing component 1202 may include a multimedia module to facilitate interaction between the multimedia component 1208 and the processing component 1202.
  • the memory 1204 is configured to store various types of data to support operation at the device 1200. Examples of these data include instructions for any application or method operating on the electronic device 1200, contact data, phone book data, messages, pictures, videos, and so on.
  • the memory 1204 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable and removable Programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • the power supply component 1206 provides power to various components of the electronic device 1200.
  • the power components 1206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 1200.
  • the multimedia component 1208 includes a screen that provides an output interface between the electronic device 1200 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
  • the multimedia component 1208 includes a front camera and/or a rear camera. When the device 1200 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 1210 is configured to output and/or input audio signals.
  • the audio component 1210 includes a microphone (MIC).
  • the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 1204 or sent via the communication component 1216.
  • the audio component 1210 further includes a speaker for outputting audio signals.
  • the I/O interface 1212 provides an interface between the processing component 1202 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, or a button. These buttons may include, but are not limited to: home button, volume button, enable button, and lock button.
  • the sensor assembly 1214 includes one or more sensors for providing the electronic device 1200 with various aspects of status assessment.
  • the sensor component 1214 can detect the on/off state of the device 1200, and the relative positioning of the components, such as the display and the keypad of the electronic device 1200, and the sensor component 1214 can also detect the electronic device 1200 or a component of the electronic device 1200 , The location of the user changes, the presence or absence of user contact with the electronic device 1200, the orientation or acceleration/deceleration of the electronic device 1200, and the temperature change of the electronic device 1200.
  • the sensor assembly 1214 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • the sensor assembly 1214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor assembly 1214 may further include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the communication component 1216 is configured to facilitate wired or wireless communication between the electronic device 1200 and other devices.
  • the electronic device 1200 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof.
  • the communication component 1216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 1216 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the electronic device 1200 may be used by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), and field devices.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA Programming gate array
  • controller microcontroller, microprocessor or other electronic components are used to implement the above method.
  • a non-transitory computer-readable storage medium including instructions is also provided, such as a memory 1204 including instructions, which can be executed by the processor 1220 of the electronic device 1200 to complete the above method.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, or the like.
  • Fig. 8 is a block diagram of an electronic device performing the above method according to an exemplary embodiment.
  • the electronic device 1300 may be provided as a server. 8
  • the electronic device 1300 includes a processing component 1322, which further includes one or more processors, and memory resources represented by the memory 1332, for storing instructions executable by the processing component 1322, such as application programs.
  • the application programs stored in the memory 1332 may include one or more modules each corresponding to a set of instructions.
  • the processing component 1322 is configured to execute instructions to execute the above-mentioned information list display method.
  • the electronic device 1300 may also include a power supply component 1326 configured to perform power management of the electronic device 1300, a wired or wireless network interface 1350 configured to connect the electronic device 1300 to the network, and an input output (I/O) interface 1358 .
  • the electronic device 1300 can operate an operating system based on the memory 1332, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
  • Fig. 9 is a block diagram of another electronic device performing the above method according to an exemplary embodiment.
  • the electronic device 1400 includes:
  • a memory 1420 for storing processor executable instructions
  • processor 1410 is configured to execute:
  • the human voice and accompaniment are obtained based on the human voice feature data and the accompaniment sound feature data.
  • processor 1410 is further configured to execute:
  • the current vocal encoder and vocal decoder are used as the vocal encoding model and the vocal decoding model.
  • processor 1410 is further configured to execute:
  • the current accompaniment encoder and accompaniment decoder are used as the accompaniment encoding model and the accompaniment decoding model.
  • processor 1410 is further configured to execute:
  • the current mixed encoder is used as the mixed sound coding model.
  • the vocal encoder, the vocal decoder, the accompaniment sound encoder, the accompaniment sound decoder, and the mixed sound decoder are all one of CNN, DNN, and RNN neural networks Species.
  • frequency domain features are extracted from the mixed sound signal as the mixed sound feature data based on Fourier transform.
  • the vocals and the accompaniment are obtained based on the inverse Fourier transform, respectively.
  • a computer program product including a computer program product, the computer program including program instructions, which when executed by an electronic device, causes the electronic device to perform the above method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Machine Translation (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

本申请是关于一种混合声音信号的分离方法、装置、电子设备和可读介质。所述方法包括:从混合声音信号中提取混合声音特征数据;将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,第一隐变量表征人声特征,第二隐变量表征伴奏声音特征;将第一隐变量和第二隐变量分别输入到人声解码模型和伴奏解码模型,得到人声特征数据和伴奏声音特征数据;以及基于人声特征数据和伴奏声音特征数据得到人声和伴奏。该方法得到的人声和伴奏声音的信噪比较低。

Description

混合声音信号的分离方法、装置、电子设备和可读介质
相关申请的交叉引用
本申请要求在2018年12月20日提交中国专利局、申请号为201811564607.7、申请名称为“混合声音信号的分离方法、装置、电子设备和可读介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于计算机软件应用领域,尤其是一种混合声音信号的分离方法、装置、电子设备和可读介质。
背景技术
一般的流行音乐由人声和伴奏叠加混合而成。将混合好的音乐分离得到人声和伴奏(声伴分离)是一项有挑战性的工作。声伴分离对于音乐编辑,音乐检索有重要作用。声伴分离模型性能的改进能够极大地提高后续处理流程的效果。
当前主流的声伴分离模型是端到端的确定性模型,计算每个声源在时频图中的mask(掩码),再用mask乘以混合声音的时频图得到分离声源的时频特征,进而得到分离声源的时域表示。发明人发现虽然这种端到端的模型分离得到的声源信号有较高的信噪比,但是分离声源信号几乎不可能是干净的,或多或少都会掺杂有残留的其他声源信号。这些残留的干扰虽然微弱,但对后续的歌词切分,歌曲评价等步骤有非常严重的影响。目前业界专家也在持续地改进现有技术方案以及寻找新的技术方案,以逐步改善混合音中的人声和伴奏的分离效果。
发明内容
为克服相关技术中存在的问题,本申请公开一种混合声音信号的分离方 法、装置、电子设备和可读介质,以解决现有技术中存在的问题。
根据本申请实施例的第一方面,提供一种混合声音信号的分离方法,包括:
从混合声音信号中提取混合声音特征数据;
将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,所述第一隐变量表征人声特征,所述第二隐变量表征伴奏声音特征;
将所述第一隐变量和所述第二隐变量分别输入到人声解码模型和伴奏解码模型,得到人声特征数据和伴奏声音特征数据;以及
基于所述人声特征数据和所述伴奏声音特征数据得到人声和伴奏。
根据本申请实施例的第二方面,提供一种混合声音信号的分离装置,包括:
特征提取模块,用于从混合声音信号中提取混合声音特征数据;
隐变量生成模块,用于将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,所述第一隐变量表征人声特征,所述第二隐变量表征伴奏声音特征;
人声特征生成模块,用于将所述第一隐变量输入到人声解码器,得到人声特征数据;
伴奏特征生成模块,用于将所述第一隐变量输入到伴奏声音解码器,得到伴奏声音特征数据;
人声生成模块,用于基于所述人声特征数据得到人声;
伴奏生成模块,用于基于所述伴奏声音特征数据得到伴奏。
根据本申请实施例的第三方面,提供一种电子设备,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为执行上述任意一项所述的方法。
根据本申请实施例的第四方面,提供一种非临时性计算机可读存储介质, 所述计算机可读存储介质存储有计算机指令,所述计算机指令被执行时实现如上述任一项所述的方法。
根据本申请实施例的第五方面,还提供了计算机程序产品,包括计算机程序产品,所述计算机程序包括程序指令,当所述程序指令被电子设备执行时,使所述电子设备执行上述任一项所述的方法。
本申请的实施例提供的技术方案可以包括以下有益效果:通过训练后得到的模型从混合声音中分离伴奏和人声,由此得到的人声和伴奏声音的信噪比较低。进一步地,对于人声和伴奏分别进行训练,将伴奏编码模型和人声编码模型得到的隐变量构建混合声音编码器的损失函数,从而提高了训练效率。
附图说明
图1是根据本申请一示例性实施例的一种混合声音信号的分离方法的流程图;
图2A-2C是根据本申请一示例性实施例的训练步骤的具体实施例;
图3是根据本申请一示例性实施例的一种混合声音信号的分离方法的流程图;
图4A所示是包含编码器和解码器的自编码器的结构示意图;
图4B所示是现有技术的神经网络的结构示意图;
图5所示是本申请实施例的人声自编码器、伴奏自编码器和混合声音编码器的结构示意图;
图6所示是本申请实施例的混合声音信号的分离装置的结构示意图;
图7是根据一示例性实施例示出的第一种执行一种混合声音信号的分离方法的电子设备的框图;
图8是根据一示例性实施例示出的第二种执行一种混合声音信号的分离方法的电子设备的框图;
图9是根据一示例性实施例示出的第三种执行一种混合声音信号的分离 方法的电子设备的框图。
具体实施方式
在本文中,人声训练样本、人声验证样本和人声声音信号均为纯净的人声的信号(或数据),相应地,伴奏音训练样本、伴奏声音验证样本和伴奏声音信号均为纯净的伴奏声音的信号(或数据)。另外,将音频数据命名为训练样本和验证样本只是为了区分在不同步骤中使用的样本。将隐变量区分为第一、第二、第三、第四……只是为了区分在不同场景下使用的隐变量,并不意味着这些隐变量在属性上有所区别。
图1是根据本申请一示例性实施例的一种混合声音信号的分离方法的流程图。该实施例应用于从混合声音中分离人声和伴奏的应用场景。具体包括以下步骤。
在步骤S101中,从混合声音信号中提取混合声音特征数据。
在步骤S102中,将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量。
在步骤S103中,将第一隐变量和第二隐变量分别输入到人声解码模型和伴奏解码模型,得到人声特征数据和伴奏声音特征数据。
在步骤S104中,基于人声特征数据和伴奏声音特征数据得到人声和伴奏。
在本申请实施例,混合声音编码模型、人声解码模型和伴奏解码模型均为训练得到的神经网络模型。混合声音编码模型接收混合声音特征数据,输出第一隐变量和第二隐变量,基于第一隐变量经由人声解码模型得到人声,基于第二隐变量经由伴奏解码模型得到伴奏,从而实现从混合声音中分离出伴奏和人声。第一隐变量表征人声特征,第二隐变量表征伴奏声音特征。
人声编码模型和人声解码模型经由人声编码器和人声解码器的训练得到,参见图2A,所述训练包括以下步骤:
在步骤S201中,构建人声训练样本;
利用步骤S202-S205进行迭代处理,直至第一损失函数最小化:
在步骤S202中,将一个人声训练样本输入到当前的人声编码器中,得到输出的第三隐变量,第三隐变量表征人声特征;
在步骤S203中,将第三隐变量输入到当前的人声解码器,得到对应的人声验证样本;
在步骤S204中,基于当前的人声训练样本和对应的人声验证样本构建第一损失函数,基于第一损失函数反向传播更新当前的人声编码器和人声解码器的权重参数;
在步骤S205中,判定第一损失函数是否最小,如果否,则调转到步骤S202,否则跳出迭代处理;
在步骤S206中,当迭代处理结束后,将当前的人声编码器和人声解码器作为人声编码模型和人声解码模型。
伴奏编码模型和伴奏解码模型经由伴奏编码器和伴奏解码器的训练得到,参见图2B,所述训练包括以下步骤:
在步骤S301中,构建伴奏声音训练样本;
利用步骤S302-S305步骤进行迭代处理步骤,直至第二损失函数最小化:
在步骤S302中,将一个伴奏声音训练样本输入到当前的伴奏编码器中,得到输出的第四隐变量,所述第四隐变量表征所述伴奏声音特征;
在步骤S303中,将第四隐变量输入到当前的伴奏解码器,得到对应的伴奏验证样本;
在步骤S304中,基于当前的伴奏训练样本和对应的伴奏验证样本构建第二损失函数,基于第二损失函数反向传播更新当前的伴奏编码器和伴奏解码器的权重参数;
在步骤S305中,判定第二损失函数是否最小,如果否,则调转到步骤S302,否则跳出迭代处理;
在步骤S306中,当迭代处理结束后,将当前的伴奏编码器和伴奏解码器作为所述伴奏编码模型和伴奏解码模型。
混合编码模型经由混合编码器的训练得到,参见图2C,所述训练包括以 下步骤:
在步骤S401中,基于人声训练样本和伴奏声音训练样本构建混合声音训练样本;
利用步骤S402-S403步骤进行迭代处理步骤,直至损失函数最小化:
在步骤S402中,将一个混合声音训练样本输入到当前的混合编码器中,得到输出的第五隐变量和第六隐变量,第五隐变量表征人声特征,第六隐变量表征伴奏声音特征;
在步骤S403中,采用当前的第五隐变量、第六隐变量和之前训练人声编码器和伴奏编码器得到的第三隐变量、第四隐变量以及人声验证样本和人声训练样本构成的第一损失函数和伴奏训练样本和伴奏验证样本构成的第二损失函数,构建第三损失函数,基于第三损失函数的反向传播更新当前的混合编码器的权重参数;
在步骤S404中,判定第三损失函数是否最小,如果否,则调转到步骤S402,否则跳出迭代处理;
在步骤S405中,当迭代处理结束后,将当前的混合编码器作为应用场景的混合声音编码模型。
在上述模型训练中,采用的人声训练样本为纯净的人声,伴奏训练样本为纯净的伴奏声音,混合声音训练样本为采用每一个人声训练样本和每一个伴奏训练样本混合得到。而且,基于人声和伴奏训练过程中的损失函数和隐变量构建混合声音中的损失函数,因此,当人声和伴奏得到的损失函数收敛时,隐变量的损失函数也趋于收敛,从而最终得到混合声音编码模型。
需要明白的是,在上述实施例中涉及的声音特征,包括混合声音特征、人声特征、伴奏声音特征,均取自原声音信号,表示原声音中本质的声音特征。声音特征例如是声音频谱图。声音特征的提取方式均为现有技术,这里就不再赘述。
图3是根据本申请一示例性实施例的一种混合声音信号的分离方法的流程图等。
在步骤S501中,通过傅里叶变换从混合声音信号中提取混合声音特征数据。
在步骤S502中,将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量。
在步骤S503中,将第一隐变量和第二隐变量分别输入到人声解码模型和伴奏解码模型,得到人声特征数据和伴奏声音特征数据。
在步骤S504中,基于傅里叶的逆变换基于人声特征数据和伴奏声音特征数据得到人声和伴奏。
在本申请实施例中,基于傅里叶变换从混合声音信号中得到混合声音的频谱特征,再将混合声音的频谱特征到模型中,分离出表示人声频谱的第一隐变量和表示伴奏频谱的第二隐变量,进而根据第一隐变量和第二隐变量重建人声和伴奏,从而实现了从混合声音信号中分离出人声和伴奏。
图4A所示是现有技术中包含编码器和解码器的自编码器的结构示意图。
自编码器是神经网络的一种,经过训练后尝试将输入复制到输出。自编码器内部有一个隐藏层,可以产生编码作为解码器的输入。参见图4A所示,输入信号301经由编码器产生隐变量302作为解码器的输入,隐变量302经由解码器产生重建信号303。要想得到一个编码器和解码器的可用模型,需要设定损失函数,然后基于损失函数最小化的目标,通过迭代训练不断更新编码器和解码器的权重参数,以得到最终的编码器模型和解码器模型。更具体地,该网络可以看作由两部分组成:一个由函数h=f(x)表示的编码器和一个生成重构的解码器r=g(h)。通过自编码器的训练实现g(f(x))=x的权重参数。
在搭建编码器和解码器时,可以采用循环神经网络(Recurrent Neural Network,RNN)、深度神经网络(Deep Neural Network,DNN)、卷积神经网络(Convolutional Neural Network,CNN)和反向传播(Back Propagation,BP)等多种类型的神经网络。一个典型的神经网络的结构如图4B所示。输入层经过多层特征映射层,得到输出层。
将上述编码器和解码器的结构应用到本申请实施例,能够得到如图5所述的包括人声编解码器、伴奏编解码器和混合声音编码器的结构示意图。
参见5,每个人声经过编码器1(即人声编码器),得到隐变量1,隐变量1输入到解码器1(即人声解码器),得到重建人声。根据重建人声和输入的人声之间的损失函数反向传播更新编码器1和解码器1的权重参数。将人声训练样本中的每个人声样本重复上述步骤,将最终得到的编码器1和解码器1作为人声编码器模型和人声解码器模型。
同理,每个伴奏经过编码器2(即伴奏编码器),得到隐变量2,隐变量2输入到解码器2(即伴奏解码器),得到重建伴奏声。根据重建伴奏和输入的伴奏之间的损失函数反向传播更新编码器2和解码器2的权重参数。将伴奏训练样本中的每个伴奏样本重复上述步骤,将最终得到的编码器2和解码器2作为伴奏编码器模型和伴奏解码器模型。
最终,基于人声训练样本和伴奏训练样本混合得到混合声音训练样本。即每个混合声音样本均有一个人声训练样本和人声伴奏样本混合而成。将每个混合声音样本输入到混合声音编码器中,得到重建后的混合声音,将重建后的混合声音和混合声音训练样本,以及对应的人声训练样本的损失函数和对应的伴奏训练样本的损失函数一起构建混合声音编码器的损失函数,并以损失函数最小化为目标,不断地更新混合声音编码器的权重参数。将最终得到的混合声音编码器作为混合声音编码模型。
为了帮助理解,下面采用数学公式描述混合声音编码器的损失函数。混合声音编码器的损失函数
Figure PCTCN2019121730-appb-000001
采用以下公式表示:
为了帮助理解,下面采用数学公式描述混合声音编码器的损失函数。混合声音编码器的损失函数
Figure PCTCN2019121730-appb-000002
采用以下公式表示:
Figure PCTCN2019121730-appb-000003
v表示人声训练样本,
Figure PCTCN2019121730-appb-000004
表示人声验证样本(重建后的人声),a表示伴奏音训练样本,a表示伴奏声音验证样本(重建后的伴奏),h v和h a表示混 合声音编码器输出的两个隐变量(上图中的隐变量3和4),
Figure PCTCN2019121730-appb-000005
表示人声编码器输出的隐变量(上图中的隐变量1),
Figure PCTCN2019121730-appb-000006
表示伴奏编码器输出的隐变量(上图中的隐变量2)。其中,
Figure PCTCN2019121730-appb-000007
表示人声自编码器的损失函数,
Figure PCTCN2019121730-appb-000008
表示伴奏自编码器的损失函数。
上述实施例实现了从混合声音信号中分离出人声和伴奏,由此得到的声音信号信噪比较低。模型训练步骤可以离线进行,从而节约终端计算资源,模型应用步骤可以放在线上进行,从而实时完成混合声音信号的分离工作。
图6所示是本申请实施例的混合声音信号的分离装置的结构示意图。参见图6,装置800包括特征提取模块801、隐变量生成模块802、人声特征生成模块803、伴奏特征生成模块805、人声生成模块804和伴奏生成模块806。
特征提取模块801用于从混合声音信号中提取混合声音特征数据。
隐变量生成模块802用于将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,第一隐变量表征人声特征,第二隐变量表征伴奏声音特征。
人声特征生成模块803用于将采用隐变量生成模块802输出的第一隐变量输入到人声解码器,得到人声特征数据。
伴奏特征生成模块805用于将采用隐变量生成模块802输出的第二隐变量第一隐变量输入到伴奏声音解码器,得到伴奏声音特征数据。
人声生成模块804用于基于人声特征数据得到人声。
伴奏生成模块806用于基于伴奏声音特征数据得到伴奏。
在一些实施例中,上述装置还包括:人声样本收集模块和人声模型训练模块。
人声样本收集模块用于构建人声训练样本。人声训练样本中的每个样本均为纯净人声中提取的人声特征。
人声模型训练模块用于利用以下步骤进行迭代处理,直至损失函数最小化:将一个人声训练样本输入到当前的人声编码器中,得到输出的第三隐变 量,第三隐变量表征所述人声特征;将第三隐变量输入到当前的人声解码器,得到对应的人声验证样本;基于当前的人声训练样本和对应的人声验证样本构建第一损失函数,基于第一损失函数反向传播更新当前的人声编码器和人声解码器的权重参数,当迭代处理结束后,将当前的人声编码器和人声解码器作为所述人声编码模型和所述人声解码模型。
在一些实施例中,上述装置还包括:伴奏样本收集模块和伴奏模型训练模块。
伴奏样本收集模块用于构建伴奏声音训练样本。伴奏声音训练样本中的每个样本均为纯净伴奏声音中提取的伴奏声音特征。
伴奏模型训练模块用于利用以下步骤进行迭代处理,直至损失函数最小化:将一个伴奏声音训练样本输入到当前的人声编码器中,得到输出的第四隐变量,所述第四隐变量表征伴奏声音特征;将第四隐变量输入到当前的伴奏解码器,得到对应的伴奏验证样本;基于当前的伴奏训练样本和对应的伴奏验证样本构建第二损失函数,基于第二损失函数反向传播更新当前的伴奏编码器和伴奏解码器的权重参数,当迭代处理结束后,将当前的伴奏编码器和伴奏解码器作为伴奏编码模型和伴奏解码模型。
在一些实施例中,上述装置还包括:混合音样本收集模块和混合音模型训练模块。
混合音样本收集模块,用于基于人声训练样本和伴奏声音训练样本构建混合声音训练样本。混合声音训练样本的每个样本均为基于纯净人声和伴奏声音混合后从中提取的混合声音特征。
混合音模型训练模块,用于基于人声训练样本和伴奏声音训练样本构建混合声音训练样本;利用以下步骤进行迭代处理,直至损失函数最小化:将一个混合声音训练样本输入到当前的混合编码器中,得到输出的第五隐变量和第六隐变量;将当前的第五隐变量、第六隐变量、第三隐变量、第四隐变量以及第一损失函数和第二损失函数构建第三损失函数,基于第三损失函数的反向传播更新当前的混合编码器的权重参数,当迭代处理结束后,将当前 的混合编码器作为混合声音编码模型。
在一些实施例中,无论是混合声音特征数据、人声特征数据还是伴奏声音特征数据均为傅里叶变换从原声音信号中提取出的表征声音信号的深度特征的数据。
应该理解,上述装置和方法是对应的,因此,对装置以相应简略的方式进行描述。
图7是根据一示例性实施例示出的一种执行上述方法的电子设备的框图。例如,电子设备1200可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图7,电子设备1200可以包括以下一个或多个组件:处理组件1202,存储器1204,电源组件1206,多媒体组件1208,音频组件1210,输入/输出(I/O)的接口1212,传感器组件1214,以及通信组件1216。
处理组件1202通常控制电子设备1200的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件1202可以包括一个或多个处理器1220来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件1202可以包括一个或多个模块,便于处理组件1202和其他组件之间的交互。例如,处理组件1202可以包括多媒体模块,以方便多媒体组件1208和处理组件1202之间的交互。
存储器1204被配置为存储各种类型的数据以支持在设备1200的操作。这些数据的示例包括用于在电子设备1200上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器1204可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件1206为电子设备1200的各种组件提供电力。电源组件1206可以包括电源管理系统,一个或多个电源,及其他与为电子设备1200生成、管 理和分配电力相关联的组件。
多媒体组件1208包括在所述电子设备1200和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件1208包括一个前置摄像头和/或后置摄像头。当设备1200处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件1210被配置为输出和/或输入音频信号。例如,音频组件1210包括一个麦克风(MIC),当电子设备1200处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器1204或经由通信组件1216发送。在一些实施例中,音频组件1210还包括一个扬声器,用于输出音频信号。
I/O接口1212为处理组件1202和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启用按钮和锁定按钮。
传感器组件1214包括一个或多个传感器,用于为电子设备1200提供各个方面的状态评估。例如,传感器组件1214可以检测到设备1200的打开/关闭状态,组件的相对定位,例如所述组件为电子设备1200的显示器和小键盘,传感器组件1214还可以检测电子设备1200或电子设备1200一个组件的位置改变,用户与电子设备1200接触的存在或不存在,电子设备1200方位或加速/减速和电子设备1200的温度变化。传感器组件1214可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件1214还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。 在一些实施例中,该传感器组件1214还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件1216被配置为便于电子设备1200和其他设备之间有线或无线方式的通信。电子设备1200可以接入基于通信标准的无线网络,如WiFi,运营商网络(如2G、3G、4G或5G),或它们的组合。在一个示例性实施例中,通信组件1216经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件1216还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备1200可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器1204,上述指令可由电子设备1200的处理器1220执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
图8是根据一示例性实施例示出的一种执行上述方法的电子设备的框图。例如,电子设备1300可以被提供为一服务器。参照图8,电子设备1300包括处理组件1322,其进一步包括一个或多个处理器,以及由存储器1332所代表的存储器资源,用于存储可由处理组件1322的执行的指令,例如应用程序。存储器1332中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1322被配置为执行指令,以执行上述信息列表显示方法。
电子设备1300还可以包括一个电源组件1326被配置为执行电子设备 1300的电源管理,一个有线或无线网络接口1350被配置为将电子设备1300连接到网络,和一个输入输出(I/O)接口1358。电子设备1300可以操作基于存储在存储器1332的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
图9是根据一示例性实施例示出的另一种执行上述方法的电子设备的框图,参照图9,电子设备1400包括:
处理器1410;
用于存储处理器可执行指令的存储器1420;
其中,所述处理器1410被配置为执行:
从混合声音信号中提取混合声音特征数据;
将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,所述第一隐变量表征人声特征,所述第二隐变量表征伴奏声音特征;
将所述第一隐变量和所述第二隐变量分别输入到人声解码模型和伴奏解码模型,得到人声特征数据和伴奏声音特征数据;以及
基于所述人声特征数据和所述伴奏声音特征数据得到人声和伴奏。
可选的,所述处理器1410还被配置为执行:
构建人声训练样本;
利用以下步骤进行迭代处理,直至损失函数最小化:
将人声训练样本输入到当前的人声编码器中,得到输出的第三隐变量,所述第三隐变量表征人声特征;
将所述第三隐变量输入到当前的人声解码器,得到对应的人声验证样本;
基于当前的人声训练样本和对应的人声验证样本构建第一损失函数,基于所述第一损失函数反向传播更新当前的人声编码器和人声解码器的权重参数;
当所述迭代处理结束后,将当前的人声编码器和人声解码器作为所述人声编码模型和所述人声解码模型。
可选的,所述处理器1410还被配置为执行:
构建伴奏声音训练样本;
利用以下步骤进行迭代处理,直至损失函数最小化:
将伴奏声音训练样本输入到当前的人声编码器中,得到输出的第四隐变量,所述第四隐变量表征伴奏声音特征;
将所述第四隐变量输入到当前的伴奏解码器,得到对应的伴奏验证样本;
基于当前的伴奏训练样本和对应的伴奏验证样本构建第二损失函数,基于第二损失函数反向传播更新当前的伴奏编码器和伴奏解码器的权重参数;
当所述迭代处理结束后,将当前的伴奏编码器和伴奏解码器作为所述伴奏编码模型和所述伴奏解码模型。
可选的,所述处理器1410还被配置为执行:
基于所述人声训练样本和所述伴奏声音训练样本构建混合声音训练样本;
利用以下步骤进行迭代处理,直至损失函数最小化:
将混合声音训练样本输入到当前的混合编码器中,得到输出的第五隐变量和第六隐变量,所述第五隐变量表征人声特征,所述第六隐变量表征伴奏声音特征;
基于当前的第五隐变量和第六隐变量、对应的第三隐变量和第四隐变量以及所述第一损失函数和所述第二损失函数构建第三损失函数,基于所述第三损失函数的反向传播更新当前的混合编码器的权重参数;
当所述迭代处理结束后,将当前的混合编码器作为所述混合声音编码模型。
可选的,所述人声编码器、所述人声解码器、所述伴奏声音编码器、所述伴奏声音解码器、所述混合声音解码器均为CNN、DNN和RNN神经网络中的一种。
可选的,基于傅里叶变换从所述混合声音信号中提取频域特征作为所述混合声音特征数据。
可选的,基于傅里叶的逆变换分别得到所述人声和所述伴奏。
在示例性实施例中,还提供了计算机程序产品,包括计算机程序产品,所述计算机程序包括程序指令,当所述程序指令被电子设备执行时,使所述电子设备执行上述方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (16)

  1. 一种混合声音信号的分离方法,包括:
    从混合声音信号中提取混合声音特征数据;
    将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,所述第一隐变量表征人声特征,所述第二隐变量表征伴奏声音特征;
    将所述第一隐变量和所述第二隐变量分别输入到人声解码模型和伴奏解码模型,得到人声特征数据和伴奏声音特征数据;以及
    基于所述人声特征数据和所述伴奏声音特征数据得到人声和伴奏。
  2. 根据权利要求1所述的分离方法,还包括:
    构建人声训练样本;
    利用以下步骤进行迭代处理,直至损失函数最小化:
    将人声训练样本输入到当前的人声编码器中,得到输出的第三隐变量,所述第三隐变量表征人声特征;
    将所述第三隐变量输入到当前的人声解码器,得到对应的人声验证样本;
    基于当前的人声训练样本和对应的人声验证样本构建第一损失函数,基于所述第一损失函数反向传播更新当前的人声编码器和人声解码器的权重参数;
    当所述迭代处理结束后,将当前的人声编码器和人声解码器作为所述人声编码模型和所述人声解码模型。
  3. 根据权利要求2所述的分离方法,还包括:
    构建伴奏声音训练样本;
    利用以下步骤进行迭代处理,直至损失函数最小化:
    将伴奏声音训练样本输入到当前的伴奏编码器中,得到输出的第四隐变量,所述第四隐变量表征伴奏声音特征;
    将所述第四隐变量输入到当前的伴奏解码器,得到对应的伴奏验证样本;
    基于当前的伴奏训练样本和对应的伴奏验证样本构建第二损失函数,基于第二损失函数反向传播更新当前的伴奏编码器和伴奏解码器的权重参数;
    当所述迭代处理结束后,将当前的伴奏编码器和伴奏解码器作为所述伴奏编码模型和所述伴奏解码模型。
  4. 根据权利要求3所述的分离方法,还包括:
    基于所述人声训练样本和所述伴奏声音训练样本构建混合声音训练样本;
    利用以下步骤进行迭代处理,直至损失函数最小化:
    将混合声音训练样本输入到当前的混合编码器中,得到输出的第五隐变量和第六隐变量,所述第五隐变量表征人声特征,所述第六隐变量表征伴奏声音特征;
    基于当前的第五隐变量和第六隐变量、对应的第三隐变量和第四隐变量以及第一损失函数和第二损失函数构建第三损失函数,基于所述第三损失函数的反向传播更新当前的混合编码器的权重参数;
    当所述迭代处理结束后,将当前的混合编码器作为所述混合声音编码模型。
  5. 根据权利要求4所述的分离方法,所述人声编码器、所述人声解码器、所述伴奏声音编码器、所述伴奏声音解码器和所述混合声音编码器为CNN、DNN和RNN神经网络中的一种。
  6. 根据权利要求1所述的分离方法,基于傅里叶变换从所述混合声音信号中提取频域特征作为所述混合声音特征数据。
  7. 根据权利要求6所述的分离方法,基于傅里叶的逆变换得到所述人声和所述伴奏。
  8. 一种混合声音信号的分离装置,包括:
    特征提取模块,用于从混合声音信号中提取混合声音特征数据;
    隐变量生成模块,用于将所述混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,所述第一隐变量表征人声特征,所述第二隐变量表征伴奏声音特征;
    人声特征生成模块,用于将所述第一隐变量输入到人声解码模型,得到人声特征数据;
    伴奏特征生成模块,用于将所述第二隐变量输入到伴奏解码模型,得到伴奏声音特征数据;
    人声生成模块,用于基于所述人声特征数据得到人声;
    伴奏生成模块,用于基于所述伴奏声音特征数据得到伴奏。
  9. 一种电子设备,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为执行:
    从混合声音信号中提取混合声音特征数据;
    将混合声音特征数据输入到混合声音编码模型中,得到第一隐变量和第二隐变量,所述第一隐变量表征人声特征,所述第二隐变量表征伴奏声音特征;
    将所述第一隐变量和所述第二隐变量分别输入到人声解码模型和伴奏解码模型,得到人声特征数据和伴奏声音特征数据;以及
    基于所述人声特征数据和所述伴奏声音特征数据得到人声和伴奏。
  10. 根据权利要求9所述的电子设备,所述处理器还被配置为执行:
    构建人声训练样本;
    利用以下步骤进行迭代处理,直至损失函数最小化:
    将人声训练样本输入到当前的人声编码器中,得到输出的第三隐变量,所述第三隐变量表征人声特征;
    将所述第三隐变量输入到当前的人声解码器,得到对应的人声验证样本;
    基于当前的人声训练样本和对应的人声验证样本构建第一损失函数,基于所述第一损失函数反向传播更新当前的人声编码器和人声解码器的权重参数;
    当所述迭代处理结束后,将当前的人声编码器和人声解码器作为所述人 声编码模型和所述人声解码模型。
  11. 根据权利要求10所述的电子设备,所述处理器还被配置为执行:
    构建伴奏声音训练样本;
    利用以下步骤进行迭代处理,直至损失函数最小化:
    将伴奏声音训练样本输入到当前的人声编码器中,得到输出的第四隐变量,所述第四隐变量表征伴奏声音特征;
    将所述第四隐变量输入到当前的伴奏解码器,得到对应的伴奏验证样本;
    基于当前的伴奏训练样本和对应的伴奏验证样本构建第二损失函数,基于第二损失函数反向传播更新当前的伴奏编码器和伴奏解码器的权重参数;
    当所述迭代处理结束后,将当前的伴奏编码器和伴奏解码器作为所述伴奏编码模型和所述伴奏解码模型。
  12. 根据权利要求11所述的电子设备,所述处理器还被配置为执行:
    基于所述人声训练样本和所述伴奏声音训练样本构建混合声音训练样本;
    利用以下步骤进行迭代处理,直至损失函数最小化:
    将混合声音训练样本输入到当前的混合编码器中,得到输出的第五隐变量和第六隐变量,所述第五隐变量表征人声特征,所述第六隐变量表征伴奏声音特征;
    基于当前的第五隐变量和第六隐变量、对应的第三隐变量和第四隐变量以及所述第一损失函数和所述第二损失函数构建第三损失函数,基于所述第三损失函数的反向传播更新当前的混合编码器的权重参数;
    当所述迭代处理结束后,将当前的混合编码器作为所述混合声音编码模型。
  13. 根据权利要求12所述的电子设备,所述人声编码器、所述人声解码器、所述伴奏声音编码器、所述伴奏声音解码器、所述混合声音解码器均为CNN、DNN和RNN神经网络中的一种。
  14. 根据权利要求9所述的电子设备,基于傅里叶变换从所述混合声音信号中提取频域特征作为所述混合声音特征数据。
  15. 根据权利要求14所述的电子设备,基于傅里叶的逆变换分别得到所述人声和所述伴奏。
  16. 一种非临时性计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令被执行时实现如权利要求1至7任一项所述的分离方法。
PCT/CN2019/121730 2018-12-20 2019-11-28 混合声音信号的分离方法、装置、电子设备和可读介质 WO2020125372A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/352,856 US11430427B2 (en) 2018-12-20 2021-06-21 Method and electronic device for separating mixed sound signal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811564607.7 2018-12-20
CN201811564607.7A CN109801644B (zh) 2018-12-20 2018-12-20 混合声音信号的分离方法、装置、电子设备和可读介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/352,856 Continuation US11430427B2 (en) 2018-12-20 2021-06-21 Method and electronic device for separating mixed sound signal

Publications (1)

Publication Number Publication Date
WO2020125372A1 true WO2020125372A1 (zh) 2020-06-25

Family

ID=66557280

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121730 WO2020125372A1 (zh) 2018-12-20 2019-11-28 混合声音信号的分离方法、装置、电子设备和可读介质

Country Status (3)

Country Link
US (1) US11430427B2 (zh)
CN (1) CN109801644B (zh)
WO (1) WO2020125372A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012667A (zh) * 2021-03-17 2021-06-22 平安科技(深圳)有限公司 基于佛乐的音轨分离方法、装置、设备及存储介质
CN113314140A (zh) * 2021-05-31 2021-08-27 哈尔滨理工大学 一种端到端时域多尺度卷积神经网络的音源分离算法

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108847238B (zh) * 2018-08-06 2022-09-16 东北大学 一种服务机器人语音识别方法
CN109801644B (zh) * 2018-12-20 2021-03-09 北京达佳互联信息技术有限公司 混合声音信号的分离方法、装置、电子设备和可读介质
CN110164470A (zh) * 2019-06-12 2019-08-23 成都嗨翻屋科技有限公司 人声分离方法、装置、用户终端及存储介质
CN110335622B (zh) * 2019-06-13 2024-03-01 平安科技(深圳)有限公司 音频单音色分离方法、装置、计算机设备及存储介质
CN110265052B (zh) * 2019-06-24 2022-06-10 秒针信息技术有限公司 收音设备的信噪比确定方法、装置、存储介质及电子装置
CN110322894B (zh) * 2019-06-27 2022-02-11 电子科技大学 一种基于声音的波形图生成及大熊猫检测方法
CN110503976B (zh) * 2019-08-15 2021-11-23 广州方硅信息技术有限公司 音频分离方法、装置、电子设备及存储介质
CN110491412B (zh) * 2019-08-23 2022-02-25 北京市商汤科技开发有限公司 声音分离方法和装置、电子设备
CN110853618B (zh) * 2019-11-19 2022-08-19 腾讯科技(深圳)有限公司 一种语种识别的方法、模型训练的方法、装置及设备
CN110992966B (zh) * 2019-12-25 2022-07-01 开放智能机器(上海)有限公司 一种人声分离方法及系统
CN111161695B (zh) * 2019-12-26 2022-11-04 北京百度网讯科技有限公司 歌曲生成方法和装置
CN111243620B (zh) * 2020-01-07 2022-07-19 腾讯科技(深圳)有限公司 语音分离模型训练方法、装置、存储介质和计算机设备
CN111370032B (zh) * 2020-02-20 2023-02-14 厦门快商通科技股份有限公司 语音分离方法、系统、移动终端及存储介质
CN113055809B (zh) * 2021-03-12 2023-02-28 腾讯音乐娱乐科技(深圳)有限公司 一种5.1声道信号生成方法、设备及介质
US11947628B2 (en) * 2021-03-30 2024-04-02 Snap Inc. Neural networks for accompaniment extraction from songs
CN113393857A (zh) * 2021-06-10 2021-09-14 腾讯音乐娱乐科技(深圳)有限公司 一种音乐信号的人声消除方法、设备及介质
CN114255737B (zh) * 2022-02-28 2022-05-17 北京世纪好未来教育科技有限公司 语音生成方法、装置、电子设备
CN116034425A (zh) * 2022-11-16 2023-04-28 广州酷狗计算机科技有限公司 人声音符识别模型的训练方法、人声音符识别方法及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160284346A1 (en) * 2015-03-27 2016-09-29 Qualcomm Incorporated Deep neural net based filter prediction for audio event classification and extraction
CN107680611A (zh) * 2017-09-13 2018-02-09 电子科技大学 基于卷积神经网络的单通道声音分离方法
CN108847238A (zh) * 2018-08-06 2018-11-20 东北大学 一种新型服务机器人语音识别方法
CN109801644A (zh) * 2018-12-20 2019-05-24 北京达佳互联信息技术有限公司 混合声音信号的分离方法、装置、电子设备和可读介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0351899A (ja) * 1989-07-19 1991-03-06 Matsushita Electric Ind Co Ltd カラオケ装置
WO2006120829A1 (ja) 2005-05-13 2006-11-16 Matsushita Electric Industrial Co., Ltd. 混合音分離装置
KR101121505B1 (ko) * 2010-05-31 2012-03-06 동의대학교 산학협력단 스테레오 음원으로부터의 비보컬 신호 추출 방법
EP2960899A1 (en) * 2014-06-25 2015-12-30 Thomson Licensing Method of singing voice separation from an audio mixture and corresponding apparatus
CN106971741B (zh) * 2016-01-14 2020-12-01 芋头科技(杭州)有限公司 实时将语音进行分离的语音降噪的方法及系统
CN106024005B (zh) 2016-07-01 2018-09-25 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置
CN106653048B (zh) * 2016-12-28 2019-10-15 云知声(上海)智能科技有限公司 基于人声模型的单通道声音分离方法
CN108962277A (zh) * 2018-07-20 2018-12-07 广州酷狗计算机科技有限公司 语音信号分离方法、装置、计算机设备以及存储介质
US10991385B2 (en) * 2018-08-06 2021-04-27 Spotify Ab Singing voice separation with deep U-Net convolutional networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160284346A1 (en) * 2015-03-27 2016-09-29 Qualcomm Incorporated Deep neural net based filter prediction for audio event classification and extraction
CN107680611A (zh) * 2017-09-13 2018-02-09 电子科技大学 基于卷积神经网络的单通道声音分离方法
CN108847238A (zh) * 2018-08-06 2018-11-20 东北大学 一种新型服务机器人语音识别方法
CN109801644A (zh) * 2018-12-20 2019-05-24 北京达佳互联信息技术有限公司 混合声音信号的分离方法、装置、电子设备和可读介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012667A (zh) * 2021-03-17 2021-06-22 平安科技(深圳)有限公司 基于佛乐的音轨分离方法、装置、设备及存储介质
CN113314140A (zh) * 2021-05-31 2021-08-27 哈尔滨理工大学 一种端到端时域多尺度卷积神经网络的音源分离算法

Also Published As

Publication number Publication date
CN109801644A (zh) 2019-05-24
US11430427B2 (en) 2022-08-30
CN109801644B (zh) 2021-03-09
US20210312902A1 (en) 2021-10-07

Similar Documents

Publication Publication Date Title
WO2020125372A1 (zh) 混合声音信号的分离方法、装置、电子设备和可读介质
US11620984B2 (en) Human-computer interaction method, and electronic device and storage medium thereof
WO2020134556A1 (zh) 图像风格迁移方法、装置、电子设备及存储介质
WO2016188494A1 (zh) 基于语音输入的表情曲线生成方法及其装置
CN108346433A (zh) 一种音频处理方法、装置、设备及可读存储介质
TW202044113A (zh) 影像處理方法、影像處理裝置、電子設備,及電腦可讀式儲存介質
CN111583944A (zh) 变声方法及装置
CN110753238B (zh) 视频处理方法、装置、终端及存储介质
CN111128183B (zh) 语音识别方法、装置和介质
CN113362812B (zh) 一种语音识别方法、装置和电子设备
CN107146631B (zh) 音乐识别方法、音符识别模型建立方法、装置及电子设备
KR20210001859A (ko) 3차원 가상 인물 입모양 변화 제어 방법 및 장치
CN115273831A (zh) 语音转换模型训练方法、语音转换方法和装置
CN110931028B (zh) 一种语音处理方法、装置和电子设备
WO2021051588A1 (zh) 一种数据处理方法、装置和用于数据处理的装置
CN107437412B (zh) 一种声学模型处理方法、语音合成方法、装置及相关设备
CN110970015B (zh) 一种语音处理方法、装置和电子设备
CN109784537A (zh) 广告点击率的预估方法、装置及服务器和存储介质
CN112512649A (zh) 用于提供音频和视频效果的技术
CN111984765B (zh) 知识库问答过程关系检测方法及装置
CN112988956A (zh) 自动生成对话的方法及装置、信息推荐效果检测方法及装置
CN114356068B (zh) 一种数据处理方法、装置和电子设备
CN112002313B (zh) 交互方法及装置、音箱、电子设备和存储介质
CN111696566A (zh) 语音处理方法、装置和介质
CN111950266A (zh) 一种数据处理方法、装置和用于数据处理的装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19899139

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19899139

Country of ref document: EP

Kind code of ref document: A1