WO2020015270A1 - 语音信号分离方法、装置、计算机设备以及存储介质 - Google Patents

语音信号分离方法、装置、计算机设备以及存储介质 Download PDF

Info

Publication number
WO2020015270A1
WO2020015270A1 PCT/CN2018/118293 CN2018118293W WO2020015270A1 WO 2020015270 A1 WO2020015270 A1 WO 2020015270A1 CN 2018118293 W CN2018118293 W CN 2018118293W WO 2020015270 A1 WO2020015270 A1 WO 2020015270A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
spectrum
audio signal
frequency
frame
Prior art date
Application number
PCT/CN2018/118293
Other languages
English (en)
French (fr)
Inventor
张超钢
Original Assignee
广州酷狗计算机科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州酷狗计算机科技有限公司 filed Critical 广州酷狗计算机科技有限公司
Publication of WO2020015270A1 publication Critical patent/WO2020015270A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • the present invention relates to the field of speech signal processing, and in particular, to a method, a device, a computer device, and a storage medium for separating speech signals.
  • the process of converting the audio signal from the time domain to the frequency domain by using a Fourier transform is involved. This process can obtain a complex frequency spectrum. Therefore, the accompaniment spectrum and human voice spectrum can be obtained by decomposing the complex frequency spectrum, and then the inverse Fourier transform is used to obtain the accompaniment audio and human voice audio.
  • the inventors found that the prior art has at least the following problems: since the amplitude spectrum is used only when the complex frequency spectrum is decomposed, the phase distortion of the separated accompaniment audio is caused.
  • Embodiments of the present invention provide a method, an apparatus, a computer device, and a storage medium for separating a voice signal, which can solve the problem of phase distortion of the voice signal separation.
  • the technical scheme is as follows:
  • a method for separating speech signals includes:
  • the accompaniment spectrum and human voice spectrum are converted from the frequency domain to the time domain to obtain accompaniment audio and human voice audio.
  • the converting the audio signal from the time domain to the frequency domain to obtain a frequency spectrum of the audio signal includes:
  • the spectrums of the multiple audio frames are combined to obtain the spectrum of the audio signal.
  • performing frame processing on the audio signal to obtain multiple audio frames includes:
  • window processing is performed on the audio signal to obtain multiple audio frames.
  • the length of the preset window function is the same as the number of sampling points of each audio frame.
  • the number of sampling points of each audio frame is twice the number of frame overlapping sampling points.
  • the frequency spectrum of the audio signal is decomposed to obtain an accompaniment spectrum and a human voice spectrum, including:
  • the frequency spectrum of the audio signal is input into the preset decomposition model, and the accompaniment spectrum and the human voice spectrum are output.
  • a voice signal separation device includes:
  • a sampling module for sampling the sound wave waveform of the audio file to be separated to obtain an audio signal
  • a first conversion module configured to convert the audio signal from the time domain to the frequency domain to obtain a frequency spectrum of the audio signal, where the frequency spectrum is only used to represent an amplitude of the audio signal and the amplitude is a real number;
  • a decomposition module configured to decompose the frequency spectrum of the audio signal to obtain an accompaniment spectrum and a human voice spectrum
  • the second conversion module is configured to convert the accompaniment spectrum and the human voice spectrum from the frequency domain to the time domain to obtain accompaniment audio and human voice audio.
  • the first conversion module includes:
  • a frame framing unit for framing the audio signal to obtain multiple audio frames
  • a time-frequency conversion unit configured to convert the multiple audio frames from the time domain to the frequency domain, respectively, to obtain the spectrum of the multiple audio frames.
  • the spectrum of each audio frame is only used to represent the amplitude of the audio frame and the amplitude is a real number ;
  • a combining unit is configured to combine the spectrums of the multiple audio frames to obtain the spectrum of the audio signal.
  • the framing unit is used for:
  • window processing is performed on the audio signal to obtain multiple audio frames.
  • the length of the preset window function is the same as the number of sampling points of each audio frame.
  • the number of sampling points of each audio frame is twice the number of frame overlapping sampling points.
  • the decomposition module is used to call a preset decomposition model, which is used to perform spectrum separation based on a signal spectrum; input the frequency spectrum of the audio signal into the preset decomposition model, and output the accompaniment spectrum and Vocal spectrum.
  • a computer device in one aspect, includes a processor and a memory.
  • the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement operations performed by the following voice signal separation method:
  • the accompaniment spectrum and human voice spectrum are converted from the frequency domain to the time domain to obtain accompaniment audio and human voice audio.
  • the processor is further configured to perform the following operations:
  • the processor is further configured to perform the following operations:
  • window processing is performed on the audio signal to obtain multiple audio frames.
  • the length of the preset window function is the same as the number of sampling points of each audio frame.
  • the number of sampling points of each audio frame is twice the number of frame overlapping sampling points.
  • the processor is further configured to perform the following operations:
  • a preset decomposition model is invoked, the preset decomposition model is used to perform spectrum separation based on a signal spectrum; the spectrum of the audio signal is input to the preset decomposition model, and an accompaniment spectrum and a human voice spectrum are output.
  • a computer-readable storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the operations performed by the voice signal separation method.
  • the method provided by the embodiment of the present invention uses a transform algorithm that uses only real numbers to represent the amplitude of an audio frame when converting, to perform time-to-frequency and frequency-to-time conversion. Because the phase is not transformed before and after the transformation Phase information is not lost. Therefore, based on this conversion method, the accompaniment and human voice are separated from the audio file to avoid the phase distortion problem of Fourier transform spectral decomposition.
  • FIG. 1 is an implementation scenario diagram of a speech signal separation method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for separating a voice signal according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a voice signal separation device according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
  • FIG. 1 is an implementation scenario diagram of a speech signal separation method provided by an embodiment of the present invention.
  • this implementation scenario may include: at least one terminal 101 and at least one server 102, where the at least one terminal 101 may be used as a sound signal collection terminal or an audio file playback terminal, and the at least one server 102 is used for At least one terminal 101 provides an audio service, for example, an audio file to be played, and a signal separation function such as a method corresponding to the method provided by the embodiment of the present invention, so as to perform speech on the audio file provided by the terminal or selected by the terminal. Signal separation, etc.
  • the at least one server 102 may further provide a video file to be played, the video file includes picture data and audio files, and the server 102 may extract audio files from the video files to implement signals corresponding to the method provided by the embodiment of the present invention. Separation function.
  • FIG. 2 is a flowchart of a method for separating a voice signal according to an embodiment of the present invention. Taking the execution subject of this embodiment as a computer device as an example, referring to FIG. 2, this embodiment specifically includes:
  • the computer equipment samples the sound wave waveform of the separated audio file to obtain an audio signal.
  • the audio file to be separated may be an audio file uploaded by a terminal, an audio file stored on a computer device, or an audio file included in a video file stored on the computer device.
  • the computer device may be a server. It may also be any terminal, which is not limited in the embodiment of the present invention.
  • the computer device can acquire the sound wave waveform of the audio file and sample the sound wave waveform at a preset sampling rate to obtain an audio signal.
  • the preset sampling rate may correspond to the format of the audio file, and different audio file formats may correspond to different preset sampling rates. Using the audio sampling rate corresponding to the format to sample the sound wave waveform of the audio file can ensure sampling The resulting audio signal is consistent.
  • the computer device performs windowing processing on the audio signal based on a preset window function to obtain multiple audio frames.
  • the sampled audio signal can be framed according to a preset frame length to obtain multiple original audio frames.
  • the preset frame length should be short enough, and can generally be taken as 20 to 50 milliseconds. In a short enough time, the original audio frame can be regarded as an approximately stable periodic signal to facilitate the implementation of subsequent steps.
  • the number of sampling points of each audio frame should be selected within a reasonable range to improve the spectral resolution of the audio frame.
  • the range of sampling points of each original audio frame can be selected between 512 and 8192 points.
  • the number of sampling points of each audio frame may be selected as 2048 points, and accordingly, the number of frame overlapping sampling points may be selected as 1024 points.
  • a preset frame length and the number of sampling points included in each audio frame may be considered, so that both of them meet the above conditions, thereby achieving the best framing effect.
  • a windowing method may be adopted, that is, windowing processing is performed on the multiple original audio frames to obtain multiple audio frames, so that the multiple audio frames better meet the subsequent steps.
  • the periodic requirement of intermediate time-frequency conversion reduces the leakage of the frequency spectrum of the audio frame and improves the resolution of the frequency spectrum.
  • the preset window function may select a Hanning window or a Hamming window.
  • the length of the preset window function may be the same as the number of sampling points of each audio frame, and the number of sampling points of each audio frame is twice the number of overlapping sampling points of the frame.
  • the computer device converts the multiple audio frames from the time domain to the frequency domain, respectively, to obtain the frequency spectrum of the multiple audio frames.
  • the frequency spectrum of each audio frame is only used to represent the amplitude of the audio frame and the amplitude is a real number.
  • the multiple audio frames when performing time-frequency conversion, may be respectively converted from the time domain to the frequency domain through a Hartley transform to obtain the frequency spectrum of the multiple audio frames.
  • the Hartley transform is a real number transform
  • the spectrum of the multiple audio frames obtained is a real number spectrum, and the real number spectrum is only used to represent the amplitude of the audio spectrum and does not involve the phase.
  • the Hartley transform can be implemented by applying the following formula:
  • the number of sampling points for each audio frame is N
  • the number of sampling points for frame overlap is M
  • M is 1/2 of N
  • x n is the amplitude of sampling points for each frame
  • n 0,1,2, ..., N-1.
  • H k is the spectrum after Hartley transform
  • k is the frequency point
  • k 0,1,2, ..., N-1
  • N is a positive integer.
  • the computer device combines the frequency spectra of the multiple audio frames to obtain the frequency spectrum of the audio signal.
  • the spectrum of each audio frame is obtained, the spectrum of each audio frame is spliced in order to form a two-dimensional vector of N * L dimensions, where N is equal to the number of sampling points of each audio frame. L is the total number of frames.
  • the computer device calls a preset decomposition model, which is used to perform spectrum separation based on a signal spectrum; inputs the frequency spectrum of the audio signal into the preset decomposition model, and outputs an accompaniment spectrum and a human voice spectrum.
  • a preset decomposition model which is used to perform spectrum separation based on a signal spectrum
  • the preset decomposition model may be obtained by performing training based on the frequency spectrums of multiple audio signals, the accompaniment spectrum, and the human voice spectrum based on the multiple audio signals in advance.
  • the preset decomposition model may be used to represent a separation rule of the accompaniment spectrum and the human voice spectrum, so that the frequency spectrum of the audio signal is decomposed based on the separation rule.
  • the computer device converts the accompaniment spectrum and the human voice spectrum from the frequency domain to the time domain to obtain accompaniment audio and human voice audio.
  • the accompaniment spectrum and the human voice spectrum can be converted from the frequency domain to the time domain through the Hartley inverse transform to obtain the accompaniment audio and the human voice audio.
  • the method provided by the embodiment of the present invention uses a transform algorithm that uses only real numbers to represent the amplitude of the audio frame when converting, to perform time-to-frequency and frequency-to-time domain transformations. Because the transformed spectrum is a real number spectrum , There is no phase information; after the inverse transformation, the original phase is not lost, so the accompaniment and human voice are separated from the audio file based on this conversion method to avoid the phase distortion problem of Fourier transform spectrum decomposition .
  • FIG. 3 is a schematic structural diagram of a voice signal separation device according to an embodiment of the present invention.
  • the device includes:
  • a sampling module 301 configured to sample a sound wave waveform of an audio file to be separated to obtain an audio signal
  • a first conversion module 302 configured to convert the audio signal from the time domain to the frequency domain to obtain a frequency spectrum of the audio signal, where the frequency spectrum is only used to represent an amplitude of the audio signal and the amplitude is a real number;
  • a decomposition module 303 configured to decompose the frequency spectrum of the audio signal to obtain an accompaniment spectrum and a human voice spectrum
  • the second conversion module 304 is configured to convert the accompaniment spectrum and the human voice spectrum from the frequency domain to the time domain to obtain accompaniment audio and human voice audio.
  • the first conversion module 302 includes:
  • a frame framing unit configured to perform frame processing on the audio signal to obtain multiple audio frames
  • a time-frequency conversion unit is configured to convert the multiple audio frames from the time domain to the frequency domain, respectively, to obtain the frequency spectrum of the multiple audio frames, and the frequency spectrum of each audio frame is only used to represent the amplitude of the audio frame and Amplitude is real
  • a combining unit is configured to combine the frequency spectra of the multiple audio frames to obtain the frequency spectrum of the audio signal.
  • the framing unit is configured to:
  • window processing is performed on the audio signal to obtain multiple audio frames.
  • the length of the preset window function is the same as the number of sampling points of each audio frame.
  • the number of sampling points of each audio frame is twice the number of frame overlapping sampling points.
  • the decomposition module is configured to call a preset decomposition model, and the preset decomposition model is used to perform spectrum separation based on a signal spectrum; inputting the frequency spectrum of the audio signal into the preset decomposition model, Output accompaniment spectrum and human voice spectrum.
  • the voice signal separation device provided in the foregoing embodiment separates the voice signals
  • only the division of the above functional modules is used as an example.
  • the above functions may be allocated by different functional modules as required. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the voice signal separation device and the voice signal separation method provided by the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiment, and details are not described herein again.
  • the computer device 400 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 401. And one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the following method:
  • Sampling the sound wave waveform of the audio file to be separated to obtain an audio signal ; converting the audio signal from the time domain to the frequency domain to obtain the frequency spectrum of the audio signal, the frequency spectrum is only used to represent the amplitude of the audio signal and The amplitude is a real number; the frequency spectrum of the audio signal is decomposed to obtain an accompaniment spectrum and a human voice spectrum; the accompaniment spectrum and the human voice spectrum are converted from a frequency domain to a time domain to obtain an accompaniment audio and a human voice audio.
  • the processor 401 is further configured to perform the following steps: framing the audio signal to obtain multiple audio frames; and converting the multiple audio frames from the time domain to In the frequency domain, the frequency spectrums of the multiple audio frames are obtained, and the frequency spectrum of each audio frame is only used to represent the amplitude of the audio frame and the amplitude is a real number; the frequency spectra of the multiple audio frames are combined to obtain the audio frequency.
  • the spectrum of the signal is further configured to perform the following steps: framing the audio signal to obtain multiple audio frames; and converting the multiple audio frames from the time domain to In the frequency domain, the frequency spectrums of the multiple audio frames are obtained, and the frequency spectrum of each audio frame is only used to represent the amplitude of the audio frame and the amplitude is a real number; the frequency spectra of the multiple audio frames are combined to obtain the audio frequency.
  • the spectrum of the signal is further configured to perform the following steps: framing the audio signal to obtain multiple audio frames; and converting the multiple audio frames from the time domain to In the frequency
  • the processor 401 is further configured to perform the following steps: performing windowing processing on the audio signal based on a preset window function to obtain multiple audio frames.
  • the length of the preset window function is the same as the number of sampling points of each audio frame.
  • the number of sampling points of each audio frame is twice the number of frame overlapping sampling points.
  • the processor 401 is further configured to perform the following steps: calling a preset decomposition model, where the preset decomposition model is used to perform spectrum separation based on a signal spectrum;
  • the frequency spectrum of the audio signal is input to the preset decomposition model, and an accompaniment spectrum and a human voice spectrum are output.
  • the computer device may also have components such as a wired or wireless network interface, a keyboard, and an input-output interface for input and output.
  • the computer device may also include other components for implementing the functions of the device, and details are not described herein.
  • a computer-readable storage medium such as a memory including instructions, and the foregoing instructions may be executed by a processor in a terminal to complete the speech signal separation method in the following embodiments:
  • Sampling the sound wave waveform of the audio file to be separated to obtain an audio signal ; converting the audio signal from the time domain to the frequency domain to obtain the frequency spectrum of the audio signal, the frequency spectrum is only used to represent the amplitude of the audio signal and The amplitude is a real number; the frequency spectrum of the audio signal is decomposed to obtain an accompaniment spectrum and a human voice spectrum; the accompaniment spectrum and the human voice spectrum are converted from a frequency domain to a time domain to obtain an accompaniment audio and a human voice audio.
  • the processor is further configured to perform the following steps: framing the audio signal to obtain multiple audio frames; and converting the multiple audio frames from time domain to frequency respectively. Domain to obtain the frequency spectrum of the plurality of audio frames, and the frequency spectrum of each audio frame is only used to represent the amplitude of the audio frame and the amplitude is a real number; the frequency spectra of the plurality of audio frames are combined to obtain the audio signal The spectrum.
  • the processor is further configured to perform the following steps: performing windowing processing on the audio signal based on a preset window function to obtain multiple audio frames.
  • the length of the preset window function is the same as the number of sampling points of each audio frame.
  • the number of sampling points of each audio frame is twice the number of frame overlapping sampling points.
  • the processor is further configured to perform the following steps: calling a preset decomposition model, where the preset decomposition model is used to perform spectrum separation based on a signal spectrum;
  • the frequency spectrum of the audio signal is input to the preset decomposition model, and an accompaniment spectrum and a human voice spectrum are output.
  • the computer-readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
  • the program may be stored in a computer-readable storage medium.
  • the storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

一种语音信号分离方法、装置、计算机设备以及存储介质,属于语音信号处理领域。方法包括:对待分离的音频文件的声波波形进行采样,得到音频信号(201);将音频信号从时域转换至频域,得到音频信号的频谱,频谱仅用于表示音频信号的振幅且振幅为实数;将音频信号的频谱进行分解,得到伴奏频谱与人声频谱;将伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频(206)。利用转换时仅用实数来表示音频帧的振幅的变换算法,来进行时域到频域以及频域到时域的变换,由于变换前后均不会对相位进行变换,相位信息不受损失,因此,基于这种转换方式从音频文件中分离伴奏和人声,避免傅里叶变换频谱分解的相位失真问题。

Description

语音信号分离方法、装置、计算机设备以及存储介质
本申请要求于2018年7月20日提交的申请号为201810802835.7、发明名称为“语音信号分离方法、装置、计算机设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及语言信号处理领域,特别涉及一种语音信号分离方法、装置、计算机设备以及存储介质。
背景技术
随着语音信号处理技术的不断发展,语音信号分离在人们的日常生活中得到了广泛的应用。例如,用户在使用一些K歌软件时,想结合伴奏录制自己演唱的歌曲,那么就需要使用服务器提供的歌曲伴奏,伴奏的质量直接影响到最后录制成品的效果。因此,如何进行语音信号分离,以得到伴奏音频与人声音频,对于提升伴奏音频的质量至关重要。
目前,在进行语音信号分离时,均会涉及到运用傅里叶变换将音频信号从时域转换至频域的过程,该过程可以得到复数频谱。从而,可以通过对复数频谱进行分解,得到分离出的伴奏频谱与人声频谱,再通过傅里叶反变换,得到伴奏音频与人声音频。
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:由于在对复数频谱进行分解时,仅利用了振幅频谱,从而导致分离出的伴奏音频存在相位失真的现象。
发明内容
本发明实施例提供了一种语音信号分离方法、装置、计算机设备以及存储介质,能够解决语音信号分离的相位失真问题。该技术方案如下:
一方面,提供了一种语音信号分离方法,该方法包括:
对待分离的音频文件的声波波形进行采样,得到音频信号;
将该音频信号从时域转换至频域,得到该音频信号的频谱,该频谱仅用于表示该音频信号的振幅且该振幅为实数;
将该音频信号的频谱进行分解,得到伴奏频谱与人声频谱;
将该伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
在一种可能实现方式中,该将该音频信号从时域转换至频域,得到该音频信号的频谱,包括:
将该音频信号进行分帧处理,得到多个音频帧;
将该多个音频帧分别从时域转换至频域,得到该多个音频帧的频谱,每个音频帧的频谱仅用于表示该音频帧的振幅且振幅为实数;
将该多个音频帧的频谱进行组合,得到该音频信号的频谱。
在一种可能实现方式中,该将该音频信号进行分帧处理,得到多个音频帧,包括:
基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
在一种可能实现方式中,该预设窗函数的长度与该每个音频帧的采样点数相同。
在一种可能实现方式中,每个音频帧的采样点数是帧重叠采样点数的2倍。
在一种可能实现方式中,该将该音频信号的频谱进行分解,得到伴奏频谱与人声频谱,包括:
调用预设分解模型,该预设分解模型用于基于信号频谱进行频谱分离;
将该音频信号的频谱输入该预设分解模型,输出伴奏频谱与人声频谱。
一方面,提供了一种语音信号分离装置,该装置包括:
采样模块,用于对待分离的音频文件的声波波形进行采样,得到音频信号;
第一转换模块,用于将该音频信号从时域转换至频域,得到该音频信号的频谱,该频谱仅用于表示该音频信号的振幅且该振幅为实数;
分解模块,用于将该音频信号的频谱进行分解,得到伴奏频谱与人声频谱;
第二转换模块,用于将该伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
在一种可能实现方式中,该第一转换模块包括:
分帧单元,用于将该音频信号进行分帧处理,得到多个音频帧;
时频转换单元,用于将该多个音频帧分别从时域转换至频域,得到该多个音频帧的频谱,每个音频帧的频谱仅用于表示该音频帧的振幅且振幅为实数;
组合单元,用于将该多个音频帧的频谱进行组合,得到该音频信号的频谱。
在一种可能实现方式中,该分帧单元用于:
基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
在一种可能实现方式中,该预设窗函数的长度与该每个音频帧的采样点数相同。
在一种可能实现方式中,每个音频帧的采样点数是帧重叠采样点数的2倍。
在一种可能实现方式中,该分解模块用于调用预设分解模型,该预设分解模型用于基于信号频谱进行频谱分离;将该音频信号的频谱输入该预设分解模型,输出伴奏频谱与人声频谱。
一方面,提供了一种计算机设备,该计算机设备包括处理器和存储器,该存储器中存储有至少一条指令,该指令由该处理器加载并执行以实现如下语音信号分离方法所执行的操作:
对待分离的音频文件的声波波形进行采样,得到音频信号;
将所述音频信号从时域转换至频域,得到所述音频信号的频谱,所述频谱仅用于表示所述音频信号的振幅且所述振幅为实数;
将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱;
将所述伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
在一种可能实现方式中,所述处理器还用于执行下述操作:
将所述音频信号进行分帧处理,得到多个音频帧;
将所述多个音频帧分别从时域转换至频域,得到所述多个音频帧的频谱,每个音频帧的频谱仅用于表示所述音频帧的振幅且振幅为实数;
将所述多个音频帧的频谱进行组合,得到所述音频信号的频谱。
在一种可能实现方式中,所述处理器还用于执行下述操作:
基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
在一种可能实现方式中,所述预设窗函数的长度与所述每个音频帧的采样点数相同。
在一种可能实现方式中,每个音频帧的采样点数是帧重叠采样点数的2倍。
在一种可能实现方式中,所述处理器还用于执行下述操作:
调用预设分解模型,所述预设分解模型用于基于信号频谱进行频谱分离;将所述音频信号的频谱输入所述预设分解模型,输出伴奏频谱与人声频谱。
一方面,提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令,该指令由处理器加载并执行以实现如上述语音信号分离方法所执行的操作。
本发明实施例提供的方法,利用转换时仅用实数来表示音频帧的振幅的变换算法,来进行时域到频域以及频域到时域的变换,由于变换前后均不会对相位进行变换,相位信息不受损失,因此,基于这种转换方式从音频文件中分离伴奏和人声,避免傅里叶变换频谱分解的相位失真问题。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种语音信号分离方法的实施场景图;
图2是本发明实施例提供的一种语音信号分离方法的流程图;
图3是本发明实施例提供的一种语音信号分离装置结构示意图;
图4是本发明实施例提供的一种计算机设备的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
图1是本发明实施例提供的一种语音信号分离方法的实施场景图。参见图1,该实施场景内可以包括:至少一个终端101和至少一个服务器102,其中,至少一个终端101可以作为声音信号的采集终端或是音频文件的播放终端,该至少一个服务器102用于为至少一个终端101提供音频服务,例如可以提供待播放的音频文件,还可以提供诸如本发明实施例所提供方法对应的信号分离功能,以便对终端所提供的或是终端所选中的音频文件进行语音信号分离等。又例如,该至少一个服务器102还可以提供待播放的视频文件,该视频文件包括画面数据和音频文件,服务器102可以从视频文件中提取音频文件,以实现本发明实施例所提供方法对应的信号分离功能。
图2是本发明实施例提供的一种语音信号分离方法的流程图,以该实施例的执行主体为一计算机设备为例,参见图2,该实施例具体包括:
201、计算机设备对待分离的音频文件的声波波形进行采样,得到音频信号。
该待分离的音频文件可以是终端上传的音频文件,也可以是计算机设备上存储的音频文件,或是计算机设备上所存储的视频文件所包含的音频文件,当然,该计算机设备可以是服务器,也可以是任一个终端,本发明实施例对此不做限定。计算机设备在获取待处理的音频文件后,可以获取音频文件的声波波形,并对声波波形进行预设采样率的采样,以得到音频信号。
其中,该预设采样率可以与该音频文件的格式对应,不同音频文件格式可以对应于不同预设采样率,采用与该格式对应的音频采样率对音频文件的声波波形进行采样,可以保证 采样所得到的音频信号具有一致性。
202、该计算机设备基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
采样得到的音频信号可以按照预设帧长进行分帧处理,以得到多个原始音频帧。该预设帧长应足够短,一般可以取为20至50毫秒,在足够短的时间内,该原始音频帧可视为近似平稳的周期信号,以便于后续步骤的实施。
在进行分帧处理时,每个音频帧的采样点数应在合理的范围内选取,以提高音频帧的频谱分辨率。在一种可能实现方式中,上一个原始音频帧与下一个原始音频帧之间应有帧重叠的部分,以保证每个原始音频帧都有上一帧的成分,防止两个原始音频帧之间出现不连续的现象。一般地,可以将每个原始音频帧的采样点数范围选取在512至8192点之间。例如,在本发明实施例中,可以将每个音频帧的采样点数选取为2048点,相应地,将帧重叠采样点数选取为1024点。
在上述分帧处理的过程中,可以考虑预设帧长和每个音频帧内所包含的采样点数,使得二者均满足上述条件,从而达到最佳的分帧效果。
在实际进行分帧处理时,可以采取加窗的方式,也即是对该多个原始音频帧分别进行加窗处理,得到多个音频帧,以便让该多个音频帧更好地满足后续步骤中时频转换的周期性要求,减少音频帧频谱的泄漏,提高频谱的分辨率。例如,该预设窗函数可以选取汉宁窗或哈明窗。其中,该预设窗函数的长度可以与每个音频帧的采样点数相同,每个音频帧的采样点数是帧重叠采样点数的2倍。
203、该计算机设备将该多个音频帧分别从时域转换至频域,得到该多个音频帧的频谱,每个音频帧的频谱仅用于表示该音频帧的振幅且振幅为实数。
在本发明实施例中,在进行时频转换时,可以通过哈特莱变换将该多个音频帧分别从时域转换至频域,得到该多个音频帧的频谱。由于哈特莱变换为实数变换,因此得到的该多个音频帧的频谱为实数频谱,且,该实数频谱仅用于表示该音频谱的振幅,不涉及相位。具体地,该哈特莱变换可以应用下述公式实现:
Figure PCTCN2018118293-appb-000001
k=0,.....,N-1
其中,每个音频帧的采样点个数为N,帧重叠的采样点个数为M,M为N的1/2,x n为每帧的采样点幅度,n=0,1,2,...,N-1.H k为哈特莱变换后的频谱,k为频点,k=0,1,2,...,N-1,N为正整数。
需要说明的是,本发明实施例仅以哈特莱变换为例进行说明,实际上还可以采用其他 不损伤相位的变换方式,本发明实施例对此不做限定。
204、该计算机设备将该多个音频帧的频谱进行组合,得到该音频信号的频谱。
当获取到各个音频帧的频谱时,将各个音频帧的频谱按头尾相接的方式顺序拼接,组成一个N*L维的二维向量,其中N等于每个音频帧的采样点个数,L为帧的总个数。
205、该计算机设备调用预设分解模型,该预设分解模型用于基于信号频谱进行频谱分离;将该音频信号的频谱输入该预设分解模型,输出伴奏频谱与人声频谱。
其中,预设分解模型可以是预先基于多个音频信号的频谱、基于该多个音频信号的伴奏频谱和人声频谱进行训练得到的。例如,该预设分解模型可以用于表示伴奏频谱和人声频谱的分离规律,从而基于该分离规律,对该音频信号的频谱进行分解。
206、该计算机设备将该伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
当获取到伴奏频谱和人声频谱时,可以通过哈特莱反变换,将该伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
本发明实施例提供的方法,利用转换时仅用实数来表示该音频帧的振幅的变换算法,来进行时域到频域以及频域到时域的变换,由于变换后的频谱,为实数谱,没有相位信息;而进行逆变换之后,还是原来的相位,相位信息不受损失,因此,基于这种转换方式从音频文件中分离伴奏和人声,避免傅里叶变换频谱分解的相位失真问题。
上述所有可选技术方案,可以采用任意结合形成本公开的可选实施例,在此不再一一赘述。
图3是本发明实施例提供的一种语音信号分离装置的结构示意图,参见图3,所述装置包括:
采样模块301,用于对待分离的音频文件的声波波形进行采样,得到音频信号;
第一转换模块302,用于将所述音频信号从时域转换至频域,得到所述音频信号的频谱,所述频谱仅用于表示所述音频信号的振幅且所述振幅为实数;
分解模块303,用于将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱;
第二转换模块304,用于将所述伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
在一种可能实施方式中,所述第一转换模块302包括:
分帧单元,用于将所述音频信号进行分帧处理,得到多个音频帧;
时频转换单元,用于将所述多个音频帧分别从时域转换至频域,得到所述多个音频帧 的频谱,每个音频帧的频谱仅用于表示所述音频帧的振幅且振幅为实数;
组合单元,用于将所述多个音频帧的频谱进行组合,得到所述音频信号的频谱。
在一种可能实施方式中,所述分帧单元用于:
基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
在一种可能实施方式中,所述预设窗函数的长度与所述每个音频帧的采样点数相同。
在一种可能实施方式中,每个音频帧的采样点数是帧重叠采样点数的2倍。
在一种可能实施方式中,所述分解模块用于调用预设分解模型,所述预设分解模型用于基于信号频谱进行频谱分离;将所述音频信号的频谱输入所述预设分解模型,输出伴奏频谱与人声频谱。
需要说明的是:上述实施例提供的语音信号分离装置在语音信号分离时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音信号分离装置与语音信号分离方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图4是本发明实施例提供的一种计算机设备的结构示意图,该计算机设备400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)401和一个或一个以上的存储器402,其中,所述存储器402中存储有至少一条指令,所述至少一条指令由所述处理器401加载并执行以实现下述方法:
对待分离的音频文件的声波波形进行采样,得到音频信号;将所述音频信号从时域转换至频域,得到所述音频信号的频谱,所述频谱仅用于表示所述音频信号的振幅且所述振幅为实数;将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱;将所述伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
在一种可能实施方式中,该处理器401还用于执行实现下述步骤:将所述音频信号进行分帧处理,得到多个音频帧;将所述多个音频帧分别从时域转换至频域,得到所述多个音频帧的频谱,每个音频帧的频谱仅用于表示所述音频帧的振幅且振幅为实数;将所述多个音频帧的频谱进行组合,得到所述音频信号的频谱。
在一种可能实施方式中,该处理器401还用于执行实现下述步骤:基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
在一种可能实施方式中,所述预设窗函数的长度与所述每个音频帧的采样点数相同。
在一种可能实施方式中,每个音频帧的采样点数是帧重叠采样点数的2倍。
在一种可能实施方式中,该处理器401还用于执行实现下述步骤:调用预设分解模型,所述预设分解模型用于基于信号频谱进行频谱分离;
将所述音频信号的频谱输入所述预设分解模型,输出伴奏频谱与人声频谱。
当然,该计算机设备还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该计算机设备还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括指令的存储器,上述指令可由终端中的处理器执行以完成下述实施例中的语音信号分离方法:
对待分离的音频文件的声波波形进行采样,得到音频信号;将所述音频信号从时域转换至频域,得到所述音频信号的频谱,所述频谱仅用于表示所述音频信号的振幅且所述振幅为实数;将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱;将所述伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
在一种可能实施方式中,该处理器还用于执行实现下述步骤:将所述音频信号进行分帧处理,得到多个音频帧;将所述多个音频帧分别从时域转换至频域,得到所述多个音频帧的频谱,每个音频帧的频谱仅用于表示所述音频帧的振幅且振幅为实数;将所述多个音频帧的频谱进行组合,得到所述音频信号的频谱。
在一种可能实施方式中,该处理器还用于执行实现下述步骤:基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
在一种可能实施方式中,所述预设窗函数的长度与所述每个音频帧的采样点数相同。
在一种可能实施方式中,每个音频帧的采样点数是帧重叠采样点数的2倍。
在一种可能实施方式中,该处理器还用于执行实现下述步骤:调用预设分解模型,所述预设分解模型用于基于信号频谱进行频谱分离;
将所述音频信号的频谱输入所述预设分解模型,输出伴奏频谱与人声频谱。
例如,所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (19)

  1. 一种语音信号分离方法,其特征在于,所述方法包括:
    对待分离的音频文件的声波波形进行采样,得到音频信号;
    将所述音频信号从时域转换至频域,得到所述音频信号的频谱,所述频谱仅用于表示所述音频信号的振幅且所述振幅为实数;
    将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱;
    将所述伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述音频信号从时域转换至频域,得到所述音频信号的频谱,包括:
    将所述音频信号进行分帧处理,得到多个音频帧;
    将所述多个音频帧分别从时域转换至频域,得到所述多个音频帧的频谱,每个音频帧的频谱仅用于表示所述音频帧的振幅且振幅为实数;
    将所述多个音频帧的频谱进行组合,得到所述音频信号的频谱。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述音频信号进行分帧处理,得到多个音频帧,包括:
    基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
  4. 根据权利要求3所述的方法,其特征在于,所述预设窗函数的长度与所述每个音频帧的采样点数相同。
  5. 根据权利要求2所述的方法,其特征在于,每个音频帧的采样点数是帧重叠采样点数的2倍。
  6. 根据权利要求1所述的方法,其特征在于,所述将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱,包括:
    调用预设分解模型,所述预设分解模型用于基于信号频谱进行频谱分离;
    将所述音频信号的频谱输入所述预设分解模型,输出伴奏频谱与人声频谱。
  7. 一种语音信号分离装置,其特征在于,所述装置包括:
    采样模块,用于对待分离的音频文件的声波波形进行采样,得到音频信号;
    第一转换模块,用于将所述音频信号从时域转换至频域,得到所述音频信号的频谱,所述频谱仅用于表示所述音频信号的振幅且所述振幅为实数;
    分解模块,用于将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱;
    第二转换模块,用于将所述伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
  8. 根据权利要求7所述的装置,其特征在于,所述第一转换模块包括:
    分帧单元,用于将所述音频信号进行分帧处理,得到多个音频帧;
    时频转换单元,用于将所述多个音频帧分别从时域转换至频域,得到所述多个音频帧的频谱,每个音频帧的频谱仅用于表示所述音频帧的振幅且振幅为实数;
    组合单元,用于将所述多个音频帧的频谱进行组合,得到所述音频信号的频谱。
  9. 根据权利要求8所述的装置,其特征在于,所述分帧单元用于:
    基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
  10. 根据权利要求9所述的装置,其特征在于,所述预设窗函数的长度与所述每个音频帧的采样点数相同。
  11. 根据权利要求8所述的装置,其特征在于,每个音频帧的采样点数是帧重叠采样点数的2倍。
  12. 根据权利要求7所述的装置,其特征在于,所述分解模块用于调用预设分解模型,所述预设分解模型用于基于信号频谱进行频谱分离;将所述音频信号的频谱输入所述预设分解模型,输出伴奏频谱与人声频谱。
  13. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如下语音信号分离方法所执行的操作:
    对待分离的音频文件的声波波形进行采样,得到音频信号;
    将所述音频信号从时域转换至频域,得到所述音频信号的频谱,所述频谱仅用于表示所述音频信号的振幅且所述振幅为实数;
    将所述音频信号的频谱进行分解,得到伴奏频谱与人声频谱;
    将所述伴奏频谱与人声频谱从频域转换至时域,得到伴奏音频与人声音频。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述处理器还用于执行下述操作:
    将所述音频信号进行分帧处理,得到多个音频帧;
    将所述多个音频帧分别从时域转换至频域,得到所述多个音频帧的频谱,每个音频帧的频谱仅用于表示所述音频帧的振幅且振幅为实数;
    将所述多个音频帧的频谱进行组合,得到所述音频信号的频谱。
  15. 根据权利要求14所述的计算机设备,其特征在于,所述处理器还用于执行下述操作:
    基于预设窗函数,对所述音频信号进行加窗处理,得到多个音频帧。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述预设窗函数的长度与所述每个音频帧的采样点数相同。
  17. 根据权利要求14所述的计算机设备,其特征在于,每个音频帧的采样点数是帧重叠采样点数的2倍。
  18. 根据权利要求13所述的计算机设备,其特征在于,所述处理器还用于执行下述操作:
    调用预设分解模型,所述预设分解模型用于基于信号频谱进行频谱分离;将所述音频信号的频谱输入所述预设分解模型,输出伴奏频谱与人声频谱。
  19. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现如权利要求1至权利要求7任一项所述的语音信号分离方法所执行的操作。
PCT/CN2018/118293 2018-07-20 2018-11-29 语音信号分离方法、装置、计算机设备以及存储介质 WO2020015270A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810802835.7 2018-07-20
CN201810802835.7A CN108962277A (zh) 2018-07-20 2018-07-20 语音信号分离方法、装置、计算机设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2020015270A1 true WO2020015270A1 (zh) 2020-01-23

Family

ID=64482037

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/118293 WO2020015270A1 (zh) 2018-07-20 2018-11-29 语音信号分离方法、装置、计算机设备以及存储介质

Country Status (2)

Country Link
CN (1) CN108962277A (zh)
WO (1) WO2020015270A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109801644B (zh) 2018-12-20 2021-03-09 北京达佳互联信息技术有限公司 混合声音信号的分离方法、装置、电子设备和可读介质
CN109767760A (zh) * 2019-02-23 2019-05-17 天津大学 基于振幅和相位信息的多目标学习的远场语音识别方法
CN110085251B (zh) * 2019-04-26 2021-06-25 腾讯音乐娱乐科技(深圳)有限公司 人声提取方法、人声提取装置及相关产品
CN110277105B (zh) * 2019-07-05 2021-08-13 广州酷狗计算机科技有限公司 消除背景音频数据的方法、装置和系统
CN111192594B (zh) * 2020-01-10 2022-12-09 腾讯音乐娱乐科技(深圳)有限公司 人声和伴奏分离方法及相关产品
CN111429942B (zh) * 2020-03-19 2023-07-14 北京火山引擎科技有限公司 一种音频数据处理方法、装置、电子设备及存储介质
CN115240709B (zh) * 2022-07-25 2023-09-19 镁佳(北京)科技有限公司 一种音频文件的声场分析方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130121511A1 (en) * 2009-03-31 2013-05-16 Paris Smaragdis User-Guided Audio Selection from Complex Sound Mixtures
CN103943113A (zh) * 2014-04-15 2014-07-23 福建星网视易信息系统有限公司 一种歌曲去伴奏的方法和装置
CN104078051A (zh) * 2013-03-29 2014-10-01 中兴通讯股份有限公司 一种人声提取方法、系统以及人声音频播放方法及装置
CN104134444A (zh) * 2014-07-11 2014-11-05 福建星网视易信息系统有限公司 一种基于mmse的歌曲去伴奏方法和装置
CN106024005A (zh) * 2016-07-01 2016-10-12 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1945689B (zh) * 2006-10-24 2011-04-27 北京中星微电子有限公司 一种从歌曲中提取伴奏乐的方法及其装置
CN101944355B (zh) * 2009-07-03 2013-05-08 深圳Tcl新技术有限公司 伴奏音乐生成装置及其实现方法
CN102402977B (zh) * 2010-09-14 2015-12-09 无锡中星微电子有限公司 从立体声音乐中提取伴奏、人声的方法及其装置
CN104053120B (zh) * 2014-06-13 2016-03-02 福建星网视易信息系统有限公司 一种立体声音频的处理方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130121511A1 (en) * 2009-03-31 2013-05-16 Paris Smaragdis User-Guided Audio Selection from Complex Sound Mixtures
CN104078051A (zh) * 2013-03-29 2014-10-01 中兴通讯股份有限公司 一种人声提取方法、系统以及人声音频播放方法及装置
CN103943113A (zh) * 2014-04-15 2014-07-23 福建星网视易信息系统有限公司 一种歌曲去伴奏的方法和装置
CN104134444A (zh) * 2014-07-11 2014-11-05 福建星网视易信息系统有限公司 一种基于mmse的歌曲去伴奏方法和装置
CN106024005A (zh) * 2016-07-01 2016-10-12 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置

Also Published As

Publication number Publication date
CN108962277A (zh) 2018-12-07

Similar Documents

Publication Publication Date Title
WO2020015270A1 (zh) 语音信号分离方法、装置、计算机设备以及存储介质
WO2022033327A1 (zh) 视频生成方法、生成模型训练方法、装置、介质及设备
WO2021196905A1 (zh) 语音信号去混响处理方法、装置、计算机设备和存储介质
US20210193149A1 (en) Method, apparatus and device for voiceprint recognition, and medium
WO2022166710A1 (zh) 语音增强方法、装置、设备及存储介质
JP6482173B2 (ja) 音響信号処理装置およびその方法
WO2017044370A1 (en) System and method for providing words or phrases to be uttered by members of a crowd and processing the utterances in crowd-sourced campaigns to facilitate speech analysis
CN114203163A (zh) 音频信号处理方法及装置
CN111863015A (zh) 一种音频处理方法、装置、电子设备和可读存储介质
WO2024055752A9 (zh) 语音合成模型的训练方法、语音合成方法和相关装置
US20230050519A1 (en) Speech enhancement method and apparatus, device, and storage medium
Li et al. Filtering and refining: A collaborative-style framework for single-channel speech enhancement
WO2024027295A1 (zh) 语音增强模型的训练、增强方法、装置、电子设备、存储介质及程序产品
CN111739544A (zh) 语音处理方法、装置、电子设备及存储介质
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
WO2022005615A1 (en) Speech enhancement
Wu et al. Self-supervised speech denoising using only noisy audio signals
Hussain et al. Bone-conducted speech enhancement using hierarchical extreme learning machine
CN114333874A (zh) 处理音频信号的方法
WO2024055751A1 (zh) 音频数据处理方法、装置、设备、存储介质及程序产品
WO2023102932A1 (zh) 音频转换方法、电子设备、程序产品及存储介质
WO2022227932A1 (zh) 声音信号处理方法、装置和电子设备
CN115188363A (zh) 语音处理方法、系统、设备及存储介质
CN113707163A (zh) 语音处理方法及其装置和模型训练方法及其装置
Kuang et al. A lightweight speech enhancement network fusing bone-and air-conducted speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926812

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.05.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18926812

Country of ref document: EP

Kind code of ref document: A1