WO2014153922A1 - Human voice extracting method and system, and audio playing method and device for human voice - Google Patents

Human voice extracting method and system, and audio playing method and device for human voice Download PDF

Info

Publication number
WO2014153922A1
WO2014153922A1 PCT/CN2013/082328 CN2013082328W WO2014153922A1 WO 2014153922 A1 WO2014153922 A1 WO 2014153922A1 CN 2013082328 W CN2013082328 W CN 2013082328W WO 2014153922 A1 WO2014153922 A1 WO 2014153922A1
Authority
WO
WIPO (PCT)
Prior art keywords
human voice
main pitch
sound
frequency
sample
Prior art date
Application number
PCT/CN2013/082328
Other languages
French (fr)
Chinese (zh)
Inventor
佘海波
王进军
刘书昌
张欣
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2014153922A1 publication Critical patent/WO2014153922A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to the field of mixed audio separation and extraction, and in particular, to a vocal extraction method and system, and a vocal audio playing method and apparatus.
  • ASA Auditory Scene Analysis
  • the Computational Auditory Scene Analysis (CASA) technique uses computer technology to simulate the human auditory system, ultimately giving the computer a voice-like ability similar to that of the human ear.
  • the conventional CASA system first divides the sound into the simultaneous part of the vocal and background sounds and the part of the background sound only; the signal of the simultaneous occurrence of the vocal and background sounds is decomposed by the multi-channel filter; the signal of each channel is performed. Classification, to determine whether it is a vocal or background sound.
  • the invention provides a vocal sound extraction method, system and vocal audio playing method and device, to solve the technical problem of how to easily extract human voice from mixed audio.
  • the present invention provides a vocal sound extraction method, the method comprising: extracting a sound signal co-occurring between a human voice and a background sound as a sample from a beginning of an original sound signal; detecting a main sound from the sample high; Taking the main pitch as a reference frequency, comparing the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample to the reference frequency determines whether the sound source belongs to a human voice.
  • Each frame of the sound signal is subjected to a multi-channel filter to obtain a plurality of time-frequency units, and the adjacent time-frequency units belonging to the same sound source are combined as one segment;
  • the segment is a vocal segment.
  • the method further includes:
  • the main pitch is continuously detected from subsequent adjacent frames. If the main pitch is changed, the changed main pitch is used as the reference frequency, and the segment in the frame is continuously determined. Whether it is a vocal fragment.
  • the changed main pitch is used as the reference frequency, including: if the main pitch changes, continue to determine whether the main pitch of the subsequent frame is the changed value, if the main pitch of the consecutive multiple frames is The change value is based on the changed main pitch as the reference frequency.
  • the present invention also provides a method for playing a human voice audio, the method comprising:
  • the human voice signal is linearly combined with the original sound signal and played.
  • the present invention also provides a vocal sound extraction system, the system package a sample extraction unit, a main pitch detection unit, and a vocal detection unit, wherein
  • the sample extracting unit is configured to: extract a sound signal co-occurring from the vocal and the background sound as a sample from the beginning of the original sound signal, and send the sample to the main pitch detecting unit; the main pitch detecting unit, Set to: detect a main pitch from the sample, and send the main pitch to the vocal detection unit;
  • the vocal detection unit is configured to: use the main pitch as a reference frequency, and divide the original sound signal by a pitch frequency of a sound belonging to the same sound source other than the sample and the reference frequency A comparison determines whether the sound source is a human voice.
  • the vocal detection unit is configured to: use the main pitch as a reference frequency, and divide the original sound signal by a pitch frequency of a sound belonging to the same sound source other than the sample and the reference frequency
  • a comparison determines whether the sound source belongs to a human voice, including:
  • the vocal detection unit divides the sound portion of the original sound signal except the sample into a plurality of frames; and passes each frame of the sound signal through a multi-channel filter to obtain a plurality of time-frequency units, and merges adjacent ones into the same sound
  • the time-frequency unit of the source is used as a segment; if the pitch frequency of more than half of the time-frequency units in one segment is equal to the reference frequency, the segment is determined to be a vocal segment.
  • the main pitch detecting unit is further configured to: after the vocal detecting unit detects one frame, continue to detect the main pitch from the subsequent adjacent frames, and if the main pitch changes, send the changed main pitch as the reference frequency to the reference frequency.
  • the vocal detection unit is further configured to: after the vocal detecting unit detects one frame, continue to detect the main pitch from the subsequent adjacent frames, and if the main pitch changes, send the changed main pitch as the reference frequency to the reference frequency.
  • the main pitch detecting unit is configured to: change the main pitch, and change the main pitch as the reference frequency, including:
  • the main pitch detecting unit continues to determine whether the main pitch of the subsequent frame is the changed value when the main pitch changes, and if the main pitch of the plurality of subsequent frames is the changed value, the changed main pitch is used as a reference. frequency.
  • the present invention also provides a vocal audio playing device, the device comprising a vocal sound extraction system and a playing system, wherein: The vocal extraction system extracts a vocal signal from the original sound signal by using the system as described above, and transmits the vocal signal to the playing system;
  • the playing system is configured to: linearly combine the vocal signal and the original sound signal to play.
  • the above technical solution judges whether the human voice is the reference frequency of the main pitch of the sound signal, and is simple to implement compared with the existing technical solution for extracting human voice; and the above technical solution only needs to find the human voice and the background sound together from the beginning of the original sound signal.
  • the sound signal does not need to divide the original sound signal into a portion where both the human voice and the background sound appear simultaneously and only the portion of the background sound, which simplifies the amount of preprocessed data of the sound.
  • FIG. 1 is a flow chart of a voice extraction method according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a voice extraction method according to an embodiment of the present invention.
  • S101 extracts, as a sample, a sound signal that is common to the human voice and the background sound from the beginning of the original sound signal;
  • a sound of about 10 s can be read from the beginning of the original sound signal, and the part where the adult sound and the background sound coexist and the part which only has the background sound are separated; if the 10s does not find the part where the human voice and the background sound coexist, Can read the next 10s until the vocals are found;
  • the main pitch detection is also called the pitch frequency detection
  • Specific detection steps may include:
  • / is the filter order, is the filter bandwidth, / is the filter center frequency;
  • the data of each channel obtained after the frame passes through the Gammatone filter is a basic time-frequency (TF) unit; according to the human ear
  • TF time-frequency
  • each time-frequency (TF) unit belongs to a sound source (or belongs to the background sound, or belongs to the human voice);
  • the autocorrelation of each channel is calculated to obtain a correlation graph; on the correlation graph, the highest peak point information of the low frequency channel and the envelope information of the high frequency channel are used to determine the fundamental frequency of the frame;
  • the autocorrelation calculation formula is:
  • a H (c,m ) ——— ⁇ h(c, mT - n)h(c, mT -n - t)
  • N £ is the frame period (autocorrelation window size), N ⁇ 0, N c ) is the value of the signal output at channel C and time n, c is the channel, m is the frame, and t is determined by the signal frequency corresponding to the maximum delay of the window.
  • the value of t is 0 ⁇ 12.5ms, and T is the number of samples corresponding to the frame shift;
  • the multi-channel filter may be a Gammatone filter
  • the adjacent time-frequency unit When merging adjacent time-frequency units belonging to the same sound source, first determine the cross-correlation of adjacent time-frequency units. If the cross-correlation value of the adjacent time-frequency unit is greater than a preset threshold, the adjacent time-frequency belongs to the same a sound source;
  • the cross-correlation calculation formula is:
  • the main pitch is constantly changing when the vocals are singing, in order to ensure that the main pitch as the reference frequency accurately reflects the vocals, it is necessary to constantly correct the main pitch, that is, whether or not the vocal segments are determined for all the segments of each frame. After that, the main pitch is continuously detected from the subsequent adjacent frames. If the main pitch changes, the changed main pitch is used as the reference frequency, and it is determined whether the segment in the frame is a vocal segment; further, in order to avoid the main pitch being short-lived If the main pitch of the subsequent frames is the changed value, if the main pitch of the consecutive frames is the changed value, the changed main pitch is used as the reference frequency. If the main pitch is not detected from subsequent frames (if the vocal disappears) after all the segments of each frame are determined as vocal segments, the vocal and background sounds are re-extracted from the current frame. The sound signal is taken as a sample.
  • this embodiment also provides a vocal audio playing method.
  • the vocal signal is extracted from the original sound signal by the vocal extraction method as described above, and the vocal signal is linearly combined with the original sound signal and played.
  • the separated vocal and original sound superposition can achieve the effect of speech enhancement.
  • 2 is a composition diagram of a human voice extraction system of the present embodiment.
  • the system includes a sample extraction unit, a main pitch detection unit, and a vocal detection unit, wherein: the sample extraction unit is configured to extract, as a sample, a sound signal that is common to the vocal and background sounds from the beginning of the original sound signal, and Sending the sample to the main pitch detecting unit;
  • the main pitch detecting unit is configured to detect a main pitch from the sample, and send the main pitch to the vocal detecting unit;
  • the vocal detection unit is configured to compare the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample with the reference frequency by using the main pitch as a reference frequency Determining whether the sound source is a human voice;
  • the vocal detection unit is configured to divide the sound portion of the original sound signal except the sample into a plurality of frames, such as dividing the sound portion of the original sound signal other than the sample into a plurality of frames every 28 ms.
  • each frame of the sound signal is subjected to a multi-channel filter to obtain a plurality of time-frequency units, and the adjacent time-frequency units belonging to the same sound source are combined as one segment; If the pitch frequency of more than half of the time-frequency unit is equal to the reference frequency, it is determined that the segment is a human voice segment.
  • the above-mentioned main pitch detecting unit is further used for the vocal detecting unit to continue the subsequent adjacent frame after detecting one frame.
  • the main pitch is detected, and if the main pitch is changed, the changed main pitch is transmitted as the reference frequency to the vocal detection unit; to avoid a short transition of the main pitch, the main pitch detecting unit is in the subsequent adjacent frame
  • the main pitch detecting unit is in the subsequent adjacent frame
  • the vocal detection unit is described.
  • the above-mentioned main pitch detecting unit is further configured to re-trigger the sample extracting unit to re-extract the sound signal of the common sound of the human voice and the background sound from the current frame when the main pitch is not detected from the subsequent adjacent frames (if the human voice disappears). As a sample.
  • this embodiment also provides a human voice audio playback device.
  • the device comprises the above-mentioned vocal sound extraction system and a playback system;
  • a vocal sound extraction system configured to extract a vocal signal from the original sound signal, and send the vocal signal to the playing system
  • the playing system is configured to linearly combine the vocal signal and the original sound signal to play.
  • the device superimposes the separated human voice with the original sound to achieve a voice enhancement effect.
  • each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may use software functions.
  • the form of the module is implemented. The invention is not limited to any specific form of combination of hardware and software.
  • the above technical solution judges whether the human voice is the reference frequency of the main pitch of the sound signal, and is simple to implement compared with the existing technical solution for extracting human voice; and the above technical solution only needs to find the human voice and the background sound together from the beginning of the original sound signal.
  • the sound signal does not need to divide the original sound signal into a portion where both the human voice and the background sound appear simultaneously and only the portion of the background sound, which simplifies the amount of preprocessed data of the sound.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

A human voice extracting method and system, and an audio playing method and device for a human voice. The method comprises: extracting, from the beginning of an original sound signal, a sound signal in which both a human voice and background sound appear, as a sample (S101); detecting a primary pitch from the sample (S102); and using the primary pitch as a reference frequency, and comparing, with the reference frequency, a fundamental frequency of sound, in sound except the sample in the original sound signal, belonging to a same sound source, to determine whether the sound source belongs to a human voice. According to the human voice extracting method, a human voice can be easily and conveniently extracted from mixed audio.

Description

一种人声提取方法、 系统以及人声音频播放方法及装置  Vocal sound extraction method, system and human voice audio playing method and device
技术领域 Technical field
本发明涉及混合音频分离提取领域, 尤其涉及一种人声提取方法、 系统 以及人声音频播放方法及装置。  The present invention relates to the field of mixed audio separation and extraction, and in particular, to a vocal extraction method and system, and a vocal audio playing method and apparatus.
背景技术 Background technique
为了实现从双声道立体声等音频中提取人声并对其进行增强, 以达到使 语音更清晰并有效降噪的目的, 需要一种能够从混合音频中提取单一音频的 声音分离技术。 目前能够满足该要求的技术主要是基于计算听觉场景分析 ( CASA, Computational Auditory Scene Analysis ) 的音频分离技术。  In order to extract and enhance vocals from audio such as two-channel stereo to achieve clearer and effective noise reduction, a sound separation technique capable of extracting a single audio from mixed audio is needed. The current technology that can meet this requirement is mainly based on the audio separation technology of Computational Auditory Scene Analysis (CASA).
听觉场景分析 ( Auditory Scene Analysis, ASA )技术, 由听觉系统利用 声音的各种特性(时域、 频域、 空间位置等)将一路混合声音信号分解成多 个信号, 且每个信号属于不同的物理声源。 计算听觉场景分析(CASA )技 术利用计算机技术模拟人体听觉系统, 最终使计算机具备类似人耳的声音分 辨能力。 常规的 CASA系统首先将声音分成人声和背景声同时出现的部分和 只有背景声的部分; 再将人声和背景声同时出现部分的信号通过多通道滤波 器分解; 对每个通道的信号进行分类, 判断其属于人声还是背景声。  Auditory Scene Analysis (ASA) technology, which uses the various characteristics of the sound (time domain, frequency domain, spatial position, etc.) to decompose a mixed sound signal into multiple signals, and each signal belongs to a different Physical sound source. The Computational Auditory Scene Analysis (CASA) technique uses computer technology to simulate the human auditory system, ultimately giving the computer a voice-like ability similar to that of the human ear. The conventional CASA system first divides the sound into the simultaneous part of the vocal and background sounds and the part of the background sound only; the signal of the simultaneous occurrence of the vocal and background sounds is decomposed by the multi-channel filter; the signal of each channel is performed. Classification, to determine whether it is a vocal or background sound.
但是目前利用 CASA技术在对每个通道的信号进行分类, 提取人声的方 法需要综合考虑音频信号的多种特征, 如主音高、 多次谐波、 能量、 幅度调 制、 起始音和终止音, 提取算法复杂, 计算量大。  However, the use of CASA technology to classify the signals of each channel, the method of extracting vocals requires a comprehensive consideration of various characteristics of the audio signal, such as main pitch, multiple harmonics, energy, amplitude modulation, start tone and termination tone. The extraction algorithm is complex and the calculation amount is large.
发明内容 Summary of the invention
本发明提供了一种人声提取方法、 系统以及人声音频播放方法及装置, 以解决如何简便地从混合音频中提取人声的技术问题。  The invention provides a vocal sound extraction method, system and vocal audio playing method and device, to solve the technical problem of how to easily extract human voice from mixed audio.
为解决上述技术问题,本发明提供了一种人声提取方法,所述方法包括: 从原声音信号开始处提取人声和背景声共同出现的声音信号作为样本; 从所述样本中检测出主音高; 以所述主音高为参照频率, 将所述原声音信号除所述样本之外的声音部 分中属于同一声源的声音的基音频率与所述参照频率进行比较确定该声源是 否属于人声。 In order to solve the above technical problem, the present invention provides a vocal sound extraction method, the method comprising: extracting a sound signal co-occurring between a human voice and a background sound as a sample from a beginning of an original sound signal; detecting a main sound from the sample high; Taking the main pitch as a reference frequency, comparing the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample to the reference frequency determines whether the sound source belongs to a human voice.
优选地 ,  Preferably ,
以所述主音高为参照频率, 将所述原声音信号除所述样本之外的声音部 分中属于同一声源的声音的基音频率与所述参照频率进行比较确定该声源是 否属于人声, 包括:  Using the main pitch as a reference frequency, comparing the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample to the reference frequency to determine whether the sound source belongs to a human voice, Includes:
将所述原声音信号除所述样本之外的声音部分分成多帧;  Dividing the sound portion of the original sound signal other than the sample into a plurality of frames;
将每一帧声音信号经过多通道滤波器得到多个时频单元, 合并相邻的属 于同一声源的时频单元作为一个片段;  Each frame of the sound signal is subjected to a multi-channel filter to obtain a plurality of time-frequency units, and the adjacent time-frequency units belonging to the same sound source are combined as one segment;
如果一个片段内, 超过一半以上的时频单元的基音频率与所述参照频率 相等, 则该片段为人声片段。  If more than half of the time-frequency units in a segment have the same fundamental frequency as the reference frequency, the segment is a vocal segment.
优选地, 所述方法还包括:  Preferably, the method further includes:
对每一帧的全部片段进行是否为人声片段的判定之后, 继续从后续相邻 帧中检测出主音高,若主音高发生改变,以发生变化的主音高作为参照频率, 继续判断帧中的片段是否为人声片段。  After determining whether or not all the segments of each frame are vocal segments, the main pitch is continuously detected from subsequent adjacent frames. If the main pitch is changed, the changed main pitch is used as the reference frequency, and the segment in the frame is continuously determined. Whether it is a vocal fragment.
优选地 ,  Preferably ,
所述若主音高发生改变, 以发生变化的主音高作为参照频率, 包括: 若主音高发生改变, 继续判断后续的帧的主音高是否为该改变值, 若连 续多个后续帧的主音高为该改变值, 则以发生变化的主音高作为参照频率。  If the main pitch changes, the changed main pitch is used as the reference frequency, including: if the main pitch changes, continue to determine whether the main pitch of the subsequent frame is the changed value, if the main pitch of the consecutive multiple frames is The change value is based on the changed main pitch as the reference frequency.
为解决上述技术问题, 本发明还提供了一种人声音频播放方法, 所述方 法包括: In order to solve the above technical problem, the present invention also provides a method for playing a human voice audio, the method comprising:
釆用如上所述的方法从原声音信号中提取人声信号;  Extracting a human voice signal from the original sound signal by the method described above;
将所述人声信号与所述原声音信号线性组合后播放。  The human voice signal is linearly combined with the original sound signal and played.
为解决上述技术问题, 本发明还提供了一种人声提取系统, 所述系统包 括样本提取单元、 主音高检测单元、 人声检测单元, 其中, In order to solve the above technical problem, the present invention also provides a vocal sound extraction system, the system package a sample extraction unit, a main pitch detection unit, and a vocal detection unit, wherein
所述样本提取单元, 设置为: 从原声音信号开始处提取人声和背景声共 同出现的声音信号作为样本, 并将所述样本发送至所述主音高检测单元; 所述主音高检测单元, 设置为: 从所述样本中检测出主音高, 并将该主 音高发送至所述人声检测单元;  The sample extracting unit is configured to: extract a sound signal co-occurring from the vocal and the background sound as a sample from the beginning of the original sound signal, and send the sample to the main pitch detecting unit; the main pitch detecting unit, Set to: detect a main pitch from the sample, and send the main pitch to the vocal detection unit;
所述人声检测单元, 设置为: 以所述主音高为参照频率, 将所述原声音 信号除所述样本之外的声音部分中属于同一声源的声音的基音频率与所述参 照频率进行比较确定该声源是否属于人声。  The vocal detection unit is configured to: use the main pitch as a reference frequency, and divide the original sound signal by a pitch frequency of a sound belonging to the same sound source other than the sample and the reference frequency A comparison determines whether the sound source is a human voice.
优选地 ,  Preferably ,
所述人声检测单元, 设置为: 以所述主音高为参照频率, 将所述原声音 信号除所述样本之外的声音部分中属于同一声源的声音的基音频率与所述参 照频率进行比较确定该声源是否属于人声, 包括:  The vocal detection unit is configured to: use the main pitch as a reference frequency, and divide the original sound signal by a pitch frequency of a sound belonging to the same sound source other than the sample and the reference frequency A comparison determines whether the sound source belongs to a human voice, including:
所述人声检测单元将所述原声音信号除所述样本之外的声音部分分成多 帧; 将每一帧声音信号经过多通道滤波器得到多个时频单元, 合并相邻的属 于同一声源的时频单元作为一个片段; 若一个片段内, 超过一半以上的时频 单元的基音频率与所述参照频率相等, 则判断出该片段为人声片段。  The vocal detection unit divides the sound portion of the original sound signal except the sample into a plurality of frames; and passes each frame of the sound signal through a multi-channel filter to obtain a plurality of time-frequency units, and merges adjacent ones into the same sound The time-frequency unit of the source is used as a segment; if the pitch frequency of more than half of the time-frequency units in one segment is equal to the reference frequency, the segment is determined to be a vocal segment.
优选地 ,  Preferably ,
所述主音高检测单元, 还设置为: 人声检测单元检测完一帧后, 继续从 后续相邻帧中检测出主音高, 若主音高发生改变, 将发生变化的主音高作为 参照频率发送至所述人声检测单元。  The main pitch detecting unit is further configured to: after the vocal detecting unit detects one frame, continue to detect the main pitch from the subsequent adjacent frames, and if the main pitch changes, send the changed main pitch as the reference frequency to the reference frequency. The vocal detection unit.
优选地:  Preferably:
所述主音高检测单元, 设置为: 主音高发生改变, 将发生变化的主音高 作为参照频率, 包括:  The main pitch detecting unit is configured to: change the main pitch, and change the main pitch as the reference frequency, including:
所述主音高检测单元在主音高发生改变时, 继续判断后续的帧的主音高 是否为该改变值, 若连续多个后续帧的主音高为该改变值, 则将发生变化的 主音高作为参照频率。  The main pitch detecting unit continues to determine whether the main pitch of the subsequent frame is the changed value when the main pitch changes, and if the main pitch of the plurality of subsequent frames is the changed value, the changed main pitch is used as a reference. frequency.
为解决上述技术问题, 本发明还提供了一种人声音频播放装置, 所述装 置包括人声提取系统和播放系统, 其中: 所述人声提取系统釆用如上所述的系统从原声音信号中提取人声信号, 并将该人声信号发送至所述播放系统; In order to solve the above technical problem, the present invention also provides a vocal audio playing device, the device comprising a vocal sound extraction system and a playing system, wherein: The vocal extraction system extracts a vocal signal from the original sound signal by using the system as described above, and transmits the vocal signal to the playing system;
所述播放系统, 设置为: 将所述人声信号与所述原声音信号线性组合后 播放。  The playing system is configured to: linearly combine the vocal signal and the original sound signal to play.
上述技术方案以声音信号的主音高为参照频率判断是否为人声, 相对现 有提取人声的技术方案, 实现简单; 而且上述技术方案仅需从原声音信号开 始处寻找人声和背景声共同出现的声音信号, 不需要将原声音信号分成人声 和背景声同时出现的部分和只有背景声的部分,简化了声音的预处理数据量。 附图概述 The above technical solution judges whether the human voice is the reference frequency of the main pitch of the sound signal, and is simple to implement compared with the existing technical solution for extracting human voice; and the above technical solution only needs to find the human voice and the background sound together from the beginning of the original sound signal. The sound signal does not need to divide the original sound signal into a portion where both the human voice and the background sound appear simultaneously and only the portion of the background sound, which simplifies the amount of preprocessed data of the sound. BRIEF abstract
图 1为本实施例的人声提取方法流程图;  1 is a flow chart of a voice extraction method according to an embodiment of the present invention;
图 2为本实施例的人声提取系统组成图。 本发明的较佳实施方式  2 is a composition diagram of a human voice extraction system of the present embodiment. Preferred embodiment of the invention
下文中将结合附图对本发明的实施例进行详细说明。 需要说明的是, 在 不冲突的情况下, 本申请中的实施例及实施例中的特征可以相互任意组合。  Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
图 1为本实施例的人声提取方法流程图。  FIG. 1 is a flowchart of a voice extraction method according to an embodiment of the present invention.
S101 从原声音信号开始处提取人声和背景声共同出现的声音信号作为 样本;  S101 extracts, as a sample, a sound signal that is common to the human voice and the background sound from the beginning of the original sound signal;
如, 可从原声音信号开始处读取约 10s左右的一段声音, 分离成人声与 背景声共同出现的部分和只有背景声的部分; 如果这 10s没有找到人声与背 景声共同出现的部分, 可读取下一个 10s, 直到找到人声为止;  For example, a sound of about 10 s can be read from the beginning of the original sound signal, and the part where the adult sound and the background sound coexist and the part which only has the background sound are separated; if the 10s does not find the part where the human voice and the background sound coexist, Can read the next 10s until the vocals are found;
S 102从所述样本中检测出主音高;  S102 detects a main pitch from the sample;
主音高检测也称为基音频率检测;  The main pitch detection is also called the pitch frequency detection;
具体检测步骤可包括:  Specific detection steps may include:
1 )从时域上将样本分割成若干帧, 如以 20ms为帧长, 10ms为帧移; 2 )对每一个帧, 1) dividing the sample into several frames from the time domain, such as 20 ms as the frame length and 10 ms as the frame shift; 2) For each frame,
首先, 进行听觉外围处理: 用通道数为 N=128的 Gammatone滤波器 对帧 冲响应的时间形式为:
Figure imgf000007_0001
First, the auditory peripheral processing is performed: The time form of the frame impulse response using the Gammatone filter with the channel number N=128 is:
Figure imgf000007_0001
其中, /为滤波器阶数, 为滤波器带宽, /为滤波器中心频率; 帧经过 Gammatone滤波器后得到的每个通道的数据为一个最基本的时- 频(T-F )单元; 根据人耳的听觉特性, 每个时-频(T-F )单元属于一个声源 (或属于背景声, 或属于人声) ;  Where / is the filter order, is the filter bandwidth, / is the filter center frequency; the data of each channel obtained after the frame passes through the Gammatone filter is a basic time-frequency (TF) unit; according to the human ear The auditory characteristic, each time-frequency (TF) unit belongs to a sound source (or belongs to the background sound, or belongs to the human voice);
其次, 计算每个通道的自相关, 得到相关图; 在相关图上, 利用低频通 道的强度最高峰值点信息和高频通道的包络信息判断所在帧的基频;  Secondly, the autocorrelation of each channel is calculated to obtain a correlation graph; on the correlation graph, the highest peak point information of the low frequency channel and the envelope information of the high frequency channel are used to determine the fundamental frequency of the frame;
自相关计算公式为:  The autocorrelation calculation formula is:
AH (c,m ) =—— ^ h(c, mT - n)h(c, mT -n - t) A H (c,m ) =—— ^ h(c, mT - n)h(c, mT -n - t)
N£是帧周期(自相关窗大小) , N \0, Nc )是信号输出在通道 C和 时刻 n的值, c表征通道, m表征帧, t由窗最大时延对应的信号频率确定, t的取值为 0~12.5ms, T为帧移对应的样本数; N £ is the frame period (autocorrelation window size), N \0, N c ) is the value of the signal output at channel C and time n, c is the channel, m is the frame, and t is determined by the signal frequency corresponding to the maximum delay of the window. The value of t is 0~12.5ms, and T is the number of samples corresponding to the frame shift;
3 )在获得每一帧的基频之后, 排除偏差较大的基频, 取剩余基频的平均 值得到主音高;  3) after obtaining the fundamental frequency of each frame, excluding the base frequency with a large deviation, and taking the average value of the remaining fundamental frequencies to obtain the main pitch;
S103以所述主音高为参照频率,将所述原声音信号除所述样本之外的声 音部分中属于同一声源的声音的基音频率与所述参照频率进行比较确定该声 源是否属于人声, 包括:  S103, using the main pitch as a reference frequency, comparing the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample to the reference frequency to determine whether the sound source belongs to the human voice. , including:
1 )将所述原声音信号除所述样本之外的声音部分分成多帧;对于 Android 平台, 由于其把声音看作 "流" 来输入输出, 将声音流读入緩冲区 (buffer ) 交给相关函数进行处理, 再将处理后的声音流播放出来; 从将声音流读入緩 冲区到声音流被播放约等于 28ms,因此可将原声音信号除所述样本之外的声 音部分按照每 28ms为一帧分成多帧;  1) dividing the sound portion of the original sound signal except the sample into multiple frames; for the Android platform, since the sound is regarded as a "stream" input and output, the sound stream is read into a buffer (buffer) The relevant function is processed, and the processed sound stream is played out; from reading the sound stream into the buffer until the sound stream is played for about 28 ms, the sound portion of the original sound signal other than the sample can be followed. Dividing a frame into multiple frames every 28ms;
2 )将每一帧声音信号经过多通道滤波器得到多个时频单元,合并相邻的 属于同一声源的时频单元作为一个片段; 这样, 通过时频单元的合并, 一帧 信号可包括多个片段, 这个过程称为分割; 2) passing each frame of the sound signal through a multi-channel filter to obtain a plurality of time-frequency units, merging adjacent ones The time-frequency unit belonging to the same sound source is used as a segment; thus, by combining the time-frequency units, one frame signal may include a plurality of segments, and this process is called segmentation;
所述多通道滤波器可为 Gammatone滤波器;  The multi-channel filter may be a Gammatone filter;
合并相邻的属于同一声源的时频单元时, 先判断相邻时频单元的互相关 性, 若相邻时频单元的互相关值大于预设的门限, 则该相邻时频属于同一个 声源;  When merging adjacent time-frequency units belonging to the same sound source, first determine the cross-correlation of adjacent time-frequency units. If the cross-correlation value of the adjacent time-frequency unit is greater than a preset threshold, the adjacent time-frequency belongs to the same a sound source;
互相关计算公式为:  The cross-correlation calculation formula is:
- 1'  - 1'
CH (c, m) = ^ AH (c, m, t) AH [c + 1, m, t) 其中, ^ , /Μ,Ο表示归一化的 4 ( c, ιη, ΐ) 3 )如果一个片段内,超过一半以上的时频单元的基音频率与所述参照频 率相等, 则该片段为人声片段。 C H (c, m) = ^ A H (c, m, t) A H [c + 1, m, t) where ^ , /Μ, Ο represent normalized 4 ( c, ιη, ΐ) 3) If the pitch frequency of more than half of the time-frequency units in a segment is equal to the reference frequency, the segment is a vocal segment.
由于人声在歌唱的时候主音高不断变化, 为确保作为参照频率的主音高 准确反映人声, 需要不断的对主音高进行修正, 即, 对每一帧的全部片段进 行是否为人声片段的判定之后, 继续从后续相邻帧中检测出主音高, 若主音 高发生改变, 以发生变化的主音高作为参照频率, 继续判断帧中的片段是否 为人声片段; 进一步地, 为避免主音高出现短暂的突变, 继续判断后续帧的 主音高是否为该改变值时, 若连续多个后续帧的主音高为该改变值, 再以发 生变化的主音高作为参照频率。 如果对每一帧的全部片段进行是否为人声片 段的判定之后, 从后续相邻帧中检测不到主音高 (如人声消失) , 则从当前 帧向后重新提取人声和背景声共同出现的声音信号作为样本。  Since the main pitch is constantly changing when the vocals are singing, in order to ensure that the main pitch as the reference frequency accurately reflects the vocals, it is necessary to constantly correct the main pitch, that is, whether or not the vocal segments are determined for all the segments of each frame. After that, the main pitch is continuously detected from the subsequent adjacent frames. If the main pitch changes, the changed main pitch is used as the reference frequency, and it is determined whether the segment in the frame is a vocal segment; further, in order to avoid the main pitch being short-lived If the main pitch of the subsequent frames is the changed value, if the main pitch of the consecutive frames is the changed value, the changed main pitch is used as the reference frequency. If the main pitch is not detected from subsequent frames (if the vocal disappears) after all the segments of each frame are determined as vocal segments, the vocal and background sounds are re-extracted from the current frame. The sound signal is taken as a sample.
这样迭代式的修正主音高, 在算法复杂度不高的情况下, 能够满足实时 处理的需求。  This iteratively corrects the main pitch, and in the case of low algorithm complexity, it can meet the needs of real-time processing.
基于上述人声提取方法, 本实施例还给出了一种人声音频播放方法。 该方法中, 首先釆用如上所述的人声提取方法从原声音信号中提取人声 信号, 再将所述人声信号与所述原声音信号线性组合后播放。 分离出的人声 与原声叠加可以实现语音增强的效果。 图 2为本实施例的人声提取系统组成图。 Based on the above vocal sound extraction method, this embodiment also provides a vocal audio playing method. In the method, the vocal signal is extracted from the original sound signal by the vocal extraction method as described above, and the vocal signal is linearly combined with the original sound signal and played. The separated vocal and original sound superposition can achieve the effect of speech enhancement. 2 is a composition diagram of a human voice extraction system of the present embodiment.
该系统包括样本提取单元、 主音高检测单元、 人声检测单元, 其中: 所述样本提取单元, 用于从原声音信号开始处提取人声和背景声共同出 现的声音信号作为样本, 并将所述样本发送至所述主音高检测单元;  The system includes a sample extraction unit, a main pitch detection unit, and a vocal detection unit, wherein: the sample extraction unit is configured to extract, as a sample, a sound signal that is common to the vocal and background sounds from the beginning of the original sound signal, and Sending the sample to the main pitch detecting unit;
所述主音高检测单元, 用于从所述样本中检测出主音高, 并将该主音高 发送至所述人声检测单元;  The main pitch detecting unit is configured to detect a main pitch from the sample, and send the main pitch to the vocal detecting unit;
所述人声检测单元, 用于以所述主音高为参照频率, 将所述原声音信号 除所述样本之外的声音部分中属于同一声源的声音的基音频率与所述参照频 率进行比较确定该声源是否属于人声;  The vocal detection unit is configured to compare the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample with the reference frequency by using the main pitch as a reference frequency Determining whether the sound source is a human voice;
该人声检测单元, 用于将所述原声音信号除所述样本之外的声音部分分 成多帧, 如将原声音信号除所述样本之外的声音部分按照每 28ms为一帧分 成多帧, 以适应 Android平台的声音处理机制; 将每一帧声音信号经过多通 道滤波器得到多个时频单元, 合并相邻的属于同一声源的时频单元作为一个 片段; 若一个片段内, 超过一半以上的时频单元的基音频率与所述参照频率 相等, 则判断出该片段为人声片段。  The vocal detection unit is configured to divide the sound portion of the original sound signal except the sample into a plurality of frames, such as dividing the sound portion of the original sound signal other than the sample into a plurality of frames every 28 ms. To adapt to the sound processing mechanism of the Android platform; each frame of the sound signal is subjected to a multi-channel filter to obtain a plurality of time-frequency units, and the adjacent time-frequency units belonging to the same sound source are combined as one segment; If the pitch frequency of more than half of the time-frequency unit is equal to the reference frequency, it is determined that the segment is a human voice segment.
由于人声在歌唱的时候主音高不断变化, 为确保作为参照频率的主音高 准确反映人声, 上述主音高检测单元, 还用于人声检测单元检测完一帧后, 继续从后续相邻帧中检测出主音高, 若主音高发生改变, 将发生变化的主音 高作为参照频率发送至所述人声检测单元; 为避免主音高出现短暂的突变, 主音高检测单元, 在从后续相邻帧中检测出主音高发生改变时, 继续判断后 续的帧的主音高是否为该改变值, 若连续多个后续帧的主音高为该改变值, 再将发生变化的主音高作为参照频率发送至所述人声检测单元。  Since the main pitch is constantly changing when the vocal is singing, in order to ensure that the main pitch as the reference frequency accurately reflects the human voice, the above-mentioned main pitch detecting unit is further used for the vocal detecting unit to continue the subsequent adjacent frame after detecting one frame. The main pitch is detected, and if the main pitch is changed, the changed main pitch is transmitted as the reference frequency to the vocal detection unit; to avoid a short transition of the main pitch, the main pitch detecting unit is in the subsequent adjacent frame When it is detected that the main pitch changes, it is continued to determine whether the main pitch of the subsequent frame is the changed value. If the main pitch of the consecutive frames is the changed value, the changed main pitch is transmitted as the reference frequency. The vocal detection unit is described.
上述主音高检测单元, 还用于从后续相邻帧中检测不到主音高 (如人声 消失) 时, 重新触发样本提取单元从当前帧向后重新提取人声和背景声共同 出现的声音信号作为样本。  The above-mentioned main pitch detecting unit is further configured to re-trigger the sample extracting unit to re-extract the sound signal of the common sound of the human voice and the background sound from the current frame when the main pitch is not detected from the subsequent adjacent frames (if the human voice disappears). As a sample.
基于上述人声提取系统, 本实施例还给出了一种人声音频播放装置。 该装置包括上述人声提取系统以及播放系统; Based on the above-described human voice extraction system, this embodiment also provides a human voice audio playback device. The device comprises the above-mentioned vocal sound extraction system and a playback system;
人声提取系统, 用于从原声音信号中提取人声信号, 并将该人声信号发 送至所述播放系统; ;  a vocal sound extraction system, configured to extract a vocal signal from the original sound signal, and send the vocal signal to the playing system;
所述播放系统,用于将所述人声信号与所述原声音信号线性组合后播放。 该装置将分离出的人声与原声叠加可以实现语音增强的效果。  The playing system is configured to linearly combine the vocal signal and the original sound signal to play. The device superimposes the separated human voice with the original sound to achieve a voice enhancement effect.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序 来指令相关硬件完成, 所述程序可以存储于计算机可读存储介质中, 如只读 存储器、 磁盘或光盘等。 可选地, 上述实施例的全部或部分步骤也可以使用 一个或多个集成电路来实现, 相应地, 上述实施例中的各模块 /单元可以釆用 硬件的形式实现, 也可以釆用软件功能模块的形式实现。 本发明不限制于任 何特定形式的硬件和软件的结合。 One of ordinary skill in the art will appreciate that all or a portion of the above steps may be accomplished by a program instructing the associated hardware, such as a read-only memory, a magnetic disk, or an optical disk. Optionally, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits. Accordingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may use software functions. The form of the module is implemented. The invention is not limited to any specific form of combination of hardware and software.
需要说明的是, 本发明还可有其他多种实施例, 在不背离本发明精神及 和变形, 但这些相应的改变和变形都应属于本发明所附的权利要求的保护范 围。  It is to be understood that the invention may be embodied in other forms and modifications without departing from the spirit and scope of the invention.
工业实用性 Industrial applicability
上述技术方案以声音信号的主音高为参照频率判断是否为人声, 相对现 有提取人声的技术方案, 实现简单; 而且上述技术方案仅需从原声音信号开 始处寻找人声和背景声共同出现的声音信号, 不需要将原声音信号分成人声 和背景声同时出现的部分和只有背景声的部分,简化了声音的预处理数据量。  The above technical solution judges whether the human voice is the reference frequency of the main pitch of the sound signal, and is simple to implement compared with the existing technical solution for extracting human voice; and the above technical solution only needs to find the human voice and the background sound together from the beginning of the original sound signal. The sound signal does not need to divide the original sound signal into a portion where both the human voice and the background sound appear simultaneously and only the portion of the background sound, which simplifies the amount of preprocessed data of the sound.

Claims

权 利 要 求 书 claims
1、 一种人声提取方法, 所述方法包括: 1. A human voice extraction method, the method includes:
从原声音信号开始处提取人声和背景声共同出现的声音信号作为样本; 从所述样本中检测出主音高; Extract the sound signal in which the human voice and the background sound co-occur from the beginning of the original sound signal as a sample; detect the main pitch from the sample;
以所述主音高为参照频率, 将所述原声音信号除所述样本之外的声音部 分中属于同一声源的声音的基音频率与所述参照频率进行比较确定该声源是 否属于人声。 Using the main pitch as a reference frequency, compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency to determine whether the sound source belongs to a human voice.
2、 如权利要求 1所述的方法, 其中, 2. The method of claim 1, wherein,
以所述主音高为参照频率, 将所述原声音信号除所述样本之外的声音部 分中属于同一声源的声音的基音频率与所述参照频率进行比较确定该声源是 否属于人声, 包括: Using the main pitch as a reference frequency, compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency to determine whether the sound source belongs to a human voice, include:
将所述原声音信号除所述样本之外的声音部分分成多帧; dividing the sound portion of the original sound signal other than the sample into multiple frames;
将每一帧声音信号经过多通道滤波器得到多个时频单元, 合并相邻的属 于同一声源的时频单元作为一个片段; Pass each frame of sound signal through a multi-channel filter to obtain multiple time-frequency units, and merge adjacent time-frequency units belonging to the same sound source as a segment;
如果一个片段内, 超过一半以上的时频单元的基音频率与所述参照频率 相等, 则该片段为人声片段。 If the fundamental frequency of more than half of the time-frequency units in a segment is equal to the reference frequency, the segment is a vocal segment.
3、 如权利要求 2所述的方法, 其中, 所述方法还包括: 3. The method of claim 2, wherein the method further includes:
对每一帧的全部片段进行是否为人声片段的判定之后, 继续从后续相邻 帧中检测出主音高,若主音高发生改变,以发生变化的主音高作为参照频率, 继续判断帧中的片段是否为人声片段。 After determining whether all the segments in each frame are human voice segments, continue to detect the main pitch from subsequent adjacent frames. If the main pitch changes, use the changed main pitch as the reference frequency to continue to determine the segments in the frame. Whether it is a vocal clip.
4、 如权利要求 3所述的方法, 其中, 4. The method of claim 3, wherein,
所述若主音高发生改变, 以发生变化的主音高作为参照频率, 包括: 若主音高发生改变, 继续判断后续的帧的主音高是否为该改变值, 若连 续多个后续帧的主音高为该改变值, 则以发生变化的主音高作为参照频率。 If the main pitch changes, using the changed main pitch as the reference frequency includes: If the main pitch changes, continue to determine whether the main pitch of the subsequent frames is the changed value, if the main pitches of multiple consecutive subsequent frames are For this change value, the changed main pitch is used as the reference frequency.
5、 一种人声音频播放方法, 所述方法包括: 5. A vocal audio playback method, the method includes:
釆用如权利要求 1~4中任一项所述的方法从原声音信号中提取人声信号; 将所述人声信号与所述原声音信号线性组合后播放。 The method according to any one of claims 1 to 4 is used to extract the human voice signal from the original sound signal; the human voice signal and the original sound signal are linearly combined and then played.
6、 一种人声提取系统, 所述系统包括样本提取单元、 主音高检测单元、 人声检测单元, 其中, 6. A human voice extraction system, the system includes a sample extraction unit, a main pitch detection unit, and a human voice detection unit, wherein,
所述样本提取单元, 设置为: 从原声音信号开始处提取人声和背景声共 同出现的声音信号作为样本, 并将所述样本发送至所述主音高检测单元; 所述主音高检测单元, 设置为: 从所述样本中检测出主音高, 并将该主 音高发送至所述人声检测单元; The sample extraction unit is configured to: extract the sound signal in which the human voice and the background sound co-occur from the beginning of the original sound signal as a sample, and send the sample to the main pitch detection unit; the main pitch detection unit, Set to: detect the main pitch from the sample, and send the main pitch to the human voice detection unit;
所述人声检测单元, 设置为: 以所述主音高为参照频率, 将所述原声音 信号除所述样本之外的声音部分中属于同一声源的声音的基音频率与所述参 照频率进行比较确定该声源是否属于人声。 The human voice detection unit is configured to: use the main pitch as a reference frequency, and compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency. Compare and determine whether the sound source is a human voice.
7、 如权利要求 6所述的系统, 其中, 7. The system of claim 6, wherein,
所述人声检测单元, 设置为: 以所述主音高为参照频率, 将所述原声音 信号除所述样本之外的声音部分中属于同一声源的声音的基音频率与所述参 照频率进行比较确定该声源是否属于人声, 包括: The human voice detection unit is configured to: use the main pitch as a reference frequency, and compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency. Compare and determine whether the sound source is a human voice, including:
所述人声检测单元将所述原声音信号除所述样本之外的声音部分分成多 帧; 将每一帧声音信号经过多通道滤波器得到多个时频单元, 合并相邻的属 于同一声源的时频单元作为一个片段; 若一个片段内, 超过一半以上的时频 单元的基音频率与所述参照频率相等, 则判断出该片段为人声片段。 The human voice detection unit divides the sound part of the original sound signal except the sample into multiple frames; passes each frame of sound signal through a multi-channel filter to obtain multiple time-frequency units, and merges adjacent ones belonging to the same sound The time-frequency unit of the source is regarded as a segment; if the fundamental frequency of more than half of the time-frequency units in a segment is equal to the reference frequency, the segment is determined to be a vocal segment.
8、 如权利要求 7所述的系统, 其中, 8. The system of claim 7, wherein,
所述主音高检测单元, 还设置为: 人声检测单元检测完一帧后, 继续从 后续相邻帧中检测出主音高, 若主音高发生改变, 将发生变化的主音高作为 参照频率发送至所述人声检测单元。 The main pitch detection unit is also set to: after the human voice detection unit detects one frame, it continues to detect the main pitch from subsequent adjacent frames. If the main pitch changes, the changed main pitch is sent as a reference frequency to The human voice detection unit.
9、 如权利要求 8所述的系统, 其中, 9. The system of claim 8, wherein,
所述主音高检测单元, 设置为: 主音高发生改变, 将发生变化的主音高 作为参照频率, 包括: The main pitch detection unit is set to: when the main pitch changes, the changed main pitch is used as the reference frequency, including:
所述主音高检测单元在主音高发生改变时, 继续判断后续的帧的主音高 是否为该改变值, 若连续多个后续帧的主音高为该改变值, 则将发生变化的 主音高作为参照频率。 When the main pitch changes, the main pitch detection unit continues to determine whether the main pitch of subsequent frames is the changed value. If the main pitch of multiple subsequent frames is the changed value, the changed main pitch is used as a reference. frequency.
10、 一种人声音频播放装置, 所述装置包括人声提取系统和播放系统, 其中: 10. A human voice audio playback device, the device includes a human voice extraction system and a playback system, in:
所述人声提取系统釆用如权利要求 6~9任一所述的系统从原声音信号中 提取人声信号, 并将该人声信号发送至所述播放系统; The human voice extraction system uses the system as described in any one of claims 6 to 9 to extract the human voice signal from the original sound signal, and sends the human voice signal to the playback system;
所述播放系统, 设置为: 将所述人声信号与所述原声音信号线性组合后 播放。 The playback system is configured to linearly combine the vocal signal and the original sound signal before playing.
PCT/CN2013/082328 2013-03-29 2013-08-27 Human voice extracting method and system, and audio playing method and device for human voice WO2014153922A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310108032.9A CN104078051B (en) 2013-03-29 2013-03-29 A kind of voice extracting method, system and voice audio frequency playing method and device
CN201310108032.9 2013-03-29

Publications (1)

Publication Number Publication Date
WO2014153922A1 true WO2014153922A1 (en) 2014-10-02

Family

ID=51599272

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/082328 WO2014153922A1 (en) 2013-03-29 2013-08-27 Human voice extracting method and system, and audio playing method and device for human voice

Country Status (2)

Country Link
CN (1) CN104078051B (en)
WO (1) WO2014153922A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105321526B (en) * 2015-09-23 2020-07-24 联想(北京)有限公司 Audio processing method and electronic equipment
CN106571150B (en) * 2015-10-12 2021-04-16 阿里巴巴集团控股有限公司 Method and system for recognizing human voice in music
CN105632489A (en) * 2016-01-20 2016-06-01 曾戟 Voice playing method and voice playing device
CN105719657A (en) * 2016-02-23 2016-06-29 惠州市德赛西威汽车电子股份有限公司 Human voice extracting method and device based on microphone
CN105810212B (en) * 2016-03-07 2019-04-23 合肥工业大学 A kind of train under complicated noise is blown a whistle recognition methods
CN108962277A (en) * 2018-07-20 2018-12-07 广州酷狗计算机科技有限公司 Speech signal separation method, apparatus, computer equipment and storage medium
CN109036455B (en) * 2018-09-17 2020-11-06 中科上声(苏州)电子有限公司 Direct sound and background sound extraction method, loudspeaker system and sound reproduction method thereof
CN109524016B (en) * 2018-10-16 2022-06-28 广州酷狗计算机科技有限公司 Audio processing method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05210397A (en) * 1992-01-30 1993-08-20 Fujitsu Ltd Voice recognizing device
JP2003058186A (en) * 2001-08-13 2003-02-28 Yrp Kokino Idotai Tsushin Kenkyusho:Kk Method and device for suppressing noise
CN1808571A (en) * 2005-01-19 2006-07-26 松下电器产业株式会社 Acoustical signal separation system and method
CN101601088A (en) * 2007-09-11 2009-12-09 松下电器产业株式会社 Sound judgment means, sound detection device and sound determination methods
CN102054480A (en) * 2009-10-29 2011-05-11 北京理工大学 Method for separating monaural overlapping speeches based on fractional Fourier transform (FrFT)
CN102402977A (en) * 2010-09-14 2012-04-04 无锡中星微电子有限公司 Method for extracting accompaniment and human voice from stereo music and device of method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1271593C (en) * 2004-12-24 2006-08-23 北京中星微电子有限公司 Voice signal detection method
CN1945689B (en) * 2006-10-24 2011-04-27 北京中星微电子有限公司 Method and its device for extracting accompanying music from songs
CN101193460B (en) * 2006-11-20 2011-09-28 松下电器产业株式会社 Sound detection device and method
KR101459766B1 (en) * 2008-02-12 2014-11-10 삼성전자주식회사 Method for recognizing a music score image with automatic accompaniment in a mobile device
CN101577117B (en) * 2009-03-12 2012-04-11 无锡中星微电子有限公司 Extracting method of accompaniment music and device
CN102945675A (en) * 2012-11-26 2013-02-27 江苏物联网研究发展中心 Intelligent sensing network system for detecting outdoor sound of calling for help

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05210397A (en) * 1992-01-30 1993-08-20 Fujitsu Ltd Voice recognizing device
JP2003058186A (en) * 2001-08-13 2003-02-28 Yrp Kokino Idotai Tsushin Kenkyusho:Kk Method and device for suppressing noise
CN1808571A (en) * 2005-01-19 2006-07-26 松下电器产业株式会社 Acoustical signal separation system and method
CN101601088A (en) * 2007-09-11 2009-12-09 松下电器产业株式会社 Sound judgment means, sound detection device and sound determination methods
CN102054480A (en) * 2009-10-29 2011-05-11 北京理工大学 Method for separating monaural overlapping speeches based on fractional Fourier transform (FrFT)
CN102402977A (en) * 2010-09-14 2012-04-04 无锡中星微电子有限公司 Method for extracting accompaniment and human voice from stereo music and device of method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, YIPENG ET AL.: "Separation of Singing Voice From Music Accompaniment for Monaural Recordings", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 15, no. 4, May 2007 (2007-05-01), pages 1475 - 1487 *

Also Published As

Publication number Publication date
CN104078051B (en) 2018-09-25
CN104078051A (en) 2014-10-01

Similar Documents

Publication Publication Date Title
WO2014153922A1 (en) Human voice extracting method and system, and audio playing method and device for human voice
KR101726208B1 (en) Volume leveler controller and controlling method
JP4906230B2 (en) A method for time adjustment of audio signals using characterization based on auditory events
US7974838B1 (en) System and method for pitch adjusting vocals
CN105518778B (en) Wobble buffer controller, audio decoder, method and computer readable storage medium
JP6212567B2 (en) System, computer-readable storage medium and method for recovering compressed audio signals
TW202004736A (en) Systems and methods for intelligent voice activation for auto mixing
CN112400325A (en) Data-driven audio enhancement
JP5737808B2 (en) Sound processing apparatus and program thereof
US8489404B2 (en) Method for detecting audio signal transient and time-scale modification based on same
JP2009511954A (en) Neural network discriminator for separating audio sources from mono audio signals
US9502047B2 (en) Talker collisions in an auditory scene
CN103155030A (en) Method and apparatus for processing a multi-channel audio signal
Vestergaard et al. The mutual roles of temporal glimpsing and vocal characteristics in cocktail-party listening
Hummersone A psychoacoustic engineering approach to machine sound source separation in reverberant environments
Buyens et al. A stereo music preprocessing scheme for cochlear implant users
CN108965904B (en) Volume adjusting method and client of live broadcast room
CN112053669B (en) Method, device, equipment and medium for eliminating human voice
RU2009116279A (en) METHODS AND DEVICES FOR CODING AND DECODING OF OBJECT-ORIENTED AUDIO SIGNALS
JP2008102551A (en) Apparatus for processing voice signal and processing method thereof
CN106328159B (en) Audio stream processing method and device
JP2011013383A (en) Audio signal correction device and audio signal correction method
JP4826814B2 (en) Audio signal processing device
JP6313619B2 (en) Audio signal processing apparatus and program
CN110475144A (en) The extracting method of 16 channel audios in a kind of 12G-SDI data flow based on FPGA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13879827

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13879827

Country of ref document: EP

Kind code of ref document: A1