WO2010037251A1 - 一种人声判别的方法和装置 - Google Patents

一种人声判别的方法和装置 Download PDF

Info

Publication number
WO2010037251A1
WO2010037251A1 PCT/CN2009/001037 CN2009001037W WO2010037251A1 WO 2010037251 A1 WO2010037251 A1 WO 2010037251A1 CN 2009001037 W CN2009001037 W CN 2009001037W WO 2010037251 A1 WO2010037251 A1 WO 2010037251A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
current frame
vocal
discrimination
sliding
Prior art date
Application number
PCT/CN2009/001037
Other languages
English (en)
French (fr)
Inventor
谢湘勇
陈展
Original Assignee
炬力集成电路设计有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 炬力集成电路设计有限公司 filed Critical 炬力集成电路设计有限公司
Priority to US13/001,596 priority Critical patent/US20110166857A1/en
Priority to EP09817165.5A priority patent/EP2328143B8/en
Publication of WO2010037251A1 publication Critical patent/WO2010037251A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present invention relates to the field of audio processing technologies, and in particular, to a method and apparatus for vocal discrimination. Background technique
  • the vocal discrimination is to discriminate whether the voice of the person is pronounced in the audio signal.
  • the vocal discrimination has its special use environment and requirements. On the one hand, there is no need to know what the speaker is saying, only to care if someone is talking; on the other hand, it is necessary to make a real-time discrimination of the human voice. In addition, you need to consider the overhead of the system hardware and software, and reduce the hardware and software requirements as much as possible.
  • the existing vocal discrimination technology mainly includes the following two methods: One is to extract the characteristic parameters of the audio signal, and use the difference between the characteristic parameters of the human voice and the audio signal without the vocal sound in the audio signal. Sound detection.
  • the characteristic parameters mainly used by vocal discrimination include: energy value, zero-crossing rate, autocorrelation coefficient, cepstrum and so on.
  • Another vocal discriminant technique is to use the principle of linguistics to extract features of linear predictive cepstral coefficients or Mel cepstral coefficients of audio signals, and then perform vocal discrimination by template matching technique.
  • the existing vocal discrimination technology has the following shortcomings:
  • Characteristic parameters such as energy value, zero-crossing rate, and autocorrelation coefficient do not reflect the difference between vocal and non-human voices well, resulting in poor detection results;
  • the embodiment of the present invention provides a method and apparatus for vocal discrimination, which can accurately discriminate vocals in an audio signal with little computational overhead.
  • the maximum absolute value of the sliding of the segment is obtained by the following method:
  • a vocal discrimination device is configured to determine a human voice in an externally input audio signal, including:
  • a segmentation module configured to use, as a segment, every n sample points of a current frame of the audio signal, where n is a positive integer;
  • the sliding maximum absolute value module is used to obtain the maximum absolute value of the sliding of the segment; the maximum absolute value of the sliding of any segment is obtained by the following method: taking the maximum value of the absolute intensity of each sampling point in the segment as the score The initial maximum absolute value of the segment, and the maximum value of the initial maximum absolute value of the segment and the m segments after the segment, as the maximum absolute value of the sliding of the segment, where m is a positive integer;
  • a transition judging module configured to determine whether there are two adjacent segments in the current frame that are transitioned with respect to the discriminating threshold, and the maximum absolute values of the sliding of the two adjacent segments are greater than and smaller than the discriminating threshold respectively ;
  • the vocal discriminating module is configured to determine that the current frame is a vocal when the judgment result of the transition judging module is YES.
  • Figure 1 shows a pure vocal time domain waveform as an example
  • Figure 2 shows a time domain waveform of pure music as an example
  • FIG. 3 shows a time domain waveform of pop music sung by a person as an example
  • Figure 4 is a sliding maximum absolute value curve obtained according to the pure human voice conversion shown in Figure 1;
  • Figure 5 is a sliding maximum absolute value curve obtained from the pure music conversion shown in Figure 2;
  • FIG. 6 is a sliding maximum absolute value curve obtained according to the popular music conversion performed by the person shown in FIG. 3;
  • FIG. 7 is a time domain waveform diagram of a broadcast program recording;
  • FIG. 8 is a sliding maximum absolute value curve obtained by converting the time domain waveform shown in FIG. 7, and includes a discrimination threshold;
  • FIG. 9 is a flowchart of human voice discrimination according to an embodiment of the present invention.
  • Fig. 10 is a graph showing the relationship between the maximum absolute value of the sliding of a typical human voice and the discrimination threshold;
  • Fig. 11 is a diagram showing the relationship between the maximum absolute value of the sliding of the non-human voice and the discrimination threshold;
  • Figure 1 to Figure 3 show an example of a three-segment time-domain waveform diagram.
  • the abscissa is the label of the audio signal sampling point, and the ordinate is the intensity of the audio signal sampling point, where the sampling rate is 44100. In the following diagrams, the sampling rate is 44100.
  • Figure 1 is a time domain waveform diagram of pure human voice
  • Figure 2 is a time domain waveform diagram of pure music
  • Figure 3 is a time domain waveform diagram of popular music singing, which can be regarded as a superposition effect of human voice and music.
  • the vocal discrimination technique it is judged whether or not a human voice is present in the audio signal, and if the audio signal is a superimposed effect of such vocal and music, it is still considered that there is no human voice in the audio signal.
  • FIG. 1 to FIG. 3 are converted into a graph of the maximum absolute value of the sliding, as shown in FIG. 4 to FIG. 6, respectively, and the abscissa is still the label of the sample of the audio signal.
  • the ordinate is the maximum absolute absolute sliding intensity of the audio signal sampling point (ie, the maximum absolute value of the sliding).
  • the absolute absolute intensity of the absolute intensity (absolute intensity, that is, the absolute value of the intensity) of the sampling points of the m consecutive audio signal is taken as the maximum absolute value of the sliding of the first sampling point of the m consecutive audio signal sampling points, where m is A positive integer. m is called the sliding length. It can be seen that the biggest difference between Figure 4 and Figure 5 or Figure 6 is whether there will be a zero value in the curve.
  • the waveform characteristics of the human voice cause the maximum absolute value of the sliding to have a zero value, while the non-human voice such as music There will be no zero value.
  • consecutive n sample points can be regarded as a segment, and the absolute intensity of the segment of the audio signal is represented by the maximum value of the absolute intensity of each sample point in the segment; the maximum absolute value of the sliding of the segment of the audio signal is used for the segment and the segment Then, the maximum value of the absolute intensity of successive m segments is expressed, where n and m are positive integers; therefore, the abscissa of the sliding maximum absolute value curve can also represent the segment number after the sampling point segmentation, and the ordinate can also indicate each The maximum absolute value of the sliding of the segment audio signal.
  • the present invention achieves the vocal discrimination by utilizing the characteristic that the maximum absolute value of the sliding of the human voice exhibits a value of zero.
  • the surrounding environment when people speak can not be absolutely quiet, and more or less mixed with non-human voices. Therefore, it is necessary to determine an appropriate discriminating threshold, and if the curve of the sliding maximum absolute value crosses the discriminating threshold curve, it indicates a human voice.
  • Figure 7 is a time-domain waveform of a broadcast of a broadcast program, the previous paragraph is the host's speech, followed by the popular song.
  • the maximum absolute value curve of the sliding is shown in Fig. 8.
  • the abscissa in Fig. 7 and Fig. 8 is the label of the sampling point of the audio signal, the ordinate in Fig. 7 indicates the intensity of the audio sampling point, and the ordinate in Fig. 8 indicates the sampling of the audio signal.
  • the maximum absolute value of the point's sliding You can distinguish between vocal and non-human voice by choosing the appropriate discriminant threshold.
  • the horizontal solid line in Fig. 8 indicates the discrimination threshold.
  • the sliding maximum absolute value curve will appear to intersect the horizontal solid line; and in the part where the popular song is played, the sliding maximum absolute value curve will no longer intersect with the horizontal solid line. In this patent application, it will slide
  • the intersection of the dynamic maximum curve and the discriminant threshold curve is called the maximum absolute value of the slip relative to the discriminant threshold, or simply referred to as a transition.
  • the number of times the sliding maximum curve intersects the discriminant threshold curve is called the number of transitions.
  • the discriminant threshold in FIG. 8 is a constant value. In practical applications, the discriminant threshold can be dynamically adjusted according to the intensity of the audio signal.
  • a method for judging a human voice according to the first embodiment of the present invention is for discriminating a human voice in an externally input audio signal, including:
  • the maximum absolute value of the sliding of the segment is obtained by the following method:
  • the maximum value of the segment and the initial maximum absolute value of the m segments after the segment is taken as the maximum sliding absolute value of the segment, where m is a positive integer.
  • Step 901 Perform parameter initialization.
  • the initialized parameters may include the frame length of the audio signal, the discrimination threshold, the sliding length, the number of transitions, and the number of delay frames. Among them, the initial value of the number of delay frames and the number of transitions can be zero.
  • the maximum value of the absolute intensity of each pulse code modulation (PCM) data point (ie, the signal sampling point) before the current frame of the audio signal and before the current frame can be taken from the perspective of the maximum absolute intensity.
  • PCM pulse code modulation
  • Figure 10 shows the relationship between the maximum absolute value of sliding of a typical human voice and the discriminant threshold.
  • Figure 11 shows the typical absolute non-human sliding.
  • vocal and non- The distribution characteristics of vocal transitions are different.
  • the time interval between two adjacent transitions of vocals is large, and the time interval between two adjacent transitions of vocal non-vocals is small. Therefore, in order to further avoid false positives, the time interval between two adjacent transitions may be referred to as a transition length, and when a transition occurs and the transition length is greater than a preset transition length, the current frame is considered to be a human voice.
  • the solution of the invention can be applied to the real-time processing.
  • the current audio signal After the current audio signal is discriminated, since the current audio signal has been played, the current audio signal cannot be processed correspondingly, and only the audio signal after the current audio signal can be processed.
  • the human voice has a certain continuity, so the delay frame number k can be set.
  • the audio signals of consecutive k frames after the current frame can be directly considered as vocals, and the k
  • the frame is treated as a human voice, where k is a positive integer, for example, it can be taken as 5. Thereby, the human voice in the audio signal can be processed in real time.
  • Step 902 Take each n sample points of the current frame as a segment, where n is a positive integer, and take the maximum value of the absolute intensity of each sample point in each segment as the initial maximum absolute value of the segment.
  • the commonly used audio sampling rate of popular music is 44100, that is, the number of sampling points per second is 44100.
  • Step 903 For each of the segments, take the maximum value of the segment and the initial maximum absolute value of each segment within the sliding length after the segment as the sliding maximum absolute value of the segment.
  • Step 904 Update the determination threshold according to the maximum value of the absolute intensity of each PCM data point in the current frame of the audio signal and before the current frame; and determine whether the number of delay frames is zero, if it is zero, directly Go to step 905, if the number of delay frames is non-zero, decrement it by 1, and process the current frame of the audio signal as a human voice.
  • the process 4 depends on the specific application, for example, a muffling process.
  • step 902 After the audio signal in the delayed frame number is processed as a human voice, it is possible to proceed to step 902 to continue the process of determining whether or not the human voice is a human voice for the next frame (not shown).
  • Step 905 Determine, according to the maximum absolute value of the sliding of each segment in the current frame of the audio signal and the discriminating threshold, whether the maximum absolute value of the sliding in the current frame of the audio signal transitions with respect to the discriminating threshold.
  • the specific method may be to perform the following processing on the sliding maximum absolute values of all segments except the first segment in the current frame:
  • Step 906 Determine whether the audio signal is a human voice according to the distribution of the transition.
  • transition density is the number of transitions that occur within a unit of time. Count whether the transition density for a period of time meets the predetermined criteria.
  • the predetermined criteria include the maximum transition density and the minimum transition density, which define the upper and lower limits of the transition density.
  • the predetermined criteria can be derived by training a standard vocal signal. If the density of the number of transitions is less than the upper limit and greater than the lower limit, and the transition length is greater than the standard transition length, the current frame of the audio signal is a human voice, otherwise it is not a human voice.
  • step 907 is performed. If it is determined that the current frame of the audio signal is a human voice, the number of delay frames is set to a predetermined value, and then step 907 is performed. If it is determined that the current frame of the audio signal is non-human, step 907 is directly performed.
  • Step 907 It is judged whether or not the vocal discrimination is ended, and if so, the flow is ended, otherwise the process goes to step 902 to continue the process of determining whether or not the vocal is performed for the next frame.
  • An embodiment of the present invention further provides a device for performing voice recognition.
  • the method includes: a segmentation module 1201, configured to use, as a segment, every n sample points of a current frame of the audio signal. Where n is a positive integer;
  • the sliding maximum absolute value module 1202 is configured to obtain the maximum absolute value of the sliding of the segment; the maximum absolute value of the sliding of any segment is obtained by: taking the maximum value of the absolute intensity of each sampling point in the segment as the The initial maximum absolute value of the segment, and the maximum value of the segment and the initial maximum absolute value of the m segments after the segment, as the sliding maximum absolute value of the segment, where m is a positive integer;
  • the transition judging module 1203 is configured to determine, in the current frame, whether there are two adjacent segments that are transitioned relative to the discriminating threshold, and the maximum absolute values of the sliding of the two adjacent segments are greater than and less than the Discriminating threshold
  • the vocal discrimination module 1204 is configured to determine that the current frame is a vocal when the transition determination module determines that there are two adjacent segments in which a transition occurs.
  • the vocal discrimination device further includes a transition number determination module, configured to determine whether the number of transitions of adjacent segments in the current frame per unit time is in advance
  • the vocal discriminating module is configured to determine that the current frame is a vocal sound when both the transition determination module and the transition number determination module determine that the result is yes.
  • the vocal discrimination device further includes a transition interval determining module, configured to determine whether a time interval between two adjacent transitions in the current frame is greater than a preset
  • the vocal discriminating module is configured to determine that the current frame is a vocal sound when both the transition determination module and the transition interval determination module determine that the result is yes.
  • the transition determination module 1203 includes: a calculation unit 12031, configured to calculate the segment for each segment other than the first segment in the current frame. The sliding maximum absolute value minus the difference of the discrimination threshold, and the difference between the sliding maximum absolute value of the previous segment of the segment and the discrimination threshold, and multiplying the two differences;
  • the determining unit 12032 is configured to determine whether there is at least one segment in the current frame, and the product calculated for the segment is less than 0; if yes, there are two adjacent segments in which the transition occurs; otherwise, it does not exist.
  • the vocal discrimination module 1204 is further configured to directly determine that the k frame after the current frame is a human voice after determining that the current frame is a human voice, where k is a preset positive integer. N2009/001037
  • the embodiment of the invention proposes a set of vocal discrimination schemes suitable for portable multimedia players, which requires less computation and requires less storage space.
  • taking the time domain data as the sliding maximum value can well reflect the characteristics of vocal and non-human voice; using the criterion of the transition mode, the standard due to different volume can be avoided Inconsistent issues.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephone Function (AREA)

Description

一种人声判别的方法和装置 技术领域
本发明涉及音频处理技术领域, 特别涉及一种人声判别的方法和装置。 背景技术
人声判别, 顾名思义, 就是判别音频信号中是否出规了人的说话声。 人 声判别具有其特殊使用环境和要求。 一方面, 不需要知道说话人所说的内容, 只关心是否有人在说话; 另一方面, 需要做到对人声进行实时地判别。 此外, 还需要考虑系统软硬件的开销, 尽可能地降低软硬件方面的要求。
现有的人声判别技术主要包括如下两种方式: 一种是从提取音频信号的 特征参数出发, 利用音频信号中出现人声和音频信号中没有人声时特征参数 的不同之处, 进行人声的检测。 目前人声判别主要利用的特征参数包括: 能 量值、 过零率、 自相关系数、 倒谱等。 另一种人声判别技术是利用语言学的 原理, 对音频信号的线性预测倒谱系数或 Mel频率倒谱系数进行特征提取, 然后通过模板匹配技术来进行人声判别。
现有的人声判别技术存在如下不足之处:
1: 能量值、 过零率、 自相关系数等特征参数不能很好地反映出人声和非 人声之间的区别, 从而导致检测效果不佳;
2: 计算线性预测倒谱系数或 Mel频率倒谱系数, 然后通过模板匹配技术 来进行人声判别的方法过于复杂, 计算量太大, 需要占用过多的软硬件资源, 可行性不好。 发明内容
有鉴于此, 本发明实施例提出一种人声判别的方法和装置, 能够较为准 确地判别音频信号中的人声, 并且计算开销很小。
本发明实施例提出的一种人声判别的方法, 用于判别外部输入的音频信 号中的人声, 包括:
将所述音频信号的当前帧的每 n个采样点作为一个分段, 其中 n为正整 数;
判断所述当前帧中, 是否存在相对于判别阈值发生跃迁的两个相邻的分 段, 此两个相邻分段的滑动最大绝对值分别大于和小于所述判别阈值; 若是, 则确定所述当前帧为人声;
其中, 分段的滑动最大绝对值通过以下方法获得:
取该分段中各采样点的绝对强度的最大值, 作为该分段的初始最大绝对 值;
取该分段以及该分段之后 m个分段的初始最大绝对值中的最大值, 作为 该分段的滑动最大绝对值, 其中 m为正整数,
本发明实施例提出的一种人声判别装置, 用于判别外部输入的音频信号 中的人声, 包括:
分段模块, 用于将所述音频信号的当前帧的每 n个采样点作为一个分段, 其中 n为正整数;
滑动最大绝对值模块, 用于获得分段的滑动最大绝对值; 其中任一分段 的滑动最大绝对值通过以下方法获得: 取该分段中各采样点的绝对强度的最 大值, 作为该分段的初始最大绝对值, 以及取该分段以及该分段之后 m个分 段的初始最大绝对值中的最大值, 作为该分段的滑动最大绝对值, 其中 m为 正整数;
跃迁判断模块, 用于判断所述当前帧中, 是否存在相对于判别阈值发生 跃迁的两个相邻的分段, 此两个相邻分段的滑动最大绝对值分别大于和小于 所述判别阈值;
人声判别模块, 用于在所述跃迁判断模块的判断结果为是时, 确定所述 当前帧为人声。
从以上技术方案可以看出, 通过音频信号的滑动最大绝对值相对于判别 阈值的跃迁来区分人声和非人声, 可以很好地反映出人声与非人声的特性, 并且所需计算量和存储空间较小。 附图说明
图 1示出了作为示例的纯人声时域波形;
图 2示出了作为示例的纯音乐的时域波形;
图 3示出了作为示例的人唱歌的流行音乐的时域波形;
图 4为根据图 1所示纯人声转换得到的滑动最大绝对值曲线;
图 5为才艮据图 2所示纯音乐转换得到的滑动最大绝对值曲线;
图 6为根据图 3所示人唱歌的流行音乐转换得到的滑动最大绝对值曲线; 图 7为一段广播节目录音的时域波形图;
图 8为图 7所示的时域波形转换得到的滑动最大绝对值曲线, 其中包括 了判别阈值;
图 9为本发明实施例提出的人声判别的流程图;
图 10示出了典型的人声的滑动最大绝对值与判别阈值的关系图; 图 11示出了典型的非人声的滑动最大绝对值与判别阈值的关系图; 图 12为本发明实施例提出的人声判别装置的模块示意图。 具体实施方式
在对本发明具体实施方案进行说明之前, 首先介绍一下本发明方案所依 据的原理。 图 1至图 3给出了三段时域波形图的示例, 图中横坐标为音频信 号采样点的标号, 纵坐标为音频信号采样点的强度, 其中采样率为 44100。 以 下各示意图中, 采样率均为 44100。 其中, 图 1是纯人声的时域波形图; 图 2 是纯音乐的时域波形图; 图 3是人唱歌的流行音乐时域波形图, 可以看作是 人声和音乐的叠加效果。 在人声判别技术中, 是判断音频信号中是否出现了 人的说话声, 而若音频信号为这种人声和音乐的叠加效果时, 仍然认为该音 频信号中没有人声。
观察图 1至图 3的波形特征, 可以发现人声的时域图和非人声的时域图 具有明显差别。 人说话声音是抑扬顿挫的, 音节之间具有停顿, 在停顿处声 强很弱, 体现在时域波形图上就是图像变化非常剧烈, 而非人声就没有这样 的典型特征。 为了更加明显地体现出人声的上述特征, 将图 1至图 3转换为 滑动最大绝对值的曲线图, 分别如图 4至图 6所示, 橫坐标依然为音频信号 釆样点的标号, 而纵坐标为音频信号采样点的滑动最大绝对强度(即滑动最 大绝对值)。 其中,取 m个连续音频信号采样点的绝对强度(绝对强度即强度 的绝对值) 中最大的绝对强度作为该 m个连续音频信号采样点中第一个采样 点的滑动最大绝对值, m为正整数。 m被称为滑动长度。 可以看出, 图 4与 图 5或图 6之间的最大区别点就是曲线中是否会出现零值, 人声的波形特征 导致其滑动最大绝对值会出现零值, 而音乐等非人声则不会出现零值。 当然, 可以将连续 n个采样点作为一个段, 该段音频信号的绝对强度用该段中各采 样点的绝对强度的最大值表示; 该段音频信号的滑动最大绝对值用该段以及 该段之后连续 m个段的绝对强度的最大值表示, 其中 n和 m均为正整数; 因 此, 滑动最大绝对值曲线的横坐标也可以表示采样点分段后的段号, 纵坐标 也可以表示各段音频信号的滑动最大绝对值。 图 4、 图 5和图 6可以看作将一 个采样点划归为一个段的特例, 即 n=l的情形。
本发明方案利用人声的滑动最大绝对值会出现零值的这个特性来实现人 声判别。 但具体应用中, 人说话时周围的环境不可能是绝对安静的, 或多或 少会混有非人声。 因此, 需要确定合适的判别阈值, 如果滑动最大绝对值的 曲线越过了判别阈值曲线, 则表明有人声。
图 7是一段广播节目录音的时域波形, 前面一段是主持人说话, 后面是 播放流行歌曲。 其滑动最大绝对值曲线如图 8所示, 图 7和图 8中的横坐标 为音频信号采样点的标号, 图 7的纵坐标表示音频采样点的强度, 图 8的纵 坐标表示音频信号采样点的滑动最大绝对值。 可以通过选取合适的判别阈值 区分人声和非人声。 图 8 中的横实线表示判别阈值。 在主持人说话的部分, 滑动最大绝对值曲线会出现与该横实线相交的现象; 而在播放流行歌曲的部 分, 滑动最大绝对值曲线与该横实线就不再相交。 本专利申请文件中, 将滑 动最大值曲线与判别阈值曲线相交称为滑动最大绝对值相对于判别阈值发生 了跃迁, 或简称为跃迁。 滑动最大值曲线与判别阈值曲线相交的次数则称为 跃迁次数。 需要说明的是, 图 8 中的判别阔值是一个恒定值, 实际应用中判 别阈值可以根据音频信号的强度进行动态调整。
本发明实施例一提出的一种人声判别的方法, 用于判别外部输入的音频 信号中的人声, 包括:
将所述音频信号的当前帧的每 n个采样点作为一个分段, 其中 n为正整 数;
判断所述当前帧中, 是否存在相对于判别阈值发生跃迁的两个相邻的分 段, 此两个相邻分段的滑动最大绝对值分别大于和小于所述判别阈值; 若是, 则确定所迷当前帧为人声;
其中, 分段的滑动最大绝对值通过以下方法获得:
取该分段中各采样点的绝对强度的最大值, 作为该分段的初始最大绝对 值;
取该分段以及该分段之后 m个分段的初始最大绝对值中的最大值, 作为 该分段的滑动最大绝对值, 其中 m为正整数。
本发明实施例二的实现人声判别的具体流程如图 9所示, 包括如下步骤: 步骤 901: 进行参数初始化。初始化的参数可以包括音频信号的帧长、判 别阈值、 滑动长度、 跃迁次数和延迟帧数。 其中, 延迟帧数和跃迁次数的初 始值可以为零。
关于选取判别阁值的问题, 可以从最大绝对强度的角度出发, 取音频信 号的当前帧及当前帧之前的各脉冲编码调制(PCM )数据点(即信号采样点) 的绝对强度的最大值的 K分之一, K是正数。 不同的 K会导致判别能力的不 同, 建议选择 K=8有较好的效果。 通过实验发现非人声可能也会跃迁到这条 线, 图 10示出了典型的人声的滑动最大绝对值与判别阈值的关系图, 图 11 示出了典型的非人声的滑动最大绝对值与判别阈值的关系图, 其中横坐标均 为采样点的标号, 纵坐标为采样点的滑动最大绝对值。 可以发现, 人声和非 人声跃迁的分布特征不一样, 人声两次相邻跃迁之间的时间间隔大, 而非人 声两次相邻跃迁之间的时间间隔小。 因此为了进一步避免误判, 可以将两次 相邻跃迁之间的时间间隔称为跃迁长度, 当发生跃迁并且跃迁长度大于预先 设置的跃迁长度时, 认为当前帧是人声。
本发明方案可应用于实时处理的场合, 对当前音频信号进行判别后, 由 于当前音频信号已经播放, 无法对当前音频信号进行相应处理, 只能处理当 前音频信号之后的音频信号。 而人说话声有一定的延续性, 因此可以设置延 迟帧数 k, 当判别当前帧为人声后,可以直接认为当前帧之后的连续 k个帧的 音频信号都是人声, 而将这 k个帧当作人声进行处理, 其中 k为正整数, 例 如可以取为 5。 从而可以对音频信号中的人声进行实时地处理。
步骤 902: 将当前帧的每 n个釆样点作为一个分段, n为正整数, 取每个 分段中各采样点的绝对强度的最大值, 作为该分段的初始最大绝对值。
目前流行音乐等常用的音频采样率为 44100, 即每秒采样点的数目是 44100, 对于不同的采样率, 参数 n可以进行适当调整。 下面我们以 44100釆 样率为例进行说明。 如果每个采样点都要做一次滑动最大绝对值的话, 这样 空间就会占用太大了, 例如若帧长为 4096, 滑动长度选择为 2048, 那就意味 着需要 4096+2048个存储单元来存储这些数据, 这显然存储单元占用太多。 发明人通过实验发现 256点的解析度时具有较好的效杲。 因此优选的, 可以 规定 n的值为 256, 滑动长度仍然是 2048, 那么一帧包括 16个分段, 滑动长 度包括 8个分段, 则只需要 16+8 = 24个存储单元。
步骤 903:对于其中每一分段,取该分段以及该分段之后滑动长度内的各 个分段的初始最大绝对值中的最大值, 作为该分段的滑动最大绝对值。
例如, 取分段 1到分段 9的初始最大绝对值中的最大值作为分段 1的滑 动最大绝对值; 取分段 2到分段 10的初始最大绝对值中的最大值作为分段 2 的滑动最大绝对值, 依次类推。
步骤 904: 根据音频信号的当前帧中和当前帧之前的各 PCM数据点的绝 对强度的最大值更新判别阈值; 以及判断延迟帧数是否为零, 若为零则直接 转至步骤 905, 若延迟帧数非零则将其减 1, 并将音频信号的当前帧作为人声 进行处理。 所述处理4艮据具体应用而定, 例如进行消音处理。
在对延迟帧数中的音频信号作为人声进行处理后, 可以转至步骤 902继 续对下一帧执行判别是否为人声的处理(图中未示出)。
步骤 905:根据音频信号的当前帧中各分段的滑动最大绝对值以及判别阈 值, 判断音频信号的当前帧中, 滑动最大绝对值是否相对于判别阈值发生了 跃迁。 具体做法可以是分别对当前帧中除第一个分段外的所有分段的滑动最 大绝对值做下面的处理:
(当前分段的滑动最大绝对值 -判别阈值) X (前一个分段的滑动最大绝 对值-判别阈值);
判断乘积是否小于 0, 若是, 则发生了跃迁, 跃迁次数加 1 , 否则没有跃 迁。
步骤 906: 根据发生跃迁的分布判断音频信号是否为人声。
具体做法可以包括:
判断跃迁密度和跃迁长度是否达到要求。 跃迁密度的含义就是单位时间 内发生的跃迁次数。 统计截至目前一段时间内的跃迁密度是否符合预定标准。 该预定标准包括了最大的跃迁密度和最小跃迁密度, 即规定了跃迁密度的上 限和下限。 所述预定标准可以通过对标准的人声信号进行训练得出。 如果跃 迁次数的密度小于所述上限并大于所述下限, 同时跃迁长度大于标准跃迁长 度, 则音频信号的当前帧是人声, 否则不是人声。
若判断音频信号的当前帧是人声, 则将延迟帧数设置为预定值, 再执行 步骤 907。 如果判断音频信号的当前帧非人声, 则直接执行步骤 907。
步骤 907: 判断是否结束人声判别, 若是, 则结束本流程, 否则转至步骤 902继续对下一帧执行判别是否为人声的处理。
本发明实施例还提出一种用于进行人声判别的装置,如图 12所示,包括: 分段模块 1201, 用于将所述音频信号的当前帧的每 n个采样点作为一个 分段, 其中 n为正整数; 滑动最大绝对值模块 1202, 用于获得分段的滑动最大绝对值; 其中任一 分段的滑动最大绝对值通过以下方法获得: 取该分段中各采样点的绝对强度 的最大值, 作为该分段的初始最大绝对值, 以及取该分段以及该分段之后 m 个分段的初始最大绝对值中的最大值,作为该分段的滑动最大绝对值,其中 m 为正整数;
跃迁判断模块 1203, 用于判断所述当前帧中, 是否存在相对于判别阙值 发生跃迁的两个相邻的分段, 此两个相邻分段的滑动最大绝对值分别大于和 小于所述判别阈值;
人声判别模块 1204, 用于在跃迁判断模块判断出存在发生跃迁的两个相 邻的分段时, 确定所述当前帧为人声。
在本发明人声判别装置的更多实施例中, 所述人声判别装置还包括跃迁 次数判断模块, 用于判断单位时间内所述当前帧中的相邻分段发生跃迁的次 数是否在预先设定的范围内; 所述人声判别模块用于在所述跃迁判断模块和 所述跃迁次数判断模块的判断结果均为是时, 确定所述当前帧为人声。
在本发明人声判别装置的更多实施例中, 所述人声判别装置还包括跃迁 间隔判断模块, 用于判断所述当前帧中相邻两次跃迁之间的时间间隔是否大 于预先设定的值; 所述人声判别模块用于在所述跃迁判断模块和所述跃迁间 隔判断模块的判断结果均为是时, 确定所述当前帧为人声。
在本发明人声判别装置的更多实施例中, 所述跃迁判断模块 1203包括: 计算单元 12031,用于对于当前帧中除第一个分段之外的每一分段,计算 该分段的滑动最大绝对值减去判别阈值的差, 以及该分段的前一个分段的滑 动最大绝对值与判别阈值的差, 并将所述两个差值相乘;
判断单元 12032,用于判断所述当前帧中是否存在至少一个分段,对于该 分段计算出的乘积小于 0; 若是, 则存在发生跃迁的两个相邻的分段; 否则, 不存在。
所述人声判别模块 1204还用于在确定所述当前帧为人声之后, 直接确定 所述当前帧之后的 k帧为人声, 其中 k为预先设定的正整数。 N2009/001037 通过以上的实施方式的描述, 本领域的技术人员可以清楚地了解到本发 明可借助软件加必需的硬件平台的方式来实现, 当然也可以全部通过硬件来 实施, 但很多情况下前者是更佳的实施方式。 基于这样的理解, 本发明的技 术方案对背景技术做出贡献的全部或者部分可以以软件产品的形式体现出 来, 该计算机软件产品可以存储在存储介质中, 如 ROM/RAM、 磁碟、 光盘 等, 包括若干指令用以使得一台计算机设备(可以是个人计算机, 便携媒体 播放器, 或者其它具有媒体播放功能的电子产品)执行本发明各个实施例或 者实施例的某些部分所述的方法。
本发明实施例提出了一套适用于便携式多媒体播放器上的人声判别方 案, 所需计算量较小, 需要的存储空间也较小。 本发明实施例方案中, 采取 时域数据做滑动最大值, 可以很好的反应出人声与非人声的特性; 采用跃迁 方式的判断标准, 可以艮好地避免由于不同音量带来的标准不一致问题。
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡在本 发明的精神和原则之内所作的任何修改、 等同替换和改进等, 均应包含在本 发明的保护范围之内。

Claims

权 利 要 求
1、 一种人声判别的方法, 用于判别外部输入的音频信号中的人声, 其 特征在于, 包括:
将所述音频信号的当前帧的每 n个采样点作为一个分段, 其中 n为正整 数;
判断所述当前帧中, 是否存在相对于判别阈值发生跃迁的两个相邻的分 段, 此两个相邻分段的滑动最大绝对值分别大于和小于所述判别阈值; 若是, 则确定所述当前帧为人声;
其中, 分段的滑动最大绝对值通过以下方法获得:
取该分段中各采样点的绝对强度的最大值, 作为该分段的初始最大绝对 值;
取该分段以及该分段之后 m个分段的初始最大绝对值中的最大值, 作为 该分段的滑动最大绝对值, 其中 m为正整数。
1、 根据权利要求 1所述的人声判别方法, 其特征在于, 确定所述当前帧 为人声包括:
判断单位时间内所述当前帧中的相邻分段发生跃迁的次数是否在预先设 定的范围内; 若是, 则确定所述当前帧为人声。
3、 根据权利要求 1所述的人声判别方法, 其特征在于, 确定所述当前帧 为人声包括:
判断所述当前帧中相邻两次跃迁之间的时间间隔是否大于预先设定的 值; 若是, 则确定所述当前帧为人声。
4、 根据权利要求 1所述的人声判别的方法, 其特征在于, 当所述音频信 号的采样率为 44100时, n的值取为 256。
5、 根据权利要求 1所述的人声判别的方法, 其特征在于, 所述判断当前 帧中, 是否存在相对于判别阔值发生跃迁的两个相邻的分段具体包括:
对于当前帧中除第一个分段之外的每一分段, 计算该分段的滑动最大绝 对值减去判别阈值的差, 以及该分段的前一个分段的滑动最大绝对值与判别 阈值的差, 并将所述两个差值相乘;
判断所述当前帧中是否存在至少一个分段, 对于该分段计算出的乘积小 于 0; 若是, 则存在发生跃迁的两个相邻的分段; 否则, 不存在。
6、 根据权利要求 1-5中任一项所述的人声判别方法, 其特征在于, 所述 音频信号的各帧的判别阈值为一恒定值。
7、 根据权利要求 1-5中任一项所述的人声判别方法, 其特征在于, 所述 音频信号的各帧的判别阈值可调整。
8、 根据权利要求 1-5中任一项所述的人声判别的方法, 其特征在于, 当 前帧的判别阈值为所述当前帧及该帧之前的采样点的绝对强度的最大值的 K 分之一, 其中 K为正数。
9、根据权利要求 8所迷的人声判别方法, 其特征在于, 所述 K的取值为
8。
10、 根据权利要求 1-5 中任一项所迷的人声判别方法, 其特征在于, 在 确定所述当前帧为人声之后包括:
直接确定所述当前帧之后的 k帧为人声, 其中 k为预先设定的正整数。
11、 一种人声判别的装置, 用于判别外部输入的音频信号中的人声, 其 特征在于, 包括:
分段模块, 用于将所述音频信号的当前帧的每 n个采样点作为一个分段, 其中 n为正整数;
滑动最大绝对值模块, 用于获得分段的滑动最大绝对值; 其中任一分段 的滑动最大绝对值通过以下方法获得: 取该分段中各采样点的绝对强度的最 大值, 作为该分段的初始最大绝对值, 以及取该分段以及该分段之后 m个分 段的初始最大绝对值中的最大值, 作为该分段的滑动最大绝对值, 其中 m为 正整数;
跃迁判断模块, 用于判断所述当前帧中, 是否存在相对于判別阈值发生 跃迁的两个相邻的分段, 此两个相邻分段的滑动最大绝对值分别大于和小于 所述判别阈值;
人声判别模块, 用于在所述跃迁判断模块的判断结果为是时, 确定所述 当前帧为人声。
12、 根据权利要求 11所述的人声判别装置, 其特征在于, 所 人声判别 装置还包括跃迁次数判断模块, 用于判断单位时间内所述当前帧中的相邻分 段发生跃迁的次数是否在预先设定的范围内;
所述人声判别模块用于在所述跃迁判断模块和所述跃迁次数判断模块的 判断结果均为是时, 确定所述当前帧为人声。
13、 根据权利要求 11所述的人声判别装置, 其特征在于, 所述人声判别 装置还包括跃迁间隔判断模块, 用于判断所述当前帧中相邻两次跃迁之间的 时间间隔是否大于预先设定的值;
所述人声判别模块用于在所述跃迁判断模块和所迷跃迁间隔判断模块的 判断结果均为是时, 确定所述当前帧为人声,
14、 根据权利要求 11所述的人声判别装置, 其特征在于, 所述跃迁判断 模块包括:
计算单元, 用于对于当前帧中除第一个分段之外的每一分段, 计算该分 段的滑动最大绝对值减去判别阈值的差, 以及该分段的前一个分段的滑动最 大绝对值与判别阈值的差, 并将所述两个差值相乘;
判断单元, 用于判断所述当前帧中是否存在至少一个分段, 对于该分段 计算出的乘积小于 0; 若是, 则存在发生跃迁的两个相邻的分段; 否则, 不存 在。
15、 根据权利要求 11-14 中任一项所述的人声判别装置, 其特征在于, 所述人声判别模块还用于在确定所述当前帧为人声之后, 直接确定所述当前 帧之后的 k帧为人声, 其中 k为预先设定的正整数。
PCT/CN2009/001037 2008-09-26 2009-09-15 一种人声判别的方法和装置 WO2010037251A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/001,596 US20110166857A1 (en) 2008-09-26 2009-09-15 Human Voice Distinguishing Method and Device
EP09817165.5A EP2328143B8 (en) 2008-09-26 2009-09-15 Human voice distinguishing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200810167142.1A CN101359472B (zh) 2008-09-26 2008-09-26 一种人声判别的方法和装置
CN200810167142.1 2008-09-26

Publications (1)

Publication Number Publication Date
WO2010037251A1 true WO2010037251A1 (zh) 2010-04-08

Family

ID=40331902

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/001037 WO2010037251A1 (zh) 2008-09-26 2009-09-15 一种人声判别的方法和装置

Country Status (4)

Country Link
US (1) US20110166857A1 (zh)
EP (1) EP2328143B8 (zh)
CN (1) CN101359472B (zh)
WO (1) WO2010037251A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359472B (zh) * 2008-09-26 2011-07-20 炬力集成电路设计有限公司 一种人声判别的方法和装置
CN104916288B (zh) * 2014-03-14 2019-01-18 深圳Tcl新技术有限公司 一种音频中人声突出处理的方法及装置
CN109545191B (zh) * 2018-11-15 2022-11-25 电子科技大学 一种歌曲中人声起始位置的实时检测方法
CN110890104B (zh) * 2019-11-26 2022-05-03 思必驰科技股份有限公司 语音端点检测方法及系统
CN113131965B (zh) * 2021-04-16 2023-11-07 成都天奥信息科技有限公司 一种民航甚高频地空通信电台遥控装置及人声判别方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5457769A (en) * 1993-03-30 1995-10-10 Earmark, Inc. Method and apparatus for detecting the presence of human voice signals in audio signals
JPH07287589A (ja) * 1994-04-15 1995-10-31 Toyo Commun Equip Co Ltd 音声区間検出装置
US5991277A (en) * 1995-10-20 1999-11-23 Vtel Corporation Primary transmission site switching in a multipoint videoconference environment based on human voice
JP2001166783A (ja) * 1999-12-10 2001-06-22 Sanyo Electric Co Ltd 音声区間検出方法
CN1584974A (zh) * 2003-08-19 2005-02-23 扬智科技股份有限公司 判断声音信号中是否混有低频声音信号的方法及相关装置
CN101359472A (zh) * 2008-09-26 2009-02-04 炬力集成电路设计有限公司 一种人声判别的方法和装置

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6236964B1 (en) * 1990-02-01 2001-05-22 Canon Kabushiki Kaisha Speech recognition apparatus and method for matching inputted speech and a word generated from stored referenced phoneme data
US6411928B2 (en) * 1990-02-09 2002-06-25 Sanyo Electric Apparatus and method for recognizing voice with reduced sensitivity to ambient noise
US6314392B1 (en) * 1996-09-20 2001-11-06 Digital Equipment Corporation Method and apparatus for clustering-based signal segmentation
US6507814B1 (en) * 1998-08-24 2003-01-14 Conexant Systems, Inc. Pitch determination using speech classification and prior pitch estimation
US7127392B1 (en) * 2003-02-12 2006-10-24 The United States Of America As Represented By The National Security Agency Device for and method of detecting voice activity
JP3963850B2 (ja) * 2003-03-11 2007-08-22 富士通株式会社 音声区間検出装置
DE10327239A1 (de) * 2003-06-17 2005-01-27 Opticom Dipl.-Ing. Michael Keyhl Gmbh Vorrichtung und Verfahren zum extrahieren eines Testsignalabschnitts aus einem Audiosignal
FI118704B (fi) * 2003-10-07 2008-02-15 Nokia Corp Menetelmä ja laite lähdekoodauksen tekemiseksi
US20050096900A1 (en) * 2003-10-31 2005-05-05 Bossemeyer Robert W. Locating and confirming glottal events within human speech signals
US7672835B2 (en) * 2004-12-24 2010-03-02 Casio Computer Co., Ltd. Voice analysis/synthesis apparatus and program
WO2006135986A1 (en) * 2005-06-24 2006-12-28 Monash University Speech analysis system
US8175868B2 (en) * 2005-10-20 2012-05-08 Nec Corporation Voice judging system, voice judging method and program for voice judgment
US8121835B2 (en) * 2007-03-21 2012-02-21 Texas Instruments Incorporated Automatic level control of speech signals
GB2450886B (en) * 2007-07-10 2009-12-16 Motorola Inc Voice activity detector and a method of operation
US8630848B2 (en) * 2008-05-30 2014-01-14 Digital Rise Technology Co., Ltd. Audio signal transient detection
US20100017203A1 (en) * 2008-07-15 2010-01-21 Texas Instruments Incorporated Automatic level control of speech signals
JP2011065093A (ja) * 2009-09-18 2011-03-31 Toshiba Corp オーディオ信号補正装置及びオーディオ信号補正方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5457769A (en) * 1993-03-30 1995-10-10 Earmark, Inc. Method and apparatus for detecting the presence of human voice signals in audio signals
JPH07287589A (ja) * 1994-04-15 1995-10-31 Toyo Commun Equip Co Ltd 音声区間検出装置
US5991277A (en) * 1995-10-20 1999-11-23 Vtel Corporation Primary transmission site switching in a multipoint videoconference environment based on human voice
JP2001166783A (ja) * 1999-12-10 2001-06-22 Sanyo Electric Co Ltd 音声区間検出方法
CN1584974A (zh) * 2003-08-19 2005-02-23 扬智科技股份有限公司 判断声音信号中是否混有低频声音信号的方法及相关装置
CN101359472A (zh) * 2008-09-26 2009-02-04 炬力集成电路设计有限公司 一种人声判别的方法和装置

Also Published As

Publication number Publication date
EP2328143B1 (en) 2016-04-13
CN101359472A (zh) 2009-02-04
EP2328143B8 (en) 2016-06-22
EP2328143A1 (en) 2011-06-01
US20110166857A1 (en) 2011-07-07
EP2328143A4 (en) 2012-06-13
CN101359472B (zh) 2011-07-20

Similar Documents

Publication Publication Date Title
JP7150939B2 (ja) ボリューム平準化器コントローラおよび制御方法
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
JP6185457B2 (ja) 効率的なコンテンツ分類及びラウドネス推定
US8193436B2 (en) Segmenting a humming signal into musical notes
JP2000511651A (ja) 記録されたオーディオ信号の非均一的時間スケール変更
Molina et al. SiPTH: Singing transcription based on hysteresis defined on the pitch-time curve
JP5593244B2 (ja) 話速変換倍率決定装置、話速変換装置、プログラム、及び記録媒体
US9892758B2 (en) Audio information processing
JPH06332492A (ja) 音声検出方法および検出装置
JPH0990974A (ja) 信号処理方法
WO2010037251A1 (zh) 一种人声判别的方法和装置
Rossignol et al. Feature extraction and temporal segmentation of acoustic signals
CN105706167B (zh) 有语音的话音检测方法和装置
JP3607450B2 (ja) オーディオ情報分類装置
JP4696418B2 (ja) 情報検出装置及び方法
Jang et al. Enhanced Feature Extraction for Speech Detection in Media Audio.
CN114972592A (zh) 歌唱嘴型与面部动画生成方法、装置及电子设备
JP2011013383A (ja) オーディオ信号補正装置及びオーディオ信号補正方法
CN112786071A (zh) 面向语音交互场景语音片段的数据标注方法
WO2004077381A1 (en) A voice playback system
Zeng et al. Adaptive context recognition based on audio signal
JP2006154531A (ja) 音声速度変換装置、音声速度変換方法、および音声速度変換プログラム
JP2004341340A (ja) 話者認識装置
JPH10133678A (ja) 音声再生装置
JPH09146575A (ja) 発声速度検出方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09817165

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2009817165

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE