WO2013060079A1 - 一种基于信道模式噪声的录音回放攻击检测方法和系统 - Google Patents

一种基于信道模式噪声的录音回放攻击检测方法和系统 Download PDF

Info

Publication number
WO2013060079A1
WO2013060079A1 PCT/CN2011/084868 CN2011084868W WO2013060079A1 WO 2013060079 A1 WO2013060079 A1 WO 2013060079A1 CN 2011084868 W CN2011084868 W CN 2011084868W WO 2013060079 A1 WO2013060079 A1 WO 2013060079A1
Authority
WO
WIPO (PCT)
Prior art keywords
mode noise
channel mode
channel
speech signal
noise
Prior art date
Application number
PCT/CN2011/084868
Other languages
English (en)
French (fr)
Inventor
贺前华
王志锋
罗海宇
陈芬
Original Assignee
华南理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华南理工大学 filed Critical 华南理工大学
Publication of WO2013060079A1 publication Critical patent/WO2013060079A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the present invention relates to the field of intelligent speech signal processing, pattern recognition and artificial intelligence, and more particularly to a method and system for detecting a recording playback attack in a speaker recognition system based on channel mode noise.
  • speaker recognition systems have been widely used, such as: judicial forensics, e-commerce, financial systems.
  • speaker recognition systems Two common attacks faced by speaker recognition systems are speaker phishing attacks and recording playback attacks.
  • a speaker phishing attack means that the attacker attacks the system by imitating the voice of the user in the speaker recognition system.
  • Speaker recognition experiments on twin speech libraries show that existing speaker recognition techniques can distinguish between twin voices with similar acoustic characteristics, so implementing a counterfeit attack requires very good imitation skills, allowing the attacker's voice to be able to communicate with the system.
  • the user's voice is highly similar, which makes the impersonation attack less enforceable.
  • the recording playback attack means that the attacker sneaked the voice of the user in the speaker recognition system with a high-fidelity recording device in advance, and then played the high-fidelity power on the system input to perform an attack on the speaker recognition system.
  • the playback attack can be implemented by spoofing the voice when the user enters the system or spoofing a large number of user voices through syllable stitching.
  • a playback attack can be implemented by simply obtaining a partial voice of the user. Compared with the counterfeit voice, the recording and playback voice is true from the user himself. The threat to the speaker recognition system is greater.
  • high-fidelity recording and playback devices with good performance are emerging, prices are getting cheaper, and their size is getting smaller and smaller. It is easy to carry and easy to find, which makes recording playback attacks easier. .
  • the object of the present invention is to overcome the defects and deficiencies of the prior art, and to provide a recording and playback attack detection method based on channel mode noise, which can be used in a speaker recognition system to improve the success rate of recording and playback attack detection.
  • a recording mode attack detection method based on channel mode noise characterized in that the recording playback attack detection method comprises the following steps:
  • the channel noise classification decision model According to the channel noise classification decision model, the long-term statistical features are classified, and the judgment result of the recording playback attack detection is obtained.
  • the step (2) preprocessing includes pre-emphasis, framing, and windowing.
  • the step (3) includes the following steps:
  • the statistical frame is obtained by performing discrete Fourier transform on the short-time frame of the speech signal.
  • the average of the frequency components is obtained by performing discrete Fourier transform on the short-time frame of the speech signal.
  • the step (4) includes the following steps:
  • the six statistical characteristics of the step (42) are the minimum, maximum, mean, median, standard deviation, and difference between the maximum and minimum values of the channel mode noise.
  • the establishment of the channel noise classification decision model of the step (5) includes the following steps:
  • a system for implementing the above method comprising:
  • An input module for inputting a training or to be recognized voice signal
  • a pre-processing module for pre-processing the voice signal, including pre-emphasis, framing, and windowing unit;
  • a channel mode noise extraction module configured to extract channel mode noise in the preprocessed speech signal
  • a long-term statistical feature extraction module for extracting long-term statistical features based on channel mode noise
  • a channel noise model module for classifying long-term statistical features of the training by using a support vector machine to establish a channel noise classification decision model
  • the identification decision module is configured to classify long-term statistical features of the speech number to be recognized by using the channel noise classification decision model, and obtain a judgment result of the recording playback attack detection;
  • An output module configured to output a decision result of the voice signal to be recognized.
  • the basic principle of the present invention is to perform recording playback attack detection by extracting channel mode noise of a speech signal.
  • the original speech refers to the system collecting the original voice of the user
  • the playback voice refers to the recording and playback attack voice.
  • the playback voice also undergoes a recording and playback process before entering the speaker recognition system recording channel.
  • Different recording and playback devices introduce different channel noises (microphones, speakers, dither circuits, preamplifiers, power amplifiers, input and output filters, A ⁇ D, D ⁇ A, sample and hold circuits, etc.) Noise) These channel noises are superimposed on the playback speech, leaving subtle differences between the playback speech and the original speech.
  • the present invention refers to these noises introduced from transducers (microphones, speakers) and different circuits in different recording and playback devices as channel mode noise.
  • the original voice contains the channel mode noise of the system recording device, and the playback voice not only contains the channel mode noise of the system, but also contains the channel mode noise of the spoofing device and the playback device, so the channel mode noise in the speech to be recognized can be extracted.
  • Record playback attack detection The present invention extracts channel mode noise through a denoising filter, and extracts long-term statistical features based on channel mode noise, and then uses a support vector machine to establish a channel noise model for determining whether the input of the speaker recognition system is a recording playback attack.
  • the invention Compared with the existing recording and playback attack detection method, the invention has the following advantages and beneficial effects: (1) It can be applied to a text-related speaker recognition system, and can also be applied to a text-independent speaker recognition system.
  • the channel noise model can be used to establish a front-end recording playback attack detector or a back-end recording playback attack detector, so that the recording playback attack
  • the application of the algorithm is more flexible.
  • Figure 1 is a block diagram of the system of the present invention.
  • Figure 2 is a flow chart of channel pattern noise extraction and long-term feature extraction based on channel pattern noise.
  • Figure 3 is a flow chart of statistical frame extraction.
  • Figure 4 is a comparison diagram after connecting the speaker recognition system.
  • the recording playback attack detection method of the present invention can be implemented in an embedded system as follows:
  • a training voice is input, which includes an original voice signal and a playback voice signal.
  • Step (2) preprocessing the input voice signal, including pre-adding the voice signal Heavy, framing, and windowing.
  • Pre-emphasis is a high-pass filtering of the speech signal.
  • the framing of the speech signal wherein the frame length is 512 points and the frame is shifted to 256 points.
  • the window added to the speech signal is a Hamming window, wherein the function of the Hamming window is:
  • step (3) the channel mode noise in the pre-processed speech signal is extracted, and the extraction step is as shown in FIG. 2 .
  • the extraction of channel mode noise is divided into the following steps:
  • Step S301 the pre-processed voice in step (2) is input to the channel mode noise extraction module 300;
  • Step S302 the signal in step S301 is subjected to denoising filtering processing through a denoising filter, and the denoising filter is designed as follows:
  • Step S303 performing statistical frame analysis on the denoised filtering in step S302 and the speech signal in the step S301 without past noise filtering.
  • the statistical frame is the average value of the same frequency components in the short-time frame of the speech signal.
  • Step S3031 performing discrete Fourier transform on the signals processed in steps S301 and S302; Step S3032, passing in step S3031
  • the discrete Fourier transform signal is superimposed on the same frequency component in each frame; in step S3033, the superimposed spectrum in step S3032 is averaged to obtain a statistical frame of the input signal.
  • Step S304 calculating a logarithmic power spectrum, extracting a logarithmic power spectrum from the two signals of the statistical frame analysis in step S303, and then subtracting one signal that has not passed through the noise filtering from another signal passing through the denoising filter.
  • channel mode noise of the input speech signal as follows
  • DefiltO is the denoising filter designed in step S302.
  • Step (4) extracts two sets of long-term statistical features based on the signal pattern noise obtained in the above step, one set is 0 ⁇ 5 Legendre polynomial coefficients, and the other set is 6 statistical features of channel mode noise.
  • Step S401 extracting the Legendre polynomial coefficients: taking the legendary polynomial coefficients of 0 ⁇ 5 order to perform parameter fitting on the extracted channel mode noise.
  • is the Legendre polynomial coefficient.
  • the Legendre polynomial expansion is performed after the channel mode noise is extracted, and the polynomial coefficients of L Q ⁇ L 5 are obtained.
  • Each Legendre polynomial coefficient embodies information about one aspect of channel mode noise: the DC portion of the L0 channel mode noise; the slope of the L1 channel mode noise distribution curve; the curvature of the L2 channel mode noise distribution curve; L3—the channel mode noise distribution curve S curvature; more details of the L4, L5 channel mode noise distribution curve.
  • Step S402 extracting statistical features based on channel mode noise, and the set of statistical features includes the following six characteristics:
  • PN_min the minimum value of the channel mode noise
  • PN_max the maximum value of the channel mode noise
  • PN_mean the mean of the channel mode noise
  • PN_median the median of the channel mode noise
  • PN_stdev The standard deviation of the channel mode noise.
  • the two sets of long-term statistical features are combined into a set of 12-dimensional long-term statistical feature vectors, which are used as feature vectors for recording playback attack detection.
  • Step (5) establishing a support vector machine channel noise classification decision model for distinguishing whether the input speech to be recognized is original speech or playback speech.
  • the specific process of constructing the channel noise model parameters by the support vector machine is as follows: The support vector machine constructs the channel noise model parameters including positive samples and negative samples.
  • the positive sample is the long-term statistical feature based on channel mode noise obtained by the original speech signal through the above steps (2) ⁇ (4).
  • the negative sample is used to play back the voice signal after the above steps (2) ⁇ (4) Obtained long-term statistical characteristics based on channel mode noise.
  • the classification interval is equal to 2/llvvll, so that the interval is maximally equivalent to making llvvll 2 the smallest. Therefore, the classification plane that satisfies the above formula and minimizes
  • the speech samples cannot be completely noise-free, and are completely linearly separable, so the support vector machine classifier is used in the case of linear inseparability.
  • the support vector machine classifier is used in the case of linear inseparability.
  • the penalty factor C and ⁇ is determined by the SMO (Sequential Minimal Optimization) algorithm and the grid search algorithm, and is used to train the channel noise model.
  • Step (6) classifying the original voice and the played back voice, inputting the voice signal to be recognized, and obtaining the long-term statistical feature based on the channel mode noise through the above steps (2) ⁇ (4),
  • the channel noise model established in step (5) is used for recording playback attack detection, and finally the decision result is output.
  • a recording playback attack detection system of the present invention includes:
  • An input module 100 configured to input a training or to be recognized voice signal
  • a preprocessing module 200 configured to preprocess the voice signal, including pre-emphasis, framing, and windowing unit;
  • a one-channel mode noise extraction module 300 configured to extract channel mode noise in the pre-processed speech signal
  • a long-term statistical feature extraction module 400 configured to extract long-term statistical features based on channel mode noise
  • the one-channel noise model module 500 is configured to classify the long-term statistical features of the training by using a support vector machine to establish a channel noise classification decision model;
  • the identification decision module 600 is configured to determine whether the to-be-identified voice input by the channel noise model module is a recording and playback attack voice;
  • the output module 700 is configured to output a determination result of the voice signal to be recognized.
  • the invention provides a channel mode noise recording and playback attack detection method, which is compared with a sentence similarity comparison method in an Authentic and Playback Speech Database (APSD), as shown in Table 1, based on a channel
  • the mode noise method has a lower error rate.
  • the recording playback attack detectors established by the two methods are respectively connected to the actual speaker recognition system.
  • the speaker recognition system that does not load the playback attack detection module has a high error rate and low security.
  • the system After loading the channel-based noise-based playback attack detection module, the system has the lowest error rate of 10.2564%.
  • the error rate of the system after loading the attack attack detection module based on the comparison of sentence similarity is 29.0598%.
  • the channel-based noise recording and playback attack detection method proposed by the invention is not only simple and easy to implement, but also has high algorithm efficiency and low error rate. It will be more efficient for embedded recognition and other smart devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明涉及智能语音信号处理、模式识别与人工智能技术领域,特别是涉及一种基于信道模式噪声的说话人识别系统中录音回放攻击检测方法和系统。本发明公开了一种说话人识别系统中更加简便和高效的录音回放攻击检测方法,所述方法步骤如下:(1)输入待识别语音信号;(2)对语音信号进行预处理;(3)提取预处理后语音信号中的信道模式噪声;(4)提取基于信道模式噪声的长时统计特征;(5)根据信道噪声分类判决模型对长时统计特征进行分类。本发明利用信道模式噪声进行录音回放攻击检测,所提取的特征维数低,计算复杂度低,错误识别率低。因此,可极大提高说话人识别系统的安全性能,更易于在现实中使用。

Description

说 明 书 一种基于信道模式噪声的录音回放攻击检测方法和系统 技术领域
本发明涉及智能语音信号处理、模式识别与人工智能技术领域,特别 是涉及一种基于信道模式噪声的说话人识别系统中录音回放攻击检测方 法和系统。
背景技术
随着说话人识别技术的不断发展, 说话人识别系统得到了非常广泛 的应用, 例如: 司法取证、 电子商务、 金融系统等。 与此同时, 说话人识 别系统所面临的一些安全问题制约了其发展和应用。说话人识别系统面临 的两种常见攻击是说话人仿冒攻击和录音回放攻击。说话人仿冒攻击是指 攻击者通过模仿说话人识别系统中用户的声音对系统进行攻击。在双胞胎 语音库上的说话人识别实验表明现有的说话人识别技术能够区分具有类 似声学特性的双胞胎语音,因此实施说话人仿冒攻击需要有非常好的模仿 技巧,使得攻击者的语音能够和系统用户的语音达到高度相似,这使得仿 冒攻击的可实施性不高。录音回放攻击是指攻击者事先用高保真录音设备 偷录说话人识别系统中用户的语音, 然后通过高保真功放在系统输入端 回放,以此对说话人识别系统实施攻击。对于文本相关的说话人识别系统, 可以通过偷录用户进入系统时的语音或偷录大量用户语音通过音节拼接 的方式实施回放攻击。对于文本无关的系统只需获得用户部分语音即可实 施回放攻击。与仿冒语音相比, 录音回放语音是真实来自于用户本人, 它 对说话人识别系统造成的威胁更大。另一方面,现在性能好的高保真录音 及回放设备不断涌现, 价格也越来越便宜, 并且体积也越来越小, 便于携 带不易被发现, 这也让录音回放攻击变得越来越容易。
防止录音回放攻击的一种策略是通过系统随机挑选语句让用户跟读, 在进行说话人识别的同时还要判断用户是否按要求来跟读。这种方法的实 施需要事先准备丰富的语音库,并且要求用户按照语音内容跟读, 当用户 按照自己的发音习惯跟读时,将有可能不能通过说话人识别系统,这种不 太友好的交互性方式不容易被用户所接受。而且这种方法会牺牲掉说话人 识别系统对于特定用户特定文本的安全保护性,会产生其它安全问题。在 实际的应用中,这种方法只能用于文本相关的说话人识别系统,在做说话 人识别的同时还要进行语音的文本识别,这也降低了说话人识别系统的总 体效率。
还有采用句子相似度比较的方法, 用户每次输入的口令虽然文本相 同,但是两次不可能采集到同样的样本, 因此如果输入的句子和存储的句 子相似度高过一定的范围就可以认定为录音回放攻击。这种方法存在明显 缺陷:一、该算法只能够应用于文本相关的说话人识别系统进行录音回放 攻击检测;二、用户每次进入系统的样本都要存下来,需要大量存储空间; 三、每次用户进入系统的样本都要和所有存储样本进行相似性比较,计算 量非常大; 四、如果所录回放语音并不是在用户进入系统时录制, 例如私 下录制或者是通过音节拼接得到, 那么这个方法就无效; 五、这种方法对 阈值设定的依赖性很强,说话人识别本身就是进行相似度比较,相似度高 的判断为同一个说话人,因此回防攻击和说话人自身识别的相似度阈值的 界限很难确定。
发明内容
本发明的目的在于克服现有技术的缺陷和不足,提供一种基于信道模 式噪声的录音回放攻击检测方法,用于说话人识别系统中可提高录音回放 攻击检测的成功率。
本发明的另一目的还在于提供实现上述方法的系统。
本发明的目的通过下述技术方案实现:
一种基于信道模式噪声的录音回放攻击检测方法,其特征在于,所述 录音回放攻击检测方法包括以下步骤:
(1) 输入待识别语音信号;
(2) 对语音信号进行预处理;
(3) 提取预处理后语音信号中的信道模式噪声;
(4) 提取基于信道模式噪声的长时统计特征;
(5)根据信道噪声分类判决模型对长时统计特征进行分类,得到录音 回放攻击检测的判决结果。
所述步骤 (2) 预处理包括预加重、 分帧和加窗。
所述步骤 (3) 包括以下步骤:
(31) 将预处理后的语音信号进行去噪滤波处理;
(32) 对去噪滤波处理前、 后的信号分别进行统计帧分析;
(33) 将统计帧分析后的两路信号提取对数功率谱, 并作减法运算, 提取出输入语音信号的信道模式噪声。
所述统计帧是对语音信号的短时帧做离散傅里叶变换后,取其中相同 频率成分的平均值。
所述步骤 (4) 包括以下步骤:
(41 ) 提取信道模式噪声的 0~5阶 Legendre多项式展开系数;
(42) 提取信道模式噪声的六个统计特征;
(43 )将上述步骤获得的数值合并成一组 12维的长时统计特征矢量, 作为录音回放攻击检测的特征矢量。
所述步骤 (42) 的六个统计特征为信道模式噪声的最小值、 最大值、 均值、 中值、 标准差以及最大值和最小值的差值。
所述步骤 (5 ) 的信道噪声分类判决模型建立包括如下步骤:
(51 ) 输入训练语音信号;
( 52 ) 重复步骤 (2 ) ~ (4 ), 得到训练的信道模式噪声的长时统计 特征;
(53 ) 利用支持向量机 (Support Vector Machine, SVM ) 进行分类, 建立信道噪声分类判决模型。
实现上述方法的系统, 包括:
一一输入模块, 用于输入训练或待识别语音信号;
一一预处理模块, 用于对语音信号进行预处理, 其包括预加重、 分帧 和加窗单元;
一一信道模式噪声提取模块, 用于提取预处理后语音信号中的信道模 式噪声;
一一长时统计特征提取模块, 用于提取基于信道模式噪声的长时统计 特征; 一一信道噪声模型模块, 用于将训练的长时统计特征利用支持向量 机进行分类, 建立信道噪声分类判决模型;
一一识别决策模块, 用于利用信道噪声分类判决模型对待识别语音 号的长时统计特征进行分类, 得到录音回放攻击检测的判决结果;
一一输出模块, 用于输出待识别语音信号的判决结果。
本发明的基本原理是:通过提取语言信号的信道模式噪声进行录音回 放攻击检测。在说话识别系统中,原始语音是指系统采集用户的原始语音, 回放语音指录音回放攻击语音。回放语音在进入说话人识别系统录音信道 之前,还经历了一次录音和回放的过程。不同录音和回放设备会引入设备 自身不同的信道噪声(麦克风、 扬声器、 抖动电路、 前置放大器、 功率放 大器、 输入和输出滤波器、 A\D、 D\A、 取样保持电路等都会引入相应的 噪声), 这些信道噪声叠加在回放语音上, 使得回放语音和原始语音存在 着细微的差异。本发明将这些来自不同录音与回放设备中换能器 (传声器、 扬声器)和不同电路引入的噪声称为信道模式噪声。原始语音中含有系统 录音设备的信道模式噪声,而回放语音不仅含有系统的信道模式噪声,还 含有偷录设备和回放设备的信道模式噪声,因此提取出待识别语音中的信 道模式噪声即可进行录音回放攻击检测。本发明通过去噪滤波器提取信道 模式噪声,并在信道模式噪声的基础上提取长时统计特征,再利用支持向 量机建立信道噪声模型用以判决说话人识别系统的输入是否为录音回放 攻击。
本发明与现有的录音回放攻击检测方法相比,具有以下的优点和有益 效果: ( 1 )可以应用于文本相关的说话人识别系统,也可以应用于文本无关 的说话人识别系统。
(2)对原始语音和回放语音的分类识别可以在说话人识别之前也可以 在之后, 因此,可以利用信道噪声模型建立前端录音回放攻击检测器或后 端录音回放攻击检测器, 使得录音回放攻击算法的应用更加灵活。
(3 ) 长时统计特征与 MFCC (Mel Frequency Cepstrum Coefficient, Md频率倒谱系数)特征相比, 特征维数明显减少, 在训练阶段, 提取特 征时, 效率明显提高。 并且不需要将每次用户进入系统的样本存储下来, 节省了大量的存储空间和计算资源。
附图说明
图 1是本发明的系统结构图。
图 2 是信道模式噪声提取以及基于信道模式噪声的长时特征提取流 程图。
图 3是统计帧提取流程图。
图 4是连接说话人识别系统后的对比图。
具体实施方式
下面结合附图和实施例对本发明的实施作进一步描述, 但本发明的 实施不限于此。
本发明的录音回放攻击检测方法可在嵌入式系统中按以下步骤实 现:
步骤 (1 ), 输入训练语音, 其包括原始语音信号和回放语音信号。 步骤 (2), 对输入语音信号进行预处理, 包括对语音信号进行预加 重、 分帧和加窗处理。 预加重是对语音信号进行高通滤波, 滤波器的传 输函数为 HW^-az-1, 其中 α=0.975。 对语音信号的分帧, 其中帧长为 512个点, 帧移为 256个点。 对语音信号所加的窗为汉明窗, 其中汉明窗 的函数为:
0.54 -0.46 cos (^- ), 0≤w≤N-l
ωΗ{η) N-l
1 ,其他 步骤(3), 提取预处理后语音信号中的信道模式噪声, 提取步骤如图 2所示。 信道模式噪声的提取分为以下步骤:
步骤 S301, 将步骤(2) 中经过预处理的语音输入到信道模式噪声提 取模块 300;
步骤 S302,将步骤 S301中的信号通过去噪滤波器进行去噪滤波处理, 去噪滤波器的设计如下:
H(z) = , 其中 N = 32,a = 0.94 ;
Figure imgf000009_0001
步骤 S303,将步骤 S302中经过去噪滤波和步骤 S301中未经过去噪滤 波的语音信号分别进行统计帧分析。 统计帧是语音信号短时帧中相同频率 成分的平均值, 设 ={^[«], 表示帧数为 Γ的语音信号, 则第 (1 <i<T)帧信号; c; W(0 <n<N-\)的离散傅里叶变换为:
Figure imgf000009_0002
那么统计帧^ t]的表达式如下: 1 i=l
1 T N-l 2 kn
= ∑∑ ] 如图 3所示, 步骤 S303中统计帧的提取方法分为以下步骤: 步骤 S3031 ,将经步骤 S301、 S302处理的信号进行离散傅里叶变换; 步骤 S3032, 将步骤 S3031 中经过离散傅里叶变换的信号每帧中相 同频率成分叠加; 步骤 S3033 , 将步骤 S3032中叠加的频谱求平均, 得到输入信号的 统计帧。 步骤 S304, 求对数功率谱, 将步骤 S303 中的经过统计帧分析的两 路信号提取对数功率谱,然后将未经过去噪滤波的一路信号减去经过去噪 滤波器的另一路信号,从而得到输入语音信号的信道模式噪声,如下式所
Figure imgf000010_0001
其中 DefiltO为步骤 S302中设计的去噪滤波器。 步骤(4),在上述步获得的信号模式噪声的基础上提取两组长时统计 特征, 一组为 0~5 Legendre多项式系数, 另外一组为信道模式噪声 的 6种统计特征。 步骤 S401 , Legendre多项式系数的提取: 取 0~5阶的 legendre多项 式系数对提取的信道模式噪声进行参数拟合。
Legendre多项式的形式如下: f (x) =∑LnPn (x)
n=0 其中 3, ^为 Legendre多项式系数。在提取信道模式噪声之后进行 Legendre多项式展开, 获得 LQ~L5的多项式系数。 每个 Legendre多项 式系数体现了信道模式噪声一个方面的信息: L0 信道模式噪声的 直流部分; L1 信道模式噪声分布曲线的斜率; L2 信道模式噪 声分布曲线的曲率; L3——信道模式噪声分布曲线的 S曲率; L4、 L5 信道模式噪声分布曲线的更多细节信息。
步骤 S402, 提取基于信道模式噪声的统计特征, 这一组统计特征包括 以下六种特征:
• PN_min: 信道模式噪声的最小值;
• PN_max: 信道模式噪声的最大值;
• PN_mean: 信道模式噪声的均值;
• PN_median: 信道模式噪声的中值;
• PN_diff: 最大值和最小值的差;
• PN_stdev: 信道模式噪声的标准差。 将两组长时统计特征合并成一组 12维的长时统计特征矢量, 将其作 为录音回放攻击检测的特征矢量。 步骤 (5 ), 建立支持向量机信道噪声分类判决模型, 用来区分输入 的待识别语音是原始语音还是回放语音。支持向量机构建信道噪声模型参 数的具体过程如下:支持向量机构建信道噪声模型参数包括正样本和负样 本。 其中正样本为原始语音信号经过上述步骤 (2 ) ~ (4) 获得的基于信 道模式噪声的长时统计特征。 负样本为回放语音信号经过上述步骤 (2 ) ~ (4) 获得的基于信道模式噪声的长时统计特征。
所谓支持向量机分类是要求分类面不但能将两类样本正确分开,而且 使分类间隔最大。我们可以对样本集 i = l,-,n , xeRd , }^[-1,+1], 进行归一化使其满足:
Figure imgf000012_0001
此时分类间隔等于 2/llvvll, 使间隔最大等价于使 llvvll2最小。 因此满足上式 且使 |w|2最小的分类面就叫做最优分类面, 其上的训练样本点就称作支 持向量。
禾 lj用 Lagrange优化方法求解, Lagrange函数为:
Figure imgf000012_0002
将该函数转化为 Wolf对偶问题, 即在约束条件: ^ yiai = 0, 禾口 ;≥ 0, i = 1,···,η 下对《;求解下列函数最大值:
Figure imgf000012_0003
«,为原问题中与每个约束条件)^ (n,) + b]-l≥0, = 1,···,«对应的 Lagrange乘子。 解上述问题后, 设得到的最优解解为《, Pb*, χ为输入的 待分类数据。 可以得到的最优分类函数 (即支持向量机的输出函数),
Figure imgf000012_0004
实际中语音样本不可能完全无噪, 完全线性可分, 所以是在线性不 可分的情况下使用支持向量机分类器。 则可以在约束条件 yi[(wxi) + b]-l≥0 , ϊ = 1,···,η
中增加一个松弛因子 ≥0, 则约束条件变为:
yi[(wxi) + b]-l + i>0, ί =
Figure imgf000013_0001
则 Lagrange函数为:
L(w,b, ) =—(w.w) + c\ ^
2 i=l J 转变为 Wolf 问题得: 在 γΆ = 0禾卩 0≤ ≤ C, ί = 1,---,η条件下求解:
ί=1
Figure imgf000013_0002
其中 C为常数,用以控制对错份样本惩罚的程度,称为惩罚因子。 所以, 在线性不可分的情况下, 支持向量机的输出函数可以表示为:
Figure imgf000013_0003
其中, 0≤ !≤C, i = l,...,n , Sgn(.)为符号函数,
为径向基内积函数, 可作为作为支持向量机的核函数:
=6χρ(-/ΐ|| -^||), Λ > 0
实际操作中可以选择不同的核函数。
惩罚因子 C禾卩 通过 SMO (Sequential Minimal Optimization,序贯最小优 化)算法和网格搜索算法确定, 并用于训练信道噪声模型。通过实际参数 优化的一组设置为: C = 0.03125, = 0.0078125。
步骤 (6), 原始语音和回放语音的分类识别, 输入待识别的语音信 号, 经过上述步骤 (2) ~ (4) 获得基于信道模式噪声的长时统计特征, 利用步骤 (5 ) 建立的信道噪声模型进行录音回放攻击检测, 最后输出判 决结果。
如图 1所示, 本发明的一种录音回放攻击检测系统包括:
一一输入模块 100, 用于输入训练或待识别语音信号;
一一预处理模块 200, 用于对语音信号进行预处理, 其包括预加重、 分帧和加窗单元;
一一信道模式噪声提取模块 300, 用于提取预处理后语音信号中的信 道模式噪声;
一一长时统计特征提取模块 400, 用于提取基于信道模式噪声的长时 统计特征;
一一信道噪声模型模块 500, 用于将训练的长时统计特征利用支持 向量机进行分类, 建立信道噪声分类判决模型;
一一识别决策模块 600, 用于利用信道噪声模型模块判决输入的待识 别语音是否为录音回放攻击语音;
一一输出模块 700, 用于输出待识别语音信号的判决结果。
本发明提供的一种基于信道模式噪声录音回放攻击检测方法,在录音 与回放语音数据库 (Authentic and Playback Speech Database, APSD) 中 与基于句子相似度比较方法进行对比,如表 1所示,基于信道模式噪声的 方法错误率更低。
表 1
错误率 基于信道模式噪声方法 句子相似度比较的方法 错误拒绝率 2.8619% 15.6732% 错误接受率 2.4507% 15.6732%
如图 4所示,将两种方法建立的录音回放攻击检测器分别和实际的说 话人识别系统相连接。对于含有回放攻击语音的数据,未加载回放攻击检 测模块的说话人识别系统错误率很高,安全性能很低。加载基于信道模式 噪声的回放攻击检测模块后系统等错误率最低, 为 10.2564%。 而加载基 于句子相似度比较的回放攻击检测模块后系统等错误率为 29.0598%。
本发明所提出的一种基于信道模式噪声录音回放攻击检测方法不仅 简单易实现, 算法效率高, 并且错误率低。用在嵌入式识别及其它智能设 备上将有更高的效率。

Claims

权 利 要 求 书
1、 一种基于信道模式噪声的录音回放攻击检测方法, 其特征在于包 括以下步骤:
( 1 ) 输入待识别语音信号;
(2) 对语音信号进行预处理;
(3 ) 提取预处理后语音信号中的信道模式噪声;
(4) 提取基于信道模式噪声的长时统计特征;
(5 )根据信道噪声分类判决模型对长时统计特征进行分类,得到录音 回放攻击检测的判决结果。
2、 如权利要求 1所述的一种录音回放攻击检测方法, 其特征在于, 所述步骤 (2) 中的预处理包括预加重、 分帧和加窗。
3、 如权利要求 1所述的一种录音回放攻击检测方法, 其特征在于, 所述步骤 (3 ) 还包括以下步骤:
(31 ) 将预处理后的语音信号进行去噪滤波处理;
(32) 对去噪滤波处理前、 后的信号分别进行统计帧分析;
(33 ) 将统计帧分析后的两路信号提取对数功率谱, 并作减法运算, 提取出输入语音信号的信道模式噪声。
4、 如权利要求 3所述的一种录音回放攻击检测方法, 其特征在于, 所述统计帧是对语音信号的短时帧做离散傅里叶变换后,取其中相同频率 成分的平均值。
5、 如权利要求 1所述的一种录音回放攻击检测方法, 其特征在于, 所述步骤 (4) 还包括以下步骤: (41 ) 提取信道模式噪声的 0~5阶 Legendre多项式展开系数;
(42) 提取信道模式噪声的六个统计特征;
(43 )将上述步骤获得的数值合并成一组 12维的长时统计特征矢量, 作为录音回放攻击检测的特征矢量。
6、 如权利要求 5所述的一种录音回放攻击检测方法, 其特征在于, 所述步骤(42)的六个统计特征为信道模式噪声的最小值、最大值、均值、 中值、 标准差以及最大值和最小值的差值。
7、 如权利要求 1所述的一种录音回放攻击检测方法, 其特征在于, 所述步骤 (5 ) 的信道噪声分类判决模型建立包括如下步骤:
(51 ) 输入训练语音信号;
( 52 ) 重复步骤 (2 ) ~ (4 ), 得到训练的信道模式噪声的长时统计 特征;
(53 ) 利用支持向量机进行分类, 建立信道噪声分类判决模型。
8、一种基于信道模式噪声的录音回放攻击检测系统,其特征在于包括: 一一输入模块 (100), 用于输入训练语音信号或待识别语音信号; 一一预处理模块 (200), 用于对训练语音信号或待识别语音信号进行 预处理, 其包括预加重、 分帧和加窗单元;
一一信道模式噪声提取模块 (300), 用于提取预处理后训练语音信号 或待识别语音信号中的信道模式噪声;
一一长时统计特征提取模块 (400), 用于提取基于信道模式噪声的训 练语音信号或待识别语音信号的长时统计特征;
一一信道噪声模型模块 (500 ), 用于将训练语音信号的长时统计特 征利用支持向量机进行分类, 建立信道噪声分类判决模型; 一一识别决策模块 (600), 用于利用信道噪声分类判决模型对待识别 语音信号的长时统计特征进行分类, 得到录音回放攻击检测的判决结果; 一一输出模块 (700), 用于输出待识别语音信号的判决结果。
PCT/CN2011/084868 2011-10-26 2011-12-29 一种基于信道模式噪声的录音回放攻击检测方法和系统 WO2013060079A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110330598.7 2011-10-26
CN2011103305987A CN102436810A (zh) 2011-10-26 2011-10-26 一种基于信道模式噪声的录音回放攻击检测方法和系统

Publications (1)

Publication Number Publication Date
WO2013060079A1 true WO2013060079A1 (zh) 2013-05-02

Family

ID=45984833

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/084868 WO2013060079A1 (zh) 2011-10-26 2011-12-29 一种基于信道模式噪声的录音回放攻击检测方法和系统

Country Status (2)

Country Link
CN (1) CN102436810A (zh)
WO (1) WO2013060079A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105044478A (zh) * 2015-07-23 2015-11-11 国家电网公司 一种输电线路可听噪声的多通道信号提取方法
WO2016046652A1 (pt) * 2014-09-24 2016-03-31 FUNDAÇÃO CPQD - Centro de Pesquisa e Desenvolvimento em Telecomunicações Método e sistema para detecção de fraudes em aplicações baseadas em processamento de voz

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102820034B (zh) * 2012-07-16 2014-05-21 中国民航大学 一种民用航空器噪声感知与识别装置及其方法
CN104569551B (zh) * 2015-01-08 2016-03-23 漳州科华技术有限责任公司 一种应用于逆变电压的直流分量检测方法
CN106328152B (zh) * 2015-06-30 2020-01-31 芋头科技(杭州)有限公司 一种室内噪声污染自动识别监测系统
CN105023571A (zh) * 2015-07-28 2015-11-04 苏州宏展信息科技有限公司 一种用于录音笔的语音特征提取控制方法
CN105513598B (zh) * 2016-01-14 2019-04-23 宁波大学 一种基于频域信息量分布的回放语音检测方法
CN105913855B (zh) * 2016-04-11 2019-11-22 宁波大学 一种基于长窗比例因子的回放语音攻击检测算法
CN105869630B (zh) * 2016-06-27 2019-08-02 上海交通大学 基于深度学习的说话人语音欺骗攻击检测方法及系统
CN106297772B (zh) * 2016-08-24 2019-06-25 武汉大学 基于扬声器引入的语音信号失真特性的回放攻击检测方法
CN106409298A (zh) * 2016-09-30 2017-02-15 广东技术师范学院 一种声音重录攻击的识别方法
CN106531172B (zh) * 2016-11-23 2019-06-14 湖北大学 基于环境噪声变化检测的说话人语音回放鉴别方法及系统
CN109754817A (zh) * 2017-11-02 2019-05-14 北京三星通信技术研究有限公司 信号处理方法及终端设备
CN108039176B (zh) * 2018-01-11 2021-06-18 广州势必可赢网络科技有限公司 一种防录音攻击的声纹认证方法、装置及门禁系统
CN108281158A (zh) * 2018-01-12 2018-07-13 平安科技(深圳)有限公司 基于深度学习的语音活体检测方法、服务器及存储介质
CN109599117A (zh) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 一种音频数据识别方法及人声语音防重放识别系统
CN109243487B (zh) * 2018-11-30 2022-12-27 宁波大学 一种归一化常q倒谱特征的回放语音检测方法
CN111445904A (zh) * 2018-12-27 2020-07-24 北京奇虎科技有限公司 基于云端的语音控制方法、装置及电子设备
CN110299141B (zh) * 2019-07-04 2021-07-13 苏州大学 一种声纹识别中录音回放攻击检测的声学特征提取方法
CN110459226A (zh) * 2019-08-19 2019-11-15 效生软件科技(上海)有限公司 一种通过声纹引擎检测人声或机器音进行身份核验的方法
CN110718229A (zh) * 2019-11-14 2020-01-21 国微集团(深圳)有限公司 录音回放攻击的检测方法及对应检测模型的训练方法
CN111462737B (zh) * 2020-03-26 2023-08-08 中国科学院计算技术研究所 一种训练用于语音分组的分组模型的方法和语音降噪方法
CN112599149A (zh) * 2020-12-10 2021-04-02 中国传媒大学 回放攻击语音的检测方法和装置
CN113012684B (zh) * 2021-03-04 2022-05-31 电子科技大学 一种基于语音分割的合成语音检测方法
CN114441029A (zh) * 2022-01-20 2022-05-06 深圳壹账通科技服务有限公司 语音标注系统的录音噪音检测方法、装置、设备及介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1123863C (zh) * 2000-11-10 2003-10-08 清华大学 基于语音识别的信息校核方法
US20100106503A1 (en) * 2008-10-24 2010-04-29 Nuance Communications, Inc. Speaker verification methods and apparatus

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089182B2 (en) * 2000-04-18 2006-08-08 Matsushita Electric Industrial Co., Ltd. Method and apparatus for feature domain joint channel and additive noise compensation
US7346504B2 (en) * 2005-06-20 2008-03-18 Microsoft Corporation Multi-sensory speech enhancement using a clean speech prior
CN100580770C (zh) * 2005-08-08 2010-01-13 中国科学院声学研究所 基于能量及谐波的语音端点检测方法
KR100738341B1 (ko) * 2005-12-08 2007-07-12 한국전자통신연구원 성대신호를 이용한 음성인식 장치 및 그 방법

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1123863C (zh) * 2000-11-10 2003-10-08 清华大学 基于语音识别的信息校核方法
US20100106503A1 (en) * 2008-10-24 2010-04-29 Nuance Communications, Inc. Speaker verification methods and apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WANG, ZHIFENG ET AL.: "CHANNEL PATTERN NOISE BASED PLAYBACK ATTACK DETECTION ALGORITHM FOR SPEAKER RECOGNITION", PROCEEDINGS OF THE 2011 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, 13 July 2011 (2011-07-13) *
WANG, ZHIFENG ET AL.: "Playback Attack Detection Based on Channel Pattern Noise", JOURNAL OF SOUTH CHINA UNIVERSITY OF TECHNOLOGY (NATURAL SCIENCE EDITION), vol. 39, no. 10, 31 October 2011 (2011-10-31), pages 8 - 9 *
ZHANG, LIPENG ET AL.: "Prevention of impostors entering speaker recognition systems", JOURNAL OF TSINGHUA UNIVERSITY (SCIENCE AND TECHNOLOGY), vol. 48, no. SI, 31 December 2008 (2008-12-31), pages 699 - 703 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016046652A1 (pt) * 2014-09-24 2016-03-31 FUNDAÇÃO CPQD - Centro de Pesquisa e Desenvolvimento em Telecomunicações Método e sistema para detecção de fraudes em aplicações baseadas em processamento de voz
CN105044478A (zh) * 2015-07-23 2015-11-11 国家电网公司 一种输电线路可听噪声的多通道信号提取方法

Also Published As

Publication number Publication date
CN102436810A (zh) 2012-05-02

Similar Documents

Publication Publication Date Title
WO2013060079A1 (zh) 一种基于信道模式噪声的录音回放攻击检测方法和系统
CN108305615B (zh) 一种对象识别方法及其设备、存储介质、终端
Dinkel et al. End-to-end spoofing detection with raw waveform CLDNNS
TWI473080B (zh) The use of phonological emotions or excitement to assist in resolving the gender or age of speech signals
CN105405439B (zh) 语音播放方法及装置
US8589167B2 (en) Speaker liveness detection
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
CN108986824B (zh) 一种回放语音检测方法
Sahidullah et al. Robust voice liveness detection and speaker verification using throat microphones
WO2021051608A1 (zh) 一种基于深度学习的声纹识别方法、装置及设备
Chetty Biometric liveness checking using multimodal fuzzy fusion
CN107507626B (zh) 一种基于语音频谱融合特征的手机来源识别方法
CN109711350B (zh) 一种基于唇部运动和语音融合的身份认证方法
WO2018095167A1 (zh) 声纹识别方法和声纹识别系统
CN110232928B (zh) 文本无关说话人验证方法和装置
CN107533415B (zh) 声纹检测的方法和装置
CN116312559A (zh) 跨信道声纹识别模型的训练方法、声纹识别方法及装置
Aloradi et al. Speaker verification in multi-speaker environments using temporal feature fusion
WO2021139425A1 (zh) 语音端点检测方法、装置、设备及存储介质
Alam On the use of fisher vector encoding for voice spoofing detection
Saleh et al. Multimodal person identification through the fusion of face and voice biometrics
CN112992131A (zh) 一种在复杂场景下提取目标人声的乒乓球指令的方法
Hajipour et al. Listening to sounds of silence for audio replay attack detection
Chaudhari et al. Countermeasures and Challenges for Detection of Spoofing Attacks in Automatic Speaker Verification System
Korshunov et al. Presentation attack detection in voice biometrics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11874763

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11874763

Country of ref document: EP

Kind code of ref document: A1