CN113436646B

CN113436646B - Camouflage voice detection method adopting combined features and random forest

Info

Publication number: CN113436646B
Application number: CN202110648176.8A
Authority: CN
Inventors: 简志华; 于佳祺; 朱雅楠; 徐嘉; 韦凤瑜; 吴超; 游林
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2022-09-23
Anticipated expiration: 2041-06-10
Also published as: CN113436646A

Abstract

The invention relates to a method for detecting disguised voice by adopting combined characteristics and random forests, which comprises the following steps: s1, randomly selecting true voice and pseudo voice from a training voice library, respectively extracting LBP local texture characteristics and CQCC acoustic characteristics of each randomly selected voice, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector so as to obtain a training data set; s2, training the random forest by using the training data set to generate a random forest classifier; s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected. The invention can detect the authenticity of the voice to be detected, and effectively improves the safety of the ASV system.

Description

Camouflage voice detection method adopting combined features and random forest

Technical Field

The invention belongs to the technical field of disguised voice detection, and particularly relates to a disguised voice detection method adopting combined features and random forests.

Background

An Automatic Speaker Verification (ASV) system is a technique for analyzing a voice signal of a Speaker and detecting an identity of the Speaker to be recognized. The ASV system is an identity authentication mode which can complete identification without direct contact, and the main advantages of the ASV system are that the cost of detection equipment is low and the operation is convenient. Although the accuracy of the current ASV system for recognizing the target voice is high, the security of the ASV system is greatly reduced by a malicious spoofing attack aiming at impersonating the real identity of the target speaker.

The types of spoofing attacks are mainly speech synthesis, speech conversion, artificial emulation and speech playback. In order to cope with these different kinds of spoofing attacks, the detection performance of the speaker recognition system when dealing with spoofing attacks needs to be improved, so that the ASV system has the capability of anti-spoofing attacks. After applying this anti-spoofing attack technique, only samples that are detected by spoofing and determined to be true speech can be input into the ASV system for further authentication.

Disclosure of Invention

Based on the above-mentioned shortcomings in the prior art, the present invention aims to provide a method for detecting a disguised voice by using a combination feature and a random forest.

In order to achieve the above purpose of the present invention, the following technical solutions are adopted:

a method for detecting disguised voice by adopting combined characteristics and random forests comprises the following steps:

s1, selecting true voice and false voice from the training voice library randomly, extracting LBP local texture characteristics and CQCC acoustic characteristics of each voice selected randomly respectively, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to obtain a training data set;

s2, training the random forest by using the training data set to generate a random forest classifier;

s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected.

Preferably, the extracting of the LBP local texture features includes:

acquiring a spectrogram of the voice to be extracted, and analyzing the spectrogram of the voice to be extracted by using an LBP algorithm to obtain LBP local texture features;

the voice to be extracted is randomly selected voice or voice to be detected.

As a preferred scheme, before the spectrogram of the voice to be extracted is analyzed by using the LBP algorithm, the spectrogram is partitioned, and then the spectrogram of the voice to be extracted is analyzed by using the LBP algorithm for each spectrogram, so as to obtain an LBP local texture feature vector consisting of LBP local texture features of each spectrogram.

Preferably, the extracting of the CQCC acoustic features includes:

firstly, constant Q transformation is carried out on the voice to be extracted to obtain a frequency spectrum X ^CQ (k) Then, a log power spectrum log | X is obtained ^CQ (k)| ² Then the log power spectrum is resampled to log | X ^CQ (l)| ² Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;

and k and l are frequency band serial numbers before and after resampling respectively, and the voice to be extracted is randomly selected voice or voice to be detected.

As a preferred scheme, the combining LBP local texture features and CQCC acoustic features into a joint feature vector includes:

and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the features after dimension reduction so as to generate a joint feature vector.

Preferably, the step S2 includes the following steps:

s21, assuming that the training data set has N vector samples, and randomly extracting N 'vector samples from the training data set in a returning mode to be used as training set samples to train a decision tree, wherein N' is less than or equal to N;

s22, each vector sample contains M attributes, and M is the dimension of the joint feature vector; when the decision tree is split, randomly selecting M' attributes, finishing the decision tree splitting according to the Gini index, and judging whether the splitting can not be continued; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini indexes;

s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, return to step S21; if not, generating a random forest classifier.

Compared with the prior art, the invention has the following technical effects:

the method comprises the steps of extracting texture features in a speech signal spectrogram by using a Local Binary Pattern (LBP), obtaining combined features by combining acoustic features of a Constant Q Cepstrum Coefficient (CQCC), and training a Random Forest (RF) classifier to perform authenticity detection and classification on speech to be detected by using the obtained combined feature vectors, so that the safety of an ASV system is effectively improved.

Drawings

FIG. 1 is a flow chart of a method for detecting disguised speech using a combination of features and random forests in accordance with an embodiment of the present invention;

FIG. 2 is an exemplary LBP solving process in accordance with an embodiment of the present invention;

fig. 3 is a schematic diagram of an LBP texture feature extraction process according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an extraction flow of CQCC acoustic features according to an embodiment of the present invention;

FIG. 5 is a federation flow diagram for federation features of an embodiment of the present invention;

FIG. 6 is a flow chart of training a random forest according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further explained by the following specific examples.

As shown in fig. 1, the method for detecting a disguised voice by using a combination feature and a random forest according to the embodiment of the present invention includes the following steps:

s1, randomly selecting true voice and pseudo voice from a training voice library, respectively extracting LBP local texture characteristics and CQCC acoustic characteristics of each randomly selected voice, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector so as to obtain a training data set;

specifically, a spectrogram of the randomly selected voice is obtained, and the spectrogram of the voice to be extracted is analyzed by using an LBP algorithm to obtain LBP local texture features. In order to improve the detection efficiency and accuracy of the disguised voice detection method, the spectrogram is subjected to blocking processing, and then the spectrogram of the voice to be extracted is analyzed by using an LBP algorithm for each block spectrogram, so that an LBP local texture feature vector consisting of LBP local texture features of each block spectrogram is obtained.

Extraction of CQCC acoustic features, comprising:

firstly, constant Q transformation is carried out on randomly selected voice to obtain frequency spectrum X ^CQ (k) Then, a log power spectrum log | X is obtained ^CQ (k)| ² Then the log power spectrum is resampled to log | X ^CQ (l)| ² Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;

wherein k and l are the frequency band serial numbers before and after resampling respectively.

After the two features are extracted, the LBP local texture features and the CQCC acoustic features are combined to form a combined feature vector, the dimensions of the LBP local texture features and the CQCC acoustic features are different, the combined features cannot be directly generated, and the dimension of feature parameters is too large, so that the calculated amount in a deception detection stage is too large, and the efficiency of disguised voice detection is influenced. Therefore, the specific process of combining the above two features includes:

and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the reduced features to generate a joint feature vector.

the training of the random forest specifically comprises the following steps:

s22, each vector sample contains M attributes, and M is the dimension of the joint feature vector; when the decision tree is split, M' attributes are randomly selected, the decision tree is split according to the Gini index, and whether the splitting can be continued or not is judged; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini index;

s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, go back to step S21; and if not, generating a random forest classifier.

S3, extracting LBP local texture characteristics and CQCC acoustic characteristics of the voice to be detected (namely the voice to be verified), combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to be detected, and inputting the combined characteristic vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected.

For the specific process of extracting the LBP local texture feature and the CQCC acoustic feature and combining the two features, reference may be made to step S1, which is not described herein again.

The working principle and the illustration of each step are described in detail below:

(1) extraction of LBP local texture features

The LBP characteristic parameters have good performance in the field of image recognition at present, and are texture characteristics with high efficiency and good classification effect. LBP describes the texture characteristics of an object by comparing the gray value between adjacent pixels of an object image, and uses the gray value g of the pixel at the center point of one image _c As a standard, adjacentGray value g of all pixel points _i Comparing with the gray value of the central pixel point, wherein the gray value is greater than or equal to g _c Is 1, less than g _c The point code of (2) is 0. And simultaneously, the upper left corner is defined as the first digit and is recorded clockwise, so that a group of sequences consisting of 0 and 1 can be obtained, and the binary number formed by the sequences is converted into decimal to complete the basic operation of LBP.

The LBP operation formula is as follows:

wherein, g _c The gray value of the central pixel point and R are radius, and the pixel points distributed in a prototype region are g _i (i-1, …, N), LBP obtained _N,R The LBP value of the center pixel is used. As shown in fig. 2, when R is 1, the LBP value solving process for one image pixel is exemplified.

And (3) carrying out texture analysis on the spectrogram of each section of voice signal to be detected by using an LBP algorithm, and partitioning the spectrogram firstly in order to improve the overall performance of the disguised voice detection system. The invention equally divides the whole spectrogram into 16 blocks by adopting a 4 multiplied by 4 format, and each block is respectively subjected to LBP characteristic extraction. When extracting LBP, the sampling radius R is 1, and the number of sampling points N is 8.

LBP _8，1 In the mode, after the speech spectrogram is processed by the LBP of 3 × 3 in fig. 2, the value of each pixel is between 0 and 255, and the value of each pixel is counted by using a statistical histogram method, so that each speech spectrogram obtains a feature vector of 1 × 256 dimensions. For a 3 × 3LBP, many dimensions of the 1 × 256 dimensional feature vector are empty, which increases the amount of useless information and computation. Therefore, compression processing is required, and for each 8-bit binary number of 3 × 3LBP, the same class is obtained if the transition between 0 and 1 exceeds 2 times. After the processing, the original 256 cases of the LBP of 3 × 3 become 59 cases, and then statistical analysis is performed on the LBP values of the 59 possible cases by using a statistical histogram, so as to obtain 59-dimensional feature vectors. As shown in FIG. 3, it is 3X 3The LBP extracts a schematic representation of the texture feature vector. The same operation is performed on 16 blocks of the whole spectrogram, so that a feature matrix with dimension of 16 × 59 is obtained.

(2) CQCC acoustic feature extraction

The CQCC is similar to the conventional acoustic feature extraction method, but replaces the Short-time Fourier Transform (STFT) used in the conventional feature extraction with a Constant Q Transform (CQT). CQTs were initially used for musical tone analysis in music recognition, and have the major advantage of having different frequency and time resolutions in the low and high frequency bands, thereby avoiding the disadvantage of uniform distribution of STFT time-frequency resolution.

CQCC is feature extracted based on CQT, assuming x (n) represents a frame of speech signal, whose CQT is represented as:

wherein K is 1,2, …, K is the number of the frequency point,

is an operation symbol indicating a rounding-down. Minimum frequency f of analyzed frequency band _min Maximum frequency of f _max The frequency band is divided into N _O An octave that fits an exponential distribution. Each octave is subdivided into B bands, i.e.

Is a _k (N) complex conjugation, N _k Is variable window length, is the length of the dynamic window in time-frequency analysis, and a _k The mathematical expression of (n) is:

in the formula (f) _k Is the center frequency of the k-th filter band, f _s Denotes the sampling frequency, phi _k Indicating a phase shift. C is a regular factor, and the mathematical expression of C is as follows:

in the formula, w (t) represents a window function. The width distribution of K frequency bands conforms to the twelve tone rate, so f _k The mathematical expression of (a) is:

in the formula (f) ₁ Representing the center frequency of the lowest frequency band.

The mathematical expression of the trade-off parameter Q of time resolution and frequency resolution is:

the value of Q in equation (6) is only related to B, and does not change during CQT, then N _k The values of (A) are:

as shown in FIG. 4, the extraction of CQCC acoustic features is implemented by first performing constant Q transformation on a speech signal X (n) to obtain a frequency spectrum X ^CQ (k) Then obtaining an energy spectrum | X ^CQ (k)| ² Log of (2) and log power spectrum log | X ^CQ (k)| ² Then resampled to log | X ^CQ (l)| ² And finally, performing discrete cosine transform on the resampled logarithmic energy spectrum to extract a CQCC coefficient of the voice signal, and finally obtaining a CQCC feature vector, namely:

where, p is 0, 1.., and L-1, L is the frequency band number after resampling.

(3) Combined features

In a deception attack scene, the combined features have more voice information and better performance. The acoustic features CQCC and LBP texture features are combined to form combined features, the combined features cannot be directly generated due to the fact that the texture features and the acoustic features are different in dimensionality, and the deception detection stage is caused to be too large in calculated amount to influence the overall performance of the disguised voice detection system due to the fact that the characteristic parameter dimensionality is too large.

The method adopts Principal Component Analysis (PCA) to respectively perform dimensionality reduction on CQCC and LBP characteristics, and then splices the dimensionality-reduced characteristics to generate combined characteristics.

The specific flow of the PCA dimension reduction algorithm is as follows:

(a) first, a data set X { X } composed of N M-dimensional vectors is input ₁ ,x ₂ ,…,x _i ,…,x _N H, dividing each vector X in X _i Subtracting the mean vector, i.e.

This results in a de-centered data set

(b) Constructing a covariance matrix

And decomposing the eigenvalues, and sequentially selecting the eigenvectors w corresponding to the largest N' eigenvalues ₁ ,w ₂ ,…,w _N′ Obtaining the characteristic vector matrix W ═ W ₁ ,w ₂ ,…,w _N′ }. Where T represents the transpose operation of the matrix.

(c) Data set

Each sample vector in

To carry out dimensionality reduction, i.e.

Obtaining a dimensionality-reduced N' dimensional data set Z ═ { Z ═ Z ₁ ,z ₂ ,...,z _N }。

The extraction of the combined features mainly depends on a PCA dimension reduction method. Assuming that x (n) is a speech signal of L frames, a spectrogram of the speech signal is first calculated, and the spectrogram is analyzed by using the aforementioned LBP algorithm to obtain a 16 × 59 texture feature matrix LBP. And simultaneously extracting CQCC feature vectors of each frame of voice signals, wherein the dimension of the vector is 60, so that for a section of L frames of voice signals, an acoustic feature vector CQCC with dimension of 60 multiplied by L is obtained. Using PCA to respectively align matrix LBP and CQCC ^T Dimension reduction is performed, and N ' is taken to be 1, thereby obtaining LBP ' of 16 × 1 dimension and CQCC ' of 60 × 1 dimension, respectively. Finally, the LBP 'of dimension 16 x 1 and the CQCC' of dimension 60 x 1 are spliced end to obtain the joint feature vector of dimension 76 x 1. Thus, for a speech signal of any duration, after a joint feature extraction process, a joint feature vector of 76 × 1 dimension is finally generated, and a specific joint feature extraction flow is shown in fig. 5.

(4) Random forest classifier

Random Forest (RF) belongs to a machine learning algorithm, can effectively process the problems of classification and regression, and is a powerful supervised learning algorithm based on a decision tree model. The RF adopts the idea of ensemble learning, and a plurality of weak learners are combined into a strong learner. RF forms a forest structure by randomly picking data samples to form a plurality of decision trees, each tree yielding a classification result. And the RF selects the classification result with the highest ticket number as the classification result of the whole forest according to the principle that the minority obeys the majority. As shown in fig. 6, the RF training process is as follows:

(a) true and false voices are randomly selected from a voice library, N voice sections are assumed to be total, and a 76-dimensional joint feature vector is extracted from each section of voice, so that N vector samples are total to form a data set. And randomly extracting N '(N' is less than or equal to N) vector samples from the data set in a mode of putting back extraction to serve as training set samples to train the decision tree. There may be samples that are repeatedly extracted and not extracted in the process.

(b) Each sample contains M attributes, the dimensions of the joint feature vector. When the decision tree starts to split, randomly selecting M 'attributes from the M attributes, wherein the number of M' should be far smaller than M; and selecting the splitting attributes of the M' attributes by using the Gini index as a splitting strategy, namely finding out the attribute with the information gain higher than the average value in the attributes to be selected according to the splitting rule, and then finding out the attribute with the highest information gain rate.

(c) Splitting nodes of the decision tree, and stopping growing until all possible values are used, so that the decision tree is grown to the maximum extent, and pruning is avoided; thus, a decision tree is obtained, the process is repeated for the target times T, so that more decision trees are grown, and a random forest classifier is formed.

Because all decision trees are independent from each other, the importance of each decision tree is the same, and when a random forest is used for classification in the patent, each tree has the same weight, and a final classification result is decided according to a voting result. The method selects a random forest classification algorithm as a classifier for realizing true and false voice classification, trains a random forest system by using a data set containing the joint characteristics of true voice and false voice, and tests a to-be-authenticated voice set, so that the effect of classifying and identifying the true voice and the false voice can be realized.

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. A method for detecting disguised voice by adopting combined features and random forests is characterized by comprising the following steps:

s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected;

the LBP local texture feature extraction comprises the following steps:

wherein, the voice to be extracted is randomly selected voice or voice to be detected;

before analyzing the spectrogram of the voice to be extracted by using the LBP algorithm, partitioning the spectrogram, and then analyzing the spectrogram of the voice to be extracted by using the LBP algorithm for each spectrogram to obtain an LBP local texture feature vector consisting of LBP local texture features of each spectrogram.

2. The method for detecting disguised voice using united features and random forest as claimed in claim 1, wherein the extracting of the CQCC acoustic features comprises:

constant Q transformation is firstly carried out on the voice to be extracted to obtain frequency spectrum

Then obtaining a logarithmic power spectrum

Then the log power spectrum is resampledIs converted into

Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;

wherein,k、lthe frequency band serial numbers before and after resampling are respectively, and the voice to be extracted is randomly selected voice or voice to be detected.

3. The method for detecting disguised speech using combined features and random forests as claimed in claim 2, wherein said combining LBP local texture features and CQCC acoustic features into a combined feature vector comprises:

4. The method for detecting disguised speech using united features and random forests as claimed in claim 3, wherein said step S2 comprises the steps of:

s21, assuming common in training data setNVector samples are randomly drawn from the training data set with a trade-off

The vector samples are used as training set samples to train a decision tree, wherein,

；

s22, each vector sample containsMThe number of the attributes is one,Mdimension of the joint feature vector; when the decision tree is split, random selection is performed

The attribute completes decision tree splitting according to the Gini index and judges whether the decision tree can not be continuedSplitting; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini index;