CN113436646B - Camouflage voice detection method adopting combined features and random forest - Google Patents
Camouflage voice detection method adopting combined features and random forest Download PDFInfo
- Publication number
- CN113436646B CN113436646B CN202110648176.8A CN202110648176A CN113436646B CN 113436646 B CN113436646 B CN 113436646B CN 202110648176 A CN202110648176 A CN 202110648176A CN 113436646 B CN113436646 B CN 113436646B
- Authority
- CN
- China
- Prior art keywords
- voice
- features
- lbp
- cqcc
- local texture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000007637 random forest analysis Methods 0.000 title claims abstract description 42
- 238000001514 detection method Methods 0.000 title claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 30
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000003066 decision tree Methods 0.000 claims description 29
- 238000004422 calculation algorithm Methods 0.000 claims description 17
- 238000001228 spectrum Methods 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 14
- 238000000513 principal component analysis Methods 0.000 claims description 8
- 238000012952 Resampling Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 238000000638 solvent extraction Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 6
- 230000009467 reduction Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000010363 phase shift Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method for detecting disguised voice by adopting combined characteristics and random forests, which comprises the following steps: s1, randomly selecting true voice and pseudo voice from a training voice library, respectively extracting LBP local texture characteristics and CQCC acoustic characteristics of each randomly selected voice, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector so as to obtain a training data set; s2, training the random forest by using the training data set to generate a random forest classifier; s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected. The invention can detect the authenticity of the voice to be detected, and effectively improves the safety of the ASV system.
Description
Technical Field
The invention belongs to the technical field of disguised voice detection, and particularly relates to a disguised voice detection method adopting combined features and random forests.
Background
An Automatic Speaker Verification (ASV) system is a technique for analyzing a voice signal of a Speaker and detecting an identity of the Speaker to be recognized. The ASV system is an identity authentication mode which can complete identification without direct contact, and the main advantages of the ASV system are that the cost of detection equipment is low and the operation is convenient. Although the accuracy of the current ASV system for recognizing the target voice is high, the security of the ASV system is greatly reduced by a malicious spoofing attack aiming at impersonating the real identity of the target speaker.
The types of spoofing attacks are mainly speech synthesis, speech conversion, artificial emulation and speech playback. In order to cope with these different kinds of spoofing attacks, the detection performance of the speaker recognition system when dealing with spoofing attacks needs to be improved, so that the ASV system has the capability of anti-spoofing attacks. After applying this anti-spoofing attack technique, only samples that are detected by spoofing and determined to be true speech can be input into the ASV system for further authentication.
Disclosure of Invention
Based on the above-mentioned shortcomings in the prior art, the present invention aims to provide a method for detecting a disguised voice by using a combination feature and a random forest.
In order to achieve the above purpose of the present invention, the following technical solutions are adopted:
a method for detecting disguised voice by adopting combined characteristics and random forests comprises the following steps:
s1, selecting true voice and false voice from the training voice library randomly, extracting LBP local texture characteristics and CQCC acoustic characteristics of each voice selected randomly respectively, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to obtain a training data set;
s2, training the random forest by using the training data set to generate a random forest classifier;
s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected.
Preferably, the extracting of the LBP local texture features includes:
acquiring a spectrogram of the voice to be extracted, and analyzing the spectrogram of the voice to be extracted by using an LBP algorithm to obtain LBP local texture features;
the voice to be extracted is randomly selected voice or voice to be detected.
As a preferred scheme, before the spectrogram of the voice to be extracted is analyzed by using the LBP algorithm, the spectrogram is partitioned, and then the spectrogram of the voice to be extracted is analyzed by using the LBP algorithm for each spectrogram, so as to obtain an LBP local texture feature vector consisting of LBP local texture features of each spectrogram.
Preferably, the extracting of the CQCC acoustic features includes:
firstly, constant Q transformation is carried out on the voice to be extracted to obtain a frequency spectrum X CQ (k) Then, a log power spectrum log | X is obtained CQ (k)| 2 Then the log power spectrum is resampled to log | X CQ (l)| 2 Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;
and k and l are frequency band serial numbers before and after resampling respectively, and the voice to be extracted is randomly selected voice or voice to be detected.
As a preferred scheme, the combining LBP local texture features and CQCC acoustic features into a joint feature vector includes:
and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the features after dimension reduction so as to generate a joint feature vector.
Preferably, the step S2 includes the following steps:
s21, assuming that the training data set has N vector samples, and randomly extracting N 'vector samples from the training data set in a returning mode to be used as training set samples to train a decision tree, wherein N' is less than or equal to N;
s22, each vector sample contains M attributes, and M is the dimension of the joint feature vector; when the decision tree is split, randomly selecting M' attributes, finishing the decision tree splitting according to the Gini index, and judging whether the splitting can not be continued; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini indexes;
s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, return to step S21; if not, generating a random forest classifier.
Compared with the prior art, the invention has the following technical effects:
the method comprises the steps of extracting texture features in a speech signal spectrogram by using a Local Binary Pattern (LBP), obtaining combined features by combining acoustic features of a Constant Q Cepstrum Coefficient (CQCC), and training a Random Forest (RF) classifier to perform authenticity detection and classification on speech to be detected by using the obtained combined feature vectors, so that the safety of an ASV system is effectively improved.
Drawings
FIG. 1 is a flow chart of a method for detecting disguised speech using a combination of features and random forests in accordance with an embodiment of the present invention;
FIG. 2 is an exemplary LBP solving process in accordance with an embodiment of the present invention;
fig. 3 is a schematic diagram of an LBP texture feature extraction process according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an extraction flow of CQCC acoustic features according to an embodiment of the present invention;
FIG. 5 is a federation flow diagram for federation features of an embodiment of the present invention;
FIG. 6 is a flow chart of training a random forest according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further explained by the following specific examples.
As shown in fig. 1, the method for detecting a disguised voice by using a combination feature and a random forest according to the embodiment of the present invention includes the following steps:
s1, randomly selecting true voice and pseudo voice from a training voice library, respectively extracting LBP local texture characteristics and CQCC acoustic characteristics of each randomly selected voice, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector so as to obtain a training data set;
specifically, a spectrogram of the randomly selected voice is obtained, and the spectrogram of the voice to be extracted is analyzed by using an LBP algorithm to obtain LBP local texture features. In order to improve the detection efficiency and accuracy of the disguised voice detection method, the spectrogram is subjected to blocking processing, and then the spectrogram of the voice to be extracted is analyzed by using an LBP algorithm for each block spectrogram, so that an LBP local texture feature vector consisting of LBP local texture features of each block spectrogram is obtained.
Extraction of CQCC acoustic features, comprising:
firstly, constant Q transformation is carried out on randomly selected voice to obtain frequency spectrum X CQ (k) Then, a log power spectrum log | X is obtained CQ (k)| 2 Then the log power spectrum is resampled to log | X CQ (l)| 2 Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;
wherein k and l are the frequency band serial numbers before and after resampling respectively.
After the two features are extracted, the LBP local texture features and the CQCC acoustic features are combined to form a combined feature vector, the dimensions of the LBP local texture features and the CQCC acoustic features are different, the combined features cannot be directly generated, and the dimension of feature parameters is too large, so that the calculated amount in a deception detection stage is too large, and the efficiency of disguised voice detection is influenced. Therefore, the specific process of combining the above two features includes:
and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the reduced features to generate a joint feature vector.
S2, training the random forest by using the training data set to generate a random forest classifier;
the training of the random forest specifically comprises the following steps:
s21, assuming that the training data set has N vector samples, and randomly extracting N 'vector samples from the training data set in a returning mode to be used as training set samples to train a decision tree, wherein N' is less than or equal to N;
s22, each vector sample contains M attributes, and M is the dimension of the joint feature vector; when the decision tree is split, M' attributes are randomly selected, the decision tree is split according to the Gini index, and whether the splitting can be continued or not is judged; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini index;
s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, go back to step S21; and if not, generating a random forest classifier.
S3, extracting LBP local texture characteristics and CQCC acoustic characteristics of the voice to be detected (namely the voice to be verified), combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to be detected, and inputting the combined characteristic vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected.
For the specific process of extracting the LBP local texture feature and the CQCC acoustic feature and combining the two features, reference may be made to step S1, which is not described herein again.
The working principle and the illustration of each step are described in detail below:
(1) extraction of LBP local texture features
The LBP characteristic parameters have good performance in the field of image recognition at present, and are texture characteristics with high efficiency and good classification effect. LBP describes the texture characteristics of an object by comparing the gray value between adjacent pixels of an object image, and uses the gray value g of the pixel at the center point of one image c As a standard, adjacentGray value g of all pixel points i Comparing with the gray value of the central pixel point, wherein the gray value is greater than or equal to g c Is 1, less than g c The point code of (2) is 0. And simultaneously, the upper left corner is defined as the first digit and is recorded clockwise, so that a group of sequences consisting of 0 and 1 can be obtained, and the binary number formed by the sequences is converted into decimal to complete the basic operation of LBP.
The LBP operation formula is as follows:
wherein, g c The gray value of the central pixel point and R are radius, and the pixel points distributed in a prototype region are g i (i-1, …, N), LBP obtained N,R The LBP value of the center pixel is used. As shown in fig. 2, when R is 1, the LBP value solving process for one image pixel is exemplified.
And (3) carrying out texture analysis on the spectrogram of each section of voice signal to be detected by using an LBP algorithm, and partitioning the spectrogram firstly in order to improve the overall performance of the disguised voice detection system. The invention equally divides the whole spectrogram into 16 blocks by adopting a 4 multiplied by 4 format, and each block is respectively subjected to LBP characteristic extraction. When extracting LBP, the sampling radius R is 1, and the number of sampling points N is 8.
LBP 8,1 In the mode, after the speech spectrogram is processed by the LBP of 3 × 3 in fig. 2, the value of each pixel is between 0 and 255, and the value of each pixel is counted by using a statistical histogram method, so that each speech spectrogram obtains a feature vector of 1 × 256 dimensions. For a 3 × 3LBP, many dimensions of the 1 × 256 dimensional feature vector are empty, which increases the amount of useless information and computation. Therefore, compression processing is required, and for each 8-bit binary number of 3 × 3LBP, the same class is obtained if the transition between 0 and 1 exceeds 2 times. After the processing, the original 256 cases of the LBP of 3 × 3 become 59 cases, and then statistical analysis is performed on the LBP values of the 59 possible cases by using a statistical histogram, so as to obtain 59-dimensional feature vectors. As shown in FIG. 3, it is 3X 3The LBP extracts a schematic representation of the texture feature vector. The same operation is performed on 16 blocks of the whole spectrogram, so that a feature matrix with dimension of 16 × 59 is obtained.
(2) CQCC acoustic feature extraction
The CQCC is similar to the conventional acoustic feature extraction method, but replaces the Short-time Fourier Transform (STFT) used in the conventional feature extraction with a Constant Q Transform (CQT). CQTs were initially used for musical tone analysis in music recognition, and have the major advantage of having different frequency and time resolutions in the low and high frequency bands, thereby avoiding the disadvantage of uniform distribution of STFT time-frequency resolution.
CQCC is feature extracted based on CQT, assuming x (n) represents a frame of speech signal, whose CQT is represented as:
wherein K is 1,2, …, K is the number of the frequency point,is an operation symbol indicating a rounding-down. Minimum frequency f of analyzed frequency band min Maximum frequency of f max The frequency band is divided into N O An octave that fits an exponential distribution. Each octave is subdivided into B bands, i.e. Is a k (N) complex conjugation, N k Is variable window length, is the length of the dynamic window in time-frequency analysis, and a k The mathematical expression of (n) is:
in the formula (f) k Is the center frequency of the k-th filter band, f s Denotes the sampling frequency, phi k Indicating a phase shift. C is a regular factor, and the mathematical expression of C is as follows:
in the formula, w (t) represents a window function. The width distribution of K frequency bands conforms to the twelve tone rate, so f k The mathematical expression of (a) is:
in the formula (f) 1 Representing the center frequency of the lowest frequency band.
The mathematical expression of the trade-off parameter Q of time resolution and frequency resolution is:
the value of Q in equation (6) is only related to B, and does not change during CQT, then N k The values of (A) are:
as shown in FIG. 4, the extraction of CQCC acoustic features is implemented by first performing constant Q transformation on a speech signal X (n) to obtain a frequency spectrum X CQ (k) Then obtaining an energy spectrum | X CQ (k)| 2 Log of (2) and log power spectrum log | X CQ (k)| 2 Then resampled to log | X CQ (l)| 2 And finally, performing discrete cosine transform on the resampled logarithmic energy spectrum to extract a CQCC coefficient of the voice signal, and finally obtaining a CQCC feature vector, namely:
where, p is 0, 1.., and L-1, L is the frequency band number after resampling.
(3) Combined features
In a deception attack scene, the combined features have more voice information and better performance. The acoustic features CQCC and LBP texture features are combined to form combined features, the combined features cannot be directly generated due to the fact that the texture features and the acoustic features are different in dimensionality, and the deception detection stage is caused to be too large in calculated amount to influence the overall performance of the disguised voice detection system due to the fact that the characteristic parameter dimensionality is too large.
The method adopts Principal Component Analysis (PCA) to respectively perform dimensionality reduction on CQCC and LBP characteristics, and then splices the dimensionality-reduced characteristics to generate combined characteristics.
The specific flow of the PCA dimension reduction algorithm is as follows:
(a) first, a data set X { X } composed of N M-dimensional vectors is input 1 ,x 2 ,…,x i ,…,x N H, dividing each vector X in X i Subtracting the mean vector, i.e.This results in a de-centered data set
(b) Constructing a covariance matrixAnd decomposing the eigenvalues, and sequentially selecting the eigenvectors w corresponding to the largest N' eigenvalues 1 ,w 2 ,…,w N′ Obtaining the characteristic vector matrix W ═ W 1 ,w 2 ,…,w N′ }. Where T represents the transpose operation of the matrix.
(c) Data setEach sample vector inTo carry out dimensionality reduction, i.e.Obtaining a dimensionality-reduced N' dimensional data set Z ═ { Z ═ Z 1 ,z 2 ,...,z N }。
The extraction of the combined features mainly depends on a PCA dimension reduction method. Assuming that x (n) is a speech signal of L frames, a spectrogram of the speech signal is first calculated, and the spectrogram is analyzed by using the aforementioned LBP algorithm to obtain a 16 × 59 texture feature matrix LBP. And simultaneously extracting CQCC feature vectors of each frame of voice signals, wherein the dimension of the vector is 60, so that for a section of L frames of voice signals, an acoustic feature vector CQCC with dimension of 60 multiplied by L is obtained. Using PCA to respectively align matrix LBP and CQCC T Dimension reduction is performed, and N ' is taken to be 1, thereby obtaining LBP ' of 16 × 1 dimension and CQCC ' of 60 × 1 dimension, respectively. Finally, the LBP 'of dimension 16 x 1 and the CQCC' of dimension 60 x 1 are spliced end to obtain the joint feature vector of dimension 76 x 1. Thus, for a speech signal of any duration, after a joint feature extraction process, a joint feature vector of 76 × 1 dimension is finally generated, and a specific joint feature extraction flow is shown in fig. 5.
(4) Random forest classifier
Random Forest (RF) belongs to a machine learning algorithm, can effectively process the problems of classification and regression, and is a powerful supervised learning algorithm based on a decision tree model. The RF adopts the idea of ensemble learning, and a plurality of weak learners are combined into a strong learner. RF forms a forest structure by randomly picking data samples to form a plurality of decision trees, each tree yielding a classification result. And the RF selects the classification result with the highest ticket number as the classification result of the whole forest according to the principle that the minority obeys the majority. As shown in fig. 6, the RF training process is as follows:
(a) true and false voices are randomly selected from a voice library, N voice sections are assumed to be total, and a 76-dimensional joint feature vector is extracted from each section of voice, so that N vector samples are total to form a data set. And randomly extracting N '(N' is less than or equal to N) vector samples from the data set in a mode of putting back extraction to serve as training set samples to train the decision tree. There may be samples that are repeatedly extracted and not extracted in the process.
(b) Each sample contains M attributes, the dimensions of the joint feature vector. When the decision tree starts to split, randomly selecting M 'attributes from the M attributes, wherein the number of M' should be far smaller than M; and selecting the splitting attributes of the M' attributes by using the Gini index as a splitting strategy, namely finding out the attribute with the information gain higher than the average value in the attributes to be selected according to the splitting rule, and then finding out the attribute with the highest information gain rate.
(c) Splitting nodes of the decision tree, and stopping growing until all possible values are used, so that the decision tree is grown to the maximum extent, and pruning is avoided; thus, a decision tree is obtained, the process is repeated for the target times T, so that more decision trees are grown, and a random forest classifier is formed.
Because all decision trees are independent from each other, the importance of each decision tree is the same, and when a random forest is used for classification in the patent, each tree has the same weight, and a final classification result is decided according to a voting result. The method selects a random forest classification algorithm as a classifier for realizing true and false voice classification, trains a random forest system by using a data set containing the joint characteristics of true voice and false voice, and tests a to-be-authenticated voice set, so that the effect of classifying and identifying the true voice and the false voice can be realized.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.
Claims (4)
1. A method for detecting disguised voice by adopting combined features and random forests is characterized by comprising the following steps:
s1, selecting true voice and false voice from the training voice library randomly, extracting LBP local texture characteristics and CQCC acoustic characteristics of each voice selected randomly respectively, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to obtain a training data set;
s2, training the random forest by using the training data set to generate a random forest classifier;
s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected;
the LBP local texture feature extraction comprises the following steps:
acquiring a spectrogram of the voice to be extracted, and analyzing the spectrogram of the voice to be extracted by using an LBP algorithm to obtain LBP local texture features;
wherein, the voice to be extracted is randomly selected voice or voice to be detected;
before analyzing the spectrogram of the voice to be extracted by using the LBP algorithm, partitioning the spectrogram, and then analyzing the spectrogram of the voice to be extracted by using the LBP algorithm for each spectrogram to obtain an LBP local texture feature vector consisting of LBP local texture features of each spectrogram.
2. The method for detecting disguised voice using united features and random forest as claimed in claim 1, wherein the extracting of the CQCC acoustic features comprises:
constant Q transformation is firstly carried out on the voice to be extracted to obtain frequency spectrumThen obtaining a logarithmic power spectrumThen the log power spectrum is resampledIs converted intoFinally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;
wherein, the first and the second end of the pipe are connected with each other,k、lthe frequency band serial numbers before and after resampling are respectively, and the voice to be extracted is randomly selected voice or voice to be detected.
3. The method for detecting disguised speech using combined features and random forests as claimed in claim 2, wherein said combining LBP local texture features and CQCC acoustic features into a combined feature vector comprises:
and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the reduced features to generate a joint feature vector.
4. The method for detecting disguised speech using united features and random forests as claimed in claim 3, wherein said step S2 comprises the steps of:
s21, assuming common in training data setNVector samples are randomly drawn from the training data set with a trade-offThe vector samples are used as training set samples to train a decision tree, wherein,;
s22, each vector sample containsMThe number of the attributes is one,Mdimension of the joint feature vector; when the decision tree is split, random selection is performedThe attribute completes decision tree splitting according to the Gini index and judges whether the decision tree can not be continuedSplitting; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini index;
s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, go back to step S21; and if not, generating a random forest classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110648176.8A CN113436646B (en) | 2021-06-10 | 2021-06-10 | Camouflage voice detection method adopting combined features and random forest |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110648176.8A CN113436646B (en) | 2021-06-10 | 2021-06-10 | Camouflage voice detection method adopting combined features and random forest |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113436646A CN113436646A (en) | 2021-09-24 |
CN113436646B true CN113436646B (en) | 2022-09-23 |
Family
ID=77755642
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110648176.8A Active CN113436646B (en) | 2021-06-10 | 2021-06-10 | Camouflage voice detection method adopting combined features and random forest |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113436646B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113724693B (en) * | 2021-11-01 | 2022-04-01 | 中国科学院自动化研究所 | Voice judging method and device, electronic equipment and storage medium |
CN114822589B (en) * | 2022-04-02 | 2023-07-04 | 中科猷声(苏州)科技有限公司 | Indoor acoustic parameter determination method, model construction method, device and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016046652A1 (en) * | 2014-09-24 | 2016-03-31 | FUNDAÇÃO CPQD - Centro de Pesquisa e Desenvolvimento em Telecomunicações | Method and system for detecting fraud in applications based on voice processing |
CN110148425A (en) * | 2019-05-14 | 2019-08-20 | 杭州电子科技大学 | A kind of camouflage speech detection method based on complete local binary pattern |
EP3608907A1 (en) * | 2018-08-10 | 2020-02-12 | Visa International Service Association | Replay spoofing detection for automatic speaker verification system |
CN110797031A (en) * | 2019-09-19 | 2020-02-14 | 厦门快商通科技股份有限公司 | Voice change detection method, system, mobile terminal and storage medium |
CN111611566A (en) * | 2020-05-12 | 2020-09-01 | 珠海造极声音科技有限公司 | Speaker verification system and replay attack detection method thereof |
CN112927694A (en) * | 2021-03-08 | 2021-06-08 | 中国地质大学(武汉) | Voice instruction validity judging method based on fusion voiceprint features |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2018226844B2 (en) * | 2017-03-03 | 2021-11-18 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
-
2021
- 2021-06-10 CN CN202110648176.8A patent/CN113436646B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016046652A1 (en) * | 2014-09-24 | 2016-03-31 | FUNDAÇÃO CPQD - Centro de Pesquisa e Desenvolvimento em Telecomunicações | Method and system for detecting fraud in applications based on voice processing |
EP3608907A1 (en) * | 2018-08-10 | 2020-02-12 | Visa International Service Association | Replay spoofing detection for automatic speaker verification system |
CN110148425A (en) * | 2019-05-14 | 2019-08-20 | 杭州电子科技大学 | A kind of camouflage speech detection method based on complete local binary pattern |
CN110797031A (en) * | 2019-09-19 | 2020-02-14 | 厦门快商通科技股份有限公司 | Voice change detection method, system, mobile terminal and storage medium |
CN111611566A (en) * | 2020-05-12 | 2020-09-01 | 珠海造极声音科技有限公司 | Speaker verification system and replay attack detection method thereof |
CN112927694A (en) * | 2021-03-08 | 2021-06-08 | 中国地质大学(武汉) | Voice instruction validity judging method based on fusion voiceprint features |
Non-Patent Citations (3)
Title |
---|
Local Binary Pattern with Random Forest for Acoustic Scene Classification;Shamsiah Abidin等;《2018 IEEE International Conference on Multimedia and Expo (ICME)》;20181011;全文 * |
Spectrotemporal Analysis Using Local Binary Pattern Variants for Acoustic Scene Classification;Shamsiah Abidin等;《IEEE/ACM Transactions on Audio, Speech, and Language Processing》;20180712;全文 * |
时频图像特征用于声场景分类;高敏等;《声学技术》;20171031;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113436646A (en) | 2021-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108281146B (en) | Short voice speaker identification method and device | |
CN108986824B (en) | Playback voice detection method | |
CN113436646B (en) | Camouflage voice detection method adopting combined features and random forest | |
CN110120230B (en) | Acoustic event detection method and device | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
Chen et al. | Towards understanding and mitigating audio adversarial examples for speaker recognition | |
CN111816185A (en) | Method and device for identifying speaker in mixed voice | |
CN110767239A (en) | Voiceprint recognition method, device and equipment based on deep learning | |
CN114596879B (en) | False voice detection method and device, electronic equipment and storage medium | |
Gao et al. | Generalized spoofing detection inspired from audio generation artifacts | |
CN114495950A (en) | Voice deception detection method based on deep residual shrinkage network | |
Chen et al. | SEC4SR: a security analysis platform for speaker recognition | |
CN111243600A (en) | Voice spoofing attack detection method based on sound field and field pattern | |
de Almeida et al. | Use of paraconsistent feature engineering to support the long term feature choice for speaker verification | |
CN110808067A (en) | Low signal-to-noise ratio sound event detection method based on binary multiband energy distribution | |
WO2013008956A1 (en) | Sound processing method, sound processing system, video processing method, video processing system, sound processing device, and method and program for controlling same | |
CN115293214A (en) | Underwater sound target recognition model optimization method based on sample expansion network | |
KR101094763B1 (en) | Apparatus and method for extracting feature vector for user authentication | |
CN114898773A (en) | Synthetic speech detection method based on deep self-attention neural network classifier | |
CN113627327A (en) | Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network | |
Alam | On the use of fisher vector encoding for voice spoofing detection | |
CN114639387A (en) | Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram | |
CN113870896A (en) | Motion sound false judgment method and device based on time-frequency graph and convolutional neural network | |
Zhang et al. | Improving robustness of speech anti-spoofing system using resnext with neighbor filters |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |