CN113436646B - Camouflage voice detection method adopting combined features and random forest - Google Patents

Camouflage voice detection method adopting combined features and random forest Download PDF

Info

Publication number
CN113436646B
CN113436646B CN202110648176.8A CN202110648176A CN113436646B CN 113436646 B CN113436646 B CN 113436646B CN 202110648176 A CN202110648176 A CN 202110648176A CN 113436646 B CN113436646 B CN 113436646B
Authority
CN
China
Prior art keywords
voice
features
lbp
cqcc
local texture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110648176.8A
Other languages
Chinese (zh)
Other versions
CN113436646A (en
Inventor
简志华
于佳祺
朱雅楠
徐嘉
韦凤瑜
吴超
游林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110648176.8A priority Critical patent/CN113436646B/en
Publication of CN113436646A publication Critical patent/CN113436646A/en
Application granted granted Critical
Publication of CN113436646B publication Critical patent/CN113436646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for detecting disguised voice by adopting combined characteristics and random forests, which comprises the following steps: s1, randomly selecting true voice and pseudo voice from a training voice library, respectively extracting LBP local texture characteristics and CQCC acoustic characteristics of each randomly selected voice, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector so as to obtain a training data set; s2, training the random forest by using the training data set to generate a random forest classifier; s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected. The invention can detect the authenticity of the voice to be detected, and effectively improves the safety of the ASV system.

Description

Camouflage voice detection method adopting combined features and random forest
Technical Field
The invention belongs to the technical field of disguised voice detection, and particularly relates to a disguised voice detection method adopting combined features and random forests.
Background
An Automatic Speaker Verification (ASV) system is a technique for analyzing a voice signal of a Speaker and detecting an identity of the Speaker to be recognized. The ASV system is an identity authentication mode which can complete identification without direct contact, and the main advantages of the ASV system are that the cost of detection equipment is low and the operation is convenient. Although the accuracy of the current ASV system for recognizing the target voice is high, the security of the ASV system is greatly reduced by a malicious spoofing attack aiming at impersonating the real identity of the target speaker.
The types of spoofing attacks are mainly speech synthesis, speech conversion, artificial emulation and speech playback. In order to cope with these different kinds of spoofing attacks, the detection performance of the speaker recognition system when dealing with spoofing attacks needs to be improved, so that the ASV system has the capability of anti-spoofing attacks. After applying this anti-spoofing attack technique, only samples that are detected by spoofing and determined to be true speech can be input into the ASV system for further authentication.
Disclosure of Invention
Based on the above-mentioned shortcomings in the prior art, the present invention aims to provide a method for detecting a disguised voice by using a combination feature and a random forest.
In order to achieve the above purpose of the present invention, the following technical solutions are adopted:
a method for detecting disguised voice by adopting combined characteristics and random forests comprises the following steps:
s1, selecting true voice and false voice from the training voice library randomly, extracting LBP local texture characteristics and CQCC acoustic characteristics of each voice selected randomly respectively, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to obtain a training data set;
s2, training the random forest by using the training data set to generate a random forest classifier;
s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected.
Preferably, the extracting of the LBP local texture features includes:
acquiring a spectrogram of the voice to be extracted, and analyzing the spectrogram of the voice to be extracted by using an LBP algorithm to obtain LBP local texture features;
the voice to be extracted is randomly selected voice or voice to be detected.
As a preferred scheme, before the spectrogram of the voice to be extracted is analyzed by using the LBP algorithm, the spectrogram is partitioned, and then the spectrogram of the voice to be extracted is analyzed by using the LBP algorithm for each spectrogram, so as to obtain an LBP local texture feature vector consisting of LBP local texture features of each spectrogram.
Preferably, the extracting of the CQCC acoustic features includes:
firstly, constant Q transformation is carried out on the voice to be extracted to obtain a frequency spectrum X CQ (k) Then, a log power spectrum log | X is obtained CQ (k)| 2 Then the log power spectrum is resampled to log | X CQ (l)| 2 Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;
and k and l are frequency band serial numbers before and after resampling respectively, and the voice to be extracted is randomly selected voice or voice to be detected.
As a preferred scheme, the combining LBP local texture features and CQCC acoustic features into a joint feature vector includes:
and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the features after dimension reduction so as to generate a joint feature vector.
Preferably, the step S2 includes the following steps:
s21, assuming that the training data set has N vector samples, and randomly extracting N 'vector samples from the training data set in a returning mode to be used as training set samples to train a decision tree, wherein N' is less than or equal to N;
s22, each vector sample contains M attributes, and M is the dimension of the joint feature vector; when the decision tree is split, randomly selecting M' attributes, finishing the decision tree splitting according to the Gini index, and judging whether the splitting can not be continued; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini indexes;
s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, return to step S21; if not, generating a random forest classifier.
Compared with the prior art, the invention has the following technical effects:
the method comprises the steps of extracting texture features in a speech signal spectrogram by using a Local Binary Pattern (LBP), obtaining combined features by combining acoustic features of a Constant Q Cepstrum Coefficient (CQCC), and training a Random Forest (RF) classifier to perform authenticity detection and classification on speech to be detected by using the obtained combined feature vectors, so that the safety of an ASV system is effectively improved.
Drawings
FIG. 1 is a flow chart of a method for detecting disguised speech using a combination of features and random forests in accordance with an embodiment of the present invention;
FIG. 2 is an exemplary LBP solving process in accordance with an embodiment of the present invention;
fig. 3 is a schematic diagram of an LBP texture feature extraction process according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an extraction flow of CQCC acoustic features according to an embodiment of the present invention;
FIG. 5 is a federation flow diagram for federation features of an embodiment of the present invention;
FIG. 6 is a flow chart of training a random forest according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further explained by the following specific examples.
As shown in fig. 1, the method for detecting a disguised voice by using a combination feature and a random forest according to the embodiment of the present invention includes the following steps:
s1, randomly selecting true voice and pseudo voice from a training voice library, respectively extracting LBP local texture characteristics and CQCC acoustic characteristics of each randomly selected voice, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector so as to obtain a training data set;
specifically, a spectrogram of the randomly selected voice is obtained, and the spectrogram of the voice to be extracted is analyzed by using an LBP algorithm to obtain LBP local texture features. In order to improve the detection efficiency and accuracy of the disguised voice detection method, the spectrogram is subjected to blocking processing, and then the spectrogram of the voice to be extracted is analyzed by using an LBP algorithm for each block spectrogram, so that an LBP local texture feature vector consisting of LBP local texture features of each block spectrogram is obtained.
Extraction of CQCC acoustic features, comprising:
firstly, constant Q transformation is carried out on randomly selected voice to obtain frequency spectrum X CQ (k) Then, a log power spectrum log | X is obtained CQ (k)| 2 Then the log power spectrum is resampled to log | X CQ (l)| 2 Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;
wherein k and l are the frequency band serial numbers before and after resampling respectively.
After the two features are extracted, the LBP local texture features and the CQCC acoustic features are combined to form a combined feature vector, the dimensions of the LBP local texture features and the CQCC acoustic features are different, the combined features cannot be directly generated, and the dimension of feature parameters is too large, so that the calculated amount in a deception detection stage is too large, and the efficiency of disguised voice detection is influenced. Therefore, the specific process of combining the above two features includes:
and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the reduced features to generate a joint feature vector.
S2, training the random forest by using the training data set to generate a random forest classifier;
the training of the random forest specifically comprises the following steps:
s21, assuming that the training data set has N vector samples, and randomly extracting N 'vector samples from the training data set in a returning mode to be used as training set samples to train a decision tree, wherein N' is less than or equal to N;
s22, each vector sample contains M attributes, and M is the dimension of the joint feature vector; when the decision tree is split, M' attributes are randomly selected, the decision tree is split according to the Gini index, and whether the splitting can be continued or not is judged; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini index;
s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, go back to step S21; and if not, generating a random forest classifier.
S3, extracting LBP local texture characteristics and CQCC acoustic characteristics of the voice to be detected (namely the voice to be verified), combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to be detected, and inputting the combined characteristic vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected.
For the specific process of extracting the LBP local texture feature and the CQCC acoustic feature and combining the two features, reference may be made to step S1, which is not described herein again.
The working principle and the illustration of each step are described in detail below:
(1) extraction of LBP local texture features
The LBP characteristic parameters have good performance in the field of image recognition at present, and are texture characteristics with high efficiency and good classification effect. LBP describes the texture characteristics of an object by comparing the gray value between adjacent pixels of an object image, and uses the gray value g of the pixel at the center point of one image c As a standard, adjacentGray value g of all pixel points i Comparing with the gray value of the central pixel point, wherein the gray value is greater than or equal to g c Is 1, less than g c The point code of (2) is 0. And simultaneously, the upper left corner is defined as the first digit and is recorded clockwise, so that a group of sequences consisting of 0 and 1 can be obtained, and the binary number formed by the sequences is converted into decimal to complete the basic operation of LBP.
The LBP operation formula is as follows:
Figure BDA0003110044410000041
wherein, g c The gray value of the central pixel point and R are radius, and the pixel points distributed in a prototype region are g i (i-1, …, N), LBP obtained N,R The LBP value of the center pixel is used. As shown in fig. 2, when R is 1, the LBP value solving process for one image pixel is exemplified.
And (3) carrying out texture analysis on the spectrogram of each section of voice signal to be detected by using an LBP algorithm, and partitioning the spectrogram firstly in order to improve the overall performance of the disguised voice detection system. The invention equally divides the whole spectrogram into 16 blocks by adopting a 4 multiplied by 4 format, and each block is respectively subjected to LBP characteristic extraction. When extracting LBP, the sampling radius R is 1, and the number of sampling points N is 8.
LBP 8,1 In the mode, after the speech spectrogram is processed by the LBP of 3 × 3 in fig. 2, the value of each pixel is between 0 and 255, and the value of each pixel is counted by using a statistical histogram method, so that each speech spectrogram obtains a feature vector of 1 × 256 dimensions. For a 3 × 3LBP, many dimensions of the 1 × 256 dimensional feature vector are empty, which increases the amount of useless information and computation. Therefore, compression processing is required, and for each 8-bit binary number of 3 × 3LBP, the same class is obtained if the transition between 0 and 1 exceeds 2 times. After the processing, the original 256 cases of the LBP of 3 × 3 become 59 cases, and then statistical analysis is performed on the LBP values of the 59 possible cases by using a statistical histogram, so as to obtain 59-dimensional feature vectors. As shown in FIG. 3, it is 3X 3The LBP extracts a schematic representation of the texture feature vector. The same operation is performed on 16 blocks of the whole spectrogram, so that a feature matrix with dimension of 16 × 59 is obtained.
(2) CQCC acoustic feature extraction
The CQCC is similar to the conventional acoustic feature extraction method, but replaces the Short-time Fourier Transform (STFT) used in the conventional feature extraction with a Constant Q Transform (CQT). CQTs were initially used for musical tone analysis in music recognition, and have the major advantage of having different frequency and time resolutions in the low and high frequency bands, thereby avoiding the disadvantage of uniform distribution of STFT time-frequency resolution.
CQCC is feature extracted based on CQT, assuming x (n) represents a frame of speech signal, whose CQT is represented as:
Figure BDA0003110044410000051
wherein K is 1,2, …, K is the number of the frequency point,
Figure BDA0003110044410000052
is an operation symbol indicating a rounding-down. Minimum frequency f of analyzed frequency band min Maximum frequency of f max The frequency band is divided into N O An octave that fits an exponential distribution. Each octave is subdivided into B bands, i.e.
Figure BDA0003110044410000053
Figure BDA0003110044410000054
Is a k (N) complex conjugation, N k Is variable window length, is the length of the dynamic window in time-frequency analysis, and a k The mathematical expression of (n) is:
Figure BDA0003110044410000055
in the formula (f) k Is the center frequency of the k-th filter band, f s Denotes the sampling frequency, phi k Indicating a phase shift. C is a regular factor, and the mathematical expression of C is as follows:
Figure BDA0003110044410000061
in the formula, w (t) represents a window function. The width distribution of K frequency bands conforms to the twelve tone rate, so f k The mathematical expression of (a) is:
Figure BDA0003110044410000062
in the formula (f) 1 Representing the center frequency of the lowest frequency band.
The mathematical expression of the trade-off parameter Q of time resolution and frequency resolution is:
Figure BDA0003110044410000063
the value of Q in equation (6) is only related to B, and does not change during CQT, then N k The values of (A) are:
Figure BDA0003110044410000064
as shown in FIG. 4, the extraction of CQCC acoustic features is implemented by first performing constant Q transformation on a speech signal X (n) to obtain a frequency spectrum X CQ (k) Then obtaining an energy spectrum | X CQ (k)| 2 Log of (2) and log power spectrum log | X CQ (k)| 2 Then resampled to log | X CQ (l)| 2 And finally, performing discrete cosine transform on the resampled logarithmic energy spectrum to extract a CQCC coefficient of the voice signal, and finally obtaining a CQCC feature vector, namely:
Figure BDA0003110044410000065
where, p is 0, 1.., and L-1, L is the frequency band number after resampling.
(3) Combined features
In a deception attack scene, the combined features have more voice information and better performance. The acoustic features CQCC and LBP texture features are combined to form combined features, the combined features cannot be directly generated due to the fact that the texture features and the acoustic features are different in dimensionality, and the deception detection stage is caused to be too large in calculated amount to influence the overall performance of the disguised voice detection system due to the fact that the characteristic parameter dimensionality is too large.
The method adopts Principal Component Analysis (PCA) to respectively perform dimensionality reduction on CQCC and LBP characteristics, and then splices the dimensionality-reduced characteristics to generate combined characteristics.
The specific flow of the PCA dimension reduction algorithm is as follows:
(a) first, a data set X { X } composed of N M-dimensional vectors is input 1 ,x 2 ,…,x i ,…,x N H, dividing each vector X in X i Subtracting the mean vector, i.e.
Figure BDA0003110044410000071
This results in a de-centered data set
Figure BDA0003110044410000072
(b) Constructing a covariance matrix
Figure BDA0003110044410000073
And decomposing the eigenvalues, and sequentially selecting the eigenvectors w corresponding to the largest N' eigenvalues 1 ,w 2 ,…,w N′ Obtaining the characteristic vector matrix W ═ W 1 ,w 2 ,…,w N′ }. Where T represents the transpose operation of the matrix.
(c) Data set
Figure BDA0003110044410000074
Each sample vector in
Figure BDA0003110044410000075
To carry out dimensionality reduction, i.e.
Figure BDA0003110044410000076
Obtaining a dimensionality-reduced N' dimensional data set Z ═ { Z ═ Z 1 ,z 2 ,...,z N }。
The extraction of the combined features mainly depends on a PCA dimension reduction method. Assuming that x (n) is a speech signal of L frames, a spectrogram of the speech signal is first calculated, and the spectrogram is analyzed by using the aforementioned LBP algorithm to obtain a 16 × 59 texture feature matrix LBP. And simultaneously extracting CQCC feature vectors of each frame of voice signals, wherein the dimension of the vector is 60, so that for a section of L frames of voice signals, an acoustic feature vector CQCC with dimension of 60 multiplied by L is obtained. Using PCA to respectively align matrix LBP and CQCC T Dimension reduction is performed, and N ' is taken to be 1, thereby obtaining LBP ' of 16 × 1 dimension and CQCC ' of 60 × 1 dimension, respectively. Finally, the LBP 'of dimension 16 x 1 and the CQCC' of dimension 60 x 1 are spliced end to obtain the joint feature vector of dimension 76 x 1. Thus, for a speech signal of any duration, after a joint feature extraction process, a joint feature vector of 76 × 1 dimension is finally generated, and a specific joint feature extraction flow is shown in fig. 5.
(4) Random forest classifier
Random Forest (RF) belongs to a machine learning algorithm, can effectively process the problems of classification and regression, and is a powerful supervised learning algorithm based on a decision tree model. The RF adopts the idea of ensemble learning, and a plurality of weak learners are combined into a strong learner. RF forms a forest structure by randomly picking data samples to form a plurality of decision trees, each tree yielding a classification result. And the RF selects the classification result with the highest ticket number as the classification result of the whole forest according to the principle that the minority obeys the majority. As shown in fig. 6, the RF training process is as follows:
(a) true and false voices are randomly selected from a voice library, N voice sections are assumed to be total, and a 76-dimensional joint feature vector is extracted from each section of voice, so that N vector samples are total to form a data set. And randomly extracting N '(N' is less than or equal to N) vector samples from the data set in a mode of putting back extraction to serve as training set samples to train the decision tree. There may be samples that are repeatedly extracted and not extracted in the process.
(b) Each sample contains M attributes, the dimensions of the joint feature vector. When the decision tree starts to split, randomly selecting M 'attributes from the M attributes, wherein the number of M' should be far smaller than M; and selecting the splitting attributes of the M' attributes by using the Gini index as a splitting strategy, namely finding out the attribute with the information gain higher than the average value in the attributes to be selected according to the splitting rule, and then finding out the attribute with the highest information gain rate.
(c) Splitting nodes of the decision tree, and stopping growing until all possible values are used, so that the decision tree is grown to the maximum extent, and pruning is avoided; thus, a decision tree is obtained, the process is repeated for the target times T, so that more decision trees are grown, and a random forest classifier is formed.
Because all decision trees are independent from each other, the importance of each decision tree is the same, and when a random forest is used for classification in the patent, each tree has the same weight, and a final classification result is decided according to a voting result. The method selects a random forest classification algorithm as a classifier for realizing true and false voice classification, trains a random forest system by using a data set containing the joint characteristics of true voice and false voice, and tests a to-be-authenticated voice set, so that the effect of classifying and identifying the true voice and the false voice can be realized.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims (4)

1. A method for detecting disguised voice by adopting combined features and random forests is characterized by comprising the following steps:
s1, selecting true voice and false voice from the training voice library randomly, extracting LBP local texture characteristics and CQCC acoustic characteristics of each voice selected randomly respectively, and combining the LBP local texture characteristics and the CQCC acoustic characteristics to form a combined characteristic vector to obtain a training data set;
s2, training the random forest by using the training data set to generate a random forest classifier;
s3, extracting LBP local texture features and CQCC acoustic features of the voice to be detected, combining the LBP local texture features and the CQCC acoustic features to form a combined feature vector to be detected, and inputting the combined feature vector to be detected into a random forest classifier to perform authenticity detection on the voice to be detected;
the LBP local texture feature extraction comprises the following steps:
acquiring a spectrogram of the voice to be extracted, and analyzing the spectrogram of the voice to be extracted by using an LBP algorithm to obtain LBP local texture features;
wherein, the voice to be extracted is randomly selected voice or voice to be detected;
before analyzing the spectrogram of the voice to be extracted by using the LBP algorithm, partitioning the spectrogram, and then analyzing the spectrogram of the voice to be extracted by using the LBP algorithm for each spectrogram to obtain an LBP local texture feature vector consisting of LBP local texture features of each spectrogram.
2. The method for detecting disguised voice using united features and random forest as claimed in claim 1, wherein the extracting of the CQCC acoustic features comprises:
constant Q transformation is firstly carried out on the voice to be extracted to obtain frequency spectrum
Figure 367028DEST_PATH_IMAGE001
Then obtaining a logarithmic power spectrum
Figure 948182DEST_PATH_IMAGE002
Then the log power spectrum is resampledIs converted into
Figure 752190DEST_PATH_IMAGE003
Finally, discrete cosine transform is carried out on the resampled logarithmic energy spectrum to obtain CQCC acoustic features of the voice to be extracted;
wherein, the first and the second end of the pipe are connected with each other,klthe frequency band serial numbers before and after resampling are respectively, and the voice to be extracted is randomly selected voice or voice to be detected.
3. The method for detecting disguised speech using combined features and random forests as claimed in claim 2, wherein said combining LBP local texture features and CQCC acoustic features into a combined feature vector comprises:
and respectively reducing the dimensions of the LBP local texture features and the CQCC acoustic features by adopting a principal component analysis algorithm, and then splicing the reduced features to generate a joint feature vector.
4. The method for detecting disguised speech using united features and random forests as claimed in claim 3, wherein said step S2 comprises the steps of:
s21, assuming common in training data setNVector samples are randomly drawn from the training data set with a trade-off
Figure 949953DEST_PATH_IMAGE004
The vector samples are used as training set samples to train a decision tree, wherein,
Figure 497609DEST_PATH_IMAGE005
s22, each vector sample containsMThe number of the attributes is one,Mdimension of the joint feature vector; when the decision tree is split, random selection is performed
Figure 464428DEST_PATH_IMAGE006
The attribute completes decision tree splitting according to the Gini index and judges whether the decision tree can not be continuedSplitting; if yes, go to step S23; if not, continuing to finish decision tree splitting according to the Gini index;
s23, generating decision trees and judging whether the number of the decision trees is less than the target number; if yes, go back to step S21; and if not, generating a random forest classifier.
CN202110648176.8A 2021-06-10 2021-06-10 Camouflage voice detection method adopting combined features and random forest Active CN113436646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110648176.8A CN113436646B (en) 2021-06-10 2021-06-10 Camouflage voice detection method adopting combined features and random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110648176.8A CN113436646B (en) 2021-06-10 2021-06-10 Camouflage voice detection method adopting combined features and random forest

Publications (2)

Publication Number Publication Date
CN113436646A CN113436646A (en) 2021-09-24
CN113436646B true CN113436646B (en) 2022-09-23

Family

ID=77755642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110648176.8A Active CN113436646B (en) 2021-06-10 2021-06-10 Camouflage voice detection method adopting combined features and random forest

Country Status (1)

Country Link
CN (1) CN113436646B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724693B (en) * 2021-11-01 2022-04-01 中国科学院自动化研究所 Voice judging method and device, electronic equipment and storage medium
CN114822589B (en) * 2022-04-02 2023-07-04 中科猷声(苏州)科技有限公司 Indoor acoustic parameter determination method, model construction method, device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016046652A1 (en) * 2014-09-24 2016-03-31 FUNDAÇÃO CPQD - Centro de Pesquisa e Desenvolvimento em Telecomunicações Method and system for detecting fraud in applications based on voice processing
CN110148425A (en) * 2019-05-14 2019-08-20 杭州电子科技大学 A kind of camouflage speech detection method based on complete local binary pattern
EP3608907A1 (en) * 2018-08-10 2020-02-12 Visa International Service Association Replay spoofing detection for automatic speaker verification system
CN110797031A (en) * 2019-09-19 2020-02-14 厦门快商通科技股份有限公司 Voice change detection method, system, mobile terminal and storage medium
CN111611566A (en) * 2020-05-12 2020-09-01 珠海造极声音科技有限公司 Speaker verification system and replay attack detection method thereof
CN112927694A (en) * 2021-03-08 2021-06-08 中国地质大学(武汉) Voice instruction validity judging method based on fusion voiceprint features

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2018226844B2 (en) * 2017-03-03 2021-11-18 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016046652A1 (en) * 2014-09-24 2016-03-31 FUNDAÇÃO CPQD - Centro de Pesquisa e Desenvolvimento em Telecomunicações Method and system for detecting fraud in applications based on voice processing
EP3608907A1 (en) * 2018-08-10 2020-02-12 Visa International Service Association Replay spoofing detection for automatic speaker verification system
CN110148425A (en) * 2019-05-14 2019-08-20 杭州电子科技大学 A kind of camouflage speech detection method based on complete local binary pattern
CN110797031A (en) * 2019-09-19 2020-02-14 厦门快商通科技股份有限公司 Voice change detection method, system, mobile terminal and storage medium
CN111611566A (en) * 2020-05-12 2020-09-01 珠海造极声音科技有限公司 Speaker verification system and replay attack detection method thereof
CN112927694A (en) * 2021-03-08 2021-06-08 中国地质大学(武汉) Voice instruction validity judging method based on fusion voiceprint features

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Local Binary Pattern with Random Forest for Acoustic Scene Classification;Shamsiah Abidin等;《2018 IEEE International Conference on Multimedia and Expo (ICME)》;20181011;全文 *
Spectrotemporal Analysis Using Local Binary Pattern Variants for Acoustic Scene Classification;Shamsiah Abidin等;《IEEE/ACM Transactions on Audio, Speech, and Language Processing》;20180712;全文 *
时频图像特征用于声场景分类;高敏等;《声学技术》;20171031;全文 *

Also Published As

Publication number Publication date
CN113436646A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN108281146B (en) Short voice speaker identification method and device
CN108986824B (en) Playback voice detection method
CN113436646B (en) Camouflage voice detection method adopting combined features and random forest
CN110120230B (en) Acoustic event detection method and device
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
Chen et al. Towards understanding and mitigating audio adversarial examples for speaker recognition
CN111816185A (en) Method and device for identifying speaker in mixed voice
CN110767239A (en) Voiceprint recognition method, device and equipment based on deep learning
CN114596879B (en) False voice detection method and device, electronic equipment and storage medium
Gao et al. Generalized spoofing detection inspired from audio generation artifacts
CN114495950A (en) Voice deception detection method based on deep residual shrinkage network
Chen et al. SEC4SR: a security analysis platform for speaker recognition
CN111243600A (en) Voice spoofing attack detection method based on sound field and field pattern
de Almeida et al. Use of paraconsistent feature engineering to support the long term feature choice for speaker verification
CN110808067A (en) Low signal-to-noise ratio sound event detection method based on binary multiband energy distribution
WO2013008956A1 (en) Sound processing method, sound processing system, video processing method, video processing system, sound processing device, and method and program for controlling same
CN115293214A (en) Underwater sound target recognition model optimization method based on sample expansion network
KR101094763B1 (en) Apparatus and method for extracting feature vector for user authentication
CN114898773A (en) Synthetic speech detection method based on deep self-attention neural network classifier
CN113627327A (en) Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network
Alam On the use of fisher vector encoding for voice spoofing detection
CN114639387A (en) Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram
CN113870896A (en) Motion sound false judgment method and device based on time-frequency graph and convolutional neural network
Zhang et al. Improving robustness of speech anti-spoofing system using resnext with neighbor filters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant