CN112967712A - Synthetic speech detection method based on autoregressive model coefficient - Google Patents
Synthetic speech detection method based on autoregressive model coefficient Download PDFInfo
- Publication number
- CN112967712A CN112967712A CN202110212380.5A CN202110212380A CN112967712A CN 112967712 A CN112967712 A CN 112967712A CN 202110212380 A CN202110212380 A CN 202110212380A CN 112967712 A CN112967712 A CN 112967712A
- Authority
- CN
- China
- Prior art keywords
- voice
- feature
- speech
- dimensional
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 40
- 238000012360 testing method Methods 0.000 claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 48
- 238000012795 verification Methods 0.000 claims abstract description 43
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 30
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000006243 chemical reaction Methods 0.000 claims abstract description 15
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 14
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 14
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 238000005070 sampling Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 239000012634 fragment Substances 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims 1
- 238000000605 extraction Methods 0.000 abstract description 6
- 238000004422 calculation algorithm Methods 0.000 abstract description 5
- 230000011218 segmentation Effects 0.000 abstract description 4
- 230000000694 effects Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Storage Device Security (AREA)
Abstract
The invention provides a synthetic voice detection method based on autoregressive model coefficient, which relates to the technical field of voice detection and solves the problem that the existing voice feature extraction algorithm is directly applied to carry out voice detection, the detection efficiency and the detection accuracy of the voice detection cannot be simultaneously considered, firstly, voice segments of a training set, a verification set and a test set in a database are fixed to be uniform in length, AR coefficients of each section of voice signals are extracted after segmentation to form two-dimensional AR voice features, the training feature set, the verification feature set and the test feature set are constructed, a convolutional neural network classifier is trained, then the test feature set is classified through the trained convolutional neural network classifier to confirm whether the voice signals to be detected are subjected to voice synthesis or voice conversion tampering operation, the existing voice feature extraction algorithm is not directly applied, and the calculated amount in the detection process is reduced, and finally, the detection efficiency and the detection accuracy of voice detection are considered through fusion.
Description
Technical Field
The invention relates to the technical field of voice detection, in particular to a synthetic voice detection method based on autoregressive model coefficients.
Background
Automatic Speaker Verification (ASV) is deployed in an increasing number of different applications and services, such as mobile phones, smart speakers and call centers, to provide a low cost and flexible biometric solution for personal identity verification. Although the performance of ASV systems has gradually improved in recent years, ASV systems are vulnerable to spoofing attacks. Among them, in terms of audio tampering, which is of particular concern, there are mainly two kinds of spoofing attacks: speech synthesis attacks and speech conversion attacks, both of which constitute a significant threat to ASV systems. Text To Speech (TTS) is a technology for converting characters into speech, similar to human mouth, and a TTS system can generate a completely artificial speech signal by speaking contents to be expressed in different tones. Voice Conversion (VC) is a system that operates on natural speech by inputting a piece of speech and letting it sound like another person speaks while keeping the content of the utterance unchanged. Both SS and VC technologies can produce high quality speech signals that mimic the speech of a particular target.
Currently, most research on the detection of synthesized speech requires extracting features from speech, and commonly used speech features include MFCC, CQCC, and Spec. The MFCC is one of the most commonly used quantity-based features in speech processing, performs cepstrum analysis on a logarithmic spectrum on the Mel scale, and is suitable for distinguishing tampered speech from human speech; CQCC is an amplitude-based feature that uses a Constant Q Transform (CQT) in combination with traditional cepstral analysis; the Spec feature is a more primitive feature relative to MFCC, CQCC, because it is obtained by computing the STFT value over the Hamming window and then calculating the size of each part.
In 2018, 16.11.8, chinese patent (CN108831506A) discloses a GMM-BIC-based digital audio tampering point detection method and system, and also belongs to the technical field of voice detection, the method proposed in the patent is to extract the MFCC features of a silence frame after segmenting the silence frame in a voice signal, and then use the GMM-BIC method to replace the traditional SGM-BIC for digital audio tampering point detection, so that the digital audio tampering positioning is automatic, good in adaptability and high in robustness, and the detection accuracy can be ensured.
An Autoregressive Model (AR) is one of the most common stationary time series models, and is a statistical method for processing time series. Speech is also
The method belongs to one-dimensional data, and the relation between the voice sequences can be evaluated through an AR linear prediction model, so that the method has great significance in researching how to detect voice tampering based on AR coefficients.
Disclosure of Invention
In order to solve the problem that the detection efficiency and the detection accuracy of voice detection cannot be considered simultaneously when the existing voice feature extraction algorithm is directly applied to voice detection in the prior art, the invention provides a synthetic voice detection method based on autoregressive model coefficients, which reduces the calculated amount in the detection process and improves the detection accuracy.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a synthetic speech detection method based on autoregressive model coefficients at least comprises the following steps:
s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a;
s2, segmenting the voices of the training set, the verification set and the test set, extracting AR coefficients of different orders of each segmented voice, arranging the extracted AR coefficients to form two-dimensional AR voice features, and forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features;
s3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier;
s4, classifying the test feature set by using the trained convolutional neural network classifier, and determining whether the voice is subjected to voice synthesis or voice conversion tampering operation;
and S5, fusing the two-dimensional AR voice features with different orders to confirm whether the voice is subjected to voice synthesis or voice conversion tampering operation.
In the technical scheme, AR coefficients are extracted from each frame of voice signals after voice segmentation of a training set, a verification set and a test set of a known database, the arranged AR coefficients form two-dimensional AR voice features, then a convolutional neural network classifier is trained for classification, whether voice is subjected to voice synthesis or voice conversion tampering operation can be confirmed, an existing voice feature extraction algorithm is not directly applied, the calculated amount in the detection process is reduced, finally, the detectable precision is further improved by fusing the two-dimensional AR features of different orders, and the detection efficiency and the detection accuracy of voice detection are both considered.
Preferably, the process of fixing the speech segments of the training set, the verification set and the test set of the known database to the uniform length a in step S1 is:
s101, selecting a known database, and fixing the lengths of the voice segments of a training set, a verification set and a test set of the known database as a sampling points;
s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to a, if so, performing truncation operation, and fixing the voice length as a; otherwise, the voice length before fixing is expanded by copying the voice, and then the voice length is fixed as a.
Preferably, the specific process of constructing the training feature set, the verification feature set and the test feature set of the two-dimensional AR speech feature described in step S2 is as follows:
s201, dividing the voice fragments with the fixed uniform length of a in the training set, the verification set and the test set into b sections;
s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted b-dimensional AR coefficients to form a two-dimensional AR voice feature;
and S203, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimension of b multiplied by h.
Preferably, h in step S202 satisfies:
h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer, and the effect of the AR coefficient in application is ensured.
Preferably, the training feature set of the two-dimensional AR speech feature includes an original speech feature and a tampered speech feature, and in step S3, the convolutional neural network classifier is obtained by training through a gradient descent method; and taking the parameter of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set as an optimal parameter and storing the optimal parameter.
Preferably, the specific process of classifying the test feature set by using the trained convolutional neural network classifier and determining whether the speech is subjected to speech synthesis or speech conversion tampering operation in step S4 includes:
s401, inputting a test feature set of a two-dimensional AR voice feature with dimension of bx h into a trained convolutional neural network classifier;
s402, according to a calculation formula of the voice score CM:
CM(f)=log(p(bonafide|f;θ))-log(p(spoof|f;θ))
calculating the score of each voice and storing a score file, wherein f represents the currently sent voice characteristic, and theta represents the stored optimal parameter; p (bonafide | f; theta) represents the probability that the input characteristic f is the bonafide characteristic of the original voice, and p (spooff | f; theta) represents the probability that the input characteristic f is the synthetic tampered voice spooff characteristic;
s403, judging whether the value of CM (f) is larger than or equal to a judgment threshold value T, if so, the sent voice feature is original voice, otherwise, the sent voice feature is synthesized voice or converted tampered voice;
and S403, further determining the t-DCF index and the equal error probability EER index of the current test feature set according to the stored score file and the score file given by the known database.
Here, the value of cm (f) represents the possibility that the speech signal to be measured is the original speech, and the t-DCF index and the equal error probability EER index are introduced in step S403 to further judge the effectiveness of the two-dimensional AR speech feature.
Preferably, in step S5, the process of fusing two-dimensional AR speech features of different orders and determining whether speech is subjected to speech synthesis or speech conversion tampering operation includes:
s501, respectively inputting test feature sets of two-dimensional AR voice features of different orders into a trained convolutional neural network classifier to obtain different score files and storing the score files;
s502, averaging different score files to obtain a fused score file;
s503, judging whether the score value after fusion is larger than or equal to a judgment threshold value T, if so, the voice feature after fusion is original voice, otherwise, the voice feature after fusion is synthesized voice or converted and tampered voice;
and S503, further determining the fused t-DCF index and the equal error probability EER index according to the fused score file and the score file given by the known database.
Preferably, the larger the cm (f) score, the greater the probability that the currently input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech.
Preferably, the smaller the t-DCF index and the equal error probability EER index, the more effective the two-dimensional AR speech feature is represented.
Preferably, the database is known as the ASVspoof 2019 speech data set.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a synthetic voice detection method based on autoregressive model coefficients, firstly fixing voice segments of a training set, a verification set and a test set in a known database to be uniform in length, extracting AR coefficients of each section of voice signals after segmentation to form two-dimensional AR voice characteristics, thereby constructing a training characteristic set, a verification characteristic set and a test characteristic set, training a convolutional neural network classifier by utilizing the training characteristic set and the verification characteristic set, classifying the test characteristic set by the trained convolutional neural network classifier, confirming whether the voice signal to be tested is subjected to voice synthesis or voice conversion tampering operation, compared with the prior art, the existing voice feature extraction algorithm is not directly applied, the calculated amount in the detection process is reduced, the detectable precision is further improved by fusing the two-dimensional AR features with different orders, and the detection efficiency and the detection accuracy of voice detection are considered.
Drawings
Fig. 1 is a flowchart illustrating a method for detecting synthesized speech based on auto-regression model coefficients according to an embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for better illustration of the present embodiment, certain parts of the drawings may be omitted, enlarged or reduced, and do not represent actual dimensions;
it will be understood by those skilled in the art that certain well-known descriptions of the figures may be omitted.
The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
A flow chart of a method for detecting synthesized speech based on autoregressive model coefficients, as shown in fig. 1, with reference to fig. 1, the method comprises:
s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a; in this embodiment, the known database selects an ASVspoof 2019 speech data set, the uniform length is 64000, and the specific fixed process is as follows:
s101, selecting an ASVspoof 2019 voice data set as a known database, and fixing the lengths of voice segments of a training set, a verification set and a test set of the known database to be 64000 sampling points respectively;
s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to 64000, if so, performing truncation operation, and fixing the voice length to 64000; otherwise, the voice length before fixing is expanded by copying the voice, then the fixed voice length is 64000, namely when the voice length before fixing of any one of the training set, the verification set and the test set exceeds 64000, the cut-off operation is directly carried out by the prior art, so that the voice length meets the requirement of the unified length of 64000, if the voice length before fixing of any one of the training set, the verification set and the test set is less than 64000, the voice fragment is copied, the original voice length is expanded, and the voice length meets the requirement of the unified length of 64000;
s2, segmenting the voices of the training set, the verification set and the test set, extracting AR coefficients of different orders of each segmented voice, arranging the extracted AR coefficients to form two-dimensional AR voice features, and forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features;
the extraction of the AR coefficient can be realized by different mature technical means, and other specific processes are as follows:
s201, dividing the fixed voice fragments with the uniform length of 64000 in a training set, a verification set and a test set into 400 sections;
s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted 400-dimensional AR coefficients to form a two-dimensional AR voice feature;
in this embodiment, to ensure the effect of the AR coefficient when applied, the order h satisfies: h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer, namely the h can be an end point of 8 or 150, or any positive integer between 8 and 150, and each divided 400 voice segments contain 160 sampling points; the training feature set of the two-dimensional AR voice feature comprises an original voice feature and a tampered voice feature;
s203, according to the extracted AR coefficient with the order of h of each section of voice and the arrangement of the divided 400 segments, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimensionality of 400 x h.
S3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier; in this embodiment, the convolutional neural network classifier is obtained by training through a gradient descent method, and is not limited to a specific convolutional neural network classifier, and then the parameters of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set are used as the optimal parameters and stored.
S4, classifying the test feature set by using the trained convolutional neural network classifier, and determining whether the voice is subjected to voice synthesis or voice conversion tampering operation;
the specific process is as follows:
s401, inputting a test feature set of two-dimensional AR voice features with dimensions of 400 x h into a trained convolutional neural network classifier;
s402, according to a calculation formula of the voice score CM:
CM(f)=log(p(bonafide|f;θ))-log(p(spoof|f;θ))
calculating the score of each voice and storing a score file, wherein f represents the currently sent voice characteristic, and theta represents the stored optimal parameter; p (bonafide | f; theta) represents the probability that the input characteristic f is the bonafide characteristic of the original voice, and p (spooff | f; theta) represents the probability that the input characteristic f is the synthetic tampered voice spooff characteristic;
s403, judging whether the value of CM (f) is larger than or equal to a judgment threshold value T, if so, the sent voice feature is original voice, otherwise, the sent voice feature is synthesized voice or converted tampered voice;
s403, according to the stored score file and the score file given by the known database, further determining a t-DCF index and an equal error probability EER index of the current test feature set, wherein CM (f) the value of the score represents the possibility that the voice signal to be tested is the original voice, and simultaneously introducing the t-DCF index and the equal error probability EER index in the step S403 to further judge the effectiveness of the two-dimensional AR voice feature, wherein the t-DCF (a distance detection cost function) is a series cost function index and is determined by two systems (ASV, CM), and the calculation formula is as follows:
wherein,is the cost of the ASV system rejecting the target voice;is the cost of the ASV system to accept non-target speech;is the cost of the CM system rejecting real speech;is the cost of the CM system accepting the tampered speech; pitar,πnon,πspoof,πtarIs a prior probability;
for a false rejection rate of the ASV system below the threshold t,the false acceptance rate of the ASV system at threshold t,to tamper with the probability that a sample is not missed by the ASV system,is the false rejection rate of the CM system below the s threshold,is the false acceptance rate of the CM system below the s threshold.
The equal error probability EER is a point where the False Rejection Rate (FRR) is equal to the False Acceptance Rate (FAR), and the values of FAR and FRR at this time are called equal error rates.
S5, fusing two-dimensional AR voice features with different orders, and confirming whether voice is subjected to voice synthesis or voice conversion tampering operation, wherein the process is as follows:
s501, respectively inputting test feature sets of two-dimensional AR voice features of different orders into a trained convolutional neural network classifier to obtain different score files and storing the score files;
s502, averaging different score files to obtain a fused score file;
s503, judging whether the score value after fusion is larger than or equal to a judgment threshold value T, if so, the voice feature after fusion is original voice, otherwise, the voice feature after fusion is synthesized voice or converted and tampered voice;
and S503, further determining the fused t-DCF index and the equal error probability EER index according to the fused score file and the score file given by the known database.
In this embodiment, the larger the cm (f) score, the larger the probability that the currently input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score is, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech is, and the smaller the t-DCF index and the equal error probability EER index are, the more effective the two-dimensional AR speech feature is represented.
The specific comprehensive implementation process is as follows:
firstly, fixing the length of the speech of a training set, a verification set and a test set in a known database to be 64000, carrying out segmentation processing on the speech, wherein the sampling point of each segment is 160, extracting AR coefficients of each segment of speech,
wherein, the AR coefficient is selected between 8-150 orders
Experiments were conducted with AR coefficients of orders 8, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150; the formed training feature set and the verification feature set train the convolutional neural network classifier, the optimal parameters are stored, the test feature set is used for testing, in addition, two indexes of t-DCF and EER are used for evaluating the feature effect, the effect on the orders fluctuates on the whole, but the good voice tampering detection effect can be reflected, and the experimental result is shown in Table 1.
TABLE 1
Wherein, Development represents the result item on the verification set, Evaluation represents the result item on the test set, the experimental result in table 1 shows that the effect of the feature is best at 10 th order [ AR (10) ], and in the last row of table 1, the 10 th order feature and the 50 th order feature are fused, and even more, the feature effect is further improved.
The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A synthetic speech detection method based on autoregressive model coefficients is characterized by at least comprising the following steps:
s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a;
s2, segmenting the voices of the training set, the verification set and the test set, extracting AR coefficients of different orders of each segmented voice, arranging the extracted AR coefficients to form two-dimensional AR voice features, and forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features;
s3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier;
s4, classifying the test feature set by using the trained convolutional neural network classifier, and determining whether the voice is subjected to voice synthesis or voice conversion tampering operation;
and S5, fusing the two-dimensional AR voice features with different orders to confirm whether the voice is subjected to voice synthesis or voice conversion tampering operation.
2. The method for detecting synthesized speech based on autoregressive model coefficients as claimed in claim 1, wherein the step S1 is performed by fixing the speech segments of the training set, the validation set and the test set of the known database to a uniform length a:
s101, selecting a known database, and fixing the lengths of the voice segments of a training set, a verification set and a test set of the known database as a sampling points;
s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to a, if so, performing truncation operation, and fixing the voice length as a; otherwise, the voice length before fixing is expanded by copying the voice, and then the voice length is fixed as a.
3. The method for detecting synthesized speech based on autoregressive model coefficients according to claim 2, wherein the specific process of constructing the training feature set, the verification feature set and the test feature set of the two-dimensional AR speech feature described in step S2 is as follows:
s201, dividing the voice fragments with the fixed uniform length of a in the training set, the verification set and the test set into b sections;
s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted b-dimensional AR coefficients to form a two-dimensional AR voice feature;
and S203, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimension of b multiplied by h.
4. The method according to claim 3, wherein h in step S202 satisfies the following condition:
h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer.
5. The method for detecting the synthesized speech based on the autoregressive model coefficient as claimed in claim 1 or 3, wherein the training feature set of the two-dimensional AR speech feature comprises an original speech feature and a tampered speech feature, and in step S3, the convolutional neural network classifier is obtained by training through a gradient descent method; and taking the parameter of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set as an optimal parameter and storing the optimal parameter.
6. The method according to claim 5, wherein the step S4 of classifying the test feature set by using the trained convolutional neural network classifier to determine whether the speech is subjected to speech synthesis or speech conversion tampering operation comprises the following specific steps:
s401, inputting a test feature set of a two-dimensional AR voice feature with dimension of bx h into a trained convolutional neural network classifier;
s402, according to a calculation formula of the voice score CM:
CM(f)=log(p(bonafide|f;θ))-log(p(spoof|f;θ))
calculating the score of each voice and storing a score file, wherein f represents the currently sent voice characteristic, and theta represents the stored optimal parameter; p (bonafide | f; theta) represents the probability that the input characteristic f is the bonafide characteristic of the original voice, and p (spooff | f; theta) represents the probability that the input characteristic f is the synthetic tampered voice spooff characteristic;
s403, judging whether the value of CM (f) is larger than or equal to a judgment threshold value T, if so, the sent voice feature is original voice, otherwise, the sent voice feature is synthesized voice or converted tampered voice;
and S403, further determining the t-DCF index and the equal error probability EER index of the current test feature set according to the stored score file and the score file given by the known database.
7. The method of claim 6, wherein the step S5 of fusing the two-dimensional AR speech features of different orders to determine whether the speech is subjected to speech synthesis or speech conversion falsification comprises:
s501, respectively inputting test feature sets of two-dimensional AR voice features of different orders into a trained convolutional neural network classifier to obtain different score files and storing the score files;
s502, averaging different score files to obtain a fused score file;
s503, judging whether the score value after fusion is larger than or equal to a judgment threshold value T, if so, the voice feature after fusion is original voice, otherwise, the voice feature after fusion is synthesized voice or converted and tampered voice;
and S503, further determining the fused t-DCF index and the equal error probability EER index according to the fused score file and the score file given by the known database.
8. The method of claim 7, wherein the larger the CM (f) score, the greater the probability that the current input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech.
9. The method of claim 8, wherein the smaller the t-DCF metric and the EER metric, the more efficient the two-dimensional AR speech features are.
10. The method of auto-regressive model coefficient-based synthesized speech detection according to claim 9, wherein the known database is an ASVspoof 2019 speech data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110212380.5A CN112967712A (en) | 2021-02-25 | 2021-02-25 | Synthetic speech detection method based on autoregressive model coefficient |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110212380.5A CN112967712A (en) | 2021-02-25 | 2021-02-25 | Synthetic speech detection method based on autoregressive model coefficient |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112967712A true CN112967712A (en) | 2021-06-15 |
Family
ID=76286141
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110212380.5A Pending CN112967712A (en) | 2021-02-25 | 2021-02-25 | Synthetic speech detection method based on autoregressive model coefficient |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112967712A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488074A (en) * | 2021-08-20 | 2021-10-08 | 四川大学 | Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103730121A (en) * | 2013-12-24 | 2014-04-16 | 中山大学 | Method and device for recognizing disguised sounds |
EP2860706A2 (en) * | 2013-09-24 | 2015-04-15 | Agnitio S.L. | Anti-spoofing |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
CN111243621A (en) * | 2020-01-14 | 2020-06-05 | 四川大学 | Construction method of GRU-SVM deep learning model for synthetic speech detection |
CN111445924A (en) * | 2020-03-18 | 2020-07-24 | 中山大学 | Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient |
CN111798828A (en) * | 2020-05-29 | 2020-10-20 | 厦门快商通科技股份有限公司 | Synthetic audio detection method, system, mobile terminal and storage medium |
CN112349267A (en) * | 2020-10-28 | 2021-02-09 | 天津大学 | Synthesized voice detection method based on attention mechanism characteristics |
-
2021
- 2021-02-25 CN CN202110212380.5A patent/CN112967712A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2860706A2 (en) * | 2013-09-24 | 2015-04-15 | Agnitio S.L. | Anti-spoofing |
CN103730121A (en) * | 2013-12-24 | 2014-04-16 | 中山大学 | Method and device for recognizing disguised sounds |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
CN111243621A (en) * | 2020-01-14 | 2020-06-05 | 四川大学 | Construction method of GRU-SVM deep learning model for synthetic speech detection |
CN111445924A (en) * | 2020-03-18 | 2020-07-24 | 中山大学 | Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient |
CN111798828A (en) * | 2020-05-29 | 2020-10-20 | 厦门快商通科技股份有限公司 | Synthetic audio detection method, system, mobile terminal and storage medium |
CN112349267A (en) * | 2020-10-28 | 2021-02-09 | 天津大学 | Synthesized voice detection method based on attention mechanism characteristics |
Non-Patent Citations (1)
Title |
---|
于泓: "合成语音检测算法研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488074A (en) * | 2021-08-20 | 2021-10-08 | 四川大学 | Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof |
CN113488074B (en) * | 2021-08-20 | 2023-06-23 | 四川大学 | Two-dimensional time-frequency characteristic generation method for detecting synthesized voice |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108900725B (en) | Voiceprint recognition method and device, terminal equipment and storage medium | |
WO2020211354A1 (en) | Speaker identity recognition method and device based on speech content, and storage medium | |
KR100636317B1 (en) | Distributed Speech Recognition System and method | |
US8160877B1 (en) | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting | |
CN103179122B (en) | A kind of anti-telecommunications telephone fraud method and system based on voice semantic content analysis | |
EP1083542A2 (en) | A method and apparatus for speech detection | |
CN112735383A (en) | Voice signal processing method, device, equipment and storage medium | |
US9043207B2 (en) | Speaker recognition from telephone calls | |
CN112712809B (en) | Voice detection method and device, electronic equipment and storage medium | |
CN110120230B (en) | Acoustic event detection method and device | |
CN106910495A (en) | Audio classification system and method applied to abnormal sound detection | |
Kim et al. | Hierarchical approach for abnormal acoustic event classification in an elevator | |
CN102915740B (en) | Phonetic empathy Hash content authentication method capable of implementing tamper localization | |
JP5050698B2 (en) | Voice processing apparatus and program | |
CN112466287A (en) | Voice segmentation method and device and computer readable storage medium | |
Dhanalakshmi et al. | Pattern classification models for classifying and indexing audio signals | |
Efanov et al. | The BiLSTM-based synthesized speech recognition | |
Birla | A robust unsupervised pattern discovery and clustering of speech signals | |
CN110246509A (en) | A kind of stack denoising self-encoding encoder and deep neural network structure for voice lie detection | |
CN112967712A (en) | Synthetic speech detection method based on autoregressive model coefficient | |
Mandalapu et al. | Multilingual voice impersonation dataset and evaluation | |
Alex et al. | Variational autoencoder for prosody‐based speaker recognition | |
CN109920447A (en) | Recording fraud detection method based on sef-adapting filter Amplitude & Phase feature extraction | |
CN116153336B (en) | Synthetic voice detection method based on multi-domain information fusion | |
Kanrar | Dimension compactness in speaker identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |