CN112967712A - Synthetic speech detection method based on autoregressive model coefficient - Google Patents

Synthetic speech detection method based on autoregressive model coefficient Download PDF

Info

Publication number
CN112967712A
CN112967712A CN202110212380.5A CN202110212380A CN112967712A CN 112967712 A CN112967712 A CN 112967712A CN 202110212380 A CN202110212380 A CN 202110212380A CN 112967712 A CN112967712 A CN 112967712A
Authority
CN
China
Prior art keywords
voice
feature
speech
dimensional
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110212380.5A
Other languages
Chinese (zh)
Inventor
王铮
康显桂
李中华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110212380.5A priority Critical patent/CN112967712A/en
Publication of CN112967712A publication Critical patent/CN112967712A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Storage Device Security (AREA)

Abstract

The invention provides a synthetic voice detection method based on autoregressive model coefficient, which relates to the technical field of voice detection and solves the problem that the existing voice feature extraction algorithm is directly applied to carry out voice detection, the detection efficiency and the detection accuracy of the voice detection cannot be simultaneously considered, firstly, voice segments of a training set, a verification set and a test set in a database are fixed to be uniform in length, AR coefficients of each section of voice signals are extracted after segmentation to form two-dimensional AR voice features, the training feature set, the verification feature set and the test feature set are constructed, a convolutional neural network classifier is trained, then the test feature set is classified through the trained convolutional neural network classifier to confirm whether the voice signals to be detected are subjected to voice synthesis or voice conversion tampering operation, the existing voice feature extraction algorithm is not directly applied, and the calculated amount in the detection process is reduced, and finally, the detection efficiency and the detection accuracy of voice detection are considered through fusion.

Description

Synthetic speech detection method based on autoregressive model coefficient
Technical Field
The invention relates to the technical field of voice detection, in particular to a synthetic voice detection method based on autoregressive model coefficients.
Background
Automatic Speaker Verification (ASV) is deployed in an increasing number of different applications and services, such as mobile phones, smart speakers and call centers, to provide a low cost and flexible biometric solution for personal identity verification. Although the performance of ASV systems has gradually improved in recent years, ASV systems are vulnerable to spoofing attacks. Among them, in terms of audio tampering, which is of particular concern, there are mainly two kinds of spoofing attacks: speech synthesis attacks and speech conversion attacks, both of which constitute a significant threat to ASV systems. Text To Speech (TTS) is a technology for converting characters into speech, similar to human mouth, and a TTS system can generate a completely artificial speech signal by speaking contents to be expressed in different tones. Voice Conversion (VC) is a system that operates on natural speech by inputting a piece of speech and letting it sound like another person speaks while keeping the content of the utterance unchanged. Both SS and VC technologies can produce high quality speech signals that mimic the speech of a particular target.
Currently, most research on the detection of synthesized speech requires extracting features from speech, and commonly used speech features include MFCC, CQCC, and Spec. The MFCC is one of the most commonly used quantity-based features in speech processing, performs cepstrum analysis on a logarithmic spectrum on the Mel scale, and is suitable for distinguishing tampered speech from human speech; CQCC is an amplitude-based feature that uses a Constant Q Transform (CQT) in combination with traditional cepstral analysis; the Spec feature is a more primitive feature relative to MFCC, CQCC, because it is obtained by computing the STFT value over the Hamming window and then calculating the size of each part.
In 2018, 16.11.8, chinese patent (CN108831506A) discloses a GMM-BIC-based digital audio tampering point detection method and system, and also belongs to the technical field of voice detection, the method proposed in the patent is to extract the MFCC features of a silence frame after segmenting the silence frame in a voice signal, and then use the GMM-BIC method to replace the traditional SGM-BIC for digital audio tampering point detection, so that the digital audio tampering positioning is automatic, good in adaptability and high in robustness, and the detection accuracy can be ensured.
An Autoregressive Model (AR) is one of the most common stationary time series models, and is a statistical method for processing time series. Speech is also
The method belongs to one-dimensional data, and the relation between the voice sequences can be evaluated through an AR linear prediction model, so that the method has great significance in researching how to detect voice tampering based on AR coefficients.
Disclosure of Invention
In order to solve the problem that the detection efficiency and the detection accuracy of voice detection cannot be considered simultaneously when the existing voice feature extraction algorithm is directly applied to voice detection in the prior art, the invention provides a synthetic voice detection method based on autoregressive model coefficients, which reduces the calculated amount in the detection process and improves the detection accuracy.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a synthetic speech detection method based on autoregressive model coefficients at least comprises the following steps:
s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a;
s2, segmenting the voices of the training set, the verification set and the test set, extracting AR coefficients of different orders of each segmented voice, arranging the extracted AR coefficients to form two-dimensional AR voice features, and forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features;
s3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier;
s4, classifying the test feature set by using the trained convolutional neural network classifier, and determining whether the voice is subjected to voice synthesis or voice conversion tampering operation;
and S5, fusing the two-dimensional AR voice features with different orders to confirm whether the voice is subjected to voice synthesis or voice conversion tampering operation.
In the technical scheme, AR coefficients are extracted from each frame of voice signals after voice segmentation of a training set, a verification set and a test set of a known database, the arranged AR coefficients form two-dimensional AR voice features, then a convolutional neural network classifier is trained for classification, whether voice is subjected to voice synthesis or voice conversion tampering operation can be confirmed, an existing voice feature extraction algorithm is not directly applied, the calculated amount in the detection process is reduced, finally, the detectable precision is further improved by fusing the two-dimensional AR features of different orders, and the detection efficiency and the detection accuracy of voice detection are both considered.
Preferably, the process of fixing the speech segments of the training set, the verification set and the test set of the known database to the uniform length a in step S1 is:
s101, selecting a known database, and fixing the lengths of the voice segments of a training set, a verification set and a test set of the known database as a sampling points;
s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to a, if so, performing truncation operation, and fixing the voice length as a; otherwise, the voice length before fixing is expanded by copying the voice, and then the voice length is fixed as a.
Preferably, the specific process of constructing the training feature set, the verification feature set and the test feature set of the two-dimensional AR speech feature described in step S2 is as follows:
s201, dividing the voice fragments with the fixed uniform length of a in the training set, the verification set and the test set into b sections;
s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted b-dimensional AR coefficients to form a two-dimensional AR voice feature;
and S203, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimension of b multiplied by h.
Preferably, h in step S202 satisfies:
h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer, and the effect of the AR coefficient in application is ensured.
Preferably, the training feature set of the two-dimensional AR speech feature includes an original speech feature and a tampered speech feature, and in step S3, the convolutional neural network classifier is obtained by training through a gradient descent method; and taking the parameter of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set as an optimal parameter and storing the optimal parameter.
Preferably, the specific process of classifying the test feature set by using the trained convolutional neural network classifier and determining whether the speech is subjected to speech synthesis or speech conversion tampering operation in step S4 includes:
s401, inputting a test feature set of a two-dimensional AR voice feature with dimension of bx h into a trained convolutional neural network classifier;
s402, according to a calculation formula of the voice score CM:
CM(f)=log(p(bonafide|f;θ))-log(p(spoof|f;θ))
calculating the score of each voice and storing a score file, wherein f represents the currently sent voice characteristic, and theta represents the stored optimal parameter; p (bonafide | f; theta) represents the probability that the input characteristic f is the bonafide characteristic of the original voice, and p (spooff | f; theta) represents the probability that the input characteristic f is the synthetic tampered voice spooff characteristic;
s403, judging whether the value of CM (f) is larger than or equal to a judgment threshold value T, if so, the sent voice feature is original voice, otherwise, the sent voice feature is synthesized voice or converted tampered voice;
and S403, further determining the t-DCF index and the equal error probability EER index of the current test feature set according to the stored score file and the score file given by the known database.
Here, the value of cm (f) represents the possibility that the speech signal to be measured is the original speech, and the t-DCF index and the equal error probability EER index are introduced in step S403 to further judge the effectiveness of the two-dimensional AR speech feature.
Preferably, in step S5, the process of fusing two-dimensional AR speech features of different orders and determining whether speech is subjected to speech synthesis or speech conversion tampering operation includes:
s501, respectively inputting test feature sets of two-dimensional AR voice features of different orders into a trained convolutional neural network classifier to obtain different score files and storing the score files;
s502, averaging different score files to obtain a fused score file;
s503, judging whether the score value after fusion is larger than or equal to a judgment threshold value T, if so, the voice feature after fusion is original voice, otherwise, the voice feature after fusion is synthesized voice or converted and tampered voice;
and S503, further determining the fused t-DCF index and the equal error probability EER index according to the fused score file and the score file given by the known database.
Preferably, the larger the cm (f) score, the greater the probability that the currently input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech.
Preferably, the smaller the t-DCF index and the equal error probability EER index, the more effective the two-dimensional AR speech feature is represented.
Preferably, the database is known as the ASVspoof 2019 speech data set.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a synthetic voice detection method based on autoregressive model coefficients, firstly fixing voice segments of a training set, a verification set and a test set in a known database to be uniform in length, extracting AR coefficients of each section of voice signals after segmentation to form two-dimensional AR voice characteristics, thereby constructing a training characteristic set, a verification characteristic set and a test characteristic set, training a convolutional neural network classifier by utilizing the training characteristic set and the verification characteristic set, classifying the test characteristic set by the trained convolutional neural network classifier, confirming whether the voice signal to be tested is subjected to voice synthesis or voice conversion tampering operation, compared with the prior art, the existing voice feature extraction algorithm is not directly applied, the calculated amount in the detection process is reduced, the detectable precision is further improved by fusing the two-dimensional AR features with different orders, and the detection efficiency and the detection accuracy of voice detection are considered.
Drawings
Fig. 1 is a flowchart illustrating a method for detecting synthesized speech based on auto-regression model coefficients according to an embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for better illustration of the present embodiment, certain parts of the drawings may be omitted, enlarged or reduced, and do not represent actual dimensions;
it will be understood by those skilled in the art that certain well-known descriptions of the figures may be omitted.
The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
A flow chart of a method for detecting synthesized speech based on autoregressive model coefficients, as shown in fig. 1, with reference to fig. 1, the method comprises:
s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a; in this embodiment, the known database selects an ASVspoof 2019 speech data set, the uniform length is 64000, and the specific fixed process is as follows:
s101, selecting an ASVspoof 2019 voice data set as a known database, and fixing the lengths of voice segments of a training set, a verification set and a test set of the known database to be 64000 sampling points respectively;
s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to 64000, if so, performing truncation operation, and fixing the voice length to 64000; otherwise, the voice length before fixing is expanded by copying the voice, then the fixed voice length is 64000, namely when the voice length before fixing of any one of the training set, the verification set and the test set exceeds 64000, the cut-off operation is directly carried out by the prior art, so that the voice length meets the requirement of the unified length of 64000, if the voice length before fixing of any one of the training set, the verification set and the test set is less than 64000, the voice fragment is copied, the original voice length is expanded, and the voice length meets the requirement of the unified length of 64000;
s2, segmenting the voices of the training set, the verification set and the test set, extracting AR coefficients of different orders of each segmented voice, arranging the extracted AR coefficients to form two-dimensional AR voice features, and forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features;
the extraction of the AR coefficient can be realized by different mature technical means, and other specific processes are as follows:
s201, dividing the fixed voice fragments with the uniform length of 64000 in a training set, a verification set and a test set into 400 sections;
s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted 400-dimensional AR coefficients to form a two-dimensional AR voice feature;
in this embodiment, to ensure the effect of the AR coefficient when applied, the order h satisfies: h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer, namely the h can be an end point of 8 or 150, or any positive integer between 8 and 150, and each divided 400 voice segments contain 160 sampling points; the training feature set of the two-dimensional AR voice feature comprises an original voice feature and a tampered voice feature;
s203, according to the extracted AR coefficient with the order of h of each section of voice and the arrangement of the divided 400 segments, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimensionality of 400 x h.
S3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier; in this embodiment, the convolutional neural network classifier is obtained by training through a gradient descent method, and is not limited to a specific convolutional neural network classifier, and then the parameters of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set are used as the optimal parameters and stored.
S4, classifying the test feature set by using the trained convolutional neural network classifier, and determining whether the voice is subjected to voice synthesis or voice conversion tampering operation;
the specific process is as follows:
s401, inputting a test feature set of two-dimensional AR voice features with dimensions of 400 x h into a trained convolutional neural network classifier;
s402, according to a calculation formula of the voice score CM:
CM(f)=log(p(bonafide|f;θ))-log(p(spoof|f;θ))
calculating the score of each voice and storing a score file, wherein f represents the currently sent voice characteristic, and theta represents the stored optimal parameter; p (bonafide | f; theta) represents the probability that the input characteristic f is the bonafide characteristic of the original voice, and p (spooff | f; theta) represents the probability that the input characteristic f is the synthetic tampered voice spooff characteristic;
s403, judging whether the value of CM (f) is larger than or equal to a judgment threshold value T, if so, the sent voice feature is original voice, otherwise, the sent voice feature is synthesized voice or converted tampered voice;
s403, according to the stored score file and the score file given by the known database, further determining a t-DCF index and an equal error probability EER index of the current test feature set, wherein CM (f) the value of the score represents the possibility that the voice signal to be tested is the original voice, and simultaneously introducing the t-DCF index and the equal error probability EER index in the step S403 to further judge the effectiveness of the two-dimensional AR voice feature, wherein the t-DCF (a distance detection cost function) is a series cost function index and is determined by two systems (ASV, CM), and the calculation formula is as follows:
Figure BDA0002952796200000071
wherein,
Figure BDA0002952796200000072
is the cost of the ASV system rejecting the target voice;
Figure BDA0002952796200000073
is the cost of the ASV system to accept non-target speech;
Figure BDA0002952796200000074
is the cost of the CM system rejecting real speech;
Figure BDA0002952796200000075
is the cost of the CM system accepting the tampered speech; pitar,πnon,πspoof,πtarIs a prior probability;
Figure BDA0002952796200000076
Figure BDA0002952796200000077
Figure BDA0002952796200000078
Figure BDA0002952796200000079
Figure BDA00029527962000000710
for a false rejection rate of the ASV system below the threshold t,
Figure BDA00029527962000000711
the false acceptance rate of the ASV system at threshold t,
Figure BDA00029527962000000712
to tamper with the probability that a sample is not missed by the ASV system,
Figure BDA00029527962000000713
is the false rejection rate of the CM system below the s threshold,
Figure BDA00029527962000000714
is the false acceptance rate of the CM system below the s threshold.
The equal error probability EER is a point where the False Rejection Rate (FRR) is equal to the False Acceptance Rate (FAR), and the values of FAR and FRR at this time are called equal error rates.
S5, fusing two-dimensional AR voice features with different orders, and confirming whether voice is subjected to voice synthesis or voice conversion tampering operation, wherein the process is as follows:
s501, respectively inputting test feature sets of two-dimensional AR voice features of different orders into a trained convolutional neural network classifier to obtain different score files and storing the score files;
s502, averaging different score files to obtain a fused score file;
s503, judging whether the score value after fusion is larger than or equal to a judgment threshold value T, if so, the voice feature after fusion is original voice, otherwise, the voice feature after fusion is synthesized voice or converted and tampered voice;
and S503, further determining the fused t-DCF index and the equal error probability EER index according to the fused score file and the score file given by the known database.
In this embodiment, the larger the cm (f) score, the larger the probability that the currently input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score is, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech is, and the smaller the t-DCF index and the equal error probability EER index are, the more effective the two-dimensional AR speech feature is represented.
The specific comprehensive implementation process is as follows:
firstly, fixing the length of the speech of a training set, a verification set and a test set in a known database to be 64000, carrying out segmentation processing on the speech, wherein the sampling point of each segment is 160, extracting AR coefficients of each segment of speech,
wherein, the AR coefficient is selected between 8-150 orders
Experiments were conducted with AR coefficients of orders 8, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150; the formed training feature set and the verification feature set train the convolutional neural network classifier, the optimal parameters are stored, the test feature set is used for testing, in addition, two indexes of t-DCF and EER are used for evaluating the feature effect, the effect on the orders fluctuates on the whole, but the good voice tampering detection effect can be reflected, and the experimental result is shown in Table 1.
TABLE 1
Figure BDA0002952796200000081
Wherein, Development represents the result item on the verification set, Evaluation represents the result item on the test set, the experimental result in table 1 shows that the effect of the feature is best at 10 th order [ AR (10) ], and in the last row of table 1, the 10 th order feature and the 50 th order feature are fused, and even more, the feature effect is further improved.
The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A synthetic speech detection method based on autoregressive model coefficients is characterized by at least comprising the following steps:
s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a;
s2, segmenting the voices of the training set, the verification set and the test set, extracting AR coefficients of different orders of each segmented voice, arranging the extracted AR coefficients to form two-dimensional AR voice features, and forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features;
s3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier;
s4, classifying the test feature set by using the trained convolutional neural network classifier, and determining whether the voice is subjected to voice synthesis or voice conversion tampering operation;
and S5, fusing the two-dimensional AR voice features with different orders to confirm whether the voice is subjected to voice synthesis or voice conversion tampering operation.
2. The method for detecting synthesized speech based on autoregressive model coefficients as claimed in claim 1, wherein the step S1 is performed by fixing the speech segments of the training set, the validation set and the test set of the known database to a uniform length a:
s101, selecting a known database, and fixing the lengths of the voice segments of a training set, a verification set and a test set of the known database as a sampling points;
s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to a, if so, performing truncation operation, and fixing the voice length as a; otherwise, the voice length before fixing is expanded by copying the voice, and then the voice length is fixed as a.
3. The method for detecting synthesized speech based on autoregressive model coefficients according to claim 2, wherein the specific process of constructing the training feature set, the verification feature set and the test feature set of the two-dimensional AR speech feature described in step S2 is as follows:
s201, dividing the voice fragments with the fixed uniform length of a in the training set, the verification set and the test set into b sections;
s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted b-dimensional AR coefficients to form a two-dimensional AR voice feature;
and S203, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimension of b multiplied by h.
4. The method according to claim 3, wherein h in step S202 satisfies the following condition:
h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer.
5. The method for detecting the synthesized speech based on the autoregressive model coefficient as claimed in claim 1 or 3, wherein the training feature set of the two-dimensional AR speech feature comprises an original speech feature and a tampered speech feature, and in step S3, the convolutional neural network classifier is obtained by training through a gradient descent method; and taking the parameter of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set as an optimal parameter and storing the optimal parameter.
6. The method according to claim 5, wherein the step S4 of classifying the test feature set by using the trained convolutional neural network classifier to determine whether the speech is subjected to speech synthesis or speech conversion tampering operation comprises the following specific steps:
s401, inputting a test feature set of a two-dimensional AR voice feature with dimension of bx h into a trained convolutional neural network classifier;
s402, according to a calculation formula of the voice score CM:
CM(f)=log(p(bonafide|f;θ))-log(p(spoof|f;θ))
calculating the score of each voice and storing a score file, wherein f represents the currently sent voice characteristic, and theta represents the stored optimal parameter; p (bonafide | f; theta) represents the probability that the input characteristic f is the bonafide characteristic of the original voice, and p (spooff | f; theta) represents the probability that the input characteristic f is the synthetic tampered voice spooff characteristic;
s403, judging whether the value of CM (f) is larger than or equal to a judgment threshold value T, if so, the sent voice feature is original voice, otherwise, the sent voice feature is synthesized voice or converted tampered voice;
and S403, further determining the t-DCF index and the equal error probability EER index of the current test feature set according to the stored score file and the score file given by the known database.
7. The method of claim 6, wherein the step S5 of fusing the two-dimensional AR speech features of different orders to determine whether the speech is subjected to speech synthesis or speech conversion falsification comprises:
s501, respectively inputting test feature sets of two-dimensional AR voice features of different orders into a trained convolutional neural network classifier to obtain different score files and storing the score files;
s502, averaging different score files to obtain a fused score file;
s503, judging whether the score value after fusion is larger than or equal to a judgment threshold value T, if so, the voice feature after fusion is original voice, otherwise, the voice feature after fusion is synthesized voice or converted and tampered voice;
and S503, further determining the fused t-DCF index and the equal error probability EER index according to the fused score file and the score file given by the known database.
8. The method of claim 7, wherein the larger the CM (f) score, the greater the probability that the current input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech.
9. The method of claim 8, wherein the smaller the t-DCF metric and the EER metric, the more efficient the two-dimensional AR speech features are.
10. The method of auto-regressive model coefficient-based synthesized speech detection according to claim 9, wherein the known database is an ASVspoof 2019 speech data set.
CN202110212380.5A 2021-02-25 2021-02-25 Synthetic speech detection method based on autoregressive model coefficient Pending CN112967712A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110212380.5A CN112967712A (en) 2021-02-25 2021-02-25 Synthetic speech detection method based on autoregressive model coefficient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110212380.5A CN112967712A (en) 2021-02-25 2021-02-25 Synthetic speech detection method based on autoregressive model coefficient

Publications (1)

Publication Number Publication Date
CN112967712A true CN112967712A (en) 2021-06-15

Family

ID=76286141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110212380.5A Pending CN112967712A (en) 2021-02-25 2021-02-25 Synthetic speech detection method based on autoregressive model coefficient

Country Status (1)

Country Link
CN (1) CN112967712A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488074A (en) * 2021-08-20 2021-10-08 四川大学 Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds
EP2860706A2 (en) * 2013-09-24 2015-04-15 Agnitio S.L. Anti-spoofing
CN110491391A (en) * 2019-07-02 2019-11-22 厦门大学 A kind of deception speech detection method based on deep neural network
CN111243621A (en) * 2020-01-14 2020-06-05 四川大学 Construction method of GRU-SVM deep learning model for synthetic speech detection
CN111445924A (en) * 2020-03-18 2020-07-24 中山大学 Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient
CN111798828A (en) * 2020-05-29 2020-10-20 厦门快商通科技股份有限公司 Synthetic audio detection method, system, mobile terminal and storage medium
CN112349267A (en) * 2020-10-28 2021-02-09 天津大学 Synthesized voice detection method based on attention mechanism characteristics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2860706A2 (en) * 2013-09-24 2015-04-15 Agnitio S.L. Anti-spoofing
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds
CN110491391A (en) * 2019-07-02 2019-11-22 厦门大学 A kind of deception speech detection method based on deep neural network
CN111243621A (en) * 2020-01-14 2020-06-05 四川大学 Construction method of GRU-SVM deep learning model for synthetic speech detection
CN111445924A (en) * 2020-03-18 2020-07-24 中山大学 Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient
CN111798828A (en) * 2020-05-29 2020-10-20 厦门快商通科技股份有限公司 Synthetic audio detection method, system, mobile terminal and storage medium
CN112349267A (en) * 2020-10-28 2021-02-09 天津大学 Synthesized voice detection method based on attention mechanism characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于泓: "合成语音检测算法研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488074A (en) * 2021-08-20 2021-10-08 四川大学 Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof
CN113488074B (en) * 2021-08-20 2023-06-23 四川大学 Two-dimensional time-frequency characteristic generation method for detecting synthesized voice

Similar Documents

Publication Publication Date Title
CN108900725B (en) Voiceprint recognition method and device, terminal equipment and storage medium
WO2020211354A1 (en) Speaker identity recognition method and device based on speech content, and storage medium
KR100636317B1 (en) Distributed Speech Recognition System and method
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
CN103179122B (en) A kind of anti-telecommunications telephone fraud method and system based on voice semantic content analysis
EP1083542A2 (en) A method and apparatus for speech detection
CN112735383A (en) Voice signal processing method, device, equipment and storage medium
US9043207B2 (en) Speaker recognition from telephone calls
CN112712809B (en) Voice detection method and device, electronic equipment and storage medium
CN110120230B (en) Acoustic event detection method and device
CN106910495A (en) Audio classification system and method applied to abnormal sound detection
Kim et al. Hierarchical approach for abnormal acoustic event classification in an elevator
CN102915740B (en) Phonetic empathy Hash content authentication method capable of implementing tamper localization
JP5050698B2 (en) Voice processing apparatus and program
CN112466287A (en) Voice segmentation method and device and computer readable storage medium
Dhanalakshmi et al. Pattern classification models for classifying and indexing audio signals
Efanov et al. The BiLSTM-based synthesized speech recognition
Birla A robust unsupervised pattern discovery and clustering of speech signals
CN110246509A (en) A kind of stack denoising self-encoding encoder and deep neural network structure for voice lie detection
CN112967712A (en) Synthetic speech detection method based on autoregressive model coefficient
Mandalapu et al. Multilingual voice impersonation dataset and evaluation
Alex et al. Variational autoencoder for prosody‐based speaker recognition
CN109920447A (en) Recording fraud detection method based on sef-adapting filter Amplitude & Phase feature extraction
CN116153336B (en) Synthetic voice detection method based on multi-domain information fusion
Kanrar Dimension compactness in speaker identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination