CN112967712A

CN112967712A - Synthetic speech detection method based on autoregressive model coefficient

Info

Publication number: CN112967712A
Application number: CN202110212380.5A
Authority: CN
Inventors: 王铮; 康显桂; 李中华
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-06-15

Abstract

The invention provides a synthetic voice detection method based on autoregressive model coefficient, which relates to the technical field of voice detection and solves the problem that the existing voice feature extraction algorithm is directly applied to carry out voice detection, the detection efficiency and the detection accuracy of the voice detection cannot be simultaneously considered, firstly, voice segments of a training set, a verification set and a test set in a database are fixed to be uniform in length, AR coefficients of each section of voice signals are extracted after segmentation to form two-dimensional AR voice features, the training feature set, the verification feature set and the test feature set are constructed, a convolutional neural network classifier is trained, then the test feature set is classified through the trained convolutional neural network classifier to confirm whether the voice signals to be detected are subjected to voice synthesis or voice conversion tampering operation, the existing voice feature extraction algorithm is not directly applied, and the calculated amount in the detection process is reduced, and finally, the detection efficiency and the detection accuracy of voice detection are considered through fusion.

Description

Synthetic speech detection method based on autoregressive model coefficient

Technical Field

The invention relates to the technical field of voice detection, in particular to a synthetic voice detection method based on autoregressive model coefficients.

Background

Automatic Speaker Verification (ASV) is deployed in an increasing number of different applications and services, such as mobile phones, smart speakers and call centers, to provide a low cost and flexible biometric solution for personal identity verification. Although the performance of ASV systems has gradually improved in recent years, ASV systems are vulnerable to spoofing attacks. Among them, in terms of audio tampering, which is of particular concern, there are mainly two kinds of spoofing attacks: speech synthesis attacks and speech conversion attacks, both of which constitute a significant threat to ASV systems. Text To Speech (TTS) is a technology for converting characters into speech, similar to human mouth, and a TTS system can generate a completely artificial speech signal by speaking contents to be expressed in different tones. Voice Conversion (VC) is a system that operates on natural speech by inputting a piece of speech and letting it sound like another person speaks while keeping the content of the utterance unchanged. Both SS and VC technologies can produce high quality speech signals that mimic the speech of a particular target.

Currently, most research on the detection of synthesized speech requires extracting features from speech, and commonly used speech features include MFCC, CQCC, and Spec. The MFCC is one of the most commonly used quantity-based features in speech processing, performs cepstrum analysis on a logarithmic spectrum on the Mel scale, and is suitable for distinguishing tampered speech from human speech; CQCC is an amplitude-based feature that uses a Constant Q Transform (CQT) in combination with traditional cepstral analysis; the Spec feature is a more primitive feature relative to MFCC, CQCC, because it is obtained by computing the STFT value over the Hamming window and then calculating the size of each part.

In 2018, 16.11.8, chinese patent (CN108831506A) discloses a GMM-BIC-based digital audio tampering point detection method and system, and also belongs to the technical field of voice detection, the method proposed in the patent is to extract the MFCC features of a silence frame after segmenting the silence frame in a voice signal, and then use the GMM-BIC method to replace the traditional SGM-BIC for digital audio tampering point detection, so that the digital audio tampering positioning is automatic, good in adaptability and high in robustness, and the detection accuracy can be ensured.

An Autoregressive Model (AR) is one of the most common stationary time series models, and is a statistical method for processing time series. Speech is also

The method belongs to one-dimensional data, and the relation between the voice sequences can be evaluated through an AR linear prediction model, so that the method has great significance in researching how to detect voice tampering based on AR coefficients.

Disclosure of Invention

In order to solve the problem that the detection efficiency and the detection accuracy of voice detection cannot be considered simultaneously when the existing voice feature extraction algorithm is directly applied to voice detection in the prior art, the invention provides a synthetic voice detection method based on autoregressive model coefficients, which reduces the calculated amount in the detection process and improves the detection accuracy.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a synthetic speech detection method based on autoregressive model coefficients at least comprises the following steps:

s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a;

s2, segmenting the voices of the training set, the verification set and the test set, extracting AR coefficients of different orders of each segmented voice, arranging the extracted AR coefficients to form two-dimensional AR voice features, and forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features;

s3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier;

s4, classifying the test feature set by using the trained convolutional neural network classifier, and determining whether the voice is subjected to voice synthesis or voice conversion tampering operation;

and S5, fusing the two-dimensional AR voice features with different orders to confirm whether the voice is subjected to voice synthesis or voice conversion tampering operation.

In the technical scheme, AR coefficients are extracted from each frame of voice signals after voice segmentation of a training set, a verification set and a test set of a known database, the arranged AR coefficients form two-dimensional AR voice features, then a convolutional neural network classifier is trained for classification, whether voice is subjected to voice synthesis or voice conversion tampering operation can be confirmed, an existing voice feature extraction algorithm is not directly applied, the calculated amount in the detection process is reduced, finally, the detectable precision is further improved by fusing the two-dimensional AR features of different orders, and the detection efficiency and the detection accuracy of voice detection are both considered.

Preferably, the process of fixing the speech segments of the training set, the verification set and the test set of the known database to the uniform length a in step S1 is:

s101, selecting a known database, and fixing the lengths of the voice segments of a training set, a verification set and a test set of the known database as a sampling points;

s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to a, if so, performing truncation operation, and fixing the voice length as a; otherwise, the voice length before fixing is expanded by copying the voice, and then the voice length is fixed as a.

Preferably, the specific process of constructing the training feature set, the verification feature set and the test feature set of the two-dimensional AR speech feature described in step S2 is as follows:

s201, dividing the voice fragments with the fixed uniform length of a in the training set, the verification set and the test set into b sections;

s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted b-dimensional AR coefficients to form a two-dimensional AR voice feature;

and S203, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimension of b multiplied by h.

Preferably, h in step S202 satisfies:

h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer, and the effect of the AR coefficient in application is ensured.

Preferably, the training feature set of the two-dimensional AR speech feature includes an original speech feature and a tampered speech feature, and in step S3, the convolutional neural network classifier is obtained by training through a gradient descent method; and taking the parameter of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set as an optimal parameter and storing the optimal parameter.

Preferably, the specific process of classifying the test feature set by using the trained convolutional neural network classifier and determining whether the speech is subjected to speech synthesis or speech conversion tampering operation in step S4 includes:

s401, inputting a test feature set of a two-dimensional AR voice feature with dimension of bx h into a trained convolutional neural network classifier;

s402, according to a calculation formula of the voice score CM:

CM(f)＝log(p(bonafide|f；θ))-log(p(spoof|f；θ))

calculating the score of each voice and storing a score file, wherein f represents the currently sent voice characteristic, and theta represents the stored optimal parameter; p (bonafide | f; theta) represents the probability that the input characteristic f is the bonafide characteristic of the original voice, and p (spooff | f; theta) represents the probability that the input characteristic f is the synthetic tampered voice spooff characteristic;

s403, judging whether the value of CM (f) is larger than or equal to a judgment threshold value T, if so, the sent voice feature is original voice, otherwise, the sent voice feature is synthesized voice or converted tampered voice;

and S403, further determining the t-DCF index and the equal error probability EER index of the current test feature set according to the stored score file and the score file given by the known database.

Here, the value of cm (f) represents the possibility that the speech signal to be measured is the original speech, and the t-DCF index and the equal error probability EER index are introduced in step S403 to further judge the effectiveness of the two-dimensional AR speech feature.

Preferably, in step S5, the process of fusing two-dimensional AR speech features of different orders and determining whether speech is subjected to speech synthesis or speech conversion tampering operation includes:

s501, respectively inputting test feature sets of two-dimensional AR voice features of different orders into a trained convolutional neural network classifier to obtain different score files and storing the score files;

s502, averaging different score files to obtain a fused score file;

s503, judging whether the score value after fusion is larger than or equal to a judgment threshold value T, if so, the voice feature after fusion is original voice, otherwise, the voice feature after fusion is synthesized voice or converted and tampered voice;

and S503, further determining the fused t-DCF index and the equal error probability EER index according to the fused score file and the score file given by the known database.

Preferably, the larger the cm (f) score, the greater the probability that the currently input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech.

Preferably, the smaller the t-DCF index and the equal error probability EER index, the more effective the two-dimensional AR speech feature is represented.

Preferably, the database is known as the ASVspoof 2019 speech data set.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a synthetic voice detection method based on autoregressive model coefficients, firstly fixing voice segments of a training set, a verification set and a test set in a known database to be uniform in length, extracting AR coefficients of each section of voice signals after segmentation to form two-dimensional AR voice characteristics, thereby constructing a training characteristic set, a verification characteristic set and a test characteristic set, training a convolutional neural network classifier by utilizing the training characteristic set and the verification characteristic set, classifying the test characteristic set by the trained convolutional neural network classifier, confirming whether the voice signal to be tested is subjected to voice synthesis or voice conversion tampering operation, compared with the prior art, the existing voice feature extraction algorithm is not directly applied, the calculated amount in the detection process is reduced, the detectable precision is further improved by fusing the two-dimensional AR features with different orders, and the detection efficiency and the detection accuracy of voice detection are considered.

Drawings

Fig. 1 is a flowchart illustrating a method for detecting synthesized speech based on auto-regression model coefficients according to an embodiment of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for better illustration of the present embodiment, certain parts of the drawings may be omitted, enlarged or reduced, and do not represent actual dimensions;

it will be understood by those skilled in the art that certain well-known descriptions of the figures may be omitted.

The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

A flow chart of a method for detecting synthesized speech based on autoregressive model coefficients, as shown in fig. 1, with reference to fig. 1, the method comprises:

s1, fixing the voice segments of a training set, a verification set and a test set of a known database to be a uniform length a; in this embodiment, the known database selects an ASVspoof 2019 speech data set, the uniform length is 64000, and the specific fixed process is as follows:

s101, selecting an ASVspoof 2019 voice data set as a known database, and fixing the lengths of voice segments of a training set, a verification set and a test set of the known database to be 64000 sampling points respectively;

s102, judging whether the voice length of any one of the training set, the verification set and the test set before fixation is larger than or equal to 64000, if so, performing truncation operation, and fixing the voice length to 64000; otherwise, the voice length before fixing is expanded by copying the voice, then the fixed voice length is 64000, namely when the voice length before fixing of any one of the training set, the verification set and the test set exceeds 64000, the cut-off operation is directly carried out by the prior art, so that the voice length meets the requirement of the unified length of 64000, if the voice length before fixing of any one of the training set, the verification set and the test set is less than 64000, the voice fragment is copied, the original voice length is expanded, and the voice length meets the requirement of the unified length of 64000;

the extraction of the AR coefficient can be realized by different mature technical means, and other specific processes are as follows:

s201, dividing the fixed voice fragments with the uniform length of 64000 in a training set, a verification set and a test set into 400 sections;

s202, extracting the AR coefficient with the order of h of each segmented voice, and arranging the extracted 400-dimensional AR coefficients to form a two-dimensional AR voice feature;

in this embodiment, to ensure the effect of the AR coefficient when applied, the order h satisfies: h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer, namely the h can be an end point of 8 or 150, or any positive integer between 8 and 150, and each divided 400 voice segments contain 160 sampling points; the training feature set of the two-dimensional AR voice feature comprises an original voice feature and a tampered voice feature;

s203, according to the extracted AR coefficient with the order of h of each section of voice and the arrangement of the divided 400 segments, respectively forming a training feature set, a verification feature set and a test feature set of the two-dimensional AR voice features with the dimensionality of 400 x h.

S3, training a convolutional neural network classifier by utilizing a training feature set and a verification feature set of the two-dimensional AR voice feature, and storing the optimal parameters of the convolutional neural network classifier; in this embodiment, the convolutional neural network classifier is obtained by training through a gradient descent method, and is not limited to a specific convolutional neural network classifier, and then the parameters of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set are used as the optimal parameters and stored.

the specific process is as follows:

s401, inputting a test feature set of two-dimensional AR voice features with dimensions of 400 x h into a trained convolutional neural network classifier;

s402, according to a calculation formula of the voice score CM:

CM(f)＝log(p(bonafide|f；θ))-log(p(spoof|f；θ))

s403, according to the stored score file and the score file given by the known database, further determining a t-DCF index and an equal error probability EER index of the current test feature set, wherein CM (f) the value of the score represents the possibility that the voice signal to be tested is the original voice, and simultaneously introducing the t-DCF index and the equal error probability EER index in the step S403 to further judge the effectiveness of the two-dimensional AR voice feature, wherein the t-DCF (a distance detection cost function) is a series cost function index and is determined by two systems (ASV, CM), and the calculation formula is as follows:

wherein,

is the cost of the ASV system rejecting the target voice;

is the cost of the ASV system to accept non-target speech;

is the cost of the CM system rejecting real speech;

is the cost of the CM system accepting the tampered speech; pi_tar，π_non，π_spoof，π_tarIs a prior probability;

for a false rejection rate of the ASV system below the threshold t,

the false acceptance rate of the ASV system at threshold t,

to tamper with the probability that a sample is not missed by the ASV system,

is the false rejection rate of the CM system below the s threshold,

is the false acceptance rate of the CM system below the s threshold.

The equal error probability EER is a point where the False Rejection Rate (FRR) is equal to the False Acceptance Rate (FAR), and the values of FAR and FRR at this time are called equal error rates.

S5, fusing two-dimensional AR voice features with different orders, and confirming whether voice is subjected to voice synthesis or voice conversion tampering operation, wherein the process is as follows:

s502, averaging different score files to obtain a fused score file;

In this embodiment, the larger the cm (f) score, the larger the probability that the currently input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score is, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech is, and the smaller the t-DCF index and the equal error probability EER index are, the more effective the two-dimensional AR speech feature is represented.

The specific comprehensive implementation process is as follows:

firstly, fixing the length of the speech of a training set, a verification set and a test set in a known database to be 64000, carrying out segmentation processing on the speech, wherein the sampling point of each segment is 160, extracting AR coefficients of each segment of speech,

wherein, the AR coefficient is selected between 8-150 orders

Experiments were conducted with AR coefficients of orders 8, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150; the formed training feature set and the verification feature set train the convolutional neural network classifier, the optimal parameters are stored, the test feature set is used for testing, in addition, two indexes of t-DCF and EER are used for evaluating the feature effect, the effect on the orders fluctuates on the whole, but the good voice tampering detection effect can be reflected, and the experimental result is shown in Table 1.

TABLE 1

Wherein, Development represents the result item on the verification set, Evaluation represents the result item on the test set, the experimental result in table 1 shows that the effect of the feature is best at 10 th order [ AR (10) ], and in the last row of table 1, the 10 th order feature and the 50 th order feature are fused, and even more, the feature effect is further improved.

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A synthetic speech detection method based on autoregressive model coefficients is characterized by at least comprising the following steps:

2. The method for detecting synthesized speech based on autoregressive model coefficients as claimed in claim 1, wherein the step S1 is performed by fixing the speech segments of the training set, the validation set and the test set of the known database to a uniform length a:

3. The method for detecting synthesized speech based on autoregressive model coefficients according to claim 2, wherein the specific process of constructing the training feature set, the verification feature set and the test feature set of the two-dimensional AR speech feature described in step S2 is as follows:

4. The method according to claim 3, wherein h in step S202 satisfies the following condition:

h is more than or equal to 8 and less than or equal to 150, wherein h represents a positive integer.

5. The method for detecting the synthesized speech based on the autoregressive model coefficient as claimed in claim 1 or 3, wherein the training feature set of the two-dimensional AR speech feature comprises an original speech feature and a tampered speech feature, and in step S3, the convolutional neural network classifier is obtained by training through a gradient descent method; and taking the parameter of the convolutional neural network classifier corresponding to the best accuracy index of the verification feature set as an optimal parameter and storing the optimal parameter.

6. The method according to claim 5, wherein the step S4 of classifying the test feature set by using the trained convolutional neural network classifier to determine whether the speech is subjected to speech synthesis or speech conversion tampering operation comprises the following specific steps:

s402, according to a calculation formula of the voice score CM:

CM(f)＝log(p(bonafide|f；θ))-log(p(spoof|f；θ))

7. The method of claim 6, wherein the step S5 of fusing the two-dimensional AR speech features of different orders to determine whether the speech is subjected to speech synthesis or speech conversion falsification comprises:

s502, averaging different score files to obtain a fused score file;

8. The method of claim 7, wherein the larger the CM (f) score, the greater the probability that the current input two-dimensional AR speech feature is the original speech; the smaller the CM (f) score, the smaller the probability that the currently input two-dimensional AR speech feature is the original speech.

9. The method of claim 8, wherein the smaller the t-DCF metric and the EER metric, the more efficient the two-dimensional AR speech features are.

10. The method of auto-regressive model coefficient-based synthesized speech detection according to claim 9, wherein the known database is an ASVspoof 2019 speech data set.