CN111653289B - Playback voice detection method - Google Patents

Playback voice detection method Download PDF

Info

Publication number
CN111653289B
CN111653289B CN202010479392.XA CN202010479392A CN111653289B CN 111653289 B CN111653289 B CN 111653289B CN 202010479392 A CN202010479392 A CN 202010479392A CN 111653289 B CN111653289 B CN 111653289B
Authority
CN
China
Prior art keywords
speech
training
cepstrum coefficient
features
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010479392.XA
Other languages
Chinese (zh)
Other versions
CN111653289A (en
Inventor
王让定
胡君
严迪群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huzhou Chuangguan Technology Co ltd
Kong Fanbin
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN202010479392.XA priority Critical patent/CN111653289B/en
Publication of CN111653289A publication Critical patent/CN111653289A/en
Application granted granted Critical
Publication of CN111653289B publication Critical patent/CN111653289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a playback voice detection method, which is characterized by comprising the following steps: the method comprises the following steps: 1) A training stage: 1.1 Input training speech samples, the training speech samples comprising original speech and playback speech; 1.2 Extracting cepstral features of the training speech samples; 1.3 Training a residual error network model according to the extracted features to obtain network model parameters; 2) And (3) a testing stage: 2.1 Input a test speech sample; 2.2 Extracting cepstral features of the test speech sample; 2.3 Utilizing a residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample; 2.4 Determine whether the test speech sample is playback speech. Compared with the prior art, the invention has the advantages that: based on a deep learning mode, the cepstrum characteristics of the voice signals are combined with a deep residual error network, so that the detection performance of the system is effectively improved, and the algorithm has better robustness.

Description

Playback voice detection method
Technical Field
The invention relates to a voice detection technology, in particular to a playback voice detection method.
Background
With the continuous development of voiceprint authentication technology, various fraudulent voice attacks on the voiceprint authentication technology are becoming more and more serious. The playback voice attack mainly depends on using recording equipment to steal and record voice of a legal user when the legal user enters the system and using high-fidelity playback equipment to play back so as to attack the voiceprint authentication system. The playback voice attack is easier to implement because the voice sample is derived from real voice, the operation is simple, the acquisition is convenient, and the attacker does not need to have related professional knowledge.
Early playback voice detection technologies are based on traditional manual features, however, traditional manual feature extraction cannot well express high-level semantics of voice, so that more current playback voice detection technologies are based on a deep learning mode, shallow features are extracted deeper through a neural network model, and performance of a playback voice detection algorithm is improved. As in prior art 1: lavrenyeva G, novoselov S, malykh E, et al, audio Replay attach Detection with Deep Learning framework [ C ]// interspace.2017: 82-86 ], the authors propose a lightweight convolutional neural network LCNN, construct a Max-Feature-Map (MFM) layer by applying a maximum output method to each convolutional layer in the LCNN, and eliminate the part with smaller output in a competitive Learning manner. As with prior art 2: jung J, shim H, heo H S, et al, replay attack detection with compensation using end-to-end DNN for the ASV spoof2019Chanllege J, arXiv preprint arXiv, 1904.10134,2019. As with prior art 3: chetteri B, stoller D, morfi V, et al, ensemble modules for spaofing detection in automatic spectral verification [ J ] arXiv preprinting arXiv, 1904.04589,2019.
The prior art 1 has the problem of weak robustness, the performance of the detection algorithm is not high in the evaluation set, mainly because the generalization capability of the model is not strong, the detection performance is obviously reduced when the unknown attack in the evaluation set is faced; the problems of complex detection algorithms exist in the prior art 2 and the prior art 3, in the prior art 2, an author partitions a data set, so that the playback voice attacks in a test set and a training set do not overlap, but the partitioning process of the data set is complex, and therefore more time is consumed; in the prior art 3, an author assists in the recognition and judgment of the second classification by proposing a multi-feature and multi-task learning mode and also performing other auxiliary class tasks during network training, but the detection algorithm is relatively complex.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a playback voice detection method, which can improve the detection performance and robustness of the playback voice algorithm, aiming at the defects existing in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a playback voice detection method, characterized by: the method comprises the following steps:
1) A training stage:
1.1 Input training speech samples, the training speech samples comprising original speech and playback speech;
1.2 Extracting cepstral features of the training speech samples;
1.3 Training a residual error network model according to the extracted features to obtain network model parameters;
2) And (3) a testing stage:
2.1 Input a test speech sample;
2.2 Extracting cepstral features of the test speech sample;
2.3 Utilizing a residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample;
2.4 Determine whether the test speech sample is playback speech.
Preferably, in order to keep more detailed information on the frequency spectrum, in step 1.2) and step 2.2), the full-frequency cepstral coefficient features are extracted.
Further, the extraction method of the full-frequency cepstrum coefficient features comprises the steps of 1) performing framing and windowing processing on the voice signal of the training voice sample or the testing voice sample, and then performing Fourier transform on the framed voice signal to obtain the spectral coefficient X of the voice signal i (k):
Figure GDA0003890916360000021
Wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, k =0,1, 2.. The N-1, j represents a plurality of numbers, m represents the frame number after framing the voice signal, and N represents the number of Fourier transform points;
2) Then, the absolute value is calculated to obtainTo the corresponding amplitude spectrum coefficient E i (k):
Figure GDA0003890916360000022
3) Then carrying out logarithm operation and DCT transformation to obtain a frequency cepstrum coefficient BFCC (i) of the ith frame:
Figure GDA0003890916360000023
in order to enable training models with different characteristics to cooperate with each other to obtain a better fusion result, in the step 1.2) and the step 2.2), mel frequency cepstrum coefficient characteristics and normal Q cepstrum coefficient characteristics are also extracted respectively, corresponding three residual error networks are obtained in the training stage according to the full frequency cepstrum coefficient characteristics, the Mel frequency cepstrum coefficient characteristics and the normal Q cepstrum coefficient characteristics respectively, residual error network identification results obtained according to the full frequency cepstrum coefficient characteristics, the Mel frequency cepstrum coefficient characteristics and the normal Q cepstrum coefficient characteristics are correspondingly obtained in the testing stage, and the fusion of the three identification results is comprehensively judged.
According to one aspect of the invention, in step 1.2) and step 2.2), the extracted features are mel-frequency cepstral coefficients.
According to another aspect of the invention, in step 1.2) and step 2.2), the normal-Q cepstral coefficient features are extracted.
Preferably, the residual error network comprises a two-dimensional convolutional layer, a residual error block sequence, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer which are connected in sequence.
Preferably, the activation function layer adopts a leakage correction linear unit.
Preferably, in order to increase the convergence rate of learning, a batch normalization processing layer is further provided between the two-dimensional convolution layer and the activation function layer.
To improve the detection accuracy, in step 2.4), the score of the residual network output is combined with the ASV system score to determine whether the test speech sample is original speech or playback speech.
Compared with the prior art, the invention has the advantages that: the neural network can be used for extracting deeper features and completely representing detail information in the voice signal, and the cepstrum features of the voice signal are combined with the deep residual error network based on a deep learning mode, so that the detection performance of the system is effectively improved, and the algorithm has better robustness; the residual error network can well model the distortion in time domain and frequency domain, thereby improving the accuracy of neural network classification; by extracting the full-frequency cepstrum coefficient characteristics, a filter is not needed, and more detailed information on the frequency spectrum can be reserved.
Drawings
FIG. 1 is a flowchart of a playback voice detection method according to an embodiment of the present invention;
FIG. 2 is a flowchart of MFCC extraction for a playback voice detection method in an embodiment of the present invention;
fig. 3 is a flow chart of CQCC extraction of the playback voice detection method according to the embodiment of the present invention;
fig. 4 is a BFCC extraction flow chart of the playback voice detection method of the embodiment of the present invention;
fig. 5 is a schematic diagram of a residual error network of a playback voice detection method according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.
In the description of the present invention, it is to be understood that the terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and simplicity in description, but do not indicate or imply that the devices or elements so referred to must have a particular orientation, be constructed and operated in a particular orientation, and that the directional terms are illustrative only and are not to be construed as limiting since the disclosed embodiments of the invention can be positioned in different orientations, e.g., "upper" and "lower" are not necessarily limited to directions opposite or coincident with the direction of gravity. Furthermore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
Referring to fig. 1, a playback voice detection method includes the steps of:
1) A training stage:
1.1 Input training speech samples;
1.2 Extracting cepstral features;
1.3 ) training a residual error network model to obtain network model parameters;
2) And (3) a testing stage:
2.1 Input a test speech sample;
2.2 Extracting cepstral features;
2.3 Carrying out identification and scoring by utilizing a residual error network obtained by training in the step 1);
2.4 ) combine the ASV system score with step 2.3) to arrive at a decision.
In the above steps, the extraction features of the training speech sample and the testing speech sample may be implemented by using a filter, and the features in the speech sample may be determined according to the filter, where the filter may be a filter preconfigured according to actual requirements, and is used to extract features in the speech sample, such as traditional mel-frequency cepstrum coefficient features and normal Q cepstrum coefficient features.
1. Mel-frequency cepstrum coefficient characteristics: mel-Frequency Cepstral Coefficients (MFCCs) are a characteristic parameter of speech that is commonly used in the field of speaker recognition, and conform to the auditory characteristics of the human ear, and have different auditory sensitivities to sound waves of different frequencies. The Mel-frequency cepstrum coefficient is a cepstrum coefficient extracted from a frequency domain of Mel scale, wherein the Mel scale can reflect the non-linear characteristic of human ear frequency, and the relationship expression of the Mel-frequency cepstrum coefficient and the frequency is shown as the following formula:
Figure GDA0003890916360000041
wherein, F mel Is the perceived frequency in Mel (Mel) units, and f is the actual frequency in hertz (Hz). Converting the speech signal to the perceptual frequency domain, rather than simply expressing it as a fourier transform, generally better simulates the processing of the human auditory process.
Referring to fig. 2, a flow chart of a specific extraction process of MFCC is shown, comprising the following steps:
1) The signal x (n) of the speech sample (training speech sample or testing speech sample) is preprocessed to be x i (m), then carrying out Fourier time-frequency transform (STFT) on each frame of signals obtained after framing to obtain the spectral coefficient of the frame of signals;
Figure GDA0003890916360000042
wherein i represents the i-th frame after framing, k represents a frequency point in the i-th frame, and k =1, 2.., N-1;
2) Solving the magnitude spectrum coefficient of each frame of the obtained spectrum coefficient:
Figure GDA0003890916360000051
3) The resulting energy spectral coefficients are fed to a bank of mel-filters where the energy is calculated. The resulting Mel spectral coefficients are derived from the energy spectral coefficients and the frequency response H in the Mel filter m (k) Multiply and sum, i.e.:
Figure GDA0003890916360000052
wherein, M is more than or equal to 0 and less than or equal to M refers to the mth Mel filter, and M filters are in total;
4) Then, carrying out logarithm operation and DCT (discrete cosine transformation) on the Mel frequency spectrum coefficient to obtain a Mel frequency cepstrum coefficient:
Figure GDA0003890916360000053
the standard MFCC only reflects the static characteristics of parameters in a voice signal, the dynamic characteristics of the voice signal can be obtained through the differential spectral coefficients of the static characteristics, and the static coefficients of the MFCC characteristics can be preferably combined with the dynamic characteristics of first-order coefficients and second-order coefficients to improve the recognition performance of the system.
2. Constant Q cepstral coefficient characteristics: the Constant Q Cepstrum Coefficients (CQCC) are characterized by that the speech signal is obtained by CQT time-frequency transform. Used in the CQT transform is a set of filters with a constant center frequency to bandwidth ratio of Q. The cross-axis frequency of the spectrum obtained by the CQT transform is non-linear because its center frequency follows an exponential distribution and the window length used by the filter bank varies according to frequency. When the time domain voice signal is converted into the frequency domain voice signal, higher frequency resolution can be provided in a low frequency area, and higher time resolution can be provided in a high frequency area.
Referring to fig. 3, a flowchart of a specific extraction process of the constant Q cepstrum coefficient includes the following steps:
1) The perception frequency domain signal of the signal X (n) of the voice sample (training voice sample or testing voice sample) after CQT conversion is X CQ (k),X CQ (k) The calculation formula of (a) is as follows:
Figure GDA0003890916360000054
where K denotes a frequency point, K =1,2 s Is the sampling rate of the speech samples, f k For the center frequency of the filter, which follows an exponential distribution, it is defined as follows:
Figure GDA0003890916360000055
b is the number of frequency points in each multiplication, f1 is the central frequency of the lowest frequency point, and the frequency is calculated by the following formula:
Figure GDA0003890916360000056
Figure GDA0003890916360000061
the Q factor being the centre frequency f k And bandwidth B k Is a constant independent of k, which basically defines the formula:
Figure GDA0003890916360000062
the window function adopts a Hanning window, and the time resolution is gradually reduced along with the improvement of the frequency resolution, so the window length N k Varies with k and is inversely proportional to k, and the window function is defined as follows:
Figure GDA0003890916360000063
2) Calculating the frequency value X of the kth frequency point of the ith frame i (k) Amplitude of (d):
Figure GDA0003890916360000064
3) Solving a logarithmic spectrum coefficient for the amplitude:
Figure GDA0003890916360000065
wherein T is k Represents the total number of frames of the speech signal in the K-th band, K =1,2Here K is taken to be 420.
4) And uniformly resampling the obtained logarithmic spectrum coefficients, wherein the new frequency representation is related to the original frequency representation as follows:
Figure GDA0003890916360000066
5) Performing DCT on the resampled frequency spectrum coefficient, namely:
Figure GDA0003890916360000067
where p =0,1.., L-1,l denotes the frequency points after resampling.
The two types are the existing feature extraction, and in the invention, the following feature extraction method can be adopted:
in addition, the feature of the speech sample may also be extracted without using a filter, which is specifically as follows:
3. full frequency cepstral coefficients: compared with other traditional characteristics, such as CQCC and MFCC, the Full-Frequency Cepstral Coefficients (BFCC) abandon the use of a filter in the traditional characteristics, namely directly carrying out logarithm operation and DCT (discrete cosine transform) on the spectral Coefficients obtained by Fourier transform, and the advantage of the method is that more detailed information on the Frequency spectrum can be reserved.
Referring to fig. 4, a flowchart of a specific extraction process of BFCC is shown, which comprises the following steps:
1) A section of voice is subjected to framing and windowing, and then Fourier transform is performed on a framed voice signal to obtain a spectral coefficient:
Figure GDA0003890916360000071
wherein i represents the ith frame after framing, k represents a frequency point in the ith frame, k =0,1, 2.,. N-1, j represents a complex number, m represents the number of frames after framing of the voice signal, and N represents the number of fourier transform points, in this embodiment, N =512;
2) Then, the absolute value of the amplitude spectrum coefficient is calculated to obtain the corresponding amplitude spectrum coefficient:
Figure GDA0003890916360000072
3) Further carrying out logarithm operation and DCT transformation to obtain the cepstrum coefficient:
Figure GDA0003890916360000073
in the BFCC feature, preferably, the logarithmic energy coefficient and the first and second order difference coefficients of the feature may also be added as the final feature vector.
In the invention, the network model adopts a residual error network, the overall framework of which is shown in fig. 5, when different characteristics are taken as input, the overall framework of the network model is unchanged, and only the dimension of the input characteristics at the input end is changed.
The residual error network comprises a two-dimensional convolution layer, four identical residual error block sequences, a Dropot layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer (classifier) which are sequentially connected, the value of a Dropot layer function is set to be 0.5, and the network output layer preferably adopts a Softmax layer.
The activation function layer adopts a leakage correction linear unit (LeakyRelu), which is evolved from a corrected linear unit (Relu) activation function. The Relu activation function activates neurons only when the input exceeds a threshold, changing all negative values to 0, while positive values do not. However, its complement is very weak during training, especially when the input is negative, the learning speed of Relu will be very slow or even the neuron will not function directly, resulting in that its weight cannot be updated, so that this neuron will not activate any data any more, and the gradient is always 0. Therefore, in order to solve the complement point of the Relu function, the invention adopts the LeakyRelu activation function in the network. Unlike Relu, which sets all negative values to 0, leakyRelu adds a very small non-zero slope to all negative values, thus solving the problem that the neuron does not learn when the input of Relu is negative, and the mathematical expression is as follows:
Figure GDA0003890916360000074
in addition, the GRU layer can not only re-aggregate the frame-level features extracted from the upper layer into a single speech feature, but also has a simple model, and is very suitable for constructing a deeper network. In the residual error network, the two gates of the residual error network can also enable the efficiency of the whole network to be higher, the calculation is more time-saving, and the convergence speed of the model during training is obviously accelerated.
The three extracted cepstrum features are respectively sent to the input end of a residual error network, the time domain-frequency domain features based on the frame level are firstly extracted through a two-dimensional convolution layer, then the output of the layer is sent to 4 identical residual error block sequences to promote the deeper training of the network, and the output of the last residual error block is sent to a Dropout layer, a first full-link layer, an activation function layer and a GRU layer in sequence. After the GRU layer, its features based on the utterance are mapped to a new space by the second fully-connected layer, converted to new features, and sent to an output layer with only two node units, where the logarithms of the classifications are generated, and finally the output of the second fully-connected layer is sent to the Softmax layer to convert the logarithms to score probability distributions.
Because the neural network may have a phenomenon that the gradient disappears when the number of layers of the network is too deep during training, a Batch Normalization (BN) layer is added into the residual network, wherein the BN layer is used for drawing the distribution deviating from a normal range into a standardized distribution range according to a standardized means, and the layer is positioned between the two-dimensional convolution layer and the activation function layer, so that data can be distributed in a region sensitive to an activation function, the gradient is increased, and the convergence speed of learning is accelerated.
In the training phase, the Adam algorithm can be used for optimizing the residual error networkAnd (4) transforming. At the same time, with 10e -4 As the learning rate of the network, the training process was stopped after 50 cycles with 32 as the value of the batch process. The loss function adopts a binary cross entropy function between the predicted value and the target value, and the node output of the last full connection layer is used as a prediction score.
In the training stage of step 1), extracting traditional characteristics by using training voice samples (including training sets of original voice and playback voice) and sending the traditional characteristics into a neural network to respectively train residual error network models of the original voice and the playback voice; in the testing stage, the characteristics in the tested voice sample are extracted and sent to a residual error network model trained in the training stage, the tested voice is classified and judged according to the score result of the network output layer, and the result is combined with the score of an ASV (Automatic Speaker Recognition) system to be used as the final result for judging whether the tested voice sample is the playback voice.
Through the training and testing processes, the subsystems about the three cepstrum characteristics can be obtained respectively, the scores of the three subsystems are fused, the fusion mode is that the scores of the three subsystems are fused, and the formula is as follows:
S=i·S BFCC +j·S MFCC +k·S CQCC
wherein i, j and k are weight coefficients of scores of the three subsystems respectively, and the constraint condition is that i + j + k =1,S BFCC 、S MFCC 、S CQCC Respectively, the scores of the normalized subsystems. By enabling training models with different characteristics to cooperate with each other, a better fusion result is obtained.

Claims (6)

1. A playback voice detection method, characterized by: the method comprises the following steps:
1) A training stage:
1.1 Input training speech samples, the training speech samples comprising original speech and playback speech;
1.2 Extracting cepstrum features of the training voice sample, wherein the cepstrum features comprise full-frequency cepstrum coefficient features, mel frequency cepstrum coefficient features and normal Q cepstrum coefficient features, and corresponding three residual error networks are obtained according to the full-frequency cepstrum coefficient features, the Mel frequency cepstrum coefficient features and the normal Q cepstrum coefficient features respectively;
1.3 Training a residual error network model according to the extracted features to obtain network model parameters;
2) And (3) a testing stage:
2.1 Input a test speech sample;
2.2 Extracting cepstrum characteristics of the tested voice sample, wherein the cepstrum characteristics comprise full-frequency cepstrum coefficient characteristics, mel frequency cepstrum coefficient characteristics and normal Q cepstrum coefficient characteristics, and residual error network identification results are obtained according to the full-frequency cepstrum coefficient characteristics, the Mel frequency cepstrum coefficient characteristics and the normal Q cepstrum coefficient characteristics respectively;
2.3 Utilizing a residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample: obtaining subsystems related to three cepstrum characteristics through the step 1.2) and the step 2.2), and fusing scores of the three subsystems in a mode of fusing the scores of the three subsystems, wherein the formula is as follows:
S=i·S BFCC +j·S MFCC +k·S CQCC
wherein i, j and k are weight coefficients of scores of the three subsystems respectively, and the constraint condition is that i + j + k =1,S BFCC 、S MFCC 、S CQCC Respectively scoring the normalized full frequency cepstrum coefficient characteristic subsystem, the normalized Mel frequency cepstrum coefficient characteristic subsystem and the normalized normal Q cepstrum coefficient characteristic subsystem;
2.4 Determine whether the test speech sample is playback speech.
2. The playback voice detection method according to claim 1, characterized in that: 1) The full frequency is processed by framing and windowing the speech signal of the training speech sample or the testing speech sample, and then Fourier transform is carried out on the framed speech signal to obtain the spectral coefficient X i (k):
Figure FDA0003890916350000011
Wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, k =0,1, 2.. The N-1, j represents a plurality of numbers, m represents the frame number after framing the voice signal, and N represents the number of Fourier transform points;
2) Then, the absolute value is calculated to obtain the corresponding amplitude spectrum coefficient E i (k):
Figure FDA0003890916350000012
3) Then carrying out logarithmic operation and DCT transformation to obtain a full frequency cepstrum coefficient BFCC (i) of the ith frame:
Figure FDA0003890916350000021
3. the playback voice detection method according to claim 1 or 2, characterized in that: the residual error network comprises a two-dimensional convolution layer, a residual error block sequence, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer which are connected in sequence.
4. The playback voice detection method according to claim 3, characterized in that: the activation function layer adopts a leakage correction linear unit.
5. The playback voice detection method according to claim 3, characterized in that: there is also a batch normalization layer between the two-dimensional convolution layer and the activation function layer.
6. The playback voice detection method according to claim 1 or 2, characterized in that: in step 2.4), the score of the residual network output is combined with the ASV system score to determine whether the test speech sample is original speech or played back speech.
CN202010479392.XA 2020-05-29 2020-05-29 Playback voice detection method Active CN111653289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010479392.XA CN111653289B (en) 2020-05-29 2020-05-29 Playback voice detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010479392.XA CN111653289B (en) 2020-05-29 2020-05-29 Playback voice detection method

Publications (2)

Publication Number Publication Date
CN111653289A CN111653289A (en) 2020-09-11
CN111653289B true CN111653289B (en) 2022-12-27

Family

ID=72344774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010479392.XA Active CN111653289B (en) 2020-05-29 2020-05-29 Playback voice detection method

Country Status (1)

Country Link
CN (1) CN111653289B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822587B (en) * 2021-01-19 2023-07-14 四川大学 Audio characteristic compression method based on constant Q transformation
CN113012684B (en) * 2021-03-04 2022-05-31 电子科技大学 Synthesized voice detection method based on voice segmentation
CN113096692B (en) * 2021-03-19 2024-05-28 招商银行股份有限公司 Voice detection method and device, equipment and storage medium
CN113506583B (en) * 2021-06-28 2024-01-05 杭州电子科技大学 Camouflage voice detection method using residual error network
CN113284486B (en) * 2021-07-26 2021-11-16 中国科学院自动化研究所 Robust voice identification method for environmental countermeasure
CN113488074B (en) * 2021-08-20 2023-06-23 四川大学 Two-dimensional time-frequency characteristic generation method for detecting synthesized voice
CN115022087B (en) * 2022-07-20 2024-02-27 中国工商银行股份有限公司 Voice recognition verification processing method and device
CN117153190B (en) * 2023-10-27 2024-01-19 广东技术师范大学 Playback voice detection method based on attention mechanism combination characteristics

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464568B (en) * 2017-09-25 2020-06-30 四川长虹电器股份有限公司 Speaker identification method and system based on three-dimensional convolution neural network text independence
CN109920447B (en) * 2019-01-29 2021-07-13 天津大学 Recording fraud detection method based on adaptive filter amplitude phase characteristic extraction
CN109935233A (en) * 2019-01-29 2019-06-25 天津大学 A kind of recording attack detection method based on amplitude and phase information

Also Published As

Publication number Publication date
CN111653289A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111653289B (en) Playback voice detection method
CN108369813B (en) Specific voice recognition method, apparatus and storage medium
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
CN108831443B (en) Mobile recording equipment source identification method based on stacked self-coding network
CN110459241B (en) Method and system for extracting voice features
CN106782511A (en) Amendment linear depth autoencoder network audio recognition method
CN111785285A (en) Voiceprint recognition method for home multi-feature parameter fusion
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
CN112541533A (en) Modified vehicle identification method based on neural network and feature fusion
CN111048097A (en) Twin network voiceprint recognition method based on 3D convolution
WO2019232867A1 (en) Voice discrimination method and apparatus, and computer device, and storage medium
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
CN118486297B (en) Response method based on voice emotion recognition and intelligent voice assistant system
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
CN109300470A (en) Audio mixing separation method and audio mixing separator
CN116778956A (en) Transformer acoustic feature extraction and fault identification method
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN114283835A (en) Voice enhancement and detection method suitable for actual communication condition
CN118173092A (en) Online customer service platform based on AI voice interaction
CN118098247A (en) Voiceprint recognition method and system based on parallel feature extraction model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230821

Address after: Room 502, No. 3 Pulan 1st Street, Chancheng District, Foshan City, Guangdong Province, 528000

Patentee after: Kong Fanbin

Address before: Room 337, Building 3, No. 266, Zhenxing Road, Yuyue Town, Deqing County, Huzhou City, Zhejiang Province, 313000

Patentee before: Huzhou Chuangguan Technology Co.,Ltd.

Effective date of registration: 20230821

Address after: Room 337, Building 3, No. 266, Zhenxing Road, Yuyue Town, Deqing County, Huzhou City, Zhejiang Province, 313000

Patentee after: Huzhou Chuangguan Technology Co.,Ltd.

Address before: 315211, Fenghua Road, Jiangbei District, Zhejiang, Ningbo 818

Patentee before: Ningbo University