CN111653289A - Playback voice detection method - Google Patents

Playback voice detection method Download PDF

Info

Publication number
CN111653289A
CN111653289A CN202010479392.XA CN202010479392A CN111653289A CN 111653289 A CN111653289 A CN 111653289A CN 202010479392 A CN202010479392 A CN 202010479392A CN 111653289 A CN111653289 A CN 111653289A
Authority
CN
China
Prior art keywords
voice
training
detection method
playback
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010479392.XA
Other languages
Chinese (zh)
Other versions
CN111653289B (en
Inventor
王让定
胡君
严迪群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huzhou Chuangguan Technology Co ltd
Kong Fanbin
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN202010479392.XA priority Critical patent/CN111653289B/en
Publication of CN111653289A publication Critical patent/CN111653289A/en
Application granted granted Critical
Publication of CN111653289B publication Critical patent/CN111653289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a playback voice detection method, which is characterized by comprising the following steps: the method comprises the following steps: 1) a training stage: 1.1) inputting training voice samples, wherein the training voice samples comprise original voice and playback voice; 1.2) extracting cepstrum characteristics of a training voice sample; 1.3) training a residual error network model according to the extracted features to obtain network model parameters; 2) and (3) a testing stage: 2.1) inputting a test voice sample; 2.2) extracting cepstrum characteristics of the test voice sample; 2.3) utilizing the residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample; 2.4) judging whether the test voice sample is playback voice. Compared with the prior art, the invention has the advantages that: based on a deep learning mode, the cepstrum characteristics of the voice signals are combined with a deep residual error network, so that the detection performance of the system is effectively improved, and the algorithm has better robustness.

Description

Playback voice detection method
Technical Field
The invention relates to a voice detection technology, in particular to a playback voice detection method.
Background
With the continuous development of voiceprint authentication technology, various fraudulent voice attacks on the voiceprint authentication technology are becoming more and more serious. The playback voice attack mainly depends on the attack on the voiceprint authentication system by using a recording device to record the voice of a legal user when the legal user enters the system and using a high-fidelity playback device for playback. The playback voice attack is easier to implement because the voice sample is derived from real voice, the operation is simple, the acquisition is convenient, and the attacker does not need to have related professional knowledge.
Early playback voice detection technologies are all based on traditional manual features, however, traditional manual feature extraction cannot well express high-level semantics of voice, so at present, more playback voice detection technologies are all based on a deep learning mode, shallow features are extracted deeper through a neural network model, and therefore performance of a playback voice detection algorithm is improved. As in prior art 1: lavrenyeyeva G, Novoseov S, Malykh E, et al, audio replay Attack Detection with Deep Learning framework [ C ]// Interspeed.2017: 82-86. As with prior art 2: jung J, Shim H, Heo H S, et al, replay attach detection with complete high-resolution information formation end-to-end DNN for the ASVspoof 2019 Challenge [ J ]. arXiv prediction arXiv:1904.10134,2019. As with prior art 3: chettri B, Stoller D, Morfi V, et al, Ensemblemodulethes for spaofining protection in automatic spectral verification [ J ] arXivpreprint arXiv:1904.04589,2019, the authors propose a framework based on multi-feature integration and multi-task learning, which contains complementary information of various spectral features in the network, but since a single conventional feature is not sufficient to summarize information of a speech signal over the full frequency band, the speech information is better expressed through the multi-feature integration, and furthermore, the authors propose a butterfly unit for multi-task learning to facilitate parameter sharing of binary class tasks and other auxiliary class tasks during propagation.
The prior art 1 has the problem of weak robustness, the performance of the detection algorithm is not high in the evaluation set, mainly because the generalization capability of the model is not strong, the detection performance is obviously reduced when the unknown attack in the evaluation set is faced; the problems of complex detection algorithms exist in the prior art 2 and the prior art 3, in the prior art 2, the author partitions the data set, so that the playback voice attacks in the test set and the training set do not overlap, but the partitioning process of the data set is complex, and therefore more time is consumed; in the prior art 3, an author assists in the recognition and judgment of the second classification by proposing a multi-feature and multi-task learning mode and also performing other auxiliary class tasks during network training, but the detection algorithm is relatively complex.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a playback voice detection method, which can improve the detection performance and robustness of the playback voice algorithm, aiming at the defects existing in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a playback voice detection method, characterized by: the method comprises the following steps:
1) a training stage:
1.1) inputting training voice samples, wherein the training voice samples comprise original voice and playback voice;
1.2) extracting cepstrum characteristics of a training voice sample;
1.3) training a residual error network model according to the extracted features to obtain network model parameters;
2) and (3) a testing stage:
2.1) inputting a test voice sample;
2.2) extracting cepstrum characteristics of the test voice sample;
2.3) utilizing the residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample;
2.4) judging whether the test voice sample is playback voice.
Preferably, in order to retain more detailed information on the frequency spectrum, in step 1.2) and step 2.2), the full frequency cepstrum coefficient features are extracted.
Further, the extraction method of the full-frequency cepstrum coefficient features comprises the steps of 1) performing framing and windowing processing on the voice signal of the training voice sample or the testing voice sample, and then performing Fourier transform on the framed voice signal to obtain the spectral coefficient X of the voice signali(k):
Figure BDA0002516802070000021
Wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, k is 0,1,2, N-1, j represents a complex number, m represents the number of frames after framing of the voice signal, and N represents the number of Fourier transform points;
2) then, the absolute value is calculated to obtain the corresponding amplitude spectrum coefficient Ei(k):
Figure BDA0002516802070000022
3) Then carrying out logarithm operation and DCT transformation to obtain a frequency cepstrum coefficient BFCC (i) of the ith frame:
Figure BDA0002516802070000023
in order to enable training models with different characteristics to cooperate with each other to obtain a better fusion result, in the step 1.2) and the step 2.2), a Mel frequency cepstrum coefficient characteristic and a normal Q cepstrum coefficient characteristic are also respectively extracted, corresponding three residual error networks are respectively obtained according to the full frequency cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the normal Q cepstrum coefficient characteristic in a training stage, residual error network identification results obtained according to the full frequency cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the normal Q cepstrum coefficient characteristic are correspondingly obtained in a testing stage, and the fusion of the three identification results is comprehensively judged.
According to one aspect of the invention, in step 1.2) and step 2.2), the extracted features are mel-frequency cepstral coefficients.
According to another aspect of the invention, in step 1.2) and step 2.2), the normal-Q cepstral coefficient features are extracted.
Preferably, the residual error network comprises a two-dimensional convolution layer, a residual error block sequence, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer which are connected in sequence.
Preferably, the activation function layer adopts a leakage correction linear unit.
Preferably, in order to increase the convergence rate of learning, a batch normalization layer is further provided between the two-dimensional convolution layer and the activation function layer.
To improve the detection accuracy, in step 2.4), the score of the residual network output is combined with the ASV system score to determine whether the test speech sample is original speech or played back speech.
Compared with the prior art, the invention has the advantages that: the neural network can be used for extracting deeper features and completely representing detail information in the voice signal, and the cepstrum features of the voice signal are combined with the deep residual error network based on a deep learning mode, so that the detection performance of the system is effectively improved, and the algorithm has better robustness; the residual error network can well model the distortion in time domain and frequency domain, thereby improving the accuracy of neural network classification; by extracting the full-frequency cepstrum coefficient characteristics, a filter is not needed, and more detailed information on the frequency spectrum can be reserved.
Drawings
FIG. 1 is a flowchart of a playback voice detection method according to an embodiment of the present invention;
FIG. 2 is a MFCC extraction flow diagram for a playback voice detection method in an embodiment of the present invention;
FIG. 3 is a flow chart of CQCC extraction of the playback voice detection method according to the embodiment of the present invention;
fig. 4 is a BFCC extraction flow chart of the playback voice detection method of the embodiment of the present invention;
fig. 5 is a schematic diagram of a residual error network of a playback voice detection method according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the present invention and to simplify the description, but are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and that the directional terms are used for purposes of illustration and are not to be construed as limiting, for example, because the disclosed embodiments of the present invention may be oriented in different directions, "lower" is not necessarily limited to a direction opposite to or coincident with the direction of gravity. Furthermore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
Referring to fig. 1, a playback voice detection method includes the steps of:
1) a training stage:
1.1) inputting a training voice sample;
1.2) extracting cepstrum characteristics;
1.3) training a residual error network model to obtain network model parameters;
2) and (3) a testing stage:
2.1) inputting a test voice sample;
2.2) extracting cepstrum characteristics;
2.3) utilizing the residual error network obtained by training in the step 1) to carry out identification and scoring;
2.4) combining the ASV system score and the step 2.3) to obtain a decision.
In the above steps, the extraction features of the training speech sample and the testing speech sample may be implemented by using a filter, and the features in the speech sample may be determined according to the filter, where the filter may be a filter preconfigured according to actual requirements, and is used to extract features in the speech sample, such as traditional mel-frequency cepstrum coefficient features and normal Q cepstrum coefficient features.
1. Mel-frequency cepstrum coefficient characteristics: Mel-Frequency cepstral coefficients (MFCCs) are a characteristic parameter of speech that is commonly used in the field of speaker recognition, and conform to the auditory characteristics of the human ear, and have different auditory sensitivities to sound waves of different frequencies. The Mel-frequency cepstrum coefficient is a cepstrum coefficient extracted in a frequency domain of Mel scale, the Mel scale can reflect the nonlinear characteristic of human ear frequency, and the relationship expression of the Mel-frequency cepstrum coefficient and the frequency is shown as the following formula:
Figure BDA0002516802070000041
wherein, FmelIs the perceived frequency in Mel (Mel), and f is the actual frequency in hertz (Hz). Converting the speech signal to the perceptual frequency domain, rather than simply expressing it as a fourier transform, generally better simulates the processing of the human auditory process.
Referring to fig. 2, a flow chart of a specific extraction process of MFCC is shown, comprising the following steps:
1) the signal x (n) of the speech sample (training speech sample or testing speech sample) is preprocessed to be xi(m), then carrying out Fourier time-frequency transformation (STFT) on each frame of signals obtained after framing to obtain the spectral coefficient of the frame of signals;
Figure BDA0002516802070000042
wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, and k is 1, 2.
2) And solving the magnitude spectral coefficient of each frame of the obtained spectral coefficient:
Figure BDA0002516802070000051
3) the resulting energy spectral coefficients are fed to a bank of mel-filters where the energy is calculated. The resulting Mel spectral coefficients are derived from the energy spectral coefficients and the frequency response H in the Mel filterm(k) Multiply and sum, i.e.:
Figure BDA0002516802070000052
wherein, M is more than or equal to 0 and less than or equal to M refers to the mth Mel filter, and M filters are in total;
4) then, carrying out logarithm operation and DCT (discrete cosine transformation) on the Mel frequency spectrum coefficient to obtain a Mel frequency cepstrum coefficient:
Figure BDA0002516802070000053
the standard MFCC only reflects the static characteristics of parameters in a voice signal, the dynamic characteristics of the voice signal can be obtained through the differential spectral coefficients of the static characteristics, and the static coefficients of the MFCC characteristics can be preferably combined with the dynamic characteristics of first-order coefficients and second-order coefficients to improve the recognition performance of the system.
2. Constant Q cepstral coefficient characteristics: the Constant Q Cepstral Coefficients (CQCCs) are characterized by the CQT time-frequency transform of the speech signal. Used in the CQT transform is a set of filters whose center frequency to bandwidth ratio is constant at a constant Q. The cross-axis frequency of the spectrum obtained by the CQT transform is non-linear because its center frequency follows an exponential distribution and the window length used by the filter bank varies according to frequency. When the time domain voice signal is converted into the frequency domain voice signal, higher frequency resolution can be provided in a low frequency area, and higher time resolution can be provided in a high frequency area.
Referring to fig. 3, a flowchart of a specific extraction process of the constant Q cepstrum coefficient includes the following steps:
1) the signal X (n) of the speech sample (training speech sample or testing speech sample) is the perception frequency domain signal X after CQT conversionCQ(k),XCQ(k) The calculation formula of (a) is as follows:
Figure BDA0002516802070000054
where K denotes a frequency point, K1, 2sIs the sampling rate of the speech samples, fkFor the center frequency of the filter, which follows an exponential distribution, the definition is as follows:
Figure BDA0002516802070000055
b is the number of frequency points per multiple, and f1 is the center frequency of the lowest frequency point, which is calculated by the following formula:
Figure BDA0002516802070000056
Figure BDA0002516802070000061
the Q factor being the centre frequency fkAnd bandwidth BkIs a constant independent of k, which basically defines the formula:
Figure BDA0002516802070000062
the window function adopts a Hanning window, and the time resolution is gradually reduced along with the improvement of the frequency resolution, so the window length NkIs a function of k and is inversely proportional to k, and the window function is defined as follows:
Figure BDA0002516802070000063
2) calculating the frequency value X of the kth frequency point of the ith framei(k) Amplitude of (d):
Figure BDA0002516802070000064
3) and solving a logarithmic spectrum coefficient for the amplitude:
Figure BDA0002516802070000065
wherein T iskDenotes the total number of frames of the speech signal in the kth band, K being 1, 2.
4) And uniformly resampling the obtained logarithmic spectrum coefficients, wherein the new frequency representation is related to the original frequency representation as follows:
Figure BDA0002516802070000066
5) performing DCT on the resampled spectral coefficients, namely:
Figure BDA0002516802070000067
where p is 0, 1., L-1, L denotes the frequency points after resampling.
The two types are the existing feature extraction, and in the invention, the following feature extraction method can be adopted:
in addition, the features of the speech sample may also be extracted without using a filter, specifically as follows:
3. full frequency cepstral coefficients: compared with some other traditional characteristics, such as CQCC and MFCC, the Full-Frequency cepstral coefficients (BFCC) abandon the use of filters in the traditional characteristics, that is, directly perform logarithm operation and DCT transformation on the spectral coefficients obtained through fourier transform, which is advantageous in that more detailed information on the spectrum can be retained.
Referring to fig. 4, a flow chart of a specific extraction process of BFCC is shown, which includes the following steps:
1) a section of voice is subjected to framing and windowing, and then Fourier transform is carried out on a framed voice signal to obtain a spectral coefficient:
Figure BDA0002516802070000071
wherein, i represents the ith frame after being framed, k represents the frequency point in the ith frame, k is 0,1, 2., N-1, j represents a complex number, m represents the number of frames after being framed by the voice signal, N represents the number of fourier transform points, in this embodiment, N is 512;
2) then, the absolute value of the amplitude spectrum coefficient is calculated to obtain the corresponding amplitude spectrum coefficient:
Figure BDA0002516802070000072
3) further carrying out logarithm operation and DCT transformation to obtain the cepstrum coefficient:
Figure BDA0002516802070000073
in the BFCC feature, preferably, the logarithmic energy coefficient and the first and second order difference coefficients of the feature may also be added as the final feature vector.
In the invention, the network model adopts a residual error network, the overall framework of which is shown in fig. 5, when different characteristics are taken as input, the overall framework of the network model is unchanged, and only the dimension of the input characteristics at the input end is changed.
The residual error network comprises a two-dimensional convolution layer, four identical residual error block sequences, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer (classifier) which are sequentially connected, wherein the value of a Dropout layer function is set to be 0.5, and the network output layer preferably adopts a Softmax layer.
The activation function layer adopts a leakage correction linear unit (LeakyRelu), which is evolved from a corrected linear unit (Relu) activation function. The Relu activation function activates neurons only when the input exceeds a threshold, changing all negative values to 0, while positive values do not. However, its complement is very weak during training, especially when the input is negative, the learning speed of Relu will be very slow or even the neuron will not function directly, resulting in that its weight cannot be updated, so that this neuron will not activate any data any more, and the gradient is always 0. Therefore, in order to solve the complement point of the Relu function, the invention adopts the LeakyRelu activation function in the network. Unlike the Relu function, which sets all negative values to 0, the LeakyRelu function adds a very small non-zero slope to all negative values, thus solving the problem that the neuron does not learn when the Relu function inputs negative values, and the mathematical expression is as follows:
Figure BDA0002516802070000074
in addition, the GRU layer can not only re-aggregate the frame-level features extracted from the upper layer into a single speech feature, but also has a simple model, and is very suitable for constructing a deeper network. In the residual error network, the two gates of the residual error network can also enable the efficiency of the whole network to be higher, the calculation is more time-saving, and the convergence speed of the model during training is obviously accelerated.
The three extracted cepstrum features are respectively sent to the input end of a residual error network, the time domain-frequency domain features based on the frame level are firstly extracted through a two-dimensional convolution layer, then the output of the layer is sent to 4 identical residual error block sequences to promote the deeper training of the network, and the output of the last residual error block is sent to a Dropout layer, a first full-link layer, an activation function layer and a GRU layer in sequence. After the GRU layer, features based on the words are mapped to a new space through the second full connection layer, converted into new features and sent into an output layer with only two node units, classification logarithms are generated, and finally the output of the second full connection layer is sent into a Softmax layer to convert the logarithms into score probability distribution.
Because the neural network may have a phenomenon that the gradient disappears when the number of layers of the network is too deep during training, a Batch Normalization (BN) layer is added into the residual network, wherein the BN layer is used for drawing the distribution deviating from a normal range into a standardized distribution range according to a standardized means, and the layer is positioned between the two-dimensional convolution layer and the activation function layer, so that data can be distributed in a region sensitive to an activation function, the gradient is increased, and the convergence speed of learning is accelerated.
In the training phase, the residual error network can be optimized by using an Adam algorithm. At the same time, with 10e-4As the learning rate of the network, the training process was stopped after 50 cycles with 32 as the value of the batch process. The loss function adopts a binary cross entropy function between the predicted value and the target value, and the node output of the last full-connection layer is used as the prediction score.
In the training stage of step 1), extracting traditional characteristics by using training voice samples (including training sets of original voice and playback voice) and sending the traditional characteristics into a neural network to respectively train residual error network models of the original voice and the playback voice; in the testing stage, the characteristics in the tested voice sample are extracted and sent to a residual error network model trained in the training stage, the tested voice is classified and judged according to the score result of the network output layer, and the result is combined with the score of an ASV (automatic speaker recognition) system to be used as the final result for judging whether the tested voice sample is the playback voice.
Through the training and testing processes, the subsystems about the three cepstrum characteristics can be obtained respectively, the scores of the three subsystems are fused, the fusion mode is that the scores of the three subsystems are fused, and the formula is as follows:
S=i·SBFCC+j·SMFCC+k·SCQCC
wherein, i, j, k are weight coefficients of the three subsystem scores respectively, and the constraint condition is that i + j + k is 1, SBFCC、SMFCC、SCQCCRespectively, the scores of the normalized subsystems. By making a differenceThe training models of the features cooperate to achieve a better fusion result.

Claims (10)

1. A playback voice detection method, characterized by: the method comprises the following steps:
1) a training stage:
1.1) inputting training voice samples, wherein the training voice samples comprise original voice and playback voice;
1.2) extracting cepstrum characteristics of a training voice sample;
1.3) training a residual error network model according to the extracted features to obtain network model parameters;
2) and (3) a testing stage:
2.1) inputting a test voice sample;
2.2) extracting cepstrum characteristics of the test voice sample;
2.3) utilizing the residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample;
2.4) judging whether the test voice sample is playback voice.
2. The playback voice detection method according to claim 1, characterized in that: in step 1.2) and step 2.2), the full frequency cepstrum coefficient features are extracted.
3. The playback voice detection method according to claim 2, characterized in that: full frequency 1) performing framing and windowing on a speech signal of a training speech sample or a test speech sample, and performing Fourier transform on the framed speech signal to obtain a spectral coefficient X of the speech signali(k):
Figure FDA0002516802060000011
Wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, k is 0,1,2, N-1, j represents a complex number, m represents the number of frames after framing of the voice signal, and N represents the number of Fourier transform points;
2) then, find outAbsolute value to obtain corresponding amplitude spectrum coefficient Ei(k):
Figure FDA0002516802060000012
3) Then, logarithmic operation and DCT transformation are performed to obtain the frequency cepstrum coefficient bfcc (i) of the i frame:
Figure FDA0002516802060000013
4. the playback voice detection method according to claim 2, characterized in that: in the step 1.2) and the step 2.2), a mel-frequency cepstrum coefficient feature and a normal-Q cepstrum coefficient feature are further extracted respectively, corresponding three residual error networks are obtained in a training stage according to the full-frequency cepstrum coefficient feature, the mel-frequency cepstrum coefficient feature and the normal-Q cepstrum coefficient feature respectively, residual error network identification results obtained according to the full-frequency cepstrum coefficient feature, the mel-frequency cepstrum coefficient feature and the normal-Q cepstrum coefficient feature are correspondingly obtained in a testing stage, and the three identification results are fused for comprehensive judgment.
5. The playback voice detection method according to claim 1, characterized in that: in step 1.2) and step 2.2), the mel-frequency cepstrum coefficient characteristics are extracted.
6. The playback voice detection method according to claim 1, characterized in that: in step 1.2) and step 2.2), the features of the normal-Q cepstrum coefficients are extracted.
7. The playback voice detection method according to any one of claims 1 to 6, characterized in that: the residual error network comprises a two-dimensional convolution layer, a residual error block sequence, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer which are connected in sequence.
8. The playback voice detection method according to claim 7, characterized in that: the activation function layer adopts a leakage correction linear unit.
9. The playback voice detection method according to claim 7, characterized in that: there is also a batch normalization layer between the two-dimensional convolution layer and the activation function layer.
10. The playback voice detection method according to any one of claims 1 to 6, characterized in that: in step 2.4), the score of the residual network output is combined with the ASV system score to determine whether the test speech sample is original speech or played back speech.
CN202010479392.XA 2020-05-29 2020-05-29 Playback voice detection method Active CN111653289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010479392.XA CN111653289B (en) 2020-05-29 2020-05-29 Playback voice detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010479392.XA CN111653289B (en) 2020-05-29 2020-05-29 Playback voice detection method

Publications (2)

Publication Number Publication Date
CN111653289A true CN111653289A (en) 2020-09-11
CN111653289B CN111653289B (en) 2022-12-27

Family

ID=72344774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010479392.XA Active CN111653289B (en) 2020-05-29 2020-05-29 Playback voice detection method

Country Status (1)

Country Link
CN (1) CN111653289B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012684A (en) * 2021-03-04 2021-06-22 电子科技大学 Synthesized voice detection method based on voice segmentation
CN113096692A (en) * 2021-03-19 2021-07-09 招商银行股份有限公司 Voice detection method and device, equipment and storage medium
CN113284486A (en) * 2021-07-26 2021-08-20 中国科学院自动化研究所 Robust voice identification method for environmental countermeasure
CN113488074A (en) * 2021-08-20 2021-10-08 四川大学 Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof
CN113506583A (en) * 2021-06-28 2021-10-15 杭州电子科技大学 Disguised voice detection method using residual error network
CN114822587A (en) * 2021-01-19 2022-07-29 四川大学 Audio feature compression method based on constant Q transformation
CN115022087A (en) * 2022-07-20 2022-09-06 中国工商银行股份有限公司 Voice recognition verification processing method and device
CN117153190A (en) * 2023-10-27 2023-12-01 广东技术师范大学 Playback voice detection method based on attention mechanism combination characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN109920447A (en) * 2019-01-29 2019-06-21 天津大学 Recording fraud detection method based on sef-adapting filter Amplitude & Phase feature extraction
CN109935233A (en) * 2019-01-29 2019-06-25 天津大学 A kind of recording attack detection method based on amplitude and phase information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN109920447A (en) * 2019-01-29 2019-06-21 天津大学 Recording fraud detection method based on sef-adapting filter Amplitude & Phase feature extraction
CN109935233A (en) * 2019-01-29 2019-06-25 天津大学 A kind of recording attack detection method based on amplitude and phase information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HANILCI C: "Optimizing acoustic features for source cell-phone recognition using speech signals", 《2013 ACM WORKSHOP ON INFORMATION HIDING AND MULTIMEDIA SECURITY》 *
李勇等: "一种基于深度CNN的入侵检测算法", 《计算机应用与软件》 *
林朗等: "基于修正倒谱特征的回放语音检测算法", 《计算机应用》 *
裴安山等: "基于语音静音段特征的手机来源识别方法", 《电信科学》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822587A (en) * 2021-01-19 2022-07-29 四川大学 Audio feature compression method based on constant Q transformation
CN113012684A (en) * 2021-03-04 2021-06-22 电子科技大学 Synthesized voice detection method based on voice segmentation
CN113012684B (en) * 2021-03-04 2022-05-31 电子科技大学 Synthesized voice detection method based on voice segmentation
CN113096692A (en) * 2021-03-19 2021-07-09 招商银行股份有限公司 Voice detection method and device, equipment and storage medium
CN113096692B (en) * 2021-03-19 2024-05-28 招商银行股份有限公司 Voice detection method and device, equipment and storage medium
CN113506583B (en) * 2021-06-28 2024-01-05 杭州电子科技大学 Camouflage voice detection method using residual error network
CN113506583A (en) * 2021-06-28 2021-10-15 杭州电子科技大学 Disguised voice detection method using residual error network
CN113284486A (en) * 2021-07-26 2021-08-20 中国科学院自动化研究所 Robust voice identification method for environmental countermeasure
CN113488074A (en) * 2021-08-20 2021-10-08 四川大学 Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof
CN113488074B (en) * 2021-08-20 2023-06-23 四川大学 Two-dimensional time-frequency characteristic generation method for detecting synthesized voice
CN115022087B (en) * 2022-07-20 2024-02-27 中国工商银行股份有限公司 Voice recognition verification processing method and device
CN115022087A (en) * 2022-07-20 2022-09-06 中国工商银行股份有限公司 Voice recognition verification processing method and device
CN117153190A (en) * 2023-10-27 2023-12-01 广东技术师范大学 Playback voice detection method based on attention mechanism combination characteristics
CN117153190B (en) * 2023-10-27 2024-01-19 广东技术师范大学 Playback voice detection method based on attention mechanism combination characteristics

Also Published As

Publication number Publication date
CN111653289B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN111653289B (en) Playback voice detection method
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN109584884B (en) Voice identity feature extractor, classifier training method and related equipment
CN108831443B (en) Mobile recording equipment source identification method based on stacked self-coding network
CN111785285A (en) Voiceprint recognition method for home multi-feature parameter fusion
CN110459241B (en) Method and system for extracting voice features
CN108198561A (en) A kind of pirate recordings speech detection method based on convolutional neural networks
Mallidi et al. Novel neural network based fusion for multistream ASR
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
CN112541533A (en) Modified vehicle identification method based on neural network and feature fusion
CN112735477B (en) Voice emotion analysis method and device
CN108922514B (en) Robust feature extraction method based on low-frequency log spectrum
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN110570871A (en) TristouNet-based voiceprint recognition method, device and equipment
CN116778956A (en) Transformer acoustic feature extraction and fault identification method
CN109300470A (en) Audio mixing separation method and audio mixing separator
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Cui et al. Research on audio recognition based on the deep neural network in music teaching
CN114283835A (en) Voice enhancement and detection method suitable for actual communication condition
CN118098247A (en) Voiceprint recognition method and system based on parallel feature extraction model
CN111261192A (en) Audio detection method based on LSTM network, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230821

Address after: Room 502, No. 3 Pulan 1st Street, Chancheng District, Foshan City, Guangdong Province, 528000

Patentee after: Kong Fanbin

Address before: Room 337, Building 3, No. 266, Zhenxing Road, Yuyue Town, Deqing County, Huzhou City, Zhejiang Province, 313000

Patentee before: Huzhou Chuangguan Technology Co.,Ltd.

Effective date of registration: 20230821

Address after: Room 337, Building 3, No. 266, Zhenxing Road, Yuyue Town, Deqing County, Huzhou City, Zhejiang Province, 313000

Patentee after: Huzhou Chuangguan Technology Co.,Ltd.

Address before: 315211, Fenghua Road, Jiangbei District, Zhejiang, Ningbo 818

Patentee before: Ningbo University

TR01 Transfer of patent right