CN116778910A

CN116778910A - Voice detection method

Info

Publication number: CN116778910A
Application number: CN202310505872.2A
Authority: CN
Inventors: 张鹏远; 张震; 陆镜泽; 孙旭东; 王文超; 刘睿霖; 王丽; 杜金浩; 陈树丽; 计哲
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-09-19

Abstract

The application provides a voice detection method, which comprises the following steps: obtaining target voice, and preprocessing the target voice, wherein the preprocessing comprises pre-emphasis, framing and windowing; determining a first channel characteristic, a first sound source wave characteristic and a plurality of first related characteristics of the preprocessed target voice; determining the first principal component feature based on the first channel feature, a first source wave feature, and a plurality of first correlation features; inputting the first principal component characteristics into a trained classifier, and outputting a classification result, wherein the classification result is fake voice or natural voice. The application utilizes trace information left by fake voice at the fundamental frequency and utilizes the difference between fake voice and natural voice in sound source and sound channel characteristics to realize fake voice detection. And screening the sound source and the sound channel characteristics by using a principal component analysis method, selecting principal components with higher correlation as characteristics, reducing characteristic dimension and redundancy characteristics, and improving generalization capability and efficiency of the model.

Description

Voice detection method

Technical Field

The application relates to the field of voice detection, in particular to a voice detection method.

Background

With the continuous progress of technology, speech technology has been widely used, such as speech recognition, speech synthesis, and the like. With the explosive development of deep learning, artificial intelligence technology has been introduced in many tasks in the speech field to improve performance. However, speech technology also introduces some challenges in the development process. In order to cope with the serious threat of the spoofing attack, the development of a fake voice detection system against the spoofing attack has been attracting attention in recent years. Although many fake voice detection methods have been proposed, only a few have been implemented. The existing fake voice detection system is not designed aiming at the characteristics of fake voice, so that the model generalization capability is low, and the trust of people is difficult to obtain.

The fake identifying system of the current fake voice detecting system excessively depends on the traditional manual design characteristics and the classification performance of the deep neural network, is not designed aiming at the fake voice characteristics, and causes that a model is closely related to a data set, so that generalization is poor in an actual application scene, and actual deployment is difficult.

Disclosure of Invention

In a first aspect, an embodiment of the present application provides a method for detecting voice, where the method includes: target voice is obtained, and is preprocessed, wherein the preprocessing comprises pre-emphasis, framing and windowing; determining a plurality of first channel features of the preprocessed target speech; determining a first sound source wave characteristic of the preprocessed target voice; the first sound source wave characteristics are extracted through an inverse filter; determining a first principal component feature based on the plurality of first channel features and the first sound source wave feature; inputting the first principal component characteristics into a trained classifier, and outputting a classification result, wherein the classification result is fake voice or natural voice.

Therefore, the voice detection method provided by the embodiment of the application utilizes trace information left by the fake voice at the fundamental frequency, and utilizes the difference between the fake voice and natural voice on the characteristics of a sound source and a sound channel caused by different generation processes to realize fake voice detection. Meanwhile, the method of principal component analysis is used for screening the sound source and the sound channel characteristics respectively, principal components with higher correlation are selected as characteristics, the characteristic dimension and the redundancy characteristic are reduced, and the generalization capability and the efficiency of the model are improved.

In some implementations, the plurality of first channel features includes an amplitude-frequency characteristic of a channel filter and an amplitude-frequency characteristic of an inverse filter, and determining the plurality of first channel features of the preprocessed target speech includes: predicting the filter parameters of each frame of the preprocessed target voice by using linear prediction coding; and calculating and determining the amplitude-frequency characteristic of the sound channel filter and the amplitude-frequency characteristic of the inverse filter according to the filter parameters of each frame.

Thus, the embodiment of the application introduces various channel-related features to enhance the generalization capability of the classifier.

In some implementations, the plurality of first channel features further includes a fundamental frequency feature, a fundamental frequency perturbation feature, an amplitude perturbation feature, and a mel-frequency cepstral coefficient.

In some implementations, determining the first source wave characteristic of the pre-processed target speech includes: acquiring short-time Fourier features of the target voice after framing; and carrying out inverse filtering on the short-time Fourier features through an inverse filter to obtain first sound source wave features of the preprocessed target voice.

Therefore, the embodiment of the application separates the sound source characteristic and the sound channel characteristic by using the inverse filtering method, so that the classifier can judge by utilizing the specific characteristic introduced by different fake voice and natural voice generation mechanisms, the information is more fully utilized, and meanwhile, no excessive data preprocessing steps are introduced.

In some implementations, determining the first principal component feature based on the plurality of first channel features and the first sound source wave feature includes: splicing the plurality of first sound channel characteristics and the first sound source wave characteristics to obtain first splicing characteristics; the feature quantity of the first splicing features is n; decentralizing each of the first stitching features; calculating a covariance matrix of the first split joint characteristic after the decentralization; performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and corresponding eigenvectors; selecting the first k eigenvectors to form a conversion matrix according to the sequence of the eigenvalues from large to small, wherein k is smaller than n; multiplying the first splicing characteristic by a conversion matrix to obtain a first principal component characteristic.

Therefore, the embodiment of the application introduces principal component analysis in the process of feature splicing to realize dimension reduction, and selects the high-correlation features with higher contribution to the task. Redundancy is reduced to achieve an improvement in efficiency.

In some implementations, the method further includes the step of training the classifier: acquiring training voice of a training set with a label, and preprocessing the training voice, wherein the preprocessing comprises pre-emphasis, framing and windowing; determining a plurality of second sound features of the preprocessed training speech; determining the second sound source wave characteristics of the preprocessed training voice; the second sound source wave characteristics are extracted through a sound channel inverse filter; determining a second principal component feature based on the plurality of second sound features and the second sound source wave feature; and inputting the second principal component characteristics into a classifier, performing iterative training, and obtaining the trained classifier under the condition of convergence of the loss function.

Therefore, the classifier obtained by training in the embodiment of the application has simple structure and good portability. Other beneficial effects are as described previously and will not be described in detail herein.

In some implementations, the plurality of second channel features includes an amplitude-frequency characteristic of the channel filter and an amplitude-frequency characteristic of the inverse filter, and determining the plurality of second channel features of the preprocessed training speech includes: predicting filter parameters of each frame of the preprocessed training speech using linear predictive coding; and calculating and determining the amplitude-frequency characteristic of the sound channel filter and the amplitude-frequency characteristic of the inverse filter according to the filter parameters of each frame of training voice.

In some implementations, the plurality of second sound features further includes a fundamental frequency feature, a fundamental frequency perturbation feature, an amplitude perturbation feature, and a mel-frequency cepstral coefficient.

In some implementations, determining the second acoustic source signature of the pre-processed training speech includes: acquiring short-time Fourier features of the training voice after framing; and carrying out inverse filtering on the short-time Fourier features through an inverse filter to obtain second sound source wave features of the preprocessed training voice.

In some implementations, determining the second principal component feature based on the plurality of second sound features and the second sound source feature includes: splicing the plurality of second sound characteristics and the second sound source wave characteristics to obtain second splicing characteristics; the feature quantity of the second splicing features is n; decentralizing each of the second stitching features; calculating a covariance matrix of the second split joint characteristic after the decentralization; performing eigenvalue decomposition on the covariance matrix to obtain n eigenvalues and corresponding eigenvectors; selecting the first k eigenvectors to form a conversion matrix according to the sequence of the eigenvalues from large to small, wherein k is smaller than n; and multiplying the second splicing characteristic by the conversion matrix to obtain a second principal component characteristic.

In a second aspect, an embodiment of the present application provides an electronic device, including: at least one memory for storing a program;

at least one processor for executing the program stored in the memory, the processor being configured to perform the method according to any one of the first aspects when the program stored in the memory is executed.

In a third aspect, embodiments of the present application provide a computer storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method provided in the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

The drawings that accompany the detailed description can be briefly described as follows.

Fig. 1 is a system architecture diagram of a voice detection method according to an embodiment of the present application;

fig. 2 is a flowchart of a voice detection method according to an embodiment of the present application;

fig. 3 is a training flowchart of a voice detection method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In describing embodiments of the present application, words such as "exemplary," "such as" or "for example" are used to mean serving as examples, illustrations or explanations. Any embodiment or design described herein as "exemplary," "such as" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary," "such as" or "for example," etc., is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a alone, B alone, and both A and B. In addition, unless otherwise indicated, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of terminals means two or more terminals.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In the description of embodiments of the application reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with each other without conflict.

In the description of the embodiments of the present application, the terms "first\second\third, etc." or module a, module B, module C, etc. are used merely to distinguish similar objects and do not represent a particular ordering for the objects, it being understood that particular orders or precedence may be interchanged as allowed so that the embodiments of the present application described herein can be implemented in an order other than that illustrated or described herein.

In the description of the embodiment of the present application, reference numerals indicating steps, such as S110, S120, … …, etc., do not necessarily indicate that the steps are performed in this order, and the order of the steps may be interchanged or performed simultaneously as allowed.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Fig. 1 is a system architecture diagram of a voice detection method according to an embodiment of the present application. As shown in fig. 1, the voice preprocessing module 11 first preprocesses a target voice, which may be an acquired real or fake voice; the preprocessing includes pre-emphasis, framing and windowing. The channel feature extraction module 121 extracts coefficients of a channel filter of each frame of a speech signal in a target speech using Linear Predictive Coding (LPC), and estimates the amplitude-frequency characteristics of the channel filter of each frame of the speech signal and the amplitude-frequency characteristics of its inverse filter based on the coefficients, resulting in channel features. The acoustic source wave feature extraction module 122 calculates a Short Time Fourier Transform (STFT) spectrum for each frame of the pre-processed speech signal, and inverse filters the Short Time Fourier Transform (STFT) spectrum using an acoustic channel inverse filter, resulting in acoustic source wave features. The acoustic source channel correlation feature extraction module 123 obtains, for each frame of the preprocessed speech signal, correlation features such as a fundamental frequency feature, a fundamental frequency perturbation feature, and a mel-frequency cepstrum coefficient (MFCC). The principal component determination module 13 performs stitching on the characteristics of the vocal tract, the sound source wave, the fundamental frequency characteristic, the fundamental frequency perturbation characteristic, the mel-frequency cepstrum coefficient (MFCC) and the like, and uses Principal Component Analysis (PCA) to remove redundant characteristics, thereby obtaining principal component characteristics with high correlation. The classification module 14 inputs the principal component features into a classifier for two-classification, and outputs the result of obtaining natural speech or fake speech. The embodiment of the application provides a voice detection method, aiming at the voice which is difficult to distinguish between true and false, the difference between a sound source and a sound channel is fully utilized, and the fake voice recognition with generalization in the actual use scene is realized; and the feature selection capability of principal component analysis is utilized, so that redundancy is reduced, and the efficiency is improved.

Fig. 2 is a schematic diagram of a voice detection method according to an embodiment of the present application. As shown in fig. 2, the voice detection method includes: s11, acquiring target voice, and preprocessing the target voice, wherein the preprocessing comprises pre-emphasis, framing and windowing; s12, determining a first sound channel characteristic, a first sound source wave characteristic and a plurality of first related characteristics of the preprocessed target voice; s14, determining a first principal component feature based on the first sound channel feature, the first sound source wave feature and a plurality of first correlation features; s15, inputting the first principal component characteristics into a trained classifier, and outputting a classification result, wherein the classification result is fake voice or natural voice.

The following describes each step of the voice detection method provided by the present application in detail with reference to the embodiment.

S11, acquiring target voice, and preprocessing the target voice, wherein the preprocessing comprises pre-emphasis, framing and windowing.

Because the high frequency part of the voice has smaller amplitude compared with the lower frequency part, the pre-emphasis can balance the frequency spectrum, avoid numerical problems in the Fourier transform operation process, improve the signal-to-noise ratio (SNR), eliminate the effects of vocal cords and lips in the sounding process, compensate the high frequency part of the voice signal restrained by the sounding system, and highlight formants of high frequency.

In embodiments of the present application, a high pass filter may be used to pre-emphasis the target speech. The pre-emphasized speech signal y (n) is:

y(n)＝x(n)-0.79·x(n-1) (1)

where x (n) represents an nth frame of speech signal in the target speech and x (n-1) represents an nth-1 frame of speech signal in the target speech.

In most cases, the speech signal is non-stationary and it is not meaningful to fourier transform the entire signal. A fourier transform can thus be performed on the short-time frames, and a good approximation of the time-frequency transform of the signal is obtained by concatenating the adjacent frames.

In an embodiment of the present application, after pre-emphasis, the target speech is divided into a plurality of short-time frames, each of which is a short-time stationary speech signal.

Illustratively, the frame length may be set to 25ms, the frame movement may be set to 10ms, and the original speech is sliced into a plurality of speech frames for calculating the short time-frequency characteristics.

After dividing the target speech into short time frames, multiplying each frame by a window function to increase the continuity of the left and right ends of the frame and reduce spectral leakage. The window function may be a Hamming window (Hamming), and the window function formula w (n) is:

where N represents the number of sampling points within the window.

S12, determining a first sound channel characteristic, a first sound source wave characteristic and a plurality of first related characteristics of the preprocessed target voice.

The first channel characteristic, the first source wave characteristic and the plurality of first correlation characteristics of the target speech are described below, respectively.

S121, determining the first channel characteristics of the preprocessed target voice.

The first channel characteristics include an amplitude-frequency characteristic of a channel filter and an amplitude-frequency characteristic of an inverse filter, and channel filter parameters of each frame of the target speech after the preprocessing can be predicted using Linear Predictive Coding (LPC); and calculating the amplitude-frequency characteristic of the sound channel filter of each frame of the target voice and the amplitude-frequency characteristic of the inverse filter according to the filter parameters of each frame.

In the embodiment of the application, the zero point and the pole of the channel filter of each frame of voice under z transformation can be calculated according to Linear Predictive Coding (LPC), and the amplitude-frequency characteristic of the channel filter can be determined according to the zero point and the pole of the channel filter. Further, for each frame of the target speech, the zero and the pole of the channel filter are interchanged to obtain the channel inverse filter. And calculating the amplitude-frequency characteristic of the inverse filter according to the zero and the pole after the exchange.

S122, determining a first sound source wave characteristic of the preprocessed target voice; the first sound source wave characteristic is obtained by multiplying the amplitude-frequency characteristic of the inverse filter with the original voice in the frequency domain and filtering out the sound channel characteristic.

In the embodiment of the application, the short-time Fourier characteristic of each frame of the target voice is acquired; and carrying out inverse filtering on the short-time Fourier features through an inverse filter to obtain first sound source wave features.

S123, determining a plurality of first relevant characteristics of the preprocessed target voice, wherein the first relevant characteristics comprise a Mel cepstrum coefficient, a fundamental frequency characteristic, a fundamental frequency perturbation characteristic and an amplitude perturbation characteristic.

The mel cepstrum coefficient is obtained by filtering short-time fourier spectrum of each frame of target voice by a mel filter by using a mel filter bank.

Specifically, each frame of the target voice is windowed through short-time Fourier transform and then subjected to time-frequency analysis, and then a log-magnitude spectrum is taken to obtain a spectrogram. Wherein the short-time fourier transform is defined as:

where x (τ) is a single frame speech signal, h (τ -t) is an analysis window function, τ is an offset.

Filtering the spectrogram through an 80-dimensional Mel filter bank to obtain Mel spectrum characteristics.

The mel filter bank models the human auditory perception system, which is different in sensitivity to signals of different frequencies. The mel filter group is not uniformly distributed on the frequency coordinate axis, the mel filters are densely distributed in the low frequency region, and the mel filters are sparsely distributed in the high frequency region. Thus the mel filter bank can simulate the nonlinear sense of human ear to soundIt is known that at lower frequencies, there is more discrimination and at higher frequencies, there is less resolution. Mel frequency f _mel The conversion formula with the linear frequency f is:

according to the formula, a Mel filter group can be calculated, and each filter in the filter group is multiplied and accumulated with a short time Fourier Spectrum (STFT) in sequence, so that the characteristics of the Mel filter group are obtained. And carrying out cepstrum analysis on the calculated Mel filter group characteristics to obtain final Mel cepstrum (MFCC) characteristics.

The cepstrum analysis is used for extracting the envelope of the frequency spectrum, and the formant characteristic of the extracted voice, namely the sound channel characteristic is represented in the time-frequency characteristic of the voice. Cepstrum analysis can be performed on fbank features using Discrete Cosine Transform (DCT) to obtain Mel cepstrum coefficient MFCC _i ：

Wherein S is _j Is the output value of the jth filter in fbank, MFCC _i Is the mel cepstrum coefficient of the ith mel filter, and N is the number of mel filter banks.

Fundamental frequency characteristic F ₀ Is a feature obtained by calculating the pitch period of each frame signal.

The fundamental perturbation feature is a feature that characterizes the aperiodic variation of the fundamental perturbation. The fundamental perturbation is a measure of the amount of variation in adjacent cycles of the pitch frequency, and the fundamental perturbation is expressed in percent (%), and the fundamental perturbation Jitter is given by:

wherein F is ₀ For the fundamental frequency of each frame of speech, J is the number of gene cycles per frame of speech.

The amplitude perturbation feature is a feature that characterizes the non-periodic variation of the amplitude perturbation. Amplitude perturbation is a measure of the amount of change in amplitude between adjacent periods of a speech signal, the amplitude perturbation being in decibels (dB).

The formula of the amplitude perturbation Shimmer is:

where A is the peak-to-peak amplitude for each pitch period and L is the amplitude number.

The first channel feature, the first sound source feature, and the plurality of first related features each characterize sound source and channel information of the speech, and the feature information indicates differences in the generation process between the fake speech and the real speech.

The embodiment of the application obtains a plurality of related features in the voice feature extraction process, however, excessive features contain redundant information and have influence on generalization of the model and calculation speed. Therefore, features need to be filtered to obtain features with higher correlation to the voice authentication task.

S13, determining a first principal component feature based on the first channel feature, the first sound source wave feature and the plurality of first correlation features.

The embodiment of the application uses Principal Component Analysis (PCA) to screen a plurality of first sound channel characteristics and first sound source wave characteristics, and comprises the following steps:

s131, splicing the first sound channel characteristic, the first sound source wave characteristic and a plurality of first related characteristics to obtain a first splicing characteristic; the feature number of the splicing features is n.

In the embodiment of the application, n characteristics of an amplitude-frequency characteristic of a sound channel filter, an amplitude-frequency characteristic of an inverse filter, a fundamental frequency characteristic, a fundamental frequency perturbation characteristic, an amplitude perturbation characteristic, a mel cepstrum coefficient, a first sound source wave characteristic and the like of each frame of target voice are spliced to obtain a first splicing characteristic.

S132, each feature in the spliced features is decentered.

In an embodiment of the application, the decentralizing comprises subtracting the mean of the dimension from each of the first stitched features.

S133, calculating a covariance matrix based on the decentered first splicing characteristic.

S134, performing eigenvalue decomposition on the covariance matrix to obtain a plurality of first eigenvalues and corresponding first eigenvectors.

S135, selecting the first k eigenvectors to form a first conversion matrix according to the sequence of the eigenvalues from large to small, wherein k is the feature quantity after dimension reduction, and k is less than n.

S136, multiplying the first splicing characteristic by the first conversion matrix to obtain a first principal component characteristic.

The first principal component features are high-correlation features after dimension reduction.

In the embodiment of the application, the process of principal component feature analysis can project high-dimensional features onto low-dimensional features, and the maximum possibility of the high-dimensional features with high correlation to tasks is reserved.

S14, inputting the first principal component characteristics into a trained classifier, and outputting a classification result, wherein the classification result is fake voice or natural voice.

The voice detection method provided by the embodiment of the application further comprises the step of training the classifier.

Fig. 3 is a flowchart of training a classifier in a voice detection method according to an embodiment of the present application. As shown in fig. 3, the training process of the classifier includes: s31, training voice with a label in a training set is obtained, and the training voice is preprocessed, wherein the preprocessing comprises pre-emphasis, framing and windowing. S32, determining a second sound characteristic, a second sound source characteristic and a plurality of second correlation characteristics of the preprocessed training voice. S34, determining a second principal component feature based on the second channel feature, the second sound source wave feature and the plurality of second correlation features of the training speech. S35, inputting the second principal component characteristics into the classifier, performing iterative training, and obtaining the trained classifier under the condition that the loss function converges.

The individual steps of training the classifier are described in detail below.

S31, training voice with a label in a training set is obtained, and the training voice is preprocessed, wherein the preprocessing comprises pre-emphasis, framing and windowing.

The embodiments of pre-emphasis, framing and windowing may refer to the embodiment of step S11, and will not be described herein.

S32, determining a second sound characteristic, a second sound source characteristic and a plurality of second correlation characteristics of the preprocessed training voice.

The second sound source feature, and the plurality of second correlation features of the training speech are described below, respectively.

S321, determining a second sound characteristic of the preprocessed training voice.

The second channel characteristics comprise the amplitude-frequency characteristics of a sound channel filter of the training voice and the amplitude-frequency characteristics of an inverse filter, and the filter parameters of each frame of the preprocessed training voice can be predicted by using a Linear Predictive Coding (LPC) method; and calculating according to the filter parameters of each frame of the training voice to obtain the amplitude-frequency characteristic of the sound channel filter and the amplitude-frequency characteristic of the inverse filter of each frame of the training voice.

The relation and calculation manner between the amplitude-frequency characteristic of the inverse filter and the amplitude-frequency characteristic of the filter of each frame of training speech can refer to step S121, and will not be described here again.

S322, determining second sound source wave characteristics of the preprocessed training voice; the second sound source wave characteristic is obtained by multiplying the amplitude-frequency characteristic of the inverse filter of the training voice with the original voice in the frequency domain and filtering out the sound channel characteristic.

S323, determining a plurality of second correlation features of the preprocessed training voice, wherein the second correlation features comprise a Mel cepstrum coefficient, a fundamental frequency feature, a fundamental frequency perturbation feature and an amplitude perturbation feature.

Various relevant features and calculation manners of each frame of the training speech can refer to step S123, which is not described herein.

The second channel feature, the second sound source feature and the plurality of related features of the training speech characterize the sound source and the sound channel information of the training speech, and the features indicate differences between the fake speech and the real speech in the generation process.

In the same way, the embodiment of the application obtains a plurality of sound source sound channel related features in the feature extraction process of training voice, and the excessive features contain redundant information, have influence on generalization of a model and calculation speed, and need to screen the features to obtain the features with higher correlation with a voice fake identifying task.

S33, determining a second principal component feature based on the second sound source feature, the second sound source feature and the plurality of relevant features.

The embodiment of the application uses Principal Component Analysis (PCA) to screen the second sound source characteristic, the second sound source characteristic and a plurality of related characteristics, and comprises the following steps:

s331, splicing the second sound source feature, the second sound source feature and various related features to obtain a second spliced feature; the number of features of the second stitching feature is n.

In the embodiment of the application, the second splicing characteristic is obtained by splicing n characteristics of the filter of each frame of training voice, the amplitude-frequency characteristic of the inverse filter, the fundamental frequency characteristic, the fundamental frequency perturbation characteristic, the amplitude perturbation characteristic, the mel cepstrum coefficient, the second sound source characteristic and the like.

S332, decentering each of the second stitching features.

In an embodiment of the application, the decentralizing comprises subtracting the mean of the dimension from each of the second stitching features.

S333, calculating a covariance matrix based on the second split joint characteristic after the decentralization.

And S334, performing eigenvalue decomposition on the covariance matrix to obtain a plurality of second eigenvalues and corresponding second eigenvectors.

S335, selecting the first k second eigenvectors to form a second conversion matrix according to the sequence of the second eigenvalues from large to small, wherein k is the dimension number after dimension reduction, and k is less than n.

And S336, multiplying the second splicing characteristic by a second conversion matrix to obtain a second principal component characteristic. The second principal component features are high-correlation features after dimension reduction.

S34, inputting the second principal component characteristics into the classifier, performing iterative training, and obtaining the trained classifier under the condition that the loss function converges.

In an embodiment of the application, the classifier used is a hidden layer feature extractor with a residual network (ResNet) of extrusion excitation modules (SE-blocks).

Wherein the residual network is a variant of a Convolutional Neural Network (CNN), the problem of degradation caused by network deepening is solved by adding a residual connection between the convolutional layers. Each residual module typically comprises a multi-layered structure, which may be a constituent layer of any neural network, which allows for better scalability of the residual network. After the input x of the residual module passes the forward computation F (x), the original input which is not subjected to the forward computation is added to the output, so that a short circuit connection is formed. The calculation mode ensures that even if the forward calculation is 0, the residual module is just equivalent to performing identity mapping, and the network performance is not reduced. The residual error module enables the convolutional neural network to avoid degradation problem and deepen the network layer number, so that the deep neural network has better performance.

The extrusion excitation module is an expansion module of the convolutional neural network, which can model the interdependence relationship among characteristic channels in a display mode, and improves the system performance from the channel angle. The extrusion excitation module firstly extrudes the input characteristic diagram, extrudes the space dimension through the global average pooling layer, and changes each two-dimensional characteristic channel into a real number with a global receptive field. The output one-dimensional channel features are then excited, weights are generated for each feature channel by the parameters, and correlations between the feature channels are explicitly modeled. And finally, re-weighting the calculated weight into the initial two-bit channel characteristic to finish the re-calibration of the original characteristic in the channel dimension.

And the hidden layer features extracted by the hidden layer feature extractor are subjected to two-class classification by a linear layer, so that a final fake voice detection result is obtained.

In the embodiment of the application, the classifier is trained by using an Adam optimizer and an additive angle loss function (Additive angular margin Softmax loss, AAM-Softmax), and the calculation formula of the AAM-Softmax loss function is as follows:

where m is a boundary value, and m=0.2, s=40 is a scale parameter. f (f) _i Is an embedded feature of the input and,is a weight matrix, y _i Representing tag information carried by the speech, c is the number of categories, here 2, n is the number of speech bars in the training speech. The optimizer parameters are: beta ₁ ＝0.9，β ₂ ＝0.999，∈＝10 ^-8 Weight decay is 10 ^-4 。

The voice detection method provided by the embodiment of the application is applied to the electronic equipment for deploying and configuring the PyTorch operating environment.

The voice detection method provided by the embodiment of the application separates the sound source characteristic from the sound channel characteristic by using the inverse filtering method, and utilizes the difference between the fake voice and the natural voice generation mechanism, so that the classifier can introduce the information of the specific characteristic, fully utilizes the analysis of the specific characteristic, does not need to introduce excessive data preprocessing steps, has less resource consumption and saves training time.

The voice detection method provided by the embodiment of the application introduces various sound source and sound channel related characteristics to strengthen the generalization capability of the model.

According to the voice detection method provided by the embodiment of the application, principal component analysis is introduced in the process of feature stitching to realize dimension reduction, and high-correlation features with higher contribution to tasks are selected. Redundancy is reduced to achieve an improvement in efficiency.

The voice detection method provided by the embodiment of the application has the advantages of simple structure and good portability, and the used classifier can be a convolutional neural network with other two classifications.

The embodiment of the application provides electronic equipment, which comprises: at least one memory for storing a program; at least one processor for executing the program stored in the memory, the processor for executing the voice detection method when the program stored in the memory is executed.

The embodiment of the application provides a computer storage medium, wherein instructions are stored in the computer storage medium, and when the instructions run on a computer, the instructions cause the computer to execute the voice detection method.

It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), other general purpose processor, digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

Claims

1. A method of voice detection, the method comprising:

obtaining target voice, and preprocessing the target voice, wherein the preprocessing comprises pre-emphasis, framing and windowing;

determining a first channel characteristic, a first sound source wave characteristic and a plurality of first related characteristics of the preprocessed target voice;

determining the first principal component feature based on the first channel feature, a first source wave feature, and a plurality of first correlation features;

inputting the first principal component characteristics into a trained classifier, and outputting a classification result, wherein the classification result is fake voice or natural voice.

2. The method of claim 1, wherein the first channel characteristics include an amplitude-frequency characteristic of a channel filter and an amplitude-frequency characteristic of an inverse filter, and wherein determining the first channel characteristics of the preprocessed target speech comprises:

predicting the filter parameters of each frame of the target voice after pretreatment by using linear prediction coding;

and calculating and determining the amplitude-frequency characteristic of the sound channel filter and the amplitude-frequency characteristic of the inverse filter according to the filter parameters of each frame.

3. The method of claim 1, wherein the plurality of first correlation features includes a fundamental frequency feature, a fundamental frequency perturbation feature, an amplitude perturbation feature, and a mel-frequency cepstral coefficient.

4. The method of claim 1, wherein said determining the first channel characteristic, the first source wave characteristic, and the plurality of first correlation characteristics of the pre-processed target speech comprises:

acquiring short-time Fourier features of the target voice after framing;

and carrying out inverse filtering on the short-time Fourier features through an inverse filter to obtain first sound source wave features of the preprocessed target voice.

5. The method of claim 1, wherein the determining the first principal component feature based on the first channel feature, the first source wave feature, and the plurality of first correlation features comprises:

splicing the first sound channel characteristic, the first sound source wave characteristic and a plurality of first related characteristics to obtain a first splicing characteristic; the feature quantity of the first splicing features is n;

decentralizing each of the first stitching features; calculating a covariance matrix of the first decentralised stitching feature;

performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and corresponding eigenvectors;

selecting the first k eigenvectors to form a conversion matrix according to the sequence of the eigenvalues from large to small, wherein k is smaller than n;

multiplying the first splicing characteristic by the conversion matrix to obtain the first principal component characteristic.

6. The method according to any one of claims 1-5, further comprising the step of training a classifier:

acquiring training voice of a training set with a label, and preprocessing the training voice, wherein the preprocessing comprises pre-emphasis, framing and windowing;

determining a second sound characteristic, a second sound source characteristic and a plurality of second related characteristics of the preprocessed training speech;

determining a second principal component feature based on a second sound feature, a second sound source feature, and a plurality of second correlation features of the training speech;

and inputting the second principal component characteristics into a classifier, performing iterative training, and obtaining the trained classifier under the condition of convergence of a loss function.

7. The method of claim 6, wherein the plurality of second channel characteristics includes an amplitude-frequency characteristic of a channel filter and an amplitude-frequency characteristic of an inverse filter, and wherein determining the second channel characteristics of the pre-processed training speech comprises:

predicting filter parameters for each frame of the pre-processed training speech using linear predictive coding;

and calculating and determining the amplitude-frequency characteristic of the second channel filter and the amplitude-frequency characteristic of the inverse filter according to the filter parameters of each frame of the training voice.

8. The method of claim 6, wherein the plurality of second correlation features further comprises a fundamental frequency feature, a fundamental frequency perturbation feature, an amplitude perturbation feature, and a mel-frequency cepstral coefficient for each frame of the training speech.

9. The method of claim 6, wherein the determining the second sound characteristic, the second sound source characteristic, and the plurality of second correlation characteristics of the pre-processed training speech comprises:

acquiring short-time Fourier features of the training voice after framing;

and carrying out inverse filtering on the short-time Fourier features through the inverse filter to obtain second sound source wave features of the preprocessed training voice.

10. The method of claim 6, wherein the determining the second principal component feature based on the second sound feature, the second sound source feature, and the plurality of second correlation features of the training speech comprises:

splicing the second channel characteristics, the second sound source wave characteristics and a plurality of second correlation characteristics of the training voice to obtain second splicing characteristics; the feature quantity of the second splicing features is n;

decentralizing each of the second stitching features; calculating a covariance matrix of the second decentralised stitching feature;

performing eigenvalue decomposition on the covariance matrix to obtain n eigenvalues and corresponding eigenvectors;

multiplying the second stitching feature by the transformation matrix to obtain the second principal component feature.