CN109872720A

CN109872720A - It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks

Info

Publication number: CN109872720A
Application number: CN201910085725.8A
Authority: CN
Inventors: 王泳; 赵雅珺; 张梦鸽
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2019-06-11
Anticipated expiration: 2039-01-29
Also published as: CN109872720B

Abstract

Speech detection algorithms are rerecorded to different scenes robust based on convolutional neural networks the invention discloses a kind of, more particularly to speech detection algorithms field, by the way that voice time-frequency figure is input in algorithm model, algorithm model includes seven layers, every layer includes a convolutional layer and a pond layer, and residual error connection is added by line rectification function in the output of convolutional layer between the layers, final feature is extracted finally by global poolization, and passes through sigmoid predicted detection result.Data entry modality of the present invention using time-frequency figure as network in the present invention, relative to voice data is directly inputted, the characteristic information that time-frequency figure introduces rewriting device has relatively intensive distribution, is more advantageous to neural network characteristics extraction, to accelerate to train, precision is improved.

Description

It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks

Technical field

The present invention relates to speech detection algorithms fields, it is more particularly related to which a kind of be based on convolutional neural networks Speech detection algorithms are rerecorded to different scenes robust.

Background technique

Existing research proves, voice convert (Voice Conversion, VC), speech synthesis (Speech Synthesis, SS) and rerecording voice etc., duplicity voice can effectively cheat Speaker Identification (Automatic Speaker Recognition, ASV) system, thus personation accessing system, rerecording voice can be such that the higher mistake of ASV system generation connects By rate, social safety generation is seriously threatened.Wherein, the voice messaging and feature that VC and SS needs target speaker more, then In addition existing algorithm not yet full maturity, cost of implementation and difficulty are relatively high；And it rerecords voice and utilizes cheap sound pick-up outfit It is readily available, and rerecords all features that voice forgives target person voice substantially, therefore, opposite VC and SS has more prestige The side of body.For this purpose, rerecording the detection of voice should be taken seriously.

SV (automatic Speaker Identification) system in practice using more and more, such as: access control system, phone silver The fields such as row, military affairs.Since speaker verification's process does not need any aspectant contact, ASV system be very easy to by To the attack of duplicity voice.The duplicity voice that audio frequency apparatus generates can bring prestige to ASV (automatic Speaker Identification) system The side of body, influences the security performance of the system.In nearest more than ten years, digital audio product not only emerges one after another in type, but also The integrated function of various product is also more and more, increasingly stronger.Now with the PC for being equipped with audio processing software or Person have the relatively inexpensive equipment such as PDA of audio processing ability can achieve the effect that it is same or similar.For example, high quality, Sound pick-up outfit-smart phone of low cost, the duplicity voice formed will constitute risk to ASV system.Duplicity voice Including replay attack, voice conversion, speech synthesis etc..Attacker can forge characteristic using fraudulent voice, to obtain Illegal identity access to system, and then the file data of user, privacy will be stolen, and bring the damage that can not much make up It loses.Wherein replay attack is relative to voice conversion and speech synthesis with more threat.Replay attack is from realistic objective speaker The speech samples of middle acquisition, form are continuous pre-recorded speech samples.Spoofing attack based on replay does not need pair Voice does any technical treatment, and the voice and replay voice of realistic objective speaker has identical frequency spectrum and advanced spy Sign, it is the voice attack type being easiest to.And synthesize voice and deform voice of the voice relative to realistic objective speaker, it is There is certain errors and variations, be not identical, so to the detection of replay attack relative to synthesis voice and deformation Voice has bigger difficulty.

Summary of the invention

In order to overcome the drawbacks described above of the prior art, the embodiment of the present invention provides one kind based on convolutional neural networks to not Speech detection algorithms are rerecorded with scene robust, the data entry modality by using time-frequency figure as network in the present invention, phase For directly inputting voice data, the characteristic information that time-frequency figure introduces rewriting device has relatively intensive distribution, more favorably It is extracted in neural network characteristics, to accelerate to train, improves precision, to different recording arrangements, recorded environment and record distance The detection for rerecording voice has very high accuracy.

To achieve the above object, the invention provides the following technical scheme: a kind of convolutional neural networks that are based on are to different scenes Robust rerecords speech detection algorithms, specifically includes the following steps:

A, raw tone is acquired using sound pick-up outfit, and is converted through DA/AD, voice is rerecorded in acquisition；

B, raw tone can generate distortion in conversion process, and the distortion data of raw tone is calculated by distortion model, Wherein, distortion model expression formula are as follows:

Y (t) is to rerecord voice, and x (t) is raw tone, and λ is the amplitude transformation factor, and α is the time shaft linear extendible factor, η It is superimposed noise；

Corresponding frequency domain changes expression formula:

Y (j ω), X (j ω), N (j ω) they are respectively the frequency domain representation of y (t), x (t), η, for fixed sound pick-up outfit, It is characterized in highly stable, i.e., λ, α are constants；

C, it rerecords voice and voice time-frequency figure is produced by Short Time Fourier Transform；

D, voice time-frequency figure is input in algorithm model, and algorithm model includes seven layers, and every layer includes a convolutional layer and one A pond layer, residual error connection is added by line rectification function in the output of convolutional layer between the layers, finally by the overall situation Pondization extracts final feature, and passes through sigmoid predicted detection result.

In a preferred embodiment, when rerecording voice and being converted, Short Time Fourier Transform uses the 126 length Chinese Bright (hanning) window, step-length 50, the size of time-frequency figure are (64x62).

In a preferred embodiment, algorithm model is used in frequency dimension convolution, and time dimension pond is specifically set It is set to using 3x1 convolution kernel, the pond 1x2, and can mutually agree with the feature distribution feature of time-frequency figure, voice time-frequency figure characteristic distributions It is with independence and again with uniformity in special frequency channel between adjacent speech frame.

In a preferred embodiment, algorithm model uses technology of the deep learning as data-driven.

In a preferred embodiment, rewriting device can introduce variation, depth on the frequency domain of primitive sound signal Practise input data of the model using original audio signal as network.

In a preferred embodiment, when the algorithm model carries out frequency dimension progress convolution, do not consider the time The correlation of dimension, and when frequency dimension carries out convolution, while carrying out time dimension and carrying out pond.

In a preferred embodiment, convolution kernel can parameter sharing, the equipment for the same distribution that time dimension has Characteristic information repetition training convolution nuclear parameter, pond layer use the pond (1x2) of time dimension, and frequency dimension is without pond.

Technical effect and advantage of the invention:

1, data entry modality of the present invention using time-frequency figure as network in of the invention, relative to directly inputting voice number According to, the characteristic information that time-frequency figure introduces rewriting device has relatively intensive distribution, it is more advantageous to neural network characteristics extraction, To accelerate to train, precision is improved；

2, using in frequency dimension convolution, time dimension pond is specifically configured to using 3x1 convolution kernel, the pond 1x2 the present invention Change, only carries out convolution in frequency dimension, do not consider the correlation of time dimension, convolution nuclear parameter amount can be significantly reduced, so that Model has stronger anti-over-fitting ability, and data volume is depended in reduction unduly, while in the training process due to convolution kernel Parameter sharing, the characteristic information repetition training convolution nuclear parameter of the equipment for the same distribution that time dimension has can make training more Add sufficiently；

3, the present invention does not need to need manually to choose specific one or multiple spies as traditional machine learning method Then sign is classified with classifier again, can spontaneously extract the feature and deep layer that relevant feature includes some shallow-layer edges Feature then so that classify, simplify whole flow process and reached better effect；

4, inventive algorithm has the detection for rerecording voice of different recording arrangements, recording environment and recording distance very high Accuracy.

Detailed description of the invention

Fig. 1 is algorithm model structural schematic diagram of the invention.

Fig. 2 is that voice of the invention rerecords process schematic.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Embodiment 1

It is as shown in Figure 1 it is a kind of based on convolutional neural networks to the speech detection algorithms of rerecording of different scenes robust, algorithm Model shares 7 layers, and every layer includes a convolutional layer and a pond layer, and the output of convolutional layer passes through line rectification function, and Residual error connection is added between layers, extracts final feature finally by global poolization, and pass through sigmoid predicted detection knot Fruit, using in frequency dimension convolution, time dimension pond is specifically configured to using 3x1 convolution kernel, the pond 1x2, and maximizing reduces Model capacity greatly reduces the risk of over-fitting, and it is again special with the feature distribution of time-frequency figure to the dependence of data volume to reduce model Point height is agreed with, and training parameter is assigned to more reasonable place, more compact parameter is trained with more effective feature；

Voice time-frequency figure, is generated by Short Time Fourier Transform, and relative to voice data is directly inputted, time-frequency figure is for rerecording The characteristic information that equipment introduces has relatively intensive distribution, is more advantageous to neural network characteristics extraction, to accelerate to train, improves Precision, rewriting device can introduce variation on the frequency domain of primitive sound signal, and the performance of deep learning model has data high Dependence, using original audio signal as the input data of network, feature distribution is excessively sparse, greatly improves nerve net The difficulty of network extraction validity feature；

Embodiment 2

It is as shown in Figure 2 it is a kind of speech detection algorithms are rerecorded to different scenes robust based on convolutional neural networks, rerecord Lead to a degree of distortion of voice data, including the linear extendible on amplitude distortion and time shaft, wherein distortion model expression Formula are as follows:

Corresponding frequency domain changes expression formula:

Embodiment 3

In this embodiment, using 0.2 second voice segments as experimental data, Short Time Fourier Transform uses 126 length Hammings (hanning) window, step-length 50, the size of time-frequency figure are (64x62)；

Further, in the above-mentioned technical solutions, convolution is carried out using in frequency dimension, while carries out pond in time dimension Change, only carries out convolution in frequency dimension, do not consider the correlation of time dimension, convolution nuclear parameter amount can be significantly reduced, so that Model has stronger anti-over-fitting ability, and data volume is depended in reduction unduly, while in the training process due to convolution kernel Parameter sharing, the characteristic information repetition training convolution nuclear parameter of the equipment for the same distribution that time dimension has can make training more Adding sufficiently, pond layer uses the pond (1x2) of time dimension, and for frequency dimension without pond, pond can be reduced the dimension of feature, Accelerate the calculating of network, and network structure is made to have stronger robustness to flexible, the deformation of data characteristics, for time-frequency figure, Feature distribution not only reduces characteristic dimension, but also not will lead to frequency only in time dimension pond with deformation there is no flexible The loss of dimensional characteristics, is calculated by multilayer convolution and pondization, and characteristic dimension eventually becomes one-dimensional, length and time-frequency figure frequency phase Together；

Further, in the above-mentioned technical solutions, raw tone library is by 30000 sections of voices, and totally 60 people record composition, sampling Frequency 16kHz, quantified precision 16bits；

The voice of 10 spokesman of random selection guarantees training for training as test data, the voice of remaining 50 people The independence of data and test data avoids the recording of same position spokesman from appearing in different data collection；

Specific recording process is as follows: for training set, being combined by different distance and equipment to original language under quiet environment Sound library is rerecorded 4 times, rerecords sound bank thus to obtain 4, they separately include 25000 sections of voices, is mentioned at random from 4 sound banks Take totally 25000 sections of voices collectively constitute totally 50000 sections of training dataset with raw tone as negative sample.Raw tone passes through Laptop computer is associated Y40-70AT-IFI and is played；Rewriting device is that 14 (Ins14VD-258) are got in the Inspion spirit of laptop computer Dell With smart phone millet 2S；

The case where 4 recordings, is as shown in table 1:

1 recorded speech of table

For test data, it is arranged using the identical recording of table two, in order to verify interference of the model to environment random noise Voice robustness, recorded respectively in quiet environment in the environment of having certain random noise, test set includes 4 voices altogether Library, each sound bank include the quiet environment under the library recording mode and totally 10000 tested speech containing ambient noise；

Further, in the above-mentioned technical solutions, network error function is cross entropy loss function, is optimized using Adam and is calculated Method is trained, and initial learning rate is set as 0.001, and dynamic regularized learning algorithm rate in the training process, and every training 10000 times will Learning rate reduces one times, and training batch size is 32 every time, in order in training process supervised training effect, from training data with Machine chooses 2000 datas for verifying, by comparative training data degradation function and verify data loss function, to lose letter Number, which is added regularization term and regularization coefficient is arranged, can effectively prevent over-fitting for 0.0001；

Table 2 lists some important hyper parameter settings in training process, has in the setting lower network in training process Quickly convergence, and finally obtain quite high accuracy；

2 hyper parameter (β of table₁、β₂Respectively Adam optimizer parameter)

Further, in the above-mentioned technical solutions, the present embodiment contains 4 experiments test, is for different records respectively The test of control equipment and different recordings apart from lower progress, it is as shown in table 3 to test experimental results every time:

3 experimental result of table

Test experiments accuracy in varied situations is attained by 99.8% or more, and it is fine to ensure that experimental model has Versatility.

Last: the foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, all in the present invention Spirit and principle within, any modification, equivalent replacement, improvement and so on, should be included in protection scope of the present invention it It is interior.

Claims

1. it is a kind of based on convolutional neural networks to the speech detection algorithms of rerecording of different scenes robust, spy is, specifically includes Following steps:

B, raw tone can generate distortion in conversion process, and the distortion data of raw tone is calculated by distortion model, wherein Distortion model expression formula are as follows:

Y (t) is to rerecord voice, and x (t) is raw tone, and λ is the amplitude transformation factor, and α is the time shaft linear extendible factor, and η is folded Plus noise；

Corresponding frequency domain changes expression formula:

Y (j ω), X (j ω), N (j ω) are respectively the frequency domain representation of y (t), x (t), η, for fixed sound pick-up outfit, feature Be it is highly stable, i.e., λ, α are constants；

D, voice time-frequency figure is input in algorithm model, and algorithm model includes seven layers, and every layer includes a convolutional layer and a pond Change layer, residual error connection is added by line rectification function in the output of convolutional layer between the layers, finally by global pool Final feature is extracted, and passes through sigmoid predicted detection result.

2. it is according to claim 1 it is a kind of based on convolutional neural networks to different scenes robust rerecord speech detection calculate Method, it is characterised in that: when rerecording voice and being converted, Short Time Fourier Transform uses 126 length Hamming (hanning) windows, step A length of 50, the size of time-frequency figure is (64x62).

3. it is according to claim 1 it is a kind of based on convolutional neural networks to different scenes robust rerecord speech detection calculate Method, it is characterised in that: algorithm model is used in frequency dimension convolution, time dimension pond, is specifically configured to using 3x1 convolution Core, the pond 1x2, and can mutually agreeing with the feature distribution feature of time-frequency figure, voice time-frequency figure characteristic distributions adjacent speech frame it Between have independence and special frequency channel again it is with uniformity.

4. it is according to claim 3 it is a kind of based on convolutional neural networks to different scenes robust rerecord speech detection calculate Method, it is characterised in that: algorithm model uses technology of the deep learning as data-driven.

5. it is according to claim 4 it is a kind of based on convolutional neural networks to different scenes robust rerecord speech detection calculate Method, it is characterised in that: rewriting device can introduce variation on the frequency domain of primitive sound signal, and deep learning model is believed with original audio Input data number as network.

6. it is according to claim 3 it is a kind of based on convolutional neural networks to different scenes robust rerecord speech detection calculate Method, it is characterised in that: the algorithm model carries out frequency dimension when carrying out convolution, does not consider the correlation of time dimension, and When frequency dimension carries out convolution, while carrying out time dimension and carrying out pond.

7. it is according to claim 3 it is a kind of based on convolutional neural networks to different scenes robust rerecord speech detection calculate Method, it is characterised in that: convolution kernel can parameter sharing, the equipment for the same distribution that time dimension has characteristic information repetition training volume Product nuclear parameter, pond layer use the pond (1x2) of time dimension, and frequency dimension is without pond.