CN111755014A

CN111755014A - Domain-adaptive replay attack detection method and system

Info

Publication number: CN111755014A
Application number: CN202010630019.XA
Authority: CN
Inventors: 伍强
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2020-10-09
Anticipated expiration: 2040-07-02
Also published as: CN111755014B

Abstract

The invention discloses a field self-adaptive detection method for replay attack of a sound recording, which comprises the following steps: extracting acoustic features from at least one recording region of the recording; extracting a shared voiceprint feature vector from the acoustic features; and detecting whether the sound recording is a replay sound recording or not by a domain adaptive method from the shared voiceprint feature vector. The invention can still ensure the robustness of the record replay attack detection system under the conditions of the equipment and environment of record replay and the field diversity of speakers.

Description

Domain-adaptive replay attack detection method and system

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a field-adaptive record playback attack detection method and system.

Background

In recent years, with the rapid development of artificial intelligence technology, more and more products with artificial intelligence technology appear in people's daily life, especially the smart sound box of recent years is different military prominence. The voiceprint recognition technology is almost the standard configuration of all intelligent sound boxes, and a user can finish account login, shopping payment and the like by using own voice. The detection of the replay attack of the recording is an extremely important link in a voiceprint recognition system, and whether a real person from which voice comes or the recording is judged. The diversity of domains leads to degraded performance of replay attack detection systems, as devices, environments, and speakers of the replay are diverse.

Disclosure of Invention

The invention provides a method and a system for detecting the replay attack of the recording with self-adaption in the field, aiming at solving the problem of field diversity of the replay attack of the recording. Designing a shared voiceprint feature extraction module, inputting the acoustic features of voice into the shared module, extracting the shared voiceprint features, and then respectively inputting the shared voiceprint features into four sub-classification modules, wherein the four sub-classification modules respectively comprise: a replay attack detection module, a replay device detection module, a replay environment detection module and a replay speaker detection module. The error gradients of the replay attack detection module are directly fed back to the shared voiceprint feature extraction module and the replay attack detection module, the error gradients of the replay device detection module, the replay environment detection module and the replay speaker detection module are fed back to the outside of the respective modules, and the error gradients are fed back to the shared voiceprint feature extraction module after being inverted. By the method and the system, the field adaptivity of the system can be enhanced, and the replay attack detection capability of the system is improved.

The invention realizes the purpose through the following technical scheme:

a method and a system for detecting the attack of playback of a voice record with self-adaptation field comprise the following steps:

calculating and extracting acoustic features from at least one recording region in the recording, wherein the acoustic features comprise Mel Frequency Cepstrum Coefficient (MFCC) or Power-normalized Cepstral Coefficients (PNCC);

extracting a shared voiceprint feature vector from the acoustic features;

and detecting whether the sound recording is a replay sound recording or not from the shared voiceprint feature vector by a domain adaptive method.

Further, in a detection phase, the shared voiceprint feature vectors are used to detect corresponding targets of at least one domain adaptive countermeasure task associated with the replay attack detection, the domain adaptive countermeasure task comprising: a playback device detection task, a playback environment detection task, and a playback speaker detection task.

Furthermore, the shared voiceprint feature vector is extracted through a shared voiceprint feature module, whether the record is replayed or not is detected through a replay attack detection module, the replay device detection task is achieved through a replay device detection module, the replay environment detection task is achieved through a replay environment detection module, and the replay speaker detection task is achieved through a replay speaker detection module.

Further, the shared voiceprint feature module, the replay attack detection module, the replay device detection module, the replay environment detection module, and the replay speaker detection module are all formed of a deep neural network including a combination of one or more of a Convolutional Neural Network (CNN), a recurrent neural network (RNN, LSTM, GRU), and a time-delayed neural network (TDNN).

Further, the method also comprises a training method of each module. Wherein the weight of the shared voiceprint feature module is W_fThe replay attack detection module has a weight W_aThe weight of the detection module of the playback device is W_dThe replay speaker detection module has a weight W_sThe playback environment detection module has a weight W_eThe training steps of each module are as follows:

s0: inputting the acoustic features of the sound recording into a shared voiceprint feature module, and extracting shared voiceprint feature vectors;

s1: inputting the shared voiceprint feature vector in S0 into a replay attack detection module, and outputting a classification error L_a；

S2: inputting the shared voiceprint feature vector in S0 into a detection module of a playback device, and outputting a classification error L_d；

S3: mixing S0 togetherThe shared voiceprint characteristic vector is input into a speaker detection module for replaying, and a classification error L is output_s；

S4: inputting the shared voiceprint feature vector in S0 into a playback environment detection module, and outputting a classification error L_e；

S5: the update method of each weight is as follows:

wherein is the learning rate, α₁、α₂、α₃The weights of the playback device detection module, the playback speaker detection module, and the playback environment detection module, respectively.

S6: the steps of S0 to S5 are repeated until the blocks converge.

The embodiment of the invention provides another field self-adaptive record replay attack detection system, which comprises the following modules:

the acoustic feature extraction module is used for extracting acoustic features of at least one section of recording area in the recording;

the shared voiceprint feature extraction module is used for extracting a shared voiceprint feature vector from the acoustic features;

a detection module for detecting whether the shared voiceprint feature vector is a replay attack;

further, the detection module is also used to detect at least one domain-adaptive countermeasure task associated with the replay attack.

Further, the shared voiceprint feature extraction module and the detection module further comprise a deep neural network module.

And further, the system also comprises a training module which is used for training the deep neural network module in the shared voiceprint feature extraction module and the detection module.

The invention has the beneficial effects that:

the invention can solve the problem of performance degradation of the record replay attack detection system caused by the field diversity of the record replay equipment, environment and speakers; the robustness of the replay attack detection system can still be ensured under the conditions of the devices and environments of replay recording and the field diversity of speakers.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following briefly introduces the embodiments or the drawings needed to be practical in the prior art description, and obviously, the drawings in the following description are only some embodiments of the embodiments, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1: a schematic diagram of a domain adaptive replay attack detection method;

FIG. 2: a schematic diagram of a training method in a field-adaptive replay attack detection method;

FIG. 3: the structure schematic diagram of a domain-adaptive replay attack detection system;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

Example one

A domain-adaptive replay attack detection method proposed by the present invention is described with reference to fig. 1 and 2, where fig. 1 shows a flowchart of the replay attack detection method and fig. 2 shows a training flowchart of the domain-adaptive replay attack detection method.

In step S101, extracting acoustic features from at least one recording region in the recording, the acoustic features including Mel-Frequency Cepstrum Coefficients (MFCCs) or Power-normalized Cepstral Coefficients (PNCCs);

in step S102, a shared voiceprint feature vector is extracted from the multiple acoustic features extracted in step S101;

in step S103, it is detected whether the recording is a replay attack from the shared voiceprint feature vector extracted in step S102; while in a detection phase, the shared voiceprint feature vectors are used to detect corresponding targets of at least one domain adaptive countermeasure task associated with the replay attack detection, the domain adaptive countermeasure task including, but not limited to: the replay device detects the task, replays the environment detection task and replays the speaker detection task, and obtains detection results of all fields of adaptive confrontation tasks. The shared voiceprint feature vector is extracted through a shared voiceprint feature module, whether the record is replayed or not is detected through a replay attack detection module, the replay device detection task is achieved through a replay device detection module, the replay environment detection task is achieved through a replay environment detection module, and the replay speaker detection task is achieved through a replay speaker detection module. The shared voiceprint feature module, the replay attack detection module, the replay device detection module, the replay environment detection module and the replay speaker detection module are all composed of a deep neural network comprising one or a combination of Convolutional Neural Networks (CNN), recurrent neural networks (RNN, LSTM, GRU) and time-delay neural networks (TDNN).

In addition, the method of the present invention further comprises a training method of a shared voiceprint feature module, a replay attack detection module, a replay device detection module, a replay environment detection module and a replay speaker detection module, as shown in fig. 2.

In step S201, a sound recording sample in a training set, its real playback label, and a real label of a domain-adaptive countermeasure task are acquired;

in step S202, the weight of the shared voiceprint feature module is W_fThe weight of the replay attack detection module is W_aThe weight of the detection module of the playback device is W_dThe replay speaker detection module has a weight of W_sAnd the playback environment detection module has a weight of W_eInputting the acoustic features of the sound recording into a shared voiceprint feature module, extracting shared voiceprint feature vectors, inputting the shared voiceprint feature vectors into a replay attack detection module, a replay device detection module, a replay speaker detection module and a replay environment detection module, and acquiring the detection result of replay sound recording and the detection result of a domain-adaptive confrontation task;

in step S203, the detection result of the reproduced sound recording is compared with the real tag of the reproduced sound recording, and the detection error L is obtained_a；

In step S204, parameters of the replay attack detection module are updated in a back propagation manner, where the updating manner is: w_d←

Wherein is the learning rate;

in step S205, the detection result of the domain-adaptive countermeasure task and the true label of the domain-adaptive countermeasure task are compared, respectively, and the detection error L is obtained_d、L_s、L_e；

In step S206, parameters of the domain-adaptive confrontation task detection module are updated in a back-propagation manner, where the updating manner is:

wherein is the learning rate;

in step S207, the shared voiceprint feature module parameters are updated in a back-propagation manner after the detection error of the domain adaptive countermeasure task is inverted and the detection error of the playback record is simultaneously updated, and the updating manner is as follows:

wherein is the learning rate, α₁、α₂、α₃The weights of the playback device detection module, the playback speaker detection module and the playback environment detection module are respectively.

In step S208, it is determined whether the module converges or the training frequency reaches the set maximum iteration frequency or the module error reaches the set minimum error, if any one of the conditions is satisfied, the training is terminated, otherwise, the steps S201 to S208 are repeated.

The field-adaptive attack detection method for the record replay provided by the embodiment of the invention can still ensure the robustness of a record replay attack detection system under the conditions of the field diversity of the record replay equipment, environment and speakers.

Example two

A domain-adaptive replay attack detection system proposed by the present invention is described with reference to fig. 3, of which fig. 3 shows the constituent modules. Referring to fig. 3, the system includes an acoustic feature extraction module 301, a shared voiceprint feature extraction module 302, a detection module 303, and a training module 304.

The acoustic feature extraction module 301 extracts acoustic features from at least one recording area or the whole recording in the recording data;

the shared voiceprint feature extraction module 302 extracts a shared voiceprint feature vector from the acoustic features in the acoustic feature extraction module 301;

the detection module 303 detects whether the recording is a playback recording from the shared voiceprint feature vectors in the shared voiceprint feature extraction module 302. Meanwhile, the detection module 303 may further detect at least one domain-adaptive countermeasure task associated with the replay attack from the shared voiceprint feature vector, where the countermeasure tasks include a replay device detection task, a replay environment detection task, and a replay speaker detection task and obtain detection results of all the domain-adaptive countermeasure tasks.

The training module 304 is used to train the deep neural network modules in the shared voiceprint feature extraction module 302 and the detection module 303, and the system replay attack detection at least one domain-adaptive countermeasure task that can be associated with replay attack is simultaneously trained, and the training step refers to the first embodiment described above.

The second field-adaptive attack detection system provided by the embodiment of the invention can still ensure the robustness of the attack detection system under the conditions of the field diversity of the equipment, environment and speaker for record playback.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware instructions related to a program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims. It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition. In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as the disclosure of the present invention as long as it does not depart from the spirit of the present invention.

Claims

1. A domain-adaptive replay attack detection method is characterized by comprising the following steps:

extracting acoustic features from at least one recording region of the recording;

extracting a shared voiceprint feature vector from the acoustic features; and

2. The domain-adaptive replay attack detection method of claim 1, wherein the acoustic features include mel-frequency cepstral coefficients or energy-normalized cepstral coefficients.

3. The method of claim 1, wherein the shared voiceprint feature vector is used to detect a corresponding target of at least one domain adaptive countermeasure task associated with playback attack detection, the domain adaptive countermeasure task comprising: a playback device detection task, a playback environment detection task, and a playback speaker detection task.

4. The domain-adaptive replay attack detection method of claim 3, wherein the shared voiceprint feature vector is extracted by a shared voiceprint feature extraction module, whether the recording is replayed is detected by a replay attack detection module, the replay device detection task is realized by a replay device detection module, the replay environment detection task is realized by a replay environment detection module, and the replay speaker detection task is realized by a replay speaker detection module.

5. The method and system for domain-adaptive replay attack detection of sound recordings according to claim 4, wherein the shared voiceprint feature extraction module, replay attack detection module, replay device detection module, replay environment detection module and replay speaker detection module are formed by a deep neural network, and the deep neural network comprises one or more networks selected from the group consisting of convolutional neural network, recursive neural network and delayed neural network.

6. A domain adaptive replay attack detection method according to any one of claims 4 to 5 in which the training steps for each module are as follows:

wherein the weight of the shared voiceprint feature module is W_fThe replay attack detection module has a weight W_aThe weight of the detection module of the playback device is W_dThe replay speaker detection module has a weight W_sThe playback environment detection module has a weight W_e，

S3: inputting the shared voiceprint feature vector in S0 into the speaker detection module, and outputting a classification error L_s；

S5: the update method of each weight is as follows:

S6: the steps of S0 to S5 are repeated until the blocks converge.

7. The domain adaptive replay attack detection method detection system of any one of claims 1 to 8, comprising:

a detection module for detecting whether the shared voiceprint feature vector is a replay attack.

8. The domain-adaptive replay attack detection system of claim 7, wherein the detection module is further configured to detect at least one domain-adaptive countermeasure task associated with a replay attack.

9. The domain-adaptive replay attack detection system of claim 7, wherein the shared voiceprint feature extraction module and detection module further comprises a deep neural network module.

10. A domain adaptive replay attack detection system according to claims 7-9 further including a training module for training the deep neural network module in the shared voiceprint feature extraction module and the detection module.