CN114023331A - Method, device, equipment and storage medium for detecting performance of voiceprint recognition system - Google Patents

Method, device, equipment and storage medium for detecting performance of voiceprint recognition system Download PDF

Info

Publication number
CN114023331A
CN114023331A CN202111222370.6A CN202111222370A CN114023331A CN 114023331 A CN114023331 A CN 114023331A CN 202111222370 A CN202111222370 A CN 202111222370A CN 114023331 A CN114023331 A CN 114023331A
Authority
CN
China
Prior art keywords
voiceprint
attack
character
audios
recognition system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111222370.6A
Other languages
Chinese (zh)
Inventor
汤旭东
吕博良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202111222370.6A priority Critical patent/CN114023331A/en
Publication of CN114023331A publication Critical patent/CN114023331A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to the technical field of biological identification and information security, in particular to a method, a device, equipment and a storage medium for detecting the performance of a voiceprint identification system. The method comprises the following steps: acquiring a plurality of character audios from an audio database, and splicing the character audios to obtain a first attack voiceprint, wherein the character audios are audio segments corresponding to single characters; acquiring the voiceprint characteristics of a target user, and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics; sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain recognition response of the voiceprint recognition system; and acquiring a performance detection result of the voiceprint recognition system according to the recognition response. The method can be used for detecting the anti-counterfeiting performance of the voiceprint recognition system.

Description

Method, device, equipment and storage medium for detecting performance of voiceprint recognition system
Technical Field
The present application relates to the field of biometric identification and information security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a performance of a voiceprint recognition system.
Background
At present, a voiceprint recognition system is widely applied to a plurality of business scenes such as login and payment of internet finance, the voiceprint recognition system verifies the identity of a user by using an identity verification technology based on voiceprint recognition to ensure the transaction security, meanwhile, malicious voiceprint attacks aiming at the voiceprint recognition system are gradually increased, and an attacker impersonates the identity of the attacker by imitating, collecting and generating the voiceprint of the attacker, so that the security of the voiceprint recognition system is seriously influenced. Therefore, it is necessary to detect the anti-counterfeit performance of the voiceprint recognition system and provide a reference for the security of the voiceprint recognition system.
However, currently, there is no detection method for detecting the anti-counterfeiting performance of the voiceprint recognition system, and therefore, the detection of the anti-counterfeiting performance of the voiceprint recognition system is a problem to be solved urgently.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a device and a storage medium for detecting the performance of a voiceprint recognition system, which can detect the anti-counterfeiting performance of the voiceprint recognition system.
In a first aspect, a performance detection method for a voiceprint recognition system is provided, where the method includes:
acquiring a plurality of character audios from an audio database, and splicing the character audios to obtain a first attack voiceprint, wherein the character audios are audio segments corresponding to single characters; acquiring the voiceprint characteristics of a target user, and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics; sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain recognition response of the voiceprint recognition system; and acquiring a performance detection result of the voiceprint recognition system according to the recognition response.
In one embodiment, the obtaining the first attack voiceprint includes obtaining a plurality of character audios from an audio database, and performing a splicing process on the plurality of character audios to obtain the first attack voiceprint, and includes: and randomly acquiring the character audios from the audio database, and splicing the randomly acquired character audios to obtain the replay voiceprint.
In one embodiment, the first attack voiceprint includes a structural voiceprint, the obtaining a plurality of character audios from an audio database, and performing a splicing process on the plurality of character audios to obtain a first attack voiceprint includes: acquiring attack word content, wherein the attack word content comprises a plurality of word characters; acquiring a plurality of character audios respectively corresponding to each character from the audio database; and splicing the acquired character audios according to the arrangement sequence of the character characters in the attack character content to obtain the structural voiceprint.
In one embodiment, the method further comprises: collecting original voiceprints; cleaning the original voiceprint to remove noise in the original voiceprint to obtain a candidate voiceprint; carrying out segmentation processing on the candidate voiceprints to obtain a plurality of character audios; the audio database is constructed based on the plurality of character audios.
In one embodiment, the obtaining the voiceprint feature of the target user and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint feature includes: inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user; carrying out fusion processing on the voiceprint characteristic vector and the voiceprint character to obtain a Mel spectrum; and carrying out conversion processing on the Mel spectrum to obtain the second attack voiceprint.
In one embodiment, inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user includes: and segmenting the voiceprint of the target user to obtain a plurality of voiceprint fragments, respectively inputting the voiceprint fragments into a feature extraction neural network to obtain the voiceprint feature vectors corresponding to the voiceprint fragments, and averaging the voiceprint feature vectors corresponding to the voiceprint fragments to obtain the voiceprint feature vector of the target user.
In one embodiment, before inputting the voiceprint of the target user into the feature extraction neural network, the method further comprises: acquiring a training sample set, wherein the training sample set comprises a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used for indicating that the sample voiceprint is a normal voiceprint or a malicious voiceprint; training a classification neural network based on the training sample set, the classification neural network comprising a feature extraction layer; and taking a feature extraction layer included by the classification neural network as the feature extraction neural network.
In one embodiment, the obtaining the performance test result of the voiceprint recognition system according to the recognition response includes: if the recognition response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint recognition system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is increased in sequence.
In a second aspect, a performance detection apparatus of a voiceprint recognition system is provided, the apparatus comprising:
the first acquisition module is used for acquiring a plurality of character audios from an audio database and splicing the character audios to obtain a first attack voiceprint, wherein the character audios are audio segments corresponding to a single character; the second acquisition module is used for acquiring the voiceprint characteristics of the target user and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics; the sending module is used for sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain recognition response of the voiceprint recognition system; and the third acquisition module is used for acquiring the performance detection result of the voiceprint recognition system according to the recognition response.
In one embodiment, the first attack voiceprint includes a replay voiceprint, and the first obtaining module is specifically configured to: and randomly acquiring the character audios from the audio database, and splicing the randomly acquired character audios to obtain the replay voiceprint.
In one embodiment, the first attack voiceprint includes a constructed voiceprint, and the first obtaining module is specifically configured to: acquiring attack word content, wherein the attack word content comprises a plurality of word characters; acquiring the character audios respectively corresponding to the character characters from the audio database; and splicing the acquired character audios according to the arrangement sequence of the character characters in the attack character content to obtain the structural voiceprint.
In one embodiment, the apparatus further comprises:
the acquisition module is used for acquiring original voiceprints; the cleaning module is used for cleaning the original voiceprint to remove noise in the original voiceprint to obtain a candidate voiceprint; the segmentation module is used for carrying out segmentation processing on the candidate voiceprints to obtain a plurality of character audios; and the building module is used for building the audio database based on the plurality of character audios.
In one embodiment, the second obtaining module is specifically configured to: inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user; carrying out fusion processing on the voiceprint characteristic vector and the voiceprint character to obtain a Mel spectrum; and carrying out conversion processing on the Mel spectrum to obtain the second attack voiceprint.
In one embodiment, the second obtaining module is specifically configured to: and segmenting the voiceprint of the target user to obtain a plurality of voiceprint fragments, respectively inputting the voiceprint fragments into a feature extraction neural network to obtain voiceprint feature vectors corresponding to the voiceprint fragments, and averaging the voiceprint feature vectors corresponding to the voiceprint fragments to obtain the voiceprint feature vector of the target user.
In one embodiment, the apparatus further comprises:
a fourth obtaining module, configured to obtain a training sample set, where the training sample set includes a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used to indicate that the sample voiceprint is a normal voiceprint or a malicious voiceprint; a training module for training a classification neural network based on the training sample set, the classification neural network including a feature extraction layer; and taking a feature extraction layer included by the classification neural network as the feature extraction neural network.
In one embodiment, the third obtaining module is specifically configured to: if the recognition response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint recognition system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is increased in sequence.
In a third aspect, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the first aspect when the processor executes the computer program.
In a fourth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of the above-mentioned first aspects.
According to the performance detection method, the performance detection device, the performance detection equipment and the performance detection storage medium of the voiceprint recognition system, a plurality of character audios are obtained from an audio database, and are spliced to obtain a first attack voiceprint, namely a forged voiceprint is constructed; generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics by acquiring the voiceprint characteristics of the target user, namely constructing another forged voiceprint; the character audio for constructing the first attack voiceprint is an audio fragment corresponding to a single character, and the second attack voiceprint is constructed based on the voiceprint characteristics of the target user, so that the constructed first attack voiceprint and the constructed second attack voiceprint have different counterfeiting complexity; the first attack voiceprint and the second attack voiceprint with different counterfeiting complexity degrees are sent to the voiceprint recognition system, and recognition response of the voiceprint recognition system is obtained, so that a performance detection result of the voiceprint recognition system can be obtained according to the recognition response, and anti-counterfeiting performance and anti-counterfeiting grade of the voiceprint recognition system are detected.
Drawings
Fig. 1 is an application environment diagram of a performance detection method of a voiceprint recognition system according to an embodiment of the present application;
fig. 2 is a flowchart of a performance detection method of a voiceprint recognition system according to an embodiment of the present application;
fig. 3 is a flowchart of constructing an audio database according to an embodiment of the present application;
FIG. 4 is a flowchart for constructing a playback voiceprint according to an embodiment of the present application;
FIG. 5 is a flow chart for constructing a structured voiceprint according to an embodiment of the present application;
fig. 6 is a schematic diagram of a neural network for obtaining feature extraction according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of constructing a second attack voiceprint according to an embodiment of the present application;
fig. 8 is a schematic diagram of obtaining a voiceprint feature vector according to an embodiment of the present application;
fig. 9 is a schematic diagram of a method for detecting anti-counterfeiting performance of a voiceprint recognition system according to an embodiment of the present application;
fig. 10 is a block diagram of a performance detection apparatus of a voiceprint recognition system according to an embodiment of the present application;
fig. 11 is a block diagram of a performance detection apparatus of a second voiceprint recognition system according to an embodiment of the present application;
fig. 12 is a block diagram of a performance detection apparatus of a third voiceprint recognition system provided in the embodiment of the present application;
fig. 13 is a block diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
With the rapid development of the internet and intelligent devices, voiceprint recognition systems based on machine learning and deep learning are widely applied to a plurality of service scenes such as login and payment of internet financial services, and the voiceprint recognition systems verify the identity of a user by using an identity verification technology based on voiceprint recognition, so that the usability and the safety of the services are greatly improved.
Meanwhile, malicious voiceprint attacks aiming at the application scenes are gradually increased, and an attacker is used for bypassing a voiceprint recognition system of key business transaction by simulating, acquiring, arranging and generating the characteristics of voice and voiceprint of the attacker, so that the safety of the voiceprint recognition system is seriously influenced, and the voiceprint recognition safety and ecological health are greatly influenced. Therefore, it is necessary to detect the anti-counterfeit performance of the voiceprint recognition system and provide a reference for the security of the voiceprint recognition system.
However, each website and platform gradually brings the security detection of the voiceprint recognition system into the security management work, but the protection methods adopted by each website are different, and the evaluation and the protection levels are also different.
The performance detection method of the voiceprint recognition system provided by the embodiment of the application can be applied to the application environment shown in fig. 1. The attack voiceprint construction system 101 is in communication connection with the voiceprint identification system 102, the attack voiceprint construction system sends the constructed first attack voiceprint and the second attack voiceprint to the voiceprint identification system, the voiceprint identification system receives the first attack voiceprint and the second attack voiceprint and identifies the first attack voiceprint and the second attack voiceprint to obtain corresponding identification response, and in the subsequent steps, the performance detection result of the voiceprint identification system is obtained according to the identification response of the voiceprint identification system to the first attack voiceprint and the second attack voiceprint. The attack voiceprint construction system 101 may be, but is not limited to, a server, a personal computer, a notebook computer, and the like, and the voiceprint recognition system 102 may be, but is not limited to, a server or a server cluster, various computer devices, a notebook computer, a smart phone, a tablet computer, and the like.
In the embodiment of the present application, as shown in fig. 2, a flowchart of a performance detection method of a voiceprint recognition system provided in the embodiment of the present application is shown, and the method is applied to the attack voiceprint construction system 101 in fig. 1 as an example for description, and includes the following steps:
step 201, obtaining a plurality of character audios from an audio database, and performing splicing processing on the plurality of character audios to obtain a first attack voiceprint, where the character audios are audio segments corresponding to a single character.
The various continuous audios which can be heard are composed of various character audios, the content of each character audio is such as characters or numbers, each character is output in an audio form and is heard or recognized by equipment, namely, each effective character corresponds to an audio segment, namely, the character audio is obtained, and an audio database is composed of a plurality of character audios; the voiceprint is a continuous audio that can be played and is composed of a plurality of character audios. The voiceprint recognition system is used for recognizing the voiceprint of each user, and after the voiceprint recognition is passed, the user can perform other operations in the next step; in a scene that a voiceprint recognition system performs voiceprint recognition on a certain user, the user is taken as a target user, a voiceprint naturally generated by the user is the voiceprint of the target user, and for detecting whether the voiceprint recognition system can accurately recognize the voiceprint of the target user or not, namely detecting the anti-counterfeiting capacity of the voiceprint recognition system, a fake voiceprint can be constructed and input into the voiceprint recognition system for recognition, and the anti-counterfeiting capacity of the voiceprint recognition system is judged according to the recognition result of the fake voiceprint by the voiceprint recognition system; the first attack voiceprint is a forged voiceprint, a plurality of independent character audios can be obtained from an audio database, the independent character audios are spliced to form a continuous audio which can be played and serves as the first attack voiceprint, and the first attack voiceprint is not a voiceprint naturally generated by a target user but a voiceprint constructed through splicing.
Step 202, obtaining a voiceprint feature of the target user, and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint feature.
The voiceprint of the target user is a continuous audio, the continuous audio is composed of a plurality of character audios, and the voiceprint feature of the target user can be feature information extracted from the voiceprint of the target user, for example, the feature information can be a feature vector; and simulating the voiceprint of the target user based on the voiceprint characteristics of the target user to obtain a simulated voiceprint simulating the voice of the target user, and taking the simulated voiceprint as a second attack voiceprint, wherein the second attack voiceprint is formed by character audios similar to the character audios contained in the voiceprint of the target user.
And 203, sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain a recognition response of the voiceprint recognition system.
Obviously, the obtained first attack voiceprints are spliced, the second attack voiceprint is formed by simulation, and the forging degree of the second attack voiceprint is higher than that of the first attack voiceprint; the first attack voiceprint and the second attack voiceprint are respectively sent to the voiceprint recognition system, the voiceprint recognition system recognizes the first attack voiceprint and the second attack voiceprint, recognition results corresponding to the first attack voiceprint and the second attack voiceprint are respectively obtained, and the recognition results are recognition response of the voiceprint recognition system.
And 204, acquiring a performance detection result of the voiceprint recognition system according to the recognition response.
The identification response represents whether the voiceprint identification system successfully identifies the first attack voiceprint and the second attack voiceprint, if so, the voiceprint identification system does not identify that each attack voiceprint is a fabricated voiceprint, and the anti-counterfeiting performance of the voiceprint identification system needs to be improved; the anti-counterfeiting performance grade of the voiceprint recognition system can be judged according to the recognition response of the voiceprint recognition system to the first attack voiceprint and the second attack voiceprint which are different in counterfeiting degree.
The performance detection method of the voiceprint recognition system obtains a plurality of character audios from an audio database, and splices the character audios to obtain a first attack voiceprint, namely a forged voiceprint is constructed; generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics by acquiring the voiceprint characteristics of the target user, namely constructing another forged voiceprint; the character audio for constructing the first attack voiceprint is an audio fragment corresponding to a single character, and the second attack voiceprint is constructed based on the voiceprint characteristics of the target user, so that the constructed first attack voiceprint and the constructed second attack voiceprint have different counterfeiting complexity; the first attack voiceprint and the second attack voiceprint with different counterfeiting complexity degrees are sent to the voiceprint recognition system, and recognition response of the voiceprint recognition system is obtained, so that a performance detection result of the voiceprint recognition system can be obtained according to the recognition response, and anti-counterfeiting performance and anti-counterfeiting grade of the voiceprint recognition system are detected.
In the embodiment of the present application, as shown in fig. 3, a flowchart for constructing an audio database provided in the embodiment of the present application is shown, where the method further includes the following steps:
step 301, collecting an original voiceprint.
The method comprises the following steps that voice print data from different scenes need to be collected firstly when an audio database is built, and various acquired voice print data are original voice prints; the original voiceprint can be obtained by social engineering, field talk recording, telephone recording, on-line and off-line meeting recording, field speech recording, downloading from a media platform, fishing page induction and the like; the original voiceprint acquisition equipment can be mobile phone microphones, professional recording equipment, sound card sets and other audio acquisition equipment.
Step 302, cleaning the original voiceprint to remove noise in the original voiceprint, so as to obtain a candidate voiceprint.
Most of the acquired scenes of the original voiceprints are noisy, and each acquired original voiceprint has a lot of invalid noises, such as background sounds and the like; therefore, in order to obtain clean original voiceprints, the obtained original voiceprints need to be cleaned, the cleaning process is to remove invalid noise in the original voiceprints, and the cleaning process can be to remove the invalid noise in the original voiceprints by adopting methods such as reverse filtering and denoising, so as to obtain pure effective audios corresponding to the original voiceprints, and each pure effective audio is taken as a candidate voiceprint.
Step 303, performing segmentation processing on the candidate voiceprint to obtain a plurality of character audios.
After obtaining each candidate voiceprint, carrying out segmentation processing on each candidate voiceprint through an automatic tool, and segmenting each candidate voiceprint into single character audios so as to obtain a plurality of character audios; the segmentation process of each candidate voiceprint can be realized by adopting an automatic tool, and the segmentation process of the automatic tool can be as follows: for example, segmenting a candidate voiceprint containing 0-9 numbers, marking a waveform function corresponding to the candidate voiceprint as f (T), obtaining invalid noise filtered during cleaning of an original voiceprint corresponding to the candidate voiceprint, marking a waveform function corresponding to the invalid noise as S (T), marking a peak value of the function S (T) as S1, and selecting a time window T containing a target number, wherein the target number can be 0-9; when f (t) is greater than the peak value S1 for the first time and is kept for a certain time, the position t is considered0SThe start of the audio being the number 0; when f (t) is less than the peak value S1 for the first time and is kept for a certain time, the position t is considered0eAt the end point of the number 0, t0S~t0eSo that an audio clip of digital 0 can be obtained; and so on, the audio frequency segment of 0-9 figure can be obtained by segmentation.
The audio database is constructed based on the plurality of character audios, step 304.
And segmenting each candidate voiceprint to obtain a plurality of character audios, storing all the character audios in an attack voiceprint construction system to serve as an audio database, and directly calling some character audios in the audio database when the voiceprint is constructed.
The randomness of the original voiceprint is ensured by collecting the original voiceprint from various different scenes, and the data volume of the original voiceprint is ensured to be enough; by cleaning and segmenting the original voiceprint, not only can a plurality of character audios be obtained, but also each obtained character audio is a clean audio, so that the subsequent use cannot be interfered by invalid noise.
In this embodiment of the present application, as shown in fig. 4, a flowchart for constructing a replay voiceprint provided by an embodiment of the present application is shown, where the first attack voiceprint includes a replay voiceprint, and the obtaining of multiple character audios from an audio database and a splicing process performed on the multiple character audios obtain a first attack voiceprint includes:
step 401, randomly obtaining the multiple character audios from the audio database.
And 402, splicing a plurality of randomly acquired character audios to obtain the replay voiceprint.
The first attack voiceprint can comprise a replay voiceprint, in the process of constructing the replay voiceprint, a plurality of character audios with a preset number are randomly selected from an audio database at first, the selected character audios are directly spliced to form a continuous audio, and the continuous audio is used as the replay voiceprint; wherein, the preset number can be set to different values according to actual conditions.
Because the replay voiceprint is formed according to a plurality of randomly acquired character audios, in a scene that a target user conducts voiceprint recognition in a voiceprint recognition system, the character audios contained in the replay voiceprint are meaningless, so that the similarity between the replay voiceprint and the voiceprint of the target user is low, the replay voiceprint is sent to the voiceprint recognition system, if the voiceprint recognition system does not successfully recognize the replay voiceprint, the voiceprint recognition system recognizes that the replay voiceprint is different from the target voiceprint and is a fake voiceprint, and therefore the anti-counterfeiting performance of the voiceprint recognition system can be proved to meet a low level; thus, the reproduced voiceprint can be used to detect whether the voiceprint recognition system has a lower level of anti-counterfeiting capability.
In this embodiment of the present application, as shown in fig. 5, a flowchart for constructing a structural voiceprint provided in an embodiment of the present application is shown, where the first attack voiceprint includes a structural voiceprint, and the obtaining of a plurality of character audios from an audio database and a splicing process performed on the plurality of character audios obtains the first attack voiceprint, where the method includes:
step 501, obtaining attack literal content, wherein the attack literal content comprises a plurality of literal characters.
In a scene that a target user performs voiceprint recognition in a voiceprint recognition system, the target user outputs a voiceprint containing text content given by the voiceprint recognition system, wherein the text content comprises a plurality of text characters, and the characters can be characters, numbers and the like; and taking the character content as attack character content, wherein the attack character content correspondingly comprises a plurality of character characters.
Step 502, the character audios corresponding to the respective character characters are obtained from the audio database.
And selecting character audio frequencies respectively corresponding to the character characters from an audio database according to the character characters in the acquired attack character content to obtain a plurality of character audio frequencies.
Step 503, splicing the acquired multiple character audios according to the arrangement order of the multiple character characters in the attack character content to obtain the structural voiceprint.
The voice print of the target user comprises a plurality of character audio frequencies, wherein the sequence of each character audio frequency included in the voice print of the target user is determined according to the character content given by the voice print recognition system, namely, the sequence of each character audio frequency included in the attack character content is the same as the sequence of each character audio frequency included in the voice print of the target user, the arrangement sequence of each character included in the attack character content is determined according to the sequence of each character audio frequency included in the voice print of the target user, a plurality of character audio frequencies corresponding to each character are spliced according to the arrangement sequence to form a continuous audio frequency, and the continuous audio frequency is used as a constructed voice print.
Because the audio sequence of each character included in the constructed voiceprint is determined according to the text content given by the voiceprint recognition system, in a scene that a target user performs voiceprint recognition in the voiceprint recognition system, compared with a replay voiceprint, the constructed voiceprint is meaningful, so that the constructed voiceprint has certain similarity with the voiceprint of the target user; in a scene that a target user performs voiceprint recognition in a voiceprint recognition system, the structural voiceprint is sent to the voiceprint recognition system, if the voiceprint recognition system does not successfully recognize the structural voiceprint, the voiceprint recognition system recognizes that the structural voiceprint is different from the target voiceprint and is a forged voiceprint, and therefore the anti-counterfeiting performance of the voiceprint recognition system can meet a common level; therefore, the voiceprint can be adopted to detect whether the voiceprint recognition system has the anti-counterfeiting capability of a common level.
In the embodiment of the present application, as described above, in the process of obtaining the second attack voiceprint, the voiceprint feature of the target user needs to be obtained first, and optionally, the voiceprint feature of the target user may be obtained by using a feature extraction neural network; referring to fig. 6, which shows a flowchart of acquiring a feature extraction neural network according to an embodiment of the present application, before inputting a voiceprint of a target user into the feature extraction neural network, the method further includes:
step 601, obtaining a training sample set, where the training sample set includes a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used to indicate that the sample voiceprint is a normal voiceprint or a malicious voiceprint.
The method comprises the steps of taking each obtained original voiceprint as a sample voiceprint, classifying each sample voiceprint by adopting a simple classifier, and classifying each sample voiceprint into a normal voiceprint or a malicious voiceprint, wherein the normal voiceprint refers to the voiceprint directly output by a user in a scene of obtaining each original voiceprint, and the malicious voiceprint refers to the voiceprint which is used in the scenes and has been forged in the scene of obtaining each original voiceprint; after each sample voiceprint is classified as a normal voiceprint or a malicious voiceprint, each normal voiceprint is bound with a normal voiceprint label, and the normal voiceprint label is used for indicating that the bound sample voiceprint is a normal voiceprint; binding each malicious voiceprint with a malicious voiceprint tag, wherein the malicious voiceprint tag is used for indicating that the bound sample voiceprint is a malicious voiceprint; and taking each normal voiceprint, the corresponding normal voiceprint label, the corresponding malicious voiceprint and the corresponding malicious voiceprint label as a training sample set.
Step 602, training a classification neural network based on the training sample set, the classification neural network including a feature extraction layer.
The obtained sample training set is adopted to train a classification neural network, and the trained classification neural network can identify an input target voiceprint so as to judge whether the target voiceprint is a normal voiceprint or a malicious voiceprint; in the classification neural network, a feature extraction layer is provided for extracting feature vectors of input target voiceprints in the identification process.
Step 603, using the feature extraction layer included in the classification neural network as the feature extraction neural network.
And taking a feature extraction layer included in the classification neural network as a feature extraction neural network, wherein the feature extraction neural network is used for outputting a feature vector corresponding to the input target voiceprint.
In the embodiment of the present application, as shown in fig. 7, which shows a flowchart for constructing a second attack voiceprint provided in the embodiment of the present application, acquiring a voiceprint feature of a target user, and generating a second attack voiceprint for simulating a voice of the target user based on the voiceprint feature, where the method includes:
step 701, inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user.
And 702, fusing the vocal print characteristic vector and the vocal print characters to obtain a Mel spectrum.
Acquiring a voiceprint of a target user identified in a voiceprint identification system by the target user, and inputting the voiceprint of the target user into a feature extraction neural network, so that a voiceprint feature vector corresponding to the voiceprint of the target user can be obtained; the method comprises the steps of obtaining characters corresponding to each character audio frequency contained in a voiceprint of a target user, namely voiceprint characters of the target user, and carrying out feature fusion on the voiceprint feature vectors corresponding to the voiceprint of the target user and the voiceprint characters of the target user, so that a corresponding Mel spectrum can be obtained.
Step 703, performing a conversion process on the mel spectrum to obtain the second attack voiceprint.
The mel spectrum obtained after the above features are fused is the voiceprint information of the frequency domain, so that the mel spectrum needs to be converted into the voiceprint information of the time domain, the voiceprint information of the time domain comprises a plurality of audio characters and can be normally played, and the voiceprint information of the time domain is used as a second attack voiceprint.
The second attack voiceprint is obtained by performing feature fusion on a voiceprint feature vector corresponding to the voiceprint of the target user and the voiceprint characters of the target user, so that the second attack voiceprint has higher similarity with the voiceprint of the target user in a scene in which the target user performs voiceprint recognition in a voiceprint recognition system; in a scene that a target user performs voiceprint recognition in a voiceprint recognition system, sending the second attack voiceprint to the voiceprint recognition system, and if the voiceprint recognition system does not successfully recognize the second attack voiceprint, indicating that the second attack voiceprint is different from the target voiceprint and is a forged voiceprint, so that the anti-counterfeiting performance of the voiceprint recognition system can be higher; therefore, whether the voiceprint recognition system has higher-level anti-counterfeiting capability or not can be detected by adopting the second attack voiceprint.
In this embodiment of the present application, as shown in fig. 8, which shows a flowchart for obtaining a voiceprint feature vector provided in this embodiment of the present application, inputting a voiceprint of the target user into a feature extraction neural network to obtain the voiceprint feature vector of the target user, includes:
step 801, performing segmentation processing on the voiceprint of the target user to obtain a plurality of voiceprint fragments.
Step 802, inputting the plurality of voiceprint fragments to a feature extraction neural network respectively to obtain a voiceprint feature vector corresponding to each voiceprint fragment.
The voiceprint of the target user can be segmented according to seconds to obtain a plurality of voiceprint segments, and each voiceprint segment comprises a section of audio; and respectively inputting the voiceprint fragments into the feature extraction neural network according to the segmentation sequence, and obtaining corresponding voiceprint feature vectors of each voiceprint fragment so as to obtain the corresponding voiceprint feature vectors of each voiceprint fragment.
Step 803, the voiceprint feature vectors corresponding to the voiceprint fragments are averaged to obtain the voiceprint feature vector of the target user.
And averaging the obtained voiceprint feature vectors corresponding to the voiceprint fragments to obtain a feature vector, wherein the feature vector is used as the voiceprint feature vector of the target user for subsequent processing.
In the embodiment of the present application, obtaining the performance detection result of the voiceprint recognition system according to the recognition response includes: if the recognition response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint recognition system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is increased in sequence.
In a scene that a target user performs voiceprint recognition in a voiceprint recognition system, an attack voiceprint construction system respectively sends constructed replay voiceprints, constructed voiceprints and second attack voiceprints to the voiceprint recognition system in batches; firstly, the replay voiceprint is sent to an identification server in a voiceprint identification system, and the identification server identifies the replay voiceprint and outputs a corresponding identification response to an attack voiceprint construction system; when receiving the identification response corresponding to the replay voiceprint sent by the identification server, the attack voiceprint construction system sends the constructed voiceprint to the identification server, and the identification server identifies the constructed voiceprint and outputs the corresponding identification response to the attack voiceprint construction system; when receiving an identification response sent by the identification server and corresponding to the constructed voiceprint as success, the attack voiceprint construction system sends a second attack voiceprint to the identification server, and the identification server identifies the second attack voiceprint and outputs a corresponding identification response to the attack voiceprint construction system; when the identification response output by the identification server is successful, the voiceprint is represented as a forged voiceprint which is not identified by the identification server.
According to each response identification result obtained by attacking the voiceprint construction system, if the identification response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint identification system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is increased in sequence.
The identification server is sent with the constructed forged voiceprints with different voiceprint similarities with the target user, and the obtained identification response results also reflect the capability of the identification server for identifying the forged voiceprints with different similarities, so that the identification capability of the identification server, namely the anti-counterfeiting performance of the voiceprint identification system, can be judged according to the response results.
In the embodiment of the present application, as shown in fig. 9, a flowchart of an anti-counterfeiting performance detection method for a voiceprint recognition system provided in the embodiment of the present application is shown, and the method includes:
step 901, collecting original voiceprints.
In an attack voiceprint construction system, various audio data are obtained by methods of social engineering, field conversation recording, telephone recording, online and offline meeting recording, field speech recording, downloading from a media platform, fishing page induction and the like, and the various audio data are used as original voiceprints; the original voiceprint acquisition equipment can be mobile phone microphones, professional recording equipment, sound card sets and other audio acquisition equipment.
And 902, cleaning the original voiceprint to remove noise in the original voiceprint to obtain a candidate voiceprint.
In order to obtain clean original voiceprints, cleaning processing needs to be performed on the obtained original voiceprints, wherein the cleaning processing is to remove invalid noise in the original voiceprints, and the invalid noise is background noise and the like; the cleaning process may be to remove the invalid noise in each original voiceprint by using a common filtering method such as reverse filtering and denoising, so as to obtain a pure effective audio corresponding to each original voiceprint, and to use each pure effective audio as a candidate voiceprint.
And 903, performing segmentation processing on the candidate voiceprints to obtain a plurality of character audios, and constructing an audio database based on the plurality of character audios.
After obtaining each candidate voiceprint, carrying out segmentation processing on each candidate voiceprint through an automatic tool, and segmenting each candidate voiceprint into single character audios so as to obtain a plurality of character audios; and taking a character audio set formed by the plurality of character audios as an audio database.
Step 904, a training sample set is obtained, where the training sample set includes a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used to indicate that the sample voiceprint is a normal voiceprint or a malicious voiceprint.
Taking each obtained original voiceprint as a sample voiceprint, classifying each sample voiceprint by adopting a simple classifier, and classifying each sample voiceprint into a normal voiceprint or a malicious voiceprint, wherein the normal voiceprint refers to the voiceprint directly output by a user in a scene of obtaining each original voiceprint, and the malicious voiceprint refers to the voiceprint which is used in the scenes and has been forged in the scene of obtaining each original voiceprint; after each sample voiceprint is classified as a normal voiceprint or a malicious voiceprint, each normal voiceprint is bound with a normal voiceprint label, and the normal voiceprint label is used for indicating that the bound sample voiceprint is a normal voiceprint; binding each malicious voiceprint with a malicious voiceprint tag, wherein the malicious voiceprint tag is used for indicating that the bound sample voiceprint is a malicious voiceprint; and taking each normal voiceprint, the corresponding normal voiceprint label, the corresponding malicious voiceprint and the corresponding malicious voiceprint label as a training sample set.
Step 905, training a classification neural network based on the training sample set, wherein the classification neural network comprises a feature extraction layer, and the feature extraction layer included in the classification neural network is used as a feature extraction neural network.
Training a classification neural network by adopting the obtained sample training set, wherein the trained classification neural network can identify an input target voiceprint so as to judge whether the target voiceprint is a normal voiceprint or a malicious voiceprint; the classification neural network comprises a feature extraction layer and a feature extraction layer, wherein the feature extraction layer is used for extracting feature vectors of input target voiceprints in the identification process, and the feature extraction layer included in the classification neural network is used as a feature extraction neural network.
Step 906, randomly acquiring a plurality of character audios from the audio database, and splicing the randomly acquired character audios to obtain a replay voiceprint.
Randomly selecting a plurality of character audios with a preset number from an audio database, directly splicing the selected character audios to form a continuous audio, and taking the continuous audio as a replay voiceprint; wherein, the preset number can be set to different values according to actual conditions.
Step 907, obtaining attack text content, wherein the attack text content comprises a plurality of text characters, obtaining a plurality of character audios corresponding to the text characters from an audio database, and splicing the obtained character audios according to the arrangement sequence of the text characters in the attack text content to obtain a structural voiceprint.
In a scene that a target user performs voiceprint recognition in a voiceprint recognition system, the target user outputs a voiceprint containing text given by the voiceprint recognition system, wherein the text comprises a plurality of text characters and corresponding arrangement sequences, and the characters can be characters, numbers and the like; taking the character content as attack character content; selecting character audio frequencies respectively corresponding to the character characters from an audio database according to the character characters in the acquired attack character content to obtain a plurality of character audio frequencies; according to the corresponding arrangement sequence of the text content given by the voiceprint recognition system, splicing a plurality of character audios corresponding to each text character selected from an audio character library according to the arrangement sequence to form a continuous audio, and taking the continuous audio as a structural voiceprint.
Step 908, segmenting the voiceprint of the target user to obtain a plurality of voiceprint segments, inputting the voiceprint segments to the feature extraction neural network respectively to obtain the voiceprint feature vectors corresponding to the voiceprint segments, and averaging the voiceprint feature vectors corresponding to the voiceprint segments to obtain the voiceprint feature vector of the target user.
The voiceprint of the target user can be segmented according to seconds to obtain a plurality of voiceprint segments, and each voiceprint segment comprises a section of audio; inputting the voiceprint fragments into a feature extraction neural network respectively according to the segmentation sequence, wherein each voiceprint fragment obtains a corresponding voiceprint feature vector, so that each voiceprint feature vector corresponding to each voiceprint fragment is obtained; and averaging the voiceprint characteristic vectors corresponding to the voiceprint fragments to obtain the voiceprint characteristic vector of the target user.
And 909, performing fusion processing on the voiceprint feature vector and the voiceprint text to obtain a mel spectrum, and performing conversion processing on the mel spectrum to obtain the second attack voiceprint.
Acquiring characters corresponding to each character audio frequency contained in the voiceprint of the target user, namely the voiceprint characters of the target user, and performing feature fusion on the voiceprint feature vector corresponding to the voiceprint of the target user and the voiceprint characters of the target user to obtain a corresponding Mel spectrum; and the Mel spectrum obtained after the characteristic fusion is the voiceprint information of the frequency domain, so that the Mel spectrum is converted into the voiceprint information of the time domain, and the voiceprint information of the time domain is used as a second attack voiceprint.
And step 910, sending the replay voiceprint, the constructed voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain a recognition response of the voiceprint recognition system.
And in a scene that a target user performs voiceprint recognition in the voiceprint recognition system, respectively sending the replay voiceprint, the constructed voiceprint and the second attack voiceprint which are constructed in the attack voiceprint construction system to the voiceprint recognition system for recognition.
Firstly, the replay voiceprint is sent to an identification server in a voiceprint identification system, and the identification server identifies the replay voiceprint and outputs a corresponding identification response to an attack voiceprint construction system; when receiving the identification response corresponding to the replay voiceprint sent by the identification server, the attack voiceprint construction system sends the constructed voiceprint to the identification server, and the identification server identifies the constructed voiceprint and outputs the corresponding identification response to the attack voiceprint construction system; when receiving an identification response sent by the identification server and corresponding to the constructed voiceprint as success, the attack voiceprint construction system sends a second attack voiceprint to the identification server, and the identification server identifies the second attack voiceprint and outputs a corresponding identification response to the attack voiceprint construction system; when the identification response output by the identification server is successful, the voiceprint is represented as a forged voiceprint which is not identified by the identification server.
And 911, acquiring an anti-counterfeiting performance detection result of the voiceprint recognition system according to the recognition response.
According to each response identification result obtained by attacking the voiceprint construction system, if the identification response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint identification system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system represented by the first level, the second level, the third level and the fourth level is sequentially increased; so that the anti-counterfeiting performance grade of the voiceprint recognition system can be determined according to the grade.
It should be understood that although the various steps in the flow charts of fig. 2-9 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-9 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In the embodiment of the present application, as shown in fig. 10, a block diagram of a performance detection apparatus of a voiceprint recognition system provided in the embodiment of the present application is shown, where the performance detection apparatus 1000 of the voiceprint recognition system includes: a first obtaining module 1001, a second obtaining module 1002, a sending module 1003 and a third obtaining module 1004, wherein:
a first obtaining module 1001, configured to obtain multiple character audios from an audio database, and perform splicing processing on the multiple character audios to obtain a first attack voiceprint, where the character audios are audio segments corresponding to a single character;
a second obtaining module 1002, configured to obtain a voiceprint feature of a target user, and generate a second attack voiceprint for simulating a voice of the target user based on the voiceprint feature;
a sending module 1003, configured to send the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system, so as to obtain a recognition response of the voiceprint recognition system;
a third obtaining module 1004, configured to obtain a performance detection result of the voiceprint recognition system according to the recognition response.
In an embodiment of the present application, the first attack voiceprint includes a replay voiceprint, and the first obtaining module is specifically configured to: and randomly acquiring the character audios from the audio database, and splicing the randomly acquired character audios to obtain the replay voiceprint.
In an embodiment of the present application, the first attack voiceprint includes a structural voiceprint, and the first obtaining module is specifically configured to: acquiring attack word content, wherein the attack word content comprises a plurality of word characters; acquiring the character audios respectively corresponding to the character characters from the audio database; and splicing the acquired character audios according to the arrangement sequence of the character characters in the attack character content to obtain the structural voiceprint.
In an embodiment of the application, the second obtaining module is specifically configured to: inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user; carrying out fusion processing on the voiceprint characteristic vector and the voiceprint character to obtain a Mel spectrum; and carrying out conversion processing on the Mel spectrum to obtain the second attack voiceprint.
In an embodiment of the application, the second obtaining module is specifically configured to: and segmenting the voiceprint of the target user to obtain a plurality of voiceprint fragments, respectively inputting the voiceprint fragments into a feature extraction neural network to obtain voiceprint feature vectors corresponding to the voiceprint fragments, and averaging the voiceprint feature vectors corresponding to the voiceprint fragments to obtain the voiceprint feature vector of the target user.
In this embodiment of the present application, as shown in fig. 11, a block diagram of a performance detection apparatus of a second voiceprint recognition system provided in this embodiment of the present application is shown, where the performance detection apparatus 1100 of the voiceprint recognition system further includes: an acquisition module 1005, a cleaning module 1006, a cutting module 1007, and a construction module 1008, wherein:
an acquisition module 1005 for acquiring an original voiceprint;
a cleaning module 1006, configured to clean the original voiceprint to remove noise in the original voiceprint, so as to obtain a candidate voiceprint;
a segmentation module 1007, configured to perform segmentation processing on the candidate voiceprint to obtain multiple character audios;
a building module 1008 for building the audio database based on the plurality of character audios.
In this embodiment of the present application, as shown in fig. 12, a block diagram of a performance detection apparatus of a third voiceprint recognition system provided in this embodiment of the present application is shown, where the performance detection apparatus 1200 of the voiceprint recognition system further includes: a fourth acquisition module 1009 and a training module 1010, wherein:
a fourth obtaining module 1009, configured to obtain a training sample set, where the training sample set includes a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used to indicate that the sample voiceprint is a normal voiceprint or a malicious voiceprint;
a training module 1010 for training a classification neural network based on the training sample set, the classification neural network including a feature extraction layer; and taking a feature extraction layer included by the classification neural network as the feature extraction neural network.
In an embodiment of the application, the third obtaining module is specifically configured to: if the recognition response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint recognition system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is increased in sequence.
For the specific definition of the performance detection apparatus of the voiceprint recognition system, reference may be made to the above definition of the performance detection method of the voiceprint recognition system, and details are not repeated here. The modules in the performance detection device of the voiceprint recognition system can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In the embodiment of the present application, a computer device is provided, where the computer device may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing performance detection data of the voiceprint recognition system. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of performance detection for a voiceprint recognition system.
Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment of the present application, there is provided a computer device, which may be a server, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the following steps when executing the computer program:
acquiring a plurality of character audios from an audio database, and splicing the character audios to obtain a first attack voiceprint, wherein the character audios are audio segments corresponding to single characters; acquiring the voiceprint characteristics of a target user, and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics; sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain recognition response of the voiceprint recognition system; and acquiring a performance detection result of the voiceprint recognition system according to the recognition response.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
and randomly acquiring the character audios from the audio database, and splicing the randomly acquired character audios to obtain the replay voiceprint.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
acquiring attack word content, wherein the attack word content comprises a plurality of word characters; acquiring a plurality of character audios respectively corresponding to each character from the audio database; and splicing the acquired character audios according to the arrangement sequence of the character characters in the attack character content to obtain the structural voiceprint.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
collecting original voiceprints; cleaning the original voiceprint to remove noise in the original voiceprint to obtain a candidate voiceprint; carrying out segmentation processing on the candidate voiceprints to obtain a plurality of character audios; the audio database is constructed based on the plurality of character audios.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user; carrying out fusion processing on the voiceprint characteristic vector and the voiceprint character to obtain a Mel spectrum; and carrying out conversion processing on the Mel spectrum to obtain the second attack voiceprint.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
and segmenting the voiceprint of the target user to obtain a plurality of voiceprint fragments, respectively inputting the voiceprint fragments into a feature extraction neural network to obtain the voiceprint feature vectors corresponding to the voiceprint fragments, and averaging the voiceprint feature vectors corresponding to the voiceprint fragments to obtain the voiceprint feature vector of the target user.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
acquiring a training sample set, wherein the training sample set comprises a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used for indicating that the sample voiceprint is a normal voiceprint or a malicious voiceprint; training a classification neural network based on the training sample set, the classification neural network comprising a feature extraction layer; and taking a feature extraction layer included by the classification neural network as the feature extraction neural network.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
if the recognition response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint recognition system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is increased in sequence.
The implementation principle and technical effect of the computer device provided by the embodiment of the present application are similar to those of the method embodiment described above, and are not described herein again.
In an embodiment of the application, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of:
acquiring a plurality of character audios from an audio database, and splicing the character audios to obtain a first attack voiceprint, wherein the character audios are audio segments corresponding to single characters; acquiring the voiceprint characteristics of a target user, and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics; sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain recognition response of the voiceprint recognition system; and acquiring a performance detection result of the voiceprint recognition system according to the recognition response.
In one embodiment of the application, the computer program when executed by the processor performs the steps of:
and randomly acquiring the character audios from the audio database, and splicing the randomly acquired character audios to obtain the replay voiceprint.
In one embodiment of the application, the computer program when executed by the processor performs the steps of:
acquiring attack word content, wherein the attack word content comprises a plurality of word characters; acquiring a plurality of character audios respectively corresponding to each character from the audio database; and splicing the acquired character audios according to the arrangement sequence of the character characters in the attack character content to obtain the structural voiceprint.
In one embodiment of the application, the computer program when executed by the processor performs the steps of:
collecting original voiceprints; cleaning the original voiceprint to remove noise in the original voiceprint to obtain a candidate voiceprint; carrying out segmentation processing on the candidate voiceprints to obtain a plurality of character audios; the audio database is constructed based on the plurality of character audios.
In one embodiment of the application, the computer program when executed by the processor performs the steps of:
inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user; carrying out fusion processing on the voiceprint characteristic vector and the voiceprint character to obtain a Mel spectrum; and carrying out conversion processing on the Mel spectrum to obtain the second attack voiceprint.
In one embodiment of the application, the computer program when executed by the processor performs the steps of:
and segmenting the voiceprint of the target user to obtain a plurality of voiceprint fragments, respectively inputting the voiceprint fragments into a feature extraction neural network to obtain the voiceprint feature vectors corresponding to the voiceprint fragments, and averaging the voiceprint feature vectors corresponding to the voiceprint fragments to obtain the voiceprint feature vector of the target user.
In one embodiment of the application, the computer program when executed by the processor performs the steps of:
acquiring a training sample set, wherein the training sample set comprises a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used for indicating that the sample voiceprint is a normal voiceprint or a malicious voiceprint; training a classification neural network based on the training sample set, the classification neural network comprising a feature extraction layer; and taking a feature extraction layer included by the classification neural network as the feature extraction neural network.
In one embodiment of the application, the computer program when executed by the processor performs the steps of:
if the recognition response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint recognition system is at a first level; if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level; if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level; if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint, and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level; the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is increased in sequence.
The implementation principle and technical effect of the computer-readable storage medium provided by this embodiment are similar to those of the above-described method embodiment, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (SyMchliMk) DRAM (SLDRAM), RaMbus (RaMbus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (11)

1. A method for detecting performance of a voiceprint recognition system, the method comprising:
acquiring a plurality of character audios from an audio database, and splicing the character audios to obtain a first attack voiceprint, wherein the character audios are audio segments corresponding to a single character;
acquiring voiceprint characteristics of a target user, and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics;
sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain recognition response of the voiceprint recognition system;
and acquiring a performance detection result of the voiceprint recognition system according to the recognition response.
2. The method of claim 1, wherein the first attack voiceprint comprises a replay voiceprint, and wherein obtaining a plurality of character audios from an audio database and splicing the character audios to obtain the first attack voiceprint comprises:
and randomly acquiring the character audios from the audio database, and splicing the randomly acquired character audios to obtain the replay voiceprint.
3. The method of claim 2, wherein the first attack voiceprint comprises a constructed voiceprint, and the obtaining a plurality of character audios from an audio database and splicing the plurality of character audios to obtain the first attack voiceprint comprises:
acquiring attack word content, wherein the attack word content comprises a plurality of word characters;
acquiring a plurality of character audios corresponding to the character characters from the audio database;
and splicing the acquired character audios according to the arrangement sequence of the character characters in the attack character content to obtain the structural voiceprint.
4. The method of any of claims 1 to 3, further comprising:
collecting original voiceprints;
cleaning the original voiceprint to remove noise in the original voiceprint to obtain a candidate voiceprint;
carrying out segmentation processing on the candidate voiceprints to obtain a plurality of character audios;
the audio database is built based on the plurality of character audios.
5. The method of claim 1, wherein the obtaining of the voiceprint feature of the target user and the generating of the second attack voiceprint for simulating the target user voice based on the voiceprint feature comprises:
inputting the voiceprint of the target user into a feature extraction neural network to obtain a voiceprint feature vector of the target user;
fusing the voiceprint characteristic vector of the target user with the voiceprint characters to obtain a Mel spectrum;
and converting the Mel spectrum to obtain the second attack sound stripe.
6. The method according to claim 5, wherein the inputting the voiceprint of the target user into a feature extraction neural network to obtain the voiceprint feature vector of the target user comprises:
segmenting the voiceprint of the target user to obtain a plurality of voiceprint fragments;
respectively inputting the voiceprint fragments into the feature extraction neural network to obtain a voiceprint feature vector corresponding to each voiceprint fragment;
and averaging the voiceprint characteristic vectors corresponding to the voiceprint fragments to obtain the voiceprint characteristic vector of the target user.
7. The method of claim 5 or 6, wherein prior to the inputting the target user's voiceprint into a feature extraction neural network, the method further comprises:
acquiring a training sample set, wherein the training sample set comprises a sample voiceprint and a voiceprint label corresponding to the sample voiceprint, and the voiceprint label is used for indicating that the sample voiceprint is a normal voiceprint or a malicious voiceprint;
training a classification neural network based on the training sample set, the classification neural network comprising a feature extraction layer;
and taking a feature extraction layer included by the classification neural network as the feature extraction neural network.
8. The method according to claim 3, wherein the obtaining a performance test result of the voiceprint recognition system according to the recognition response comprises:
if the recognition response corresponding to the replay voiceprint is successful, determining that the performance of the voiceprint recognition system is at a first level;
if the recognition response corresponding to the replay voiceprint fails and the recognition response corresponding to the constructed voiceprint succeeds, determining that the performance of the voiceprint recognition system is at a second level;
if the recognition response corresponding to the replay voiceprint and the recognition response corresponding to the constructed voiceprint are both successful and the recognition response corresponding to the second attack voiceprint fails, determining that the performance of the voiceprint recognition system is at a third level;
if the recognition response corresponding to the replay voiceprint, the recognition response corresponding to the constructed voiceprint and the recognition response corresponding to the second attack voiceprint all fail, determining that the performance of the voiceprint recognition system is at a fourth level;
the performance of the voiceprint recognition system characterized by the first level, the second level, the third level and the fourth level is sequentially increased.
9. A performance detection apparatus for a voiceprint recognition system, the apparatus comprising:
the first acquisition module is used for acquiring a plurality of character audios from an audio database and splicing the character audios to obtain a first attack voiceprint, wherein the character audios are audio segments corresponding to a single character;
the second acquisition module is used for acquiring the voiceprint characteristics of the target user and generating a second attack voiceprint for simulating the voice of the target user based on the voiceprint characteristics;
the sending module is used for sending the first attack voiceprint and the second attack voiceprint to a voiceprint recognition system to obtain recognition response of the voiceprint recognition system;
and the third acquisition module is used for acquiring the performance detection result of the voiceprint recognition system according to the recognition response.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
CN202111222370.6A 2021-10-20 2021-10-20 Method, device, equipment and storage medium for detecting performance of voiceprint recognition system Pending CN114023331A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111222370.6A CN114023331A (en) 2021-10-20 2021-10-20 Method, device, equipment and storage medium for detecting performance of voiceprint recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111222370.6A CN114023331A (en) 2021-10-20 2021-10-20 Method, device, equipment and storage medium for detecting performance of voiceprint recognition system

Publications (1)

Publication Number Publication Date
CN114023331A true CN114023331A (en) 2022-02-08

Family

ID=80056800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111222370.6A Pending CN114023331A (en) 2021-10-20 2021-10-20 Method, device, equipment and storage medium for detecting performance of voiceprint recognition system

Country Status (1)

Country Link
CN (1) CN114023331A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116013323A (en) * 2022-12-27 2023-04-25 浙江大学 Active evidence obtaining method oriented to voice conversion

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116013323A (en) * 2022-12-27 2023-04-25 浙江大学 Active evidence obtaining method oriented to voice conversion

Similar Documents

Publication Publication Date Title
US11663307B2 (en) RtCaptcha: a real-time captcha based liveness detection system
CN109729383B (en) Double-recording video quality detection method and device, computer equipment and storage medium
CN108429619A (en) Identity identifying method and system
CN105991593B (en) A kind of method and device identifying consumer's risk
CN110032924A (en) Recognition of face biopsy method, terminal device, storage medium and electronic equipment
CN110955874A (en) Identity authentication method, identity authentication device, computer equipment and storage medium
CN111275448A (en) Face data processing method and device and computer equipment
JP7412496B2 (en) Living body (liveness) detection verification method, living body detection verification system, recording medium, and training method for living body detection verification system
CN110769425B (en) Method and device for judging abnormal call object, computer equipment and storage medium
CN109831677B (en) Video desensitization method, device, computer equipment and storage medium
CN110796054A (en) Certificate authenticity verifying method and device
CN114677634B (en) Surface label identification method and device, electronic equipment and storage medium
CN110675252A (en) Risk assessment method and device, electronic equipment and storage medium
CN113191787A (en) Telecommunication data processing method, device electronic equipment and storage medium
CN114023331A (en) Method, device, equipment and storage medium for detecting performance of voiceprint recognition system
CN112351047B (en) Double-engine based voiceprint identity authentication method, device, equipment and storage medium
CN110766074A (en) Method and device for testing identification qualification of abnormal grains in biological identification method
CN111932270B (en) Bank customer identity verification method and device
CN111063359B (en) Telephone return visit validity judging method, device, computer equipment and medium
CN117252429A (en) Risk user identification method and device, storage medium and electronic equipment
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior
CN115688107A (en) Fraud-related APP detection system and method
CN111339829B (en) User identity authentication method, device, computer equipment and storage medium
JP3322491B2 (en) Voice recognition device
CN113178196B (en) Audio data extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination