WO2024002501A1 - Detecting a spoofing in a user voice audio - Google Patents

Detecting a spoofing in a user voice audio Download PDF

Info

Publication number
WO2024002501A1
WO2024002501A1 PCT/EP2022/072121 EP2022072121W WO2024002501A1 WO 2024002501 A1 WO2024002501 A1 WO 2024002501A1 EP 2022072121 W EP2022072121 W EP 2022072121W WO 2024002501 A1 WO2024002501 A1 WO 2024002501A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voiceprint
training
user
subspace
Prior art date
Application number
PCT/EP2022/072121
Other languages
French (fr)
Inventor
Santiago PRIETO CALERO
Guillermo BARBADILLO VILLANUEVA
Miguel Ángel SÁNCHEZ YOLDI
Miguel Santos LUPARELLI MATHIEU
Eduardo Azanza Ladrón
Original Assignee
Veridas Digital Authentication Solutions, S.L.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Veridas Digital Authentication Solutions, S.L. filed Critical Veridas Digital Authentication Solutions, S.L.
Publication of WO2024002501A1 publication Critical patent/WO2024002501A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies

Definitions

  • the present disclosure relates to detector methods of detecting a spoofing in a user voice audio and to detector systems and computer programs suitable for performing said detector methods and, furthermore, to trainer methods of training said detector systems and to trainer systems and computer programs suitable for performing said trainer methods.
  • voice biometric systems are being increasingly used and demanded both in friendly (non-invasive) identification environments, such as applications for commercial smartphones, and in applications with sensitive data such as banking.
  • This demand generates the need to make a great technological effort around voice biometric systems.
  • the developments addressed in this line are aimed at improving these systems, improving their safety, their speed of operation, etc.
  • Biometric voice recognition despite being one of the most secure identification methods today, is still partly vulnerable to attempts of identity fraud, also known as spoofing.
  • One of the challenges voice biometric systems are currently facing is the generalization of detecting fraudulent attacks, since voice biometric systems are exposed to spoofing attacks that may compromise security. These attacks may be physical attacks, which may include the recording of a legitimate voice and its replaying (recording and replay) and, on the other hand, logical attacks based on synthesizing the voice of a person with access to the system (Voice synthesis or conversion).
  • An object of the disclosure is to provide new methods, systems and computer programs aimed at improving prior art manners of detecting spoofing in a user voice audio.
  • methods of detecting spoofing in a user voice audio comprise obtaining a user voiceprint outputted by a voiceprint generator, the user voiceprint mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed-voice subspace.
  • Detector methods further comprise classifying the user voiceprint as corresponding to the genuine-voice subspace or to the spoofed-voice subspace and detecting the spoofing in the user voice audio depending on whether the user voiceprint has been classified as corresponding to the genuine-voice subspace or to the spoofed-voice subspace.
  • voiceprint is herein used to refer to a digitally recorded representation or biometric mathematical representation of a person’s voice that may be used for authentication purposes because it is as individual as a fingerprint.
  • voiceprint representation space is used herein to denote a mathematical representation space in which any generable voiceprint from any voice audio may fall.
  • voiceprint generator is used herein to refer to any human voice feature extractor that generates a mathematical representation of the voice based on the extracted features, said mathematical representation being biometrically representative and identifier of the human to whom the voice belongs, in same or similar manner as a fingerprint thereof.
  • the genuine-voice subspace and the spoofed-voice subspace may be disjoint voiceprint subspaces or, in other words, with no overlapping between them. This way, an improved classification of the user voiceprint may be performed in the sense that it may only correspond to either the genuine-voice subspace or the spoofed-voice subspace. Inventors experimentally proved that use of logistic regression produced best results in terms of balance between both accuracy and simplicity, and accordingly concluded that such the purposed classification may be performed based on a logistic regression model very successfully.
  • Detector methods may further include providing the user voice audio to the voiceprint generator for it to output the user voiceprint mathematically representing the user voice audio.
  • Detector methods may further comprise predefining (or estimating or anticipating or determining) the genuine-voice subspace by clustering, in the voiceprint representation space, a plurality of voiceprint samples that are known to mathematically represent genuine or authentic voice audios, and/or predefining (or estimating or anticipating or determining) the spoofed-voice subspace by clustering, in the voiceprint representation space, a plurality of voiceprint samples that are known to mathematically represent spoofed voice audios.
  • the spoofed voice audios used to predefine the spoofed-voice subspace may include voice audios resulting from voice recording and subsequent replay of the recorded voice and/or from voice synthesis and/or from voice conversion.
  • detector methods may thus be capable of detecting replay attacks and/or voice synthesis attacks and/or voice conversion attacks.
  • the voice recording and subsequent replay of the recorded voice may be performed by using a smartphone.
  • the classifying of the user voiceprint may include outputting (by, e.g., logistic regression model) a probability that the user voiceprint corresponds to or falls within the genuine-voice subspace or to the spoofed-voice subspace. In this case, once such a probability has been obtained, it may be determined whether said probability satisfies or dissatisfies a predefined spoofing threshold.
  • authenticator methods of authenticating a person may be provided, said authenticator methods including a detector method such as the ones described in other parts of the disclosure.
  • authenticator methods may include obtaining, from the person, the user voice audio, and inputting the user voice audio into the voiceprint generator for it to output the user voiceprint.
  • These authenticator methods may further include performing the detector method so as to detect a spoofing in the user voiceprint and authenticating the person depending on whether no spoofing has been detected and, in said case, on a comparison between the user voiceprint and a reference voiceprint of the person.
  • the reference voiceprint may be a voiceprint obtained from a person’s voice (at, e.g., a registration stage) which may then be stored and retrieved each time authentication of the person is required or requested.
  • detector systems for detecting spoofing in a user voice audio.
  • Detector systems comprise an obtainer module, a classifier module and a detector module.
  • the obtainer module is configured to obtain a user voiceprint outputted by a voiceprint generator, the user voiceprint mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed- voice subspace.
  • the classifier module is configured to classify the user voiceprint as corresponding to the genuine-voice subspace or to the spoofed-voice subspace.
  • the detector module is configured to detect the spoofing in the user voice audio depending on whether the user voiceprint has been classified as corresponding to the genuine-voice subspace or to the spoofed-voice subspace.
  • the classifier module may be a machine learning module which is, therefore, trainable using machine learning.
  • computer programs comprising program instructions for causing a system or computing system to perform methods of detecting a spoofing in a user voice audio, such as those described in other parts of the disclosure.
  • These computer programs may be embodied on a storage medium and/or carried on a carrier signal.
  • computing systems for detecting a spoofing in a user voice audio, said computing systems comprising a memory and a processor, embodying instructions stored in the memory and executable by the processor, and the instructions comprising functionality or functionalities to execute detector methods of detecting a spoofing in a user voice audio, such as those described in other parts of the disclosure.
  • trainer methods are provided for training a detector system with the classifier being a machine learning module, such as those described in other parts of the disclosure.
  • Trainer methods comprise obtaining a training set of training voiceprints outputted by the voiceprint generator, the training set including a first training subset and a second training subset, the first training subset including training voiceprints that are known to correspond to the genuine-voice subspace, and the second training subset including training voiceprints that are known to correspond to the spoofed-voice subspace.
  • Trainer methods further comprise performing, for each training voiceprint in whole or part of the training set, a training loop including training the classifier module to classify the training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset or to the second training subset.
  • the training of the classifier module may include minimizing binary cross entropy with training voiceprints in the first training subset and training voiceprints in the second training subset, and/or applying a L2 regularization method, and/or applying a large-scale bounded constraint based on a L-BFGS-B optimization algorithm.
  • trainer systems are provided for training a detector system with the classifier being a machine learning module, such as those described in other parts of the disclosure.
  • Trainer systems comprise a training set unit and a training loop unit.
  • the training set unit is configured to obtain a training set of training voiceprints outputted by the voiceprint generator.
  • the training set includes a first training subset and a second training subset.
  • the first training subset includes training voiceprints that are known to correspond to the genuine-voice subspace.
  • the second training subset includes training voiceprints that are known to correspond to the spoofed-voice subspace.
  • the training loop unit is configured to perform, for each training voiceprint in whole or part of the training set, a training loop implemented by a classifier trainer unit.
  • the classifier trainer unit is configured to train the classifier module to classify the training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset or to the second training subset.
  • computer programs comprising program instructions for causing a system or computing system to perform trainer methods of training a detector system, such as those described in other parts of the disclosure.
  • These computer programs may be embodied on a storage medium and/or carried on a carrier signal.
  • computing systems for training a detector system such as those described in other parts of the disclosure, said computing systems comprising a memory and a processor, embodying instructions stored in the memory and executable by the processor, and the instructions comprising functionality or functionalities to execute trainer methods of training a detector system, such as those described in other parts of the disclosure.
  • Figure 1 is a block diagram schematically illustrating detector systems for detecting spoofing in a user voice audio according to examples.
  • Figure 2 is a flow chart schematically illustrating detector methods of detecting spoofing in a user voice audio according to examples.
  • Figure 3 is a block diagram schematically illustrating authenticator systems for authenticating a person according to examples.
  • Figure 4 is a flow chart schematically illustrating authenticator methods of authenticating a person according to examples.
  • Figure 5 is a block diagram schematically illustrating trainer systems for training a detector system such as the ones of Figure 1 , according to examples.
  • Figure 6 is a flow chart schematically illustrating trainer methods of training a detector system such as the ones of Figure 1 , according to examples.
  • FIG. 1 is a block diagram schematically illustrating detector systems for detecting spoofing in a user voice audio according to examples.
  • detector systems 100 may include an obtainer module 101 , a classifier module 102 and a detector module 103.
  • the obtainer module 101 may be configured to obtain a user voiceprint 104 from (or outputted by) a voiceprint generator.
  • the user voiceprint 104 may mathematically represent the user voice audio in a voiceprint representation space.
  • the voiceprint representation space may include a genuine-voice subspace and a spoofed- voice subspace.
  • the classifier module 102 may be configured to classify the user voiceprint 104 as corresponding to the genuine-voice subspace or to the spoofed-voice subspace, thereby producing a classification 105 of the user voiceprint 104.
  • the detector module 103 may be configured to detect the spoofing in the user voice audio depending on whether the user voiceprint 104 has been classified as corresponding to the genuine- voice subspace or to the spoofed-voice subspace, thereby outputting some kind of data 106 indicating detected or non-detected spoofing.
  • the genuine-voice subspace and the spoofed-voice subspace may have been defined or predefined as disjoint subspaces or, in other words, with no overlapping between them. This way, classification of the user voiceprint 104 may be better in the sense that the probability that the user voiceprint 104 belongs to either the genuine-voice subspace or the spoofed-voice subspace may be clearer or more conclusive.
  • logistic regression was experimentally identified by inventors as very suitable to detect voice spoofing attacks, since it permits defining a low complexity model while providing very effective accuracy for this purpose. It was surprisingly concluded and empirically proved that logistic regression models have a high spoofing detection capacity, presumably due to a loss of fidelity in low frequencies in said attacks to be detected, along with its relative simplicity as commented above.
  • Detector methods may further include providing the user voice audio to the voiceprint generator in such a way that the voiceprint generator outputs the user voiceprint 104 mathematically representing the user voice audio.
  • Genuine-voice subspace may be grouped (or estimated or anticipated or determined or predefined) by clustering, in the voiceprint representation space, voiceprint samples that are known to mathematically represent genuine or authentic voice audios.
  • spoofed-voice subspace may be grouped (or estimated or anticipated or determined or predefined) by clustering, in the voiceprint representation space, voiceprint samples that are known to mathematically represent spoofed voice audios. In both cases, genuine and spoofed voiceprint samples, the larger quantity and more representative the better.
  • the spoofed voice audios used to define spoofed-voice subspace may include voice audios resulting from voice recording and subsequent replay of the recorded voice which, in some examples, may come from a smartphone.
  • the spoofed voice audios may include voice audios resulting from voice synthesis voice audios resulting from voice conversion, respectively.
  • the classifying of the user voiceprint may be performed by computing or determining a probability that the user voiceprint corresponds to (or falls within) either the genuine-voice subspace or the spoofed-voice subspace. In this probability-based approach, it may be verified whether such a probability satisfies or dissatisfies a predefined spoofing threshold. For example, if the probability is above (or outside) the predefined spoofing threshold (or range), it may be determined that the user voice audio is a spoof and, otherwise, is genuine or authentic.
  • Figure 2 is a flow chart schematically illustrating methods of detecting spoofing in a user voice audio according to examples.
  • detector methods may be initiated (e.g. at block 200) upon detection of a starting condition such as e.g. a request for starting the method or an invocation of the method from user interface or the like. Since detector methods according to Figure 2 are performable by detector systems according to Figure 1 , number references from Figure 1 may be reused in the following description of Figure 2.
  • Detector methods may further include (e.g. at block 201) obtaining a user voiceprint 104 outputted by a voiceprint generator, the user voiceprint 104 mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed-voice subspace.
  • This functionality implemented or implementable at block 201 may be performed by e.g. obtainer module 101 previously described with reference to Figure 1. Functional details and considerations explained about said obtainer module 101 may thus be similarly attributed or attributable to method block 201.
  • Detector methods may further include (e.g. at block 202) classifying the user voiceprint 104 as corresponding to the genuine-voice subspace or to the spoofed-voice subspace, thereby outputting a classification 105 of the user voiceprint 104.
  • This functionality implemented or implementable at block 202 may be performed by e.g. classifier module 102 previously described with reference to Figure 1 . Functional details and considerations explained about said classifier module 102 may thus be similarly attributed or attributable to method block 202.
  • Detector methods may still further include (e.g. at block 203) detecting the spoofing in the user voice audio depending on whether the classification 105 of the user voiceprint 104 indicates correspondence to the genuine-voice subspace or to the spoofed-voice subspace, thereby outputting an indicator 106 of detected or non-detected spoofing.
  • This functionality implemented or implementable at block 203 may be performed by e.g. detector module 103 previously described with reference to Figure 1. Functional details and considerations explained about said detector module 103 may thus be similarly attributed or attributable to method block 203.
  • Detector methods may terminate (e.g. at block 204) when an ending condition is detected such as e.g. once detection has been (either successfully or unsuccessfully) completed, under reception of a user termination request, under shutdown or deactivation of the detector system 100 performing the method, etc.
  • FIG 3 is a block diagram schematically illustrating authenticator systems for authenticating a person according to examples.
  • authenticator systems 300 may include a voice obtainer module 301 , an inputter module 302, a voiceprint generator 303, a detector system 100 (such as the ones described in other parts of the disclosure, e.g., regarding Figure 1), and an authenticator module 304.
  • the voice obtainer module 301 may be configured to obtain the user voice audio 305 from the person to be authenticated. Suitable voice capturing component or means may be provided to the person for this purpose.
  • the inputter module 302 may be configured to input the user voice audio 305 into the voiceprint generator 303 for it to produce corresponding user voiceprint 306.
  • the inputter module 302 may simply act as a bridge receiving the user voice audio 305 and providing it to the voiceprint generator 303 or, in other examples, may additionally pre-process the user voice audio 305 as required by the voiceprint generator 303.
  • Such pre-processing of the user voice audio 305 (which may be performed by the inputter module 302) may include, e.g., transforming the user voice audio 305 from time domain to frequency domain. This transformation may be performed by applying, e.g., Fast Fourier Transformation (FFT), Mel-frequency-based methods, etc.
  • FFT Fast Fourier Transformation
  • the detector system may receive as input and process the user voiceprint 306 so as to detect whether it is spoofed or non-spoofed in, e.g., any of the manners explained in relation to Figure 1.
  • An indicator of spoofed or non-spoofed user voiceprint 306 outputted by detector system 100 may be received by authenticator module 304 to authenticate the person. If the person’s voiceprint 306 has been determined as spoofed, the authenticator module 304 may directly determine unsuccessful authentication.
  • the authenticator module 304 may authenticate the person based on a comparison between the user or person voiceprint 306 and a reference voiceprint 307 of the person.
  • the reference voiceprint 307 may be a voiceprint generated for/by the person at, e.g., an initial registration phase, and it may be stored somewhere such as, e.g., in a DB 308 in such a way that the authenticator system 300 may retrieve it for authenticating the user or person every time it is requested or required. If the person’s voiceprint 306 and the reference voiceprint 307 are determined as corresponding to each other or, in other words, as coming from the same person, the authenticator module 304 may determine successful authentication and, otherwise, unsuccessful authentication.
  • Figure 4 is a flow chart schematically illustrating methods of authenticating a person (or authenticator methods) according to examples, said authenticator methods including a detector method such as, e.g. those of Figure 2.
  • authenticator methods may be initiated (e.g. at block 400) upon detection of a starting condition such as e.g. a request for starting the method or an invocation of the method from user interface or the like. Since authenticator methods according to Figure 4 are performable by authenticator systems according to Figure 3, and since said authenticator systems are purposed to use detector systems according to Figure 1 , number references from Figures 1 and 3 may be reused in following description of Figure 4.
  • Authenticator methods may further include (e.g. at block 401) obtaining, from the person, the user voice audio 305.
  • This functionality implemented or implementable at block 401 may be performed by e.g. voice obtainer module 301 previously described with reference to Figure 3. Functional details and considerations explained about said voice obtainer module 301 may thus be similarly attributed or attributable to method block 401 .
  • authenticator methods may further include (e.g. at block 402) inputting the user voice audio 305 into the voiceprint generator 303 in such a way that the voiceprint generator 303 outputs the user voiceprint 306 (104 in Figure 1).
  • This functionality implemented or implementable at block 402 may be performed by e.g. inputter module 302 previously described with reference to Figure 3. Functional details and considerations explained about said inputter module 302 may thus be similarly attributed or attributable to method block 402.
  • Authenticator methods may yet further include (e.g. at block 403) performing the detector method to detect a spoofing in the user voiceprint 306 (104 in Figure 1).
  • This functionality implemented or implementable at block 403 may be performed by e.g. detector methods previously described with reference to Figure 2. Functional details and considerations explained about said detector methods may thus be similarly attributed or attributable to method block 403.
  • Authenticator methods may furthermore include (e.g. at block 404) verifying whether spoofing has been detected (at, e.g., block 403) and, in said case, Y, transition to end block 407 may be performed to indicate or output that the person has been unsuccessfully authenticated. Otherwise, N, transition to block 405 may be performed to verify whether the user voiceprint 306 (sufficiently) corresponds to a reference voiceprint 307 of the person, in which case, Y, transition to end block 406 may be performed to indicate or output that the person has been successfully authenticated. Otherwise, N, transition to end block 407 may be performed to indicate or output that the person has been unsuccessfully authenticated.
  • This authenticating functionality implemented or implementable at blocks 404 and 405 may be performed by e.g. authenticator module 304 previously described with reference to Figure 3. Functional details and considerations explained about said authenticator module 304 may thus be similarly attributed or attributable to method blocks 404 and 405.
  • FIG. 5 is a block diagram schematically illustrating trainer systems for training a detector system 100 according to examples.
  • trainer systems 500 may include a training set unit 501 and a training loop unit 502.
  • a training loop unit 502 may include a classifier trainer unit 503 and may be configured to iteratively perform a classification function implemented by said classifier trainer unit 503.
  • Detector system 100 is fully shown in this figure, but the obtainer module 101 may be disabled or inexistent because (training) data 506 to be processed by the classifier module 102 may come only from the classifier trainer unit 503.
  • the training set unit 501 may be configured to obtain (or receive or capture) a training set of training voiceprints outputted by the voiceprint generator for training the detector system 100 and, in particular, its classifier module 102.
  • the training loop unit 502 may be configured to perform, for each training voiceprint in (whole or part of) the training set, a training loop implemented or implementable by, e.g., the classifier trainer unit 503.
  • the training set may include training voiceprints that are known to correspond to the genuine- voice subspace (in, e.g., first training subset) 503 and training voiceprints that are known to correspond to the spoofed-voice subspace (in, e.g., second training subset) 505.
  • Training of the classifier module 102 may include providing, by the classifier trainer unit 503, current (i.e., first or next) training voiceprint to the classifier module 102 for its training to classify the training voiceprint as spoofed or genuine. Classifier module 102 may thus output a probability of whether the training voiceprint is spoofed or genuine.
  • Training of the classifier module 102 may further include providing, by the classifier trainer unit 503, a classification of the outputted probability in terms of having more or less loss (or divergence) with respect to what is expected to output the classifier module 102.
  • This classification of the probability outputted by the classifier module 102 may be performed by, e.g., a probability classifier provided by the classifier trainer unit 503, which may have necessary knowledge about the training voiceprints to be processed for that aim.
  • a probability classifier may be trained along with the classifier module 102 in cooperative manner according to known machine learning techniques. If such a loss inferred by the probability classifier is determined unacceptable, same current training voiceprint may be repeatedly processed until inferred loss becomes acceptable. This repetitive approach may cause both probability classifier and classifier module 102 to converge towards targeted knowledge.
  • Global or cumulative loss may be calculated periodically, e.g. at completion of each iteration or several iterations of the training loop, depending on loss attributed to each or several of the processed training voiceprints. This global or cumulative loss may accurately indicate whether whole detector system 100 is converging or not to purposed knowledge, at what pace, how accurately, etc. over training cycles.
  • Trainer systems according to Figure 5 may permit training the classifier module 102 in a very interrelated manner with large numbers of training voiceprints and very representative of the pursued classification knowledge, such that the classifier module 102 may result much more effective and efficient than in prior art training approaches with same or similar purposes, as experimentally confirmed by inventors.
  • Figure 6 is a flow chart schematically illustrating methods of training a detector system (or trainer methods) according to examples.
  • trainer methods may be initiated (e.g. at block 600) upon detection of a starting condition such as e.g. a request for starting the method or an invocation of the method from user interface or the like. Since trainer methods according to Figure 6 are performable by trainer systems according to Figure 5, and since said trainer systems are purposed to train detector systems according to Figure 1 , number references from said Figures 1 and 5 may be reused in following description of Figure 6.
  • Trainer methods may further include (e.g. at block 601) obtaining or receiving a training set of training voiceprints outputted by the voiceprint generator 303, the training set comprising a first training subset 504 including training voiceprints that are known to correspond to the genuine-voice subspace, and a second training subset 505 including training voiceprints that are known to correspond to the spoofed-voice subspace.
  • This functionality implemented or implementable at block 601 may be performed by e.g. training set unit 501 previously described with reference to Figure 5. Functional details and considerations explained about said training set unit 501 may thus be similarly attributed or attributable to method block 601 .
  • trainer methods may further include performing, for each training voiceprint in whole or part of the training set, a training loop that may include, e.g., block 602 explained below.
  • This training loop may be performed by e.g. training loop unit 502 previously described with reference to Figure 5. Functional details and considerations explained about said training loop unit 502 may thus be similarly attributed or attributable to such a training loop.
  • Trainer methods may still further include (e.g. at block 602) training the classifier module to classify the first or next training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset 504 or to the second training subset 505.
  • first training voiceprint in the (whole or part of the) training set 504, 505 may be processed and, otherwise (at any non-first iteration), next training voiceprint may be processed.
  • Such a training may be performed by the classifier trainer unit 503 providing classifier module 102 with suitable data and/or instructions 506 including, e.g., the first or next training voiceprint.
  • This functionality implemented or implementable at block 602 may be performed by e.g. classifier trainer unit 503 previously described with reference to Figure 5. Functional details and considerations explained about said classifier trainer unit 503 may thus be similarly attributed or attributable to method block 602.
  • Trainer methods may yet furthermore include (e.g. at decision block 603) verifying whether an ending condition is satisfied, in which case, Y, the method may be terminated by, e.g., transitioning to ending block 604 and, otherwise, N, a new iteration of the training loop may be initiated by, e.g., looping back to block 602 so as to process next training voiceprint.
  • Ending condition may include, e.g., completion of the whole or part of the training set (i.e. all training voiceprints to be used for the training have been processed), reception of a user termination request, shutdown or deactivation of the trainer system 500 that is performing the trainer method, etc.
  • module or “unit” may be understood to refer to software, firmware, hardware and/or various combinations thereof. It is noted that the modules are exemplary. The modules may be combined, integrated, separated, and/or duplicated to support various applications. Also, a function described herein as being performed by a particular module may be performed by one or more other modules and/or by one or more other devices instead of or in addition to the function performed by the described particular module.
  • the modules may be implemented across multiple devices, associated or linked to corresponding detector methods and/or authenticator methods and/or trainer methods proposed herein, and/or to other components that may be local or remote to one another. Additionally, the modules may be moved from one device and added to another device, and/or may be included in both devices, associated to corresponding detector methods and/or authenticator methods and/or trainer methods proposed herein. Any software implementations may be tangibly embodied in one or more storage media, such as e.g. a memory device, a floppy disk, a compact disk (CD), a digital versatile disk (DVD), or other devices that may store computer code.
  • storage media such as e.g. a memory device, a floppy disk, a compact disk (CD), a digital versatile disk (DVD), or other devices that may store computer code.
  • detector methods and/or authenticator methods and/or trainer methods may be implemented by computing means, electronic means or a combination thereof.
  • the computing means may be a set of instructions (e.g. a computer program) and then detector methods and/or authenticator methods and/or trainer methods may comprise a memory and a processor, embodying said set of instructions stored in the memory and executable by the processor.
  • These instructions may comprise functionality or functionalities to execute corresponding detector methods and/or authenticator methods and/or trainer methods such as e.g. the ones described with reference to other figures.
  • a controller of the system may be, for example, a CPLD (Complex Programmable Logic Device), an FPGA (Field Programmable Gate Array) or an ASIC (Application-Specific Integrated Circuit).
  • CPLD Complex Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • ASIC Application-Specific Integrated Circuit
  • the computing means may be a set of instructions (e.g. a computer program) and the electronic means may be any electronic circuit capable of implementing corresponding steps of the detector methods and/or authenticator methods and/or trainer methods proposed herein, such as those described with reference to other figures.
  • the computer program(s) may be embodied on a storage medium (for example, a CD- ROM, a DVD, a USB drive, a computer memory or a read-only memory) or carried on a carrier signal (for example, on an electrical or optical carrier signal).
  • a storage medium for example, a CD- ROM, a DVD, a USB drive, a computer memory or a read-only memory
  • a carrier signal for example, on an electrical or optical carrier signal.
  • the computer program(s) may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in implementing the detector methods and/or authenticator methods and/or trainer methods according to present disclosure.
  • the carrier may be any entity or device capable of carrying the computer program(s).
  • the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a hard disk.
  • a storage medium such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a hard disk.
  • the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means.
  • the carrier may be constituted by such cable or other device or means.
  • the carrier may be an integrated circuit in which the computer program(s) is/are embedded, the integrated circuit being adapted for performing, or for use in the performance of, the detector methods and/or authenticator methods and/or trainer methods proposed herein.
  • Detector methods and systems according to present disclosure are based on processing a voiceprint produced by a voiceprint generator (or model) and, therefore, they provide advantages of reusability of something (the voiceprint) that is exclusively aimed at authentication.
  • Voiceprint generators are used to generate voiceprints as unique identifiers of a person for his/her authentication based on voiceprint comparison. Detector methods and systems proposed herein may thus permit reusing a voiceprint for an innovative second aim (spoof detection) apart from the main or exclusive one (authentication).
  • Authenticator methods and systems may use a single biometric model (e.g., a single Al model) as voiceprint generator for both authentication and spoof detection.
  • a single biometric model e.g., a single Al model
  • detector methods and systems suggested herein permit avoiding the need of having a specific voiceprint generator for authentication and another specific voiceprint generator (or whatever computing means) for spoof detection.
  • the use of a single two-functional voiceprint generator/model may thus provide important benefits in terms of saving computational resources versus using two different single-functional voiceprint generators. Less memory and processing resources may be needed in such a two-functional voiceprint approach versus single-functional voiceprint approach. These benefits may be especially advantageous in devices configured to perform authentication locally based on, e.g., single biometric model embedded therein.
  • methods and systems supporting the two-functional voiceprint approach may perform authentication and spoof detection (much) more efficiently (i.e. faster) in comparison to single-functional voiceprint approach.
  • detector/authenticator methods and systems allow having only one voiceprint generator stored on-device as well as only one data storage (e.g., a database) containing only one reference voiceprint to authenticate, verify an identity and validate its authenticity on the field and on-device.
  • the anonymity of the data stored on-device i.e. irreversible voiceprint
  • humanitarian workers may perform in catastrophic situations easily and securely in a friendly on-device environment with the need of less (maybe half) memory and processing resources in comparison to having two different biometric models, one for authentication and another for spoof detection.
  • each of said biometric models/modules weights 20MB
  • having two modules implies the need of storing 40MB and double execution while having a single two-functional biometric module requires storing 20MB and single execution.

Abstract

Detector methods are provided of detecting a spoofing in a user voice audio, said methods comprising: obtaining a user voiceprint outputted by a voiceprint generator, the user voiceprint mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed-voice subspace; classifying the user voiceprint as corresponding to the genuine-voice subspace or to the spoofed-voice subspace; and detecting the spoofing in the user voice audio depending on whether the user voiceprint has been classified as corresponding to the genuine-voice subspace or to the spoofed-voice subspace. Detector systems and computer programs suitable for performing such detector methods are also provided. Trainer methods of training such detector systems are also provided, along with trainer systems and computer programs suitable for performing said trainer methods.

Description

DETECTING A SPOOFING IN A USER VOICE AUDIO
This application claims the benefit of European Patent Application EP22382631.4 filed 01 July 2022.
The present disclosure relates to detector methods of detecting a spoofing in a user voice audio and to detector systems and computer programs suitable for performing said detector methods and, furthermore, to trainer methods of training said detector systems and to trainer systems and computer programs suitable for performing said trainer methods.
BACKGROUND
Nowadays, voice biometric systems are being increasingly used and demanded both in friendly (non-invasive) identification environments, such as applications for commercial smartphones, and in applications with sensitive data such as banking. This demand generates the need to make a great technological effort around voice biometric systems. The developments addressed in this line are aimed at improving these systems, improving their safety, their speed of operation, etc.
Biometric voice recognition, despite being one of the most secure identification methods today, is still partly vulnerable to attempts of identity fraud, also known as spoofing. One of the challenges voice biometric systems are currently facing is the generalization of detecting fraudulent attacks, since voice biometric systems are exposed to spoofing attacks that may compromise security. These attacks may be physical attacks, which may include the recording of a legitimate voice and its replaying (recording and replay) and, on the other hand, logical attacks based on synthesizing the voice of a person with access to the system (Voice synthesis or conversion).
An object of the disclosure is to provide new methods, systems and computer programs aimed at improving prior art manners of detecting spoofing in a user voice audio.
SUMMARY
In an aspect, methods of detecting spoofing in a user voice audio are provided. Such methods (also denominated detector methods herein) comprise obtaining a user voiceprint outputted by a voiceprint generator, the user voiceprint mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed-voice subspace. Detector methods further comprise classifying the user voiceprint as corresponding to the genuine-voice subspace or to the spoofed-voice subspace and detecting the spoofing in the user voice audio depending on whether the user voiceprint has been classified as corresponding to the genuine-voice subspace or to the spoofed-voice subspace.
The term “voiceprint” is herein used to refer to a digitally recorded representation or biometric mathematical representation of a person’s voice that may be used for authentication purposes because it is as individual as a fingerprint. The term “voiceprint representation space” is used herein to denote a mathematical representation space in which any generable voiceprint from any voice audio may fall. Experiments carried out by the inventors have surprisingly revealed that genuine or non-spoofed voiceprints tend to cluster in what is herein denominated a genuine-voice subspace within voiceprint representation space, and that non-genuine or spoofed voiceprints tend to cluster in what is herein denominated a spoofed-voice subspace within voiceprint representation space. Therefore, the inventors concluded that any voiceprint from any voice audio may be classified into one of the genuine-voice subspace and spoofed-voice subspace through classification algorithm.
The term “voiceprint generator” is used herein to refer to any human voice feature extractor that generates a mathematical representation of the voice based on the extracted features, said mathematical representation being biometrically representative and identifier of the human to whom the voice belongs, in same or similar manner as a fingerprint thereof.
In some examples, the genuine-voice subspace and the spoofed-voice subspace may be disjoint voiceprint subspaces or, in other words, with no overlapping between them. This way, an improved classification of the user voiceprint may be performed in the sense that it may only correspond to either the genuine-voice subspace or the spoofed-voice subspace. Inventors experimentally proved that use of logistic regression produced best results in terms of balance between both accuracy and simplicity, and accordingly concluded that such the purposed classification may be performed based on a logistic regression model very successfully.
Detector methods according to examples may further include providing the user voice audio to the voiceprint generator for it to output the user voiceprint mathematically representing the user voice audio. Detector methods may further comprise predefining (or estimating or anticipating or determining) the genuine-voice subspace by clustering, in the voiceprint representation space, a plurality of voiceprint samples that are known to mathematically represent genuine or authentic voice audios, and/or predefining (or estimating or anticipating or determining) the spoofed-voice subspace by clustering, in the voiceprint representation space, a plurality of voiceprint samples that are known to mathematically represent spoofed voice audios.
In some implementations, the spoofed voice audios used to predefine the spoofed-voice subspace may include voice audios resulting from voice recording and subsequent replay of the recorded voice and/or from voice synthesis and/or from voice conversion. This manner, detector methods may thus be capable of detecting replay attacks and/or voice synthesis attacks and/or voice conversion attacks. In the case of replay attacks, the voice recording and subsequent replay of the recorded voice may be performed by using a smartphone.
In some configurations, the classifying of the user voiceprint may include outputting (by, e.g., logistic regression model) a probability that the user voiceprint corresponds to or falls within the genuine-voice subspace or to the spoofed-voice subspace. In this case, once such a probability has been obtained, it may be determined whether said probability satisfies or dissatisfies a predefined spoofing threshold.
In examples, authenticator methods of authenticating a person may be provided, said authenticator methods including a detector method such as the ones described in other parts of the disclosure. Specifically, such authenticator methods may include obtaining, from the person, the user voice audio, and inputting the user voice audio into the voiceprint generator for it to output the user voiceprint. These authenticator methods may further include performing the detector method so as to detect a spoofing in the user voiceprint and authenticating the person depending on whether no spoofing has been detected and, in said case, on a comparison between the user voiceprint and a reference voiceprint of the person. The reference voiceprint may be a voiceprint obtained from a person’s voice (at, e.g., a registration stage) which may then be stored and retrieved each time authentication of the person is required or requested.
In a further aspect, detector systems are provided for detecting spoofing in a user voice audio. Detector systems comprise an obtainer module, a classifier module and a detector module. The obtainer module is configured to obtain a user voiceprint outputted by a voiceprint generator, the user voiceprint mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed- voice subspace. The classifier module is configured to classify the user voiceprint as corresponding to the genuine-voice subspace or to the spoofed-voice subspace. The detector module is configured to detect the spoofing in the user voice audio depending on whether the user voiceprint has been classified as corresponding to the genuine-voice subspace or to the spoofed-voice subspace. In some examples, the classifier module may be a machine learning module which is, therefore, trainable using machine learning.
In a still further aspect, computer programs are provided comprising program instructions for causing a system or computing system to perform methods of detecting a spoofing in a user voice audio, such as those described in other parts of the disclosure. These computer programs may be embodied on a storage medium and/or carried on a carrier signal.
In a yet further aspect, computing systems are provided for detecting a spoofing in a user voice audio, said computing systems comprising a memory and a processor, embodying instructions stored in the memory and executable by the processor, and the instructions comprising functionality or functionalities to execute detector methods of detecting a spoofing in a user voice audio, such as those described in other parts of the disclosure.
In a furthermore aspect, trainer methods are provided for training a detector system with the classifier being a machine learning module, such as those described in other parts of the disclosure. Trainer methods comprise obtaining a training set of training voiceprints outputted by the voiceprint generator, the training set including a first training subset and a second training subset, the first training subset including training voiceprints that are known to correspond to the genuine-voice subspace, and the second training subset including training voiceprints that are known to correspond to the spoofed-voice subspace. Trainer methods further comprise performing, for each training voiceprint in whole or part of the training set, a training loop including training the classifier module to classify the training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset or to the second training subset.
In trainer methods, the training of the classifier module may include minimizing binary cross entropy with training voiceprints in the first training subset and training voiceprints in the second training subset, and/or applying a L2 regularization method, and/or applying a large-scale bounded constraint based on a L-BFGS-B optimization algorithm.
In a still furthermore aspect, trainer systems are provided for training a detector system with the classifier being a machine learning module, such as those described in other parts of the disclosure. Trainer systems comprise a training set unit and a training loop unit. The training set unit is configured to obtain a training set of training voiceprints outputted by the voiceprint generator. The training set includes a first training subset and a second training subset. The first training subset includes training voiceprints that are known to correspond to the genuine-voice subspace. The second training subset includes training voiceprints that are known to correspond to the spoofed-voice subspace. The training loop unit is configured to perform, for each training voiceprint in whole or part of the training set, a training loop implemented by a classifier trainer unit. The classifier trainer unit is configured to train the classifier module to classify the training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset or to the second training subset.
In a yet furthermore aspect, computer programs are provided comprising program instructions for causing a system or computing system to perform trainer methods of training a detector system, such as those described in other parts of the disclosure. These computer programs may be embodied on a storage medium and/or carried on a carrier signal.
In an additional aspect, computing systems are provided for training a detector system such as those described in other parts of the disclosure, said computing systems comprising a memory and a processor, embodying instructions stored in the memory and executable by the processor, and the instructions comprising functionality or functionalities to execute trainer methods of training a detector system, such as those described in other parts of the disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
Non-limiting examples of the disclosure will be described in the following, with reference to the appended drawings, in which:
Figure 1 is a block diagram schematically illustrating detector systems for detecting spoofing in a user voice audio according to examples.
Figure 2 is a flow chart schematically illustrating detector methods of detecting spoofing in a user voice audio according to examples.
Figure 3 is a block diagram schematically illustrating authenticator systems for authenticating a person according to examples.
Figure 4 is a flow chart schematically illustrating authenticator methods of authenticating a person according to examples. Figure 5 is a block diagram schematically illustrating trainer systems for training a detector system such as the ones of Figure 1 , according to examples.
Figure 6 is a flow chart schematically illustrating trainer methods of training a detector system such as the ones of Figure 1 , according to examples.
DETAILED DESCRIPTION OF EXAMPLES
Figure 1 is a block diagram schematically illustrating detector systems for detecting spoofing in a user voice audio according to examples. As generally shown in the figure, such detector systems 100 may include an obtainer module 101 , a classifier module 102 and a detector module 103. The obtainer module 101 may be configured to obtain a user voiceprint 104 from (or outputted by) a voiceprint generator. The user voiceprint 104 may mathematically represent the user voice audio in a voiceprint representation space. The voiceprint representation space may include a genuine-voice subspace and a spoofed- voice subspace. The classifier module 102 may be configured to classify the user voiceprint 104 as corresponding to the genuine-voice subspace or to the spoofed-voice subspace, thereby producing a classification 105 of the user voiceprint 104. The detector module 103 may be configured to detect the spoofing in the user voice audio depending on whether the user voiceprint 104 has been classified as corresponding to the genuine- voice subspace or to the spoofed-voice subspace, thereby outputting some kind of data 106 indicating detected or non-detected spoofing.
The genuine-voice subspace and the spoofed-voice subspace may have been defined or predefined as disjoint subspaces or, in other words, with no overlapping between them. This way, classification of the user voiceprint 104 may be better in the sense that the probability that the user voiceprint 104 belongs to either the genuine-voice subspace or the spoofed-voice subspace may be clearer or more conclusive. As commented in other parts of the disclosure, logistic regression was experimentally identified by inventors as very suitable to detect voice spoofing attacks, since it permits defining a low complexity model while providing very effective accuracy for this purpose. It was surprisingly concluded and empirically proved that logistic regression models have a high spoofing detection capacity, presumably due to a loss of fidelity in low frequencies in said attacks to be detected, along with its relative simplicity as commented above.
Detector methods according to present disclosure may further include providing the user voice audio to the voiceprint generator in such a way that the voiceprint generator outputs the user voiceprint 104 mathematically representing the user voice audio. Genuine-voice subspace may be grouped (or estimated or anticipated or determined or predefined) by clustering, in the voiceprint representation space, voiceprint samples that are known to mathematically represent genuine or authentic voice audios. Similarly, spoofed-voice subspace may be grouped (or estimated or anticipated or determined or predefined) by clustering, in the voiceprint representation space, voiceprint samples that are known to mathematically represent spoofed voice audios. In both cases, genuine and spoofed voiceprint samples, the larger quantity and more representative the better.
In order to detect replay attacks, the spoofed voice audios used to define spoofed-voice subspace may include voice audios resulting from voice recording and subsequent replay of the recorded voice which, in some examples, may come from a smartphone. Similarly, in order to detect voice synthesis attacks and voice conversion attacks, the spoofed voice audios may include voice audios resulting from voice synthesis voice audios resulting from voice conversion, respectively.
The classifying of the user voiceprint may be performed by computing or determining a probability that the user voiceprint corresponds to (or falls within) either the genuine-voice subspace or the spoofed-voice subspace. In this probability-based approach, it may be verified whether such a probability satisfies or dissatisfies a predefined spoofing threshold. For example, if the probability is above (or outside) the predefined spoofing threshold (or range), it may be determined that the user voice audio is a spoof and, otherwise, is genuine or authentic.
Figure 2 is a flow chart schematically illustrating methods of detecting spoofing in a user voice audio according to examples. As generally shown in the figure, detector methods may be initiated (e.g. at block 200) upon detection of a starting condition such as e.g. a request for starting the method or an invocation of the method from user interface or the like. Since detector methods according to Figure 2 are performable by detector systems according to Figure 1 , number references from Figure 1 may be reused in the following description of Figure 2.
Detector methods may further include (e.g. at block 201) obtaining a user voiceprint 104 outputted by a voiceprint generator, the user voiceprint 104 mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed-voice subspace. This functionality implemented or implementable at block 201 may be performed by e.g. obtainer module 101 previously described with reference to Figure 1. Functional details and considerations explained about said obtainer module 101 may thus be similarly attributed or attributable to method block 201.
Detector methods may further include (e.g. at block 202) classifying the user voiceprint 104 as corresponding to the genuine-voice subspace or to the spoofed-voice subspace, thereby outputting a classification 105 of the user voiceprint 104. This functionality implemented or implementable at block 202 may be performed by e.g. classifier module 102 previously described with reference to Figure 1 . Functional details and considerations explained about said classifier module 102 may thus be similarly attributed or attributable to method block 202.
Detector methods may still further include (e.g. at block 203) detecting the spoofing in the user voice audio depending on whether the classification 105 of the user voiceprint 104 indicates correspondence to the genuine-voice subspace or to the spoofed-voice subspace, thereby outputting an indicator 106 of detected or non-detected spoofing. This functionality implemented or implementable at block 203 may be performed by e.g. detector module 103 previously described with reference to Figure 1. Functional details and considerations explained about said detector module 103 may thus be similarly attributed or attributable to method block 203.
Detector methods may terminate (e.g. at block 204) when an ending condition is detected such as e.g. once detection has been (either successfully or unsuccessfully) completed, under reception of a user termination request, under shutdown or deactivation of the detector system 100 performing the method, etc.
Figure 3 is a block diagram schematically illustrating authenticator systems for authenticating a person according to examples. As generally shown in the figure, such authenticator systems 300 may include a voice obtainer module 301 , an inputter module 302, a voiceprint generator 303, a detector system 100 (such as the ones described in other parts of the disclosure, e.g., regarding Figure 1), and an authenticator module 304.
The voice obtainer module 301 may be configured to obtain the user voice audio 305 from the person to be authenticated. Suitable voice capturing component or means may be provided to the person for this purpose. The inputter module 302 may be configured to input the user voice audio 305 into the voiceprint generator 303 for it to produce corresponding user voiceprint 306. The inputter module 302 may simply act as a bridge receiving the user voice audio 305 and providing it to the voiceprint generator 303 or, in other examples, may additionally pre-process the user voice audio 305 as required by the voiceprint generator 303. Such pre-processing of the user voice audio 305 (which may be performed by the inputter module 302) may include, e.g., transforming the user voice audio 305 from time domain to frequency domain. This transformation may be performed by applying, e.g., Fast Fourier Transformation (FFT), Mel-frequency-based methods, etc.
Once the user voiceprint 306 has been generated, the detector system (such as the ones of Figure 1) may receive as input and process the user voiceprint 306 so as to detect whether it is spoofed or non-spoofed in, e.g., any of the manners explained in relation to Figure 1. An indicator of spoofed or non-spoofed user voiceprint 306 outputted by detector system 100 may be received by authenticator module 304 to authenticate the person. If the person’s voiceprint 306 has been determined as spoofed, the authenticator module 304 may directly determine unsuccessful authentication. If the person’s voiceprint 306 has been determined as non-spoofed or genuine, the authenticator module 304 may authenticate the person based on a comparison between the user or person voiceprint 306 and a reference voiceprint 307 of the person. The reference voiceprint 307 may be a voiceprint generated for/by the person at, e.g., an initial registration phase, and it may be stored somewhere such as, e.g., in a DB 308 in such a way that the authenticator system 300 may retrieve it for authenticating the user or person every time it is requested or required. If the person’s voiceprint 306 and the reference voiceprint 307 are determined as corresponding to each other or, in other words, as coming from the same person, the authenticator module 304 may determine successful authentication and, otherwise, unsuccessful authentication.
Figure 4 is a flow chart schematically illustrating methods of authenticating a person (or authenticator methods) according to examples, said authenticator methods including a detector method such as, e.g. those of Figure 2. As generally shown in the figure, authenticator methods may be initiated (e.g. at block 400) upon detection of a starting condition such as e.g. a request for starting the method or an invocation of the method from user interface or the like. Since authenticator methods according to Figure 4 are performable by authenticator systems according to Figure 3, and since said authenticator systems are purposed to use detector systems according to Figure 1 , number references from Figures 1 and 3 may be reused in following description of Figure 4.
Authenticator methods may further include (e.g. at block 401) obtaining, from the person, the user voice audio 305. This functionality implemented or implementable at block 401 may be performed by e.g. voice obtainer module 301 previously described with reference to Figure 3. Functional details and considerations explained about said voice obtainer module 301 may thus be similarly attributed or attributable to method block 401 . Once the user voice audio 305 has been obtained or received from the person, authenticator methods may further include (e.g. at block 402) inputting the user voice audio 305 into the voiceprint generator 303 in such a way that the voiceprint generator 303 outputs the user voiceprint 306 (104 in Figure 1). This functionality implemented or implementable at block 402 may be performed by e.g. inputter module 302 previously described with reference to Figure 3. Functional details and considerations explained about said inputter module 302 may thus be similarly attributed or attributable to method block 402.
Authenticator methods may yet further include (e.g. at block 403) performing the detector method to detect a spoofing in the user voiceprint 306 (104 in Figure 1). This functionality implemented or implementable at block 403 may be performed by e.g. detector methods previously described with reference to Figure 2. Functional details and considerations explained about said detector methods may thus be similarly attributed or attributable to method block 403.
Authenticator methods may furthermore include (e.g. at block 404) verifying whether spoofing has been detected (at, e.g., block 403) and, in said case, Y, transition to end block 407 may be performed to indicate or output that the person has been unsuccessfully authenticated. Otherwise, N, transition to block 405 may be performed to verify whether the user voiceprint 306 (sufficiently) corresponds to a reference voiceprint 307 of the person, in which case, Y, transition to end block 406 may be performed to indicate or output that the person has been successfully authenticated. Otherwise, N, transition to end block 407 may be performed to indicate or output that the person has been unsuccessfully authenticated. This authenticating functionality implemented or implementable at blocks 404 and 405 may be performed by e.g. authenticator module 304 previously described with reference to Figure 3. Functional details and considerations explained about said authenticator module 304 may thus be similarly attributed or attributable to method blocks 404 and 405.
Figure 5 is a block diagram schematically illustrating trainer systems for training a detector system 100 according to examples. As generally shown in the figure, such trainer systems 500 may include a training set unit 501 and a training loop unit 502. Such a training loop unit 502 may include a classifier trainer unit 503 and may be configured to iteratively perform a classification function implemented by said classifier trainer unit 503. Detector system 100 is fully shown in this figure, but the obtainer module 101 may be disabled or inexistent because (training) data 506 to be processed by the classifier module 102 may come only from the classifier trainer unit 503.
The training set unit 501 may be configured to obtain (or receive or capture) a training set of training voiceprints outputted by the voiceprint generator for training the detector system 100 and, in particular, its classifier module 102. The training loop unit 502 may be configured to perform, for each training voiceprint in (whole or part of) the training set, a training loop implemented or implementable by, e.g., the classifier trainer unit 503. The training set may include training voiceprints that are known to correspond to the genuine- voice subspace (in, e.g., first training subset) 503 and training voiceprints that are known to correspond to the spoofed-voice subspace (in, e.g., second training subset) 505.
Training of the classifier module 102 may include providing, by the classifier trainer unit 503, current (i.e., first or next) training voiceprint to the classifier module 102 for its training to classify the training voiceprint as spoofed or genuine. Classifier module 102 may thus output a probability of whether the training voiceprint is spoofed or genuine.
Training of the classifier module 102 may further include providing, by the classifier trainer unit 503, a classification of the outputted probability in terms of having more or less loss (or divergence) with respect to what is expected to output the classifier module 102. This classification of the probability outputted by the classifier module 102 may be performed by, e.g., a probability classifier provided by the classifier trainer unit 503, which may have necessary knowledge about the training voiceprints to be processed for that aim. Such a probability classifier may be trained along with the classifier module 102 in cooperative manner according to known machine learning techniques. If such a loss inferred by the probability classifier is determined unacceptable, same current training voiceprint may be repeatedly processed until inferred loss becomes acceptable. This repetitive approach may cause both probability classifier and classifier module 102 to converge towards targeted knowledge.
Global or cumulative loss may be calculated periodically, e.g. at completion of each iteration or several iterations of the training loop, depending on loss attributed to each or several of the processed training voiceprints. This global or cumulative loss may accurately indicate whether whole detector system 100 is converging or not to purposed knowledge, at what pace, how accurately, etc. over training cycles. Trainer systems according to Figure 5 may permit training the classifier module 102 in a very interrelated manner with large numbers of training voiceprints and very representative of the pursued classification knowledge, such that the classifier module 102 may result much more effective and efficient than in prior art training approaches with same or similar purposes, as experimentally confirmed by inventors.
Figure 6 is a flow chart schematically illustrating methods of training a detector system (or trainer methods) according to examples. As generally shown in the figure, trainer methods may be initiated (e.g. at block 600) upon detection of a starting condition such as e.g. a request for starting the method or an invocation of the method from user interface or the like. Since trainer methods according to Figure 6 are performable by trainer systems according to Figure 5, and since said trainer systems are purposed to train detector systems according to Figure 1 , number references from said Figures 1 and 5 may be reused in following description of Figure 6.
Trainer methods may further include (e.g. at block 601) obtaining or receiving a training set of training voiceprints outputted by the voiceprint generator 303, the training set comprising a first training subset 504 including training voiceprints that are known to correspond to the genuine-voice subspace, and a second training subset 505 including training voiceprints that are known to correspond to the spoofed-voice subspace. This functionality implemented or implementable at block 601 may be performed by e.g. training set unit 501 previously described with reference to Figure 5. Functional details and considerations explained about said training set unit 501 may thus be similarly attributed or attributable to method block 601 .
Once the training set 504, 505 has been obtained or received, trainer methods may further include performing, for each training voiceprint in whole or part of the training set, a training loop that may include, e.g., block 602 explained below. This training loop may be performed by e.g. training loop unit 502 previously described with reference to Figure 5. Functional details and considerations explained about said training loop unit 502 may thus be similarly attributed or attributable to such a training loop.
Trainer methods may still further include (e.g. at block 602) training the classifier module to classify the first or next training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset 504 or to the second training subset 505. At first iteration of the training loop, first training voiceprint in the (whole or part of the) training set 504, 505 may be processed and, otherwise (at any non- first iteration), next training voiceprint may be processed. Such a training may be performed by the classifier trainer unit 503 providing classifier module 102 with suitable data and/or instructions 506 including, e.g., the first or next training voiceprint. This functionality implemented or implementable at block 602 may be performed by e.g. classifier trainer unit 503 previously described with reference to Figure 5. Functional details and considerations explained about said classifier trainer unit 503 may thus be similarly attributed or attributable to method block 602.
Trainer methods may yet furthermore include (e.g. at decision block 603) verifying whether an ending condition is satisfied, in which case, Y, the method may be terminated by, e.g., transitioning to ending block 604 and, otherwise, N, a new iteration of the training loop may be initiated by, e.g., looping back to block 602 so as to process next training voiceprint. Ending condition may include, e.g., completion of the whole or part of the training set (i.e. all training voiceprints to be used for the training have been processed), reception of a user termination request, shutdown or deactivation of the trainer system 500 that is performing the trainer method, etc.
As used herein, the term “module” or “unit” may be understood to refer to software, firmware, hardware and/or various combinations thereof. It is noted that the modules are exemplary. The modules may be combined, integrated, separated, and/or duplicated to support various applications. Also, a function described herein as being performed by a particular module may be performed by one or more other modules and/or by one or more other devices instead of or in addition to the function performed by the described particular module.
The modules may be implemented across multiple devices, associated or linked to corresponding detector methods and/or authenticator methods and/or trainer methods proposed herein, and/or to other components that may be local or remote to one another. Additionally, the modules may be moved from one device and added to another device, and/or may be included in both devices, associated to corresponding detector methods and/or authenticator methods and/or trainer methods proposed herein. Any software implementations may be tangibly embodied in one or more storage media, such as e.g. a memory device, a floppy disk, a compact disk (CD), a digital versatile disk (DVD), or other devices that may store computer code.
The detector methods and/or authenticator methods and/or trainer methods according to present disclosure may be implemented by computing means, electronic means or a combination thereof. The computing means may be a set of instructions (e.g. a computer program) and then detector methods and/or authenticator methods and/or trainer methods may comprise a memory and a processor, embodying said set of instructions stored in the memory and executable by the processor. These instructions may comprise functionality or functionalities to execute corresponding detector methods and/or authenticator methods and/or trainer methods such as e.g. the ones described with reference to other figures.
In case the detector methods and/or authenticator methods and/or trainer methods are implemented only by electronic means, a controller of the system may be, for example, a CPLD (Complex Programmable Logic Device), an FPGA (Field Programmable Gate Array) or an ASIC (Application-Specific Integrated Circuit).
In case the detector methods and/or authenticator methods and/or trainer methods are a combination of electronic and computing means, the computing means may be a set of instructions (e.g. a computer program) and the electronic means may be any electronic circuit capable of implementing corresponding steps of the detector methods and/or authenticator methods and/or trainer methods proposed herein, such as those described with reference to other figures.
The computer program(s) may be embodied on a storage medium (for example, a CD- ROM, a DVD, a USB drive, a computer memory or a read-only memory) or carried on a carrier signal (for example, on an electrical or optical carrier signal).
The computer program(s) may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in implementing the detector methods and/or authenticator methods and/or trainer methods according to present disclosure. The carrier may be any entity or device capable of carrying the computer program(s).
For example, the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a hard disk. Further, the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means.
When the computer program(s) is/are embodied in a signal that may be conveyed directly by a cable or other device or means, the carrier may be constituted by such cable or other device or means. Alternatively, the carrier may be an integrated circuit in which the computer program(s) is/are embedded, the integrated circuit being adapted for performing, or for use in the performance of, the detector methods and/or authenticator methods and/or trainer methods proposed herein.
Detector methods and systems according to present disclosure are based on processing a voiceprint produced by a voiceprint generator (or model) and, therefore, they provide advantages of reusability of something (the voiceprint) that is exclusively aimed at authentication. Voiceprint generators are used to generate voiceprints as unique identifiers of a person for his/her authentication based on voiceprint comparison. Detector methods and systems proposed herein may thus permit reusing a voiceprint for an innovative second aim (spoof detection) apart from the main or exclusive one (authentication).
Authenticator methods and systems, such as the ones described herein, may use a single biometric model (e.g., a single Al model) as voiceprint generator for both authentication and spoof detection. Indeed, detector methods and systems suggested herein permit avoiding the need of having a specific voiceprint generator for authentication and another specific voiceprint generator (or whatever computing means) for spoof detection. The use of a single two-functional voiceprint generator/model may thus provide important benefits in terms of saving computational resources versus using two different single-functional voiceprint generators. Less memory and processing resources may be needed in such a two-functional voiceprint approach versus single-functional voiceprint approach. These benefits may be especially advantageous in devices configured to perform authentication locally based on, e.g., single biometric model embedded therein. Furthermore, methods and systems supporting the two-functional voiceprint approach may perform authentication and spoof detection (much) more efficiently (i.e. faster) in comparison to single-functional voiceprint approach.
There exist situations in which there may be limitations in terms of, e.g., access to communications and/or to verification systems in the cloud. In situations like Humanitarian Crisis, Environmental Catastrophes or situations where communications may be temporarily offline, it may be helpful to have on-device applications that allow, for instance, an authentication or verification process to distribute humanitarian help such as, e.g., food, goods, health assistance, etc. In that sense, on-device solutions may have some limitations in terms of computing capacity as well as storage capacity. Any product that may help to attenuate these constraints may be extremely helpful, such as detector/authenticator methods and systems according to present disclosure.
As commented in detail above, detector/authenticator methods and systems according to present disclosure allow having only one voiceprint generator stored on-device as well as only one data storage (e.g., a database) containing only one reference voiceprint to authenticate, verify an identity and validate its authenticity on the field and on-device. The anonymity of the data stored on-device (i.e. irreversible voiceprint) may also be well secured. In this scenario, humanitarian workers may perform in catastrophic situations easily and securely in a friendly on-device environment with the need of less (maybe half) memory and processing resources in comparison to having two different biometric models, one for authentication and another for spoof detection. Assuming that each of said biometric models/modules weights 20MB, having two modules implies the need of storing 40MB and double execution while having a single two-functional biometric module requires storing 20MB and single execution.
Although only a number of examples have been disclosed herein, other alternatives, modifications, uses and/or equivalents thereof are possible. Furthermore, all possible combinations of the described examples are also covered. Thus, the scope of the disclosure should not be limited by particular examples, but it should be determined only by a fair reading of the claims that follow.

Claims

1. Detector method of detecting a spoofing in a user voice audio, the detector method comprising: obtaining a user voiceprint outputted by a voiceprint generator, the user voiceprint mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed-voice subspace; classifying the user voiceprint as corresponding to the genuine-voice subspace or to the spoofed-voice subspace; and detecting the spoofing in the user voice audio depending on whether the user voiceprint has been classified as corresponding to the genuine-voice subspace or to the spoofed-voice subspace.
2. Detector method according to claim 1 , the genuine-voice subspace and the spoofed-voice subspace are disjoint voiceprint subspaces and, therefore, with no overlapping between them.
3. Detector method according to any of claims 1 or 2, the classifying of the user voiceprint is performed based on a logistic regression model.
4. Detector method according to any of claims 1 to 3, further comprising: providing the user voice audio to the voiceprint generator in such a way that the voiceprint generator outputs the user voiceprint mathematically representing the user voice audio.
5. Detector method according to any of claims 1 to 4, further comprising: predefining the genuine-voice subspace by clustering, in the voiceprint representation space, a plurality of voiceprint samples that are known to mathematically represent genuine or authentic voice audios.
6. Detector method according to any of claims 1 to 5, further comprising: predefining the spoofed-voice subspace by clustering, in the voiceprint representation space, a plurality of voiceprint samples that are known to mathematically represent spoofed voice audios.
7. Detector method according to claim 6, wherein the spoofed voice audios include voice audios resulting from voice recording and subsequent replay of the recorded voice, so as to detect replay attacks.
8. Detector method according to claim 7, wherein the voice recording and subsequent replay of the recorded voice correspond to voice recording and subsequent replay of the recorded voice using a smartphone.
9. Detector method according to any of claims 6 to 8, wherein the spoofed voice audios include voice audios resulting from voice synthesis, so as to detect voice synthesis attacks.
10. Detector method according to any of claims 6 to 9, wherein the spoofed voice audios include voice audios resulting from voice conversion, so as to detect voice conversion attacks.
11. Detector method according to any of claims 1 to 10, wherein the classifying of the user voiceprint includes outputting a probability that the user voiceprint corresponds to or falls within the genuine-voice subspace or to the spoofed-voice subspace.
12. Detector method according to claim 11 , wherein the detecting of the spoofing in the user voice audio includes verifying whether the probability outputted by the classifying of the user voiceprint satisfies or dissatisfies a predefined spoofing threshold.
13. Detector method according to any of claims 1 to 12, wherein the spoofing to be detected corresponds to a spoofing by replay attack or voice synthesis attack or voice conversion attack.
14. Authenticator method of authenticating a person, including a detector method according to any of claims 1 to 13 in such a way that the authenticating of the person is performed by obtaining, from the person, the user voice audio; inputting the user voice audio into the voiceprint generator in such a way that the voiceprint generator outputs the user voiceprint; performing the detector method to detect a spoofing in the user voiceprint; and authenticating the person depending on whether no spoofing has been detected in the user voiceprint and, in said case, on a comparison between the user voiceprint and a reference voiceprint of the person.
15. Detector system for detecting a spoofing in a user voice audio, the detector system comprising: an obtainer module configured to obtain a user voiceprint outputted by a voiceprint generator, the user voiceprint mathematically representing the user voice audio in a voiceprint representation space including a genuine-voice subspace and a spoofed-voice subspace; a classifier module configured to classify the user voiceprint as corresponding to the genuine-voice subspace or to the spoofed-voice subspace; and a detector module configured to detect the spoofing in the user voice audio depending on whether the user voiceprint has been classified as corresponding to the genuine-voice subspace or to the spoofed-voice subspace.
16. Detector system according to claim 15, wherein the classifier module is a machine learning module which is, therefore, trainable using machine learning.
17. Trainer method of training a detector system according to claim 16, the trainer method comprising: obtaining a training set of training voiceprints outputted by the voiceprint generator, the training set including a first training subset and a second training subset, the first training subset including training voiceprints that are known to correspond to the genuine- voice subspace, and the second training subset including training voiceprints that are known to correspond to the spoofed-voice subspace; and performing, for each training voiceprint in whole or part of the training set, a training loop including training the classifier module to classify the training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset or to the second training subset.
18. Trainer method according to claim 17, wherein the training of the classifier module includes minimizing binary cross entropy with training voiceprints in the first training subset and training voiceprints in the second training subset.
19. Trainer method according to any of claims 17 or 18, wherein the training of the classifier module includes applying a L2 regularization method.
20. Trainer method according to any of claims 17 to 19, wherein the training of the classifier module includes applying a large-scale bounded constraint based on a L-BFGS- B optimization algorithm.
21. Trainer system for training a detector system according to claim 16, the trainer system comprising: a training set unit configured to obtain a training set of training voiceprints outputted by the voiceprint generator, the training set including a first training subset and a second training subset, the first training subset including training voiceprints that are known to correspond to the genuine-voice subspace, and the second training subset including training voiceprints that are known to correspond to the spoofed-voice subspace; and a training loop unit configured to perform, for each training voiceprint in whole or part of the training set, a training loop implemented by a classifier trainer unit configured to train the classifier module to classify the training voiceprint taking into account the known correspondence of the training voiceprint to the first training subset or to the second training subset.
22. Computer program comprising program instructions for causing a computer or system to perform a detector method according to any of claims 1 to 13.
23. Computer program according to claim 22, embodied on a storage medium or carried on a carrier signal.
24. Computer program comprising program instructions for causing a computer or system to perform a trainer method according to any of claims 17 to 20.
25. Computer program according to claim 24, embodied on a storage medium or carried on a carrier signal.
PCT/EP2022/072121 2022-07-01 2022-08-05 Detecting a spoofing in a user voice audio WO2024002501A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP22382631 2022-07-01
EP22382631.4 2022-07-01

Publications (1)

Publication Number Publication Date
WO2024002501A1 true WO2024002501A1 (en) 2024-01-04

Family

ID=82404110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/072121 WO2024002501A1 (en) 2022-07-01 2022-08-05 Detecting a spoofing in a user voice audio

Country Status (1)

Country Link
WO (1) WO2024002501A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200053118A1 (en) * 2018-08-10 2020-02-13 Visa International Service Association Replay spoofing detection for automatic speaker verification system
US10685008B1 (en) * 2016-08-02 2020-06-16 Pindrop Security, Inc. Feature embeddings with relative locality for fast profiling of users on streaming data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10685008B1 (en) * 2016-08-02 2020-06-16 Pindrop Security, Inc. Feature embeddings with relative locality for fast profiling of users on streaming data
US20200053118A1 (en) * 2018-08-10 2020-02-13 Visa International Service Association Replay spoofing detection for automatic speaker verification system

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
AMIR MOHAMMAD ROSTAMI ET AL: "Efficient Attention Branch Network with Combined Loss Function for Automatic Speaker Verification Spoof Detection", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 5 September 2021 (2021-09-05), XP091051460 *
ANAND S ABHISHEK ANANDAB OFFICIAL@LIVE COM ET AL: "EchoVib: Exploring Voice Authentication via Unique Non-Linear Vibrations of Short Replayed Speech", PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING, ACMPUB27, NEW YORK, NY, USA, 24 May 2021 (2021-05-24), pages 67 - 81, XP058744320, ISBN: 978-1-4503-8300-4, DOI: 10.1145/3433210.3437518 *
KORSHUNOV PAVEL ET AL: "Joint operation of voice biometrics and presentation attack detection", 2016 IEEE 8TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS), IEEE, 6 September 2016 (2016-09-06), pages 1 - 6, XP033035152, DOI: 10.1109/BTAS.2016.7791179 *
NOVOSELOV SERGEY ET AL: "STC anti-spoofing systems for the ASVspoof 2015 challenge", 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 20 March 2016 (2016-03-20), pages 5475 - 5479, XP032901650, DOI: 10.1109/ICASSP.2016.7472724 *
XIN WANG ET AL: "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 14 July 2020 (2020-07-14), XP081704339 *
YOU ZHANG ET AL: "A Probabilistic Fusion Framework for Spoofing Aware Speaker Verification", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 March 2022 (2022-03-02), XP091174079 *
YOU ZHANG ET AL: "One-class learning towards generalized voice spoofing detection", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 October 2020 (2020-10-27), XP081800391 *

Similar Documents

Publication Publication Date Title
Lu et al. Lip reading-based user authentication through acoustic sensing on smartphones
CN108664880B (en) Activity test method and apparatus
US20220159035A1 (en) Replay spoofing detection for automatic speaker verification system
WO2019010054A1 (en) System and method for efficient liveness detection
Hu et al. Adversarial examples for automatic speech recognition: Attacks and countermeasures
JP2018526719A (en) Apparatus and computer-implemented method for fingerprint-based authentication
Kelkboom et al. Binary biometrics: An analytic framework to estimate the performance curves under gaussian assumption
US10445545B2 (en) Electronic device with fingerprint identification function and fingerprint identification method
Battaglino et al. The open-set problem in acoustic scene classification
WO2016113521A1 (en) Authentication method
US20220328050A1 (en) Adversarially robust voice biometrics, secure recognition, and identification
US11030292B2 (en) Authentication using sound based monitor detection
Monteiro et al. Development of voice spoofing detection systems for 2019 edition of automatic speaker verification and countermeasures challenge
Kwak et al. Voice presentation attack detection through text-converted voice command analysis
WO2024002501A1 (en) Detecting a spoofing in a user voice audio
US20200019690A1 (en) Biometric authentication system and biometric authentication method using frequency response characteristics of biometric signal
Bui et al. A clustering-based shrink autoencoder for detecting anomalies in intrusion detection systems
CN111402899A (en) Cross-channel voiceprint identification method and device
TWM622203U (en) Voiceprint identification device for financial transaction system
Chandrakala et al. Detection and classification of malware
Baldini et al. Mobile phone identification through the built-in magnetometers
JP7264355B2 (en) Handwritten Signature Authentication Method and Apparatus Based on Multiple Authentication Algorithm
US20220067149A1 (en) System and method for improving measurements of an intrusion detection system by transforming one dimensional measurements into multi-dimensional images
Shenai et al. Fast biometric authentication system based on audio-visual fusion
de Oliveira et al. A security API for multimodal multi-biometric continuous authentication

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22762024

Country of ref document: EP

Kind code of ref document: A1