CN116524936A - Voiceprint authentication method and system based on StarGAN-VC - Google Patents

Voiceprint authentication method and system based on StarGAN-VC Download PDF

Info

Publication number
CN116524936A
CN116524936A CN202310413563.2A CN202310413563A CN116524936A CN 116524936 A CN116524936 A CN 116524936A CN 202310413563 A CN202310413563 A CN 202310413563A CN 116524936 A CN116524936 A CN 116524936A
Authority
CN
China
Prior art keywords
voice signal
authentication
original
generator network
registration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310413563.2A
Other languages
Chinese (zh)
Inventor
宣晓彬
苏兆品
黄伊婷
王彤
陈智慧
张国富
岳峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202310413563.2A priority Critical patent/CN116524936A/en
Publication of CN116524936A publication Critical patent/CN116524936A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0861Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a voiceprint authentication method and system based on StarGAN-VC, and relates to the field of voiceprint authentication. The method comprises the following steps: by taking the original registered audio as a training sample, iterative training is carried out on a generator network in the StarGAN-VC network architecture, and a first generator network and a second generator network are respectively obtained based on different iterative training times. And the original registration voice signals are converted into first registration audio signals through the second generator network, so that the first registration audio signals and the original registration voice signals have certain differences, and the security of voiceprint privacy of the original registration voice signals is ensured. Because the generator in the StarGAN-VC network architecture can not know the specific conversion means in the conversion process, the confidentiality of voice signal conversion is improved, and reliable guarantee is provided for the voiceprint privacy security of the original voiceprint characteristic information of the user.

Description

Voiceprint authentication method and system based on StarGAN-VC
Technical Field
The invention relates to the technical field of voiceprint authentication, in particular to a voiceprint authentication method and system based on StarGAN-VC.
Background
With the development of information technology, higher requirements are put on trusted identity authentication capability. In year 2020, 2, china's people bank issues "personal amount information protection technical Specification", separates the dynamic voiceprint password from the personal biological identification information for the first time and the dynamic password is parallel, is accepted by industry as a personal information type with lower privacy sensitivity, and shows unique advantages and broad prospects of voiceprints in personal privacy protection.
The existing voiceprint authentication method emphasizes the accuracy of the recognition result and the high efficiency of the authentication process. However, as voiceprint authentication is widely used, the user's voice data information is subjected to the action of the lawless person's/35274. Specifically, the voice data of the user is often stored directly in the cloud without being processed or simply encrypted. If an attacker grasps sound information acquisition equipment such as a mobile phone end, a self-service end and the like, after acquiring user voice data or simply decrypting, the attacker can directly replay the voiceprint characteristic information of the user on a message layer.
In summary, the existing voiceprint authentication method does not encrypt the original voice of the user or has a simple encryption technology, so that the privacy security of the original voiceprint feature information of the user is low, and therefore, a voiceprint authentication method is needed to solve the above-mentioned problems.
Disclosure of Invention
(one) solving the technical problems
Aiming at the defects of the prior art, the invention provides a voiceprint authentication method and a voiceprint authentication system based on StarGAN-VC, which solve the problem that the privacy security of original voiceprint characteristic information of a user is lower because the original voice of the user is not encrypted or the encryption technology is simpler in the existing voiceprint authentication method.
(II) technical scheme
In order to achieve the above purpose, the invention is realized by the following technical scheme:
in a first aspect of the present invention, there is provided a StarGAN-VC-based voiceprint authentication method, the method comprising:
acquiring a first authentication voice signal; the first authentication voice signal is obtained by preprocessing an original authentication voice signal;
processing the first authentication voice signal based on a first generator network acquired in advance to obtain a second authentication voice signal; the acquiring process of the first generator network comprises the following steps: taking an original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value;
Calculating the similarity between the second authentication voice signal and the first registration voice signal, and if the similarity is larger than a first preset threshold value, successfully authenticating the original authentication voice signal; wherein the first registration voice signal is obtained based on the original registration voice signal and a second generator network acquired in advance; the acquiring process of the second generator network comprises the following steps: taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a second generator network when the number of iterative training reaches a second preset value; the second preset value is smaller than the first preset value.
Optionally, the method further comprises:
acquiring an original authentication voice signal;
and acquiring a Mel frequency cepstrum coefficient of the original authentication voice signal as the first authentication voice signal.
Optionally, the processing the first authentication voice signal based on the pre-acquired first generator network to obtain a second authentication voice signal includes:
the first authentication voice signal and the target voice tag are input into a first generator network which is acquired in advance together, and a signal output by the first generator network is obtained and used as a third authentication voice signal; wherein the target voice tag is a voice tag of a preselected target voice signal;
The third authentication voice signal and the original authentication voice tag are input into the first generator network together, and a signal output by the first generator network is obtained and used as a fourth authentication voice signal; the original authentication voice tag is a voice tag of the original authentication voice signal;
acquiring a first spectrum gain function based on the third authentication voice signal and the fourth authentication voice signal;
multiplying the spectrum envelope of the original authentication voice with the first spectrum gain function, and processing the multiplication result based on a vocoder to obtain a second authentication voice signal.
Optionally, the process of performing iterative training on the generator network in the StarGAN-VC network architecture by using the original registered voice signal as a training sample includes:
acquiring a mel frequency cepstrum coefficient of the original registered voice signal as a voice signal to be processed;
the voice signal to be processed and the target audio tag are input to a generator network in a StarGAN-VC network architecture to be trained together to obtain a first synthesized voice signal;
inputting the first synthesized voice signal into a pre-trained discriminator network to obtain the probability that the first synthesized voice signal is the target voice signal; the first synthesized voice signal and the original audio tag are input to a generator network in a StarGAN-VC network architecture to be trained together to obtain a reconstructed voice signal;
Calculating a loss value of a generator network based on a probability that the first synthesized speech signal is the target speech signal;
adjusting network parameters of the generator network based on the loss values of the generator network; after the network parameters of the generator network are adjusted, recording the parameters as finishing one iteration training;
and replacing the to-be-processed voice signal with the reconstructed voice signal, and inputting the replaced to-be-processed voice signal and the target audio tag into a generator network after network parameters are adjusted again to obtain a first synthesized voice signal, and continuing to perform iterative training.
Optionally, before calculating the similarity between the second authentication voice signal and the first registration voice signal, the method further includes:
acquiring a second registration voice signal; wherein the second registration voice signal is obtained based on encrypting the first registration voice signal;
and decrypting the second registration voice signal to obtain the first registration voice signal.
Optionally, before the acquiring the second registration voice signal, the method further includes:
acquiring the original registration voice signal;
acquiring a mel frequency cepstrum coefficient of the original registration voice signal as a registration voice signal to be processed;
Inputting the to-be-processed registration voice signal and the target voice tag into a pre-trained second generator network to obtain a signal output by the second generator network as a third registration voice signal;
inputting the third registration voice signal and the original registration voice tag into the second generator network to obtain a signal output by the second generator network as a fourth registration voice signal; wherein the original registered voice tag is a voice tag of the original registered voice signal;
acquiring a second spectrum gain function based on the third registered voice signal and the fourth registered voice signal;
multiplying the spectrum envelope of the original registered voice with the second spectrum gain function, and processing the multiplication result based on a vocoder to obtain a first registered voice signal;
and encrypting the first registration voice signal to obtain a second registration voice signal.
Optionally, encrypting the first registration voice signal to obtain a second registration voice signal includes:
deleting the original registered voice signals;
encrypting the first registration voice signal based on the AES encryption library to obtain an encrypted first registration voice signal serving as a second registration voice signal.
Optionally, after encrypting the first registration voice signal based on the AES encryption library to obtain an encrypted first registration voice signal as the second registration voice signal, the method further includes:
storing the second registered voice signal in a database, and deleting a user password; wherein the user password is the same as a key for encryption based on an AES encryption library.
Optionally, before deleting the user password, the method further includes:
hashing the user password to obtain a corresponding hash value as a user hash value;
before acquiring the original authentication speech signal, the method further comprises:
acquiring an authentication password;
hashing the authentication password to obtain a corresponding hash value as an authentication hash value;
judging whether the authentication hash value is the same as the user hash value; if yes, indicating that the authentication password passes;
the step of decrypting the second registration voice signal to obtain the first registration voice signal includes:
acquiring the authentication password;
and decrypting the second registration voice signal based on the authentication password to obtain the first registration voice signal.
In a second aspect of the present invention, there is provided a StarGAN-VC-based voiceprint authentication system, the system comprising:
the signal acquisition module is used for acquiring a first authentication voice signal; the first authentication voice signal is obtained by preprocessing an original authentication voice signal;
the signal processing module is used for processing the first authentication voice signal based on a first generator network acquired in advance to obtain a second authentication voice signal; the acquisition process of the first generator network is as follows: taking an original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value;
the similarity calculation module is used for calculating the similarity between the second authentication voice signal and the first registration voice signal, and if the similarity is larger than a first preset threshold value, the original authentication voice signal is successfully authenticated; wherein the first registration voice signal is obtained based on the original registration voice signal and a second generator network acquired in advance; the generation process of the second generator network is as follows: taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a second generator network when the number of iterative training reaches a second preset value; the second preset value is smaller than the first preset value.
(III) beneficial effects
The invention provides a voiceprint authentication method and system based on StarGAN-VC. Compared with the prior art, the method has the following beneficial effects:
the invention provides a voiceprint authentication method based on StarGAN-VC, which comprises the following steps: acquiring a first authentication voice signal; the first authentication voice signal is obtained by preprocessing the original authentication voice signal; processing the first authentication voice signal based on a first generator network acquired in advance to obtain a second authentication voice signal; the acquisition process of the first generator network comprises the following steps: taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value; calculating the similarity between the second authentication voice signal and the first registration voice signal, and if the similarity is larger than a first preset threshold value, indicating that the original authentication voice signal is successfully authenticated; wherein the first registration voice signal is obtained based on the original registration voice signal and a second generator network acquired in advance; the generation process of the second generator network is as follows: taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in the StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a second generator network when the number of iterative training reaches a second preset value; the second preset value is smaller than the first preset value.
Based on the processing, the invention carries out iterative training on the generator network in the StarGAN-VC network architecture by taking the original registered audio as a training sample, and respectively obtains a first generator network and a second generator network based on different iterative training times. And the original registration voice signals are converted into first registration audio signals through the second generator network, so that the first registration audio signals and the original registration voice signals have certain differences, and the security of voiceprint privacy of the original registration voice signals is ensured. And the original authentication voice signal is converted into the second authentication voice signal through the first generator network, so that the voiceprint privacy security of the original authentication voice signal is also effectively ensured. In the process of converting the voice signal by the generator in the StarGAN-VC network architecture, the specific conversion means in the conversion process cannot be known in the prior art, so that the confidentiality of voice signal conversion is greatly improved, and even if the first registration voice signal is acquired, the first registration voice signal cannot be reversely converted into the original registration voice signal in the prior art, and reliable guarantee is provided for the voiceprint privacy security of the original voiceprint characteristic information of the user.
Moreover, because the first generator network and the second generator network are obtained by training based on the same training samples and the same StarGAN-VC network architecture, only the iterative training times are different, and therefore the conversion process of the original registration voice signal is basically the same as that of the original authentication voice signal, and therefore the voiceprint authentication of the original authentication voice signal can be completed by calculating the similarity of the second authentication voice signal and the first registration voice signal, and the voiceprint authentication accuracy of the voice signal is further ensured on the basis of protecting the voiceprint privacy of the voice signal.
In addition, in the process of carrying out iterative training on the generator in the StarGAN-VC network architecture and obtaining the generator network, as the training iteration number corresponding to the second generator network is smaller than the training iteration number corresponding to the first generator network, compared with the difference between the original authentication voice signal and the second authentication voice signal, the difference between the original registration voice signal and the first registration voice signal is lower, so that the first registration voice signal retains more voice characteristics of the original registration voice signal on the basis of guaranteeing the voice print privacy security of the original registration voice signal, and the accuracy of the subsequent voice print authentication process is further improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a voiceprint authentication method based on StarGAN-VC provided by an embodiment of the present invention;
fig. 2 is a schematic diagram of an extraction flow of mel-frequency cepstrum coefficients according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of iterative training of a generator network according to an embodiment of the present invention;
fig. 4 is a flowchart of another voiceprint authentication method based on StarGAN-VC according to an embodiment of the present invention;
FIG. 5 is a graph of test results of similarity between synthesized audio and original audio according to an embodiment of the present invention;
fig. 6 is a block diagram of a voiceprint authentication system based on StarGAN-VC according to an embodiment of the present invention;
fig. 7 is a diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
According to the voiceprint authentication method and system based on StarGAN-VC, the problem that the privacy security of original voiceprint feature information of a user in an existing voiceprint authentication method is low is solved, iterative training is carried out on a generator network in a StarGAN-VC network architecture by taking original registered audio as a training sample, and a first generator network and a second generator network are respectively obtained based on different iterative training times. And the original registration voice signals are converted into first registration audio signals through the second generator network, so that the first registration audio signals and the original registration voice signals have certain differences, and the security of voiceprint privacy of the original registration voice signals is ensured. And the original authentication voice signal is converted into the second authentication voice signal through the first generator network, so that the voiceprint privacy security of the original authentication voice signal is also effectively ensured.
The technical scheme in the embodiment of the application aims to solve the technical problems, and the overall thought is as follows:
according to the invention, the original registration audio is used as a training sample, iterative training is carried out on the generator network in the StarGAN-VC network architecture, and the first generator network and the second generator network are respectively obtained based on different iterative training times. And the original registration voice signals are converted into first registration audio signals through the second generator network, so that the first registration audio signals and the original registration voice signals have certain differences, and the security of voiceprint privacy of the original registration voice signals is ensured. And the original authentication voice signal is converted into the second authentication voice signal through the first generator network, so that the voiceprint privacy security of the original authentication voice signal is also effectively ensured. In the process of converting the voice signal by the generator in the StarGAN-VC network architecture, the specific conversion means in the conversion process cannot be known in the prior art, so that the confidentiality of voice signal conversion is greatly improved, and even if the first registration voice signal is acquired, the first registration voice signal cannot be reversely converted into the original registration voice signal in the prior art, and reliable guarantee is provided for the voiceprint privacy security of the original voiceprint characteristic information of the user.
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
First, basic knowledge related to the technical scheme of the invention is introduced.
Voiceprints can be represented as speech features contained in speech that characterize and identify the speaker, or as speech models built based on the above-mentioned speech features (parameters). Theoretically, a voiceprint is just like a fingerprint, and few two people have the same voiceprint characteristics, so that the authenticity of the user identity can be judged based on the voiceprint. Correspondingly, voiceprint recognition is a process of recognizing a speaker corresponding to the section of voice to be recognized according to voiceprint characteristics of the voice to be recognized.
Voiceprint recognition can be divided into the following categories depending on the application scenario: voiceprint confirmation, voiceprint recognition, voiceprint detection, and voiceprint tracking. The voiceprint verification is understood to be given a voiceprint model of a speaker and a section of speech containing only one speaker, and determining whether the section of speech is spoken by the speaker. Notably, the voiceprint authentication method of the technical scheme is applied to a voiceprint confirmation scene.
Because the prior voiceprint authentication method does not encrypt the original voice of the user or has simpler encryption technology, the privacy security of the original voiceprint characteristic information of the user is lower, and therefore, the invention provides a voiceprint authentication method based on StarGAN-VC to solve the problems. Referring to fig. 1, fig. 1 is a flowchart of a voiceprint authentication method based on StarGAN-VC according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
s1, acquiring a first authentication voice signal.
The first authentication voice signal is obtained by preprocessing the original authentication voice signal.
S2, processing the first authentication voice signal based on a first generator network acquired in advance to obtain a second authentication voice signal.
The acquisition process of the first generator network comprises the following steps: and taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in the StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value.
S3, calculating the similarity between the second authentication voice signal and the first registration voice signal, and if the similarity is larger than a first preset threshold value, successfully authenticating the original authentication voice signal.
Wherein the first registration voice signal is obtained based on the original registration voice signal and a second generator network acquired in advance; the acquisition process of the second generator network comprises: taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a second generator network when the number of iterative training reaches a second preset value; the second preset value is smaller than the first preset value.
Based on the processing, the invention carries out iterative training on the generator network in the StarGAN-VC network architecture by taking the original registered audio as a training sample, and respectively obtains a first generator network and a second generator network based on different iterative training times. And the original registration voice signals are converted into first registration audio signals through the second generator network, so that the first registration audio signals and the original registration voice signals have certain differences, and the security of voiceprint privacy of the original registration voice signals is ensured. And the original authentication voice signal is converted into the second authentication voice signal through the first generator network, so that the voiceprint privacy security of the original authentication voice signal is also effectively ensured. In the process of converting the voice signal by the generator in the StarGAN-VC network architecture, the specific conversion means in the conversion process cannot be known in the prior art, so that the confidentiality of voice signal conversion is greatly improved, and even if the first registration voice signal is acquired, the first registration voice signal cannot be reversely converted into the original registration voice signal in the prior art, and reliable guarantee is provided for the voiceprint privacy security of the original voiceprint characteristic information of the user.
Moreover, because the first generator network and the second generator network are obtained by training based on the same training samples and the same StarGAN-VC network architecture, only the iterative training times are different, and therefore the conversion process of the original registration voice signal is basically the same as that of the original authentication voice signal, and therefore the voiceprint authentication of the original authentication voice signal can be completed by calculating the similarity of the second authentication voice signal and the first registration voice signal, and the voiceprint authentication accuracy of the voice signal is further ensured on the basis of protecting the voiceprint privacy of the voice signal.
In addition, in the process of carrying out iterative training on the generator in the StarGAN-VC network architecture and obtaining the generator network, as the training iteration number corresponding to the second generator network is smaller than the training iteration number corresponding to the first generator network, compared with the difference between the original authentication voice signal and the second authentication voice signal, the difference between the original registration voice signal and the first registration voice signal is lower, so that the first registration voice signal retains more voice characteristics of the original registration voice signal on the basis of guaranteeing the voice print privacy security of the original registration voice signal, and the accuracy of the subsequent voice print authentication process is further improved.
The voiceprint authentication method can be applied to voiceprint confirmation scenes such as access control, user account login and the like, and is executed by electronic equipment such as a computer, a mobile phone, voice processing equipment and the like. In the following description of the present invention, the above-described electronic device is represented by a computer. The original registration voice signal represents an audio signal input to a computer by a user in a registration process before voiceprint authentication; the original authentication voice signal represents an audio signal to be authenticated which is input to the computer by a user in the voiceprint authentication process.
For step S1, the first authentication speech signal represents a speech signal based on preprocessing the original authentication speech signal.
Since the original voice signals (i.e., the original registration voice signal and the original authentication voice signal in the present invention) recorded by the user are time series signals, which are one-dimensional array data, and the generator in the StarGAN-VC model frame can process two-dimensional data (i.e., matrix data, or can be understood as a spectrogram), the original voice signals need to be preprocessed before being input into the generator. The preprocessing method is to convert the form of an original voice signal from a one-dimensional array time sequence signal into matrix data of two-dimensional data.
Since the speech characteristics of the speech signal are included in the MFCC (Mel-scaleFrequency Cepstral Coefficients, mel-frequency cepstrum coefficient) and stored in the form of a matrix, the preferred preprocessing method of the present invention is to extract the Mel-frequency cepstrum coefficient of the original speech signal.
In some embodiments, prior to step S1, the method may include the steps of:
s4, acquiring an original authentication voice signal.
S5, acquiring a Mel frequency cepstrum coefficient of the original authentication voice signal as a first authentication voice signal.
In the implementation method, the process of obtaining the mel frequency cepstrum coefficient of the original authentication voice signal can be divided into the following steps:
step one, pre-emphasis processing is carried out on the original authentication voice signal, namely the high-frequency part of voice is improved, so that the original authentication voice signal is flattened, and noise is filtered.
And secondly, carrying out framing treatment on the original authentication voice signal subjected to pre-emphasis treatment so as to discretize the continuous voice signal. Specifically, the original authentication voice signal after pre-emphasis processing is sampled for 22.5k times in 1 second, and the number of sampling points sampled each time is taken as one frame and added into a hamming window.
And thirdly, performing FFT (Fast Fourier Transform ) on each frame of the original authentication voice signal after the framing process to convert the time domain signal into a frequency domain signal. The frequency domain signals after FFT conversion are subjected to time concatenation to obtain the spectrogram of the voice signals, and the obtained spectrogram is larger and is passed through in order to obtain the sound characteristics with proper size
Inputting the original authentication voice signal after FFT conversion into a Mel scale filter bank to obtain Mel frequency spectrum;
and fifthly, carrying out cepstrum analysis on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients. Where the cepstrum analysis includes taking the logarithm, DCT (Discrete Cosine Transform ).
Referring to fig. 2, fig. 2 is a schematic diagram of an extraction flow of mel-frequency cepstrum coefficients according to an embodiment of the present invention.
It will be appreciated that the mel-frequency cepstrum coefficient process for obtaining the original enrollment speech signal is the same as the mel-frequency cepstrum coefficient process for obtaining the original enrollment speech signal described above.
In some embodiments, step S2 may include the steps of:
s201, inputting a first authentication voice signal and a target voice tag into a first generator network which is acquired in advance together to obtain a signal output by the first generator network as a third authentication voice signal; wherein the target voice tag is a voice tag of a preselected target voice signal.
S202, the third authentication voice signal and the original authentication voice tag are input into the first generator network together, and a signal output by the first generator network is obtained and used as a fourth authentication voice signal. The original authentication voice tag is a voice tag of the original authentication voice signal.
S203, acquiring a first spectrum gain function based on the third authentication voice signal and the fourth authentication voice signal.
S204, multiplying the spectrum envelope of the original authentication voice with the first spectrum gain function, and processing the multiplication result based on the vocoder to obtain a second authentication voice signal.
It can be understood that the first authentication voice signal is two-dimensional data obtained by preprocessing an original authentication voice signal. The second authenticated voice signal is a time domain signal processed by a first spectral gain function, a vocoder, etc. of the first generator network.
Based on the processing, the first authentication voice signal is converted into the second authentication voice signal, so that the voiceprint privacy of the original authentication voice signal is effectively ensured.
In some embodiments, the acquiring of the first generated network includes: and taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in the StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value.
In one implementation, a process for iteratively training a generator network in a StarGAN-VC network architecture using an original registered speech signal as a training sample, includes the steps of:
S301, acquiring a Mel frequency cepstrum coefficient of an original registered voice signal as a voice signal to be processed.
S302, the voice signal to be processed and the target audio tag are input to a generator network in a StarGAN-VC network architecture to be trained together, and a first synthesized voice signal is obtained.
S303, inputting the first synthesized voice signal into a pre-trained discriminator network, and obtaining the probability that the first synthesized voice signal is the target voice signal; and inputting the first synthesized voice signal and the original audio tag to a generator network in a StarGAN-VC network architecture to be trained together to obtain a reconstructed voice signal.
S304, calculating a loss value of the generator network based on the probability that the first synthesized voice signal is the target voice signal.
S305, adjusting network parameters of the generator network based on the loss value of the generator network; wherein, after adjusting the network parameters of the generator network, it is recorded as completing one iteration training.
S306, replacing the to-be-processed voice signal with the reconstructed voice signal, and inputting the replaced to-be-processed voice signal and the target audio tag to the generator network after the network parameters are adjusted again to obtain a first synthesized voice signal, and continuing to perform iterative training.
Wherein the audio tag represents a speech feature of the speech signal. The pre-target voice signal is a section of voice signal pre-selected by a user, and an audio tag of the section of voice signal serves as a target audio tag.
Based on the above, when the number of iterative training performed on the generator network in the StarGAN-VC network architecture meets a first preset value, the generator network may be used as a first generator network; when the number of iterative training runs meets a second preset value, the generator network can be used as a first generator network. Wherein the first predetermined value is in the range of 75-85, preferably 80. The second preset value ranges from 15 to 25, preferably 20.
Based on the prior art, the original voice signal is converted through a generator network in the StarGAN-VC network architecture, so that the converted original voice signal is as close to the pre-target voice signal as possible.
In the technical scheme of the invention, when the original signal (namely the original authentication voice signal or the original registration voice signal) is converted by utilizing the generator network in the StarGAN-VC network architecture, the aim is to ensure that the voice signal after conversion keeps the voiceprint characteristic of the original signal as much as possible and has certain distinction with the original signal.
It will be appreciated that in order to make the first registered audio retain more voiceprint features in the original registered voice signal, thereby improving the accuracy of voiceprint recognition, the range of the second preset value is set to 15-25, preferably 20 times.
In the technical scheme of the invention, after the user finishes account registration, the user is used as a legal user. The other users except the legal user are used as illegal users. The voiceprint authentication method is used for judging whether the original authentication voice signal belongs to a legal user, namely whether the user to be authenticated is the legal user.
If the first preset value is very large (for example, exceeds 100 times), the second authentication voice signal of the illegal user or the legal user contains more voiceprint features in the pre-target voice signal, so that the similarity with the first registration voice signal is very high, and whether the original registration voice signal passes the voiceprint authentication cannot be judged based on the similarity, namely the voiceprint authentication cannot be performed. The first preset value should not be set to be large.
If the first preset value is smaller (for example, between 20-50 times), the voiceprint features of the second authentication voice signal converted by the first generator network are less. So even if the second authentication voice signal is a legal user, the similarity with the first registration voice signal is low. For the above situation, if the voiceprint authentication is guaranteed to be successful, the value of the first preset threshold needs to be reduced. And decreasing the value of the first preset threshold value directly decreases the accuracy of the voiceprint authentication of the present invention. The first preset value should not be set smaller. It should be noted that, the similarity calculation of the second authentication voice signal and the first registration voice signal may be simply understood as a result of superimposing the second authentication voice signal with the voiceprint feature of the legal user and the voiceprint feature of the pre-selected target voice signal in the first registration voice signal.
In addition, the first preset value and the second preset value can be adjusted according to the number of voice features contained in the original registered voice signal. If the number of voice features is very large, the first preset value and the second preset value can be adaptively increased.
To verify the above, the inventors performed simulation verification for different setting results of the first preset value and the second preset value. SF1 is a legal user, SF2, SF3, SF4 and TM2 are all illegal users, test texts are different registration voices input by 5 legal users, and the contents of intermediate tables in tables 1 and 2 are similarity. Table 1 shows the test results of the present technical solution, wherein the corresponding first preset value is 80, the second preset value is 20, and the test results are as follows.
Table 1 test results of the present technical scheme
Table 2 shows the test results of the comparative technical scheme, wherein the corresponding first preset value is 50, the second preset value is 20, and the test results are as follows.
Table 2 test results comparing technical solutions
The first preset threshold is set to 0.7, and the voiceprint authentication result is judged according to the similarity in table 1 and table 2. In table 1, there are only two voiceprint authentication errors, but in table 2, there are 9 authentication errors, so the preferred first preset value of 80 and the second preset value of 20 of the present invention have better effects.
Referring to fig. 3, fig. 3 is a schematic structural diagram of iterative training of a generator network according to an embodiment of the present invention. As shown in fig. 3, the to-be-processed voice signal and the target audio tag are input together into a generator network G (i.e., the to-be-trained generator network of the present invention), so as to obtain a first synthesized voice signal. Then, the first synthesized voice signal is input into a pre-trained discriminator D, and the probability that the first synthesized voice signal is the target voice signal is obtained. Simultaneously, the first synthesized voice signal and the original audio tag are input into a generator network G (namely, the generator network to be trained) together to obtain a reconstructed voice signal.
And obtaining a loss value of the generator network based on the probability that the first synthesized speech signal is the preselected target speech signal, and adjusting network parameters of the generator network based on the loss value of the generator network. And then, re-inputting the reconstructed voice signal serving as a voice signal to be processed and the target audio tag into a generator network after adjusting network parameters, and continuing to perform iterative training.
In some embodiments, the processing of the original registered audio signal by the computer during registration is as follows:
a. And acquiring the Mel frequency cepstrum coefficient of the original registered audio signal as a voice signal to be processed.
b. And carrying out iterative training on a generator network in the StarGAN-VC network architecture based on the voice signals to be processed to obtain a first generator network and a second generator network.
c. And inputting the to-be-processed registration voice signal and the target voice label into a second generator network which is trained in advance, and obtaining a signal output by the second generator network as a third registration voice signal.
d. And inputting the third registration voice signal and the original registration voice tag into a second generator network to obtain a signal output by the second generator network as a fourth registration voice signal. Wherein the original registration voice tag is a voice tag of the original registration voice signal.
e. And acquiring a second spectrum gain function based on the third registration voice signal and the fourth registration voice signal.
f. And multiplying the spectrum envelope of the original registered voice with a second spectrum gain function, and processing the multiplication result based on the vocoder to obtain a first registered voice signal.
g. And encrypting the first registration voice signal to obtain a second registration voice signal.
In some embodiments, the encrypting the first registration voice signal in the step g to obtain a second registration voice signal may include the following steps:
g01, deleting the original registered voice signal.
And G02, encrypting the first registration voice signal based on the AES encryption library to obtain an encrypted first registration voice signal serving as a second registration voice signal.
In one implementation, after step G02, the method further includes the steps of:
and G03, storing the second registration voice signal in a database, and deleting the user password. The user password represents an account password set when the user performs account registration. The account code is the same as the key used for encryption.
In some embodiments, after the account registration of the user is completed, the user password is hashed, and a corresponding hash value is obtained as the user hash value.
After the user hash value is obtained, the plaintext password in the user account is deleted.
In one implementation, when voiceprint authentication is performed, a user to be authenticated is first required to input a password to the voiceprint recognition system. At this time, a piece of password acquired by the voiceprint recognition system is used as the authentication password. Then, the authentication password is hashed to obtain a corresponding hash value as an authentication hash value.
And judging whether the authentication hash value is the same as the user hash value, if so, the authentication password is the same as the user password when the user registers, and indicating that the authentication password passes. If the passwords are different, the password input by the user to be authenticated is different from the user password, and the voice print authentication operation is stopped.
Because the account password is the same as the encryption key, after the authentication password input by the user passes, the second registration voice signal is decrypted based on the authentication password, and the first registration voice signal is obtained.
Because the original password cannot be reversely obtained according to the hash value, after deleting the plaintext password of the account, even if the hash value of the user in the equipment where the account is located is revealed, the user without the account cannot obtain the account password of the account. In addition, because the password secret key is the same as the account password, based on the processing, the occurrence of an event of user information leakage caused by directly using the secret audio to carry out voiceprint authentication by an attacker after the database is stolen can be effectively prevented.
Aiming at the step S3, the similarity between the second authentication voice signal and the first registration voice signal is calculated, and if the similarity is larger than a first preset threshold value, the original authentication voice signal is successfully authenticated. The first preset threshold is set by the user and ranges from 0 to 1.
In one implementation, the second authentication voice signal and the first registration voice signal are input to a preset discriminator, and the discriminator calculates the similarity between the second authentication voice signal and the first registration voice signal and judges whether the similarity is greater than 0.7 (i.e. the first preset threshold in the present invention). If yes, the voiceprint authentication of the original authentication voice signal is passed. If not, the voiceprint authentication of the original authentication voice signal is not passed.
Referring to fig. 4, fig. 4 is a flowchart of another voiceprint authentication method based on StarGAN-VC according to an embodiment of the present invention. As shown in fig. 4, the method includes:
when registering a user, the computer firstly acquires the account password of the user, hashes the account password, converts the account password into a user hash value, and deletes a plaintext password (for example, the account password) in the account. The computer obtains the first registration voice signal through the second generator network in the preprocessing module and the voiceprint privacy module. Then, AES encryption is performed on the first registered voice signal, and the encrypted first registered voice signal is stored in the user database.
When voiceprint authentication is performed, a computer firstly acquires a user password input by a user to be authenticated, and hashes the input user password to obtain an authentication hash value. And judging whether the voiceprint authentication system continues to perform the voiceprint authentication by judging whether the hash value is the same as the user hash value.
When voiceprint authentication is performed, the computer acquires an original authentication voice signal of a user to be authenticated, and the original authentication voice signal passes through the preprocessing module and the first generation network to obtain protected authentication voice (namely a second authentication voice signal of the invention). Meanwhile, the computer acquires the encrypted first registration voice signal in the user database, and carries out AES decryption on the encrypted first registration voice signal based on the authentication password to obtain the first registration voice signal.
And then, the protected authentication voice signal and the first registration voice signal are input into a preset discriminator together to obtain a comparison result, namely, the similarity of the protected authentication voice signal and the first registration voice signal is calculated, and whether voiceprint authentication is passed or not is further judged.
Aiming at the voiceprint authentication method based on StarGAN-VC, which is provided by the invention, the inventor performs performance test of the method.
In the performance test, the voices of 4 source speakers and 4 target speakers (composed of female and male speakers) of the vcc2016 fixed corpus are selected as training data. Wherein, each speaker's voice data set contains 150 different voices, and the voice content one-to-one corresponds between the voice data sets of different speakers. The data in this dataset is divided into two subsets, a vcc2016 training set and a vcc2016 evaluation set, respectively. Wherein the vcc2016 evaluation set is used as the test set.
In this test, voice data of four speakers SF1, SF2, TM1, TM2 in the vcc2016 training set are extracted and placed in the training set, and data in the evaluation set corresponding thereto is placed in the test set. The text information entered by the speaker is divided into text 1 and text 2.
After a legal user TM1 registers and inputs a text 1 (or a text 2), the original voice (namely, the original registered voice signal of the invention) is preprocessed and a second generator network is carried out to obtain synthesized audio SF1-TM1+1 (or SF1-TM 1+2); after the legal user TM2 registers and inputs the text 1 (or the text 2), the original voice is preprocessed and a second generator network is adopted to obtain the synthesized audio SF2-TM2+1 (or SF2-TM 2+2).
First, a test is performed for privacy preserving performance of the second generator network. It can be understood that if the synthesized audio has a large difference from the original audio in terms of the corresponding voiceprint map, even if the synthesized audio encrypted in the database is stolen and decoded, the attacker cannot obtain the original voiceprint information of the user, so that the purpose of protecting the original voiceprint characteristics of the user is achieved.
Therefore, the similarity comparison test can be performed on the synthesized voice and the original audio, and if the similarity is lower, the privacy protection of the technical scheme of the invention is higher. For the similarity comparison modes, three comparison modes are selected in the invention, and SSIM (Structure Similarity Index Measure, structural similarity measurement index), MSE (Mean Square Error ) and histogram similarity are calculated.
The histogram similarity calculation measures the image similarity based on simple vector similarity, and can be well normalized. SSIM is an index that measures the similarity of two images, and mainly considers three key features of a picture: the structural similarity index of the brightness, contrast, structure and SSIM return images is a floating point number between 0 and 1 (the closer to 1, the higher the coincidence degree); the MSE is an index value used to calculate the similarity of two pictures, and a smaller MSE indicates that the two pictures are more similar.
Referring to fig. 5, fig. 5 is a graph of a test result of similarity between synthesized audio and original audio according to an embodiment of the present invention. As shown in fig. 5, in the test result, the SSIM index is generally lower than 65%, the histogram similarity is also lower than 80%, and the MSE error is generally higher, so that it can be judged that the synthesized audio and the original audio have larger difference in auditory analysis and atlas analysis, and the test meets the requirements, so that it can be proved that the voice new energy privacy protection performance aimed at by the technical scheme of the invention is good, and the actual privacy protection requirement is met.
In the test, the large difference between the synthesized audio and the original audio is quantitatively proved, so that the aim of protecting the voiceprint privacy of the user is fulfilled. Correspondingly, the voiceprint authentication performance of the technical scheme is tested. The first preset threshold in the test is set to 0.7. The test results are shown in tables 3 and 4.
TABLE 3 synthetic Audio comparison results of different users with SF1
TABLE 4 synthetic Audio comparison results for different users with SF2
According to the test results of tables 1 and 2, the voiceprint authentication method based on StarGAN-VC provided by the invention can ensure the accuracy of voiceprint authentication on the basis of protecting the voiceprint privacy security of the original voice signal.
Referring to fig. 6, fig. 6 is a block diagram of a voiceprint authentication system based on StarGAN-VC according to the present invention. As shown in fig. 6, the system includes:
the signal acquisition module 601 is configured to acquire a first authentication voice signal. The first authentication voice signal is obtained by preprocessing the original authentication voice signal.
The signal processing module 602 is configured to process the first authentication voice signal based on a pre-acquired first generator network, so as to obtain a second authentication voice signal. The acquisition process of the first generator network is as follows: and taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in the StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value.
The similarity calculating module 603 is configured to calculate a similarity between the second authentication voice signal and the first registration voice signal, and if the similarity is greater than a first preset threshold, it indicates that the authentication of the original authentication voice signal is successful. Wherein the first registered voice signal is derived based on the original registered voice signal and a pre-acquired second generator network. The generation process of the second generator network is as follows: and taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in the StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a second generator network when the number of iterative training reaches a second preset value. The second preset value is smaller than the first preset value.
It can be understood that the StarGAN-VC-based voiceprint authentication system provided by the embodiment of the invention corresponds to the StarGAN-VC-based voiceprint authentication method, and the explanation, the examples, the beneficial effects and the like of the relevant content can refer to the corresponding content in the StarGAN-VC-based voiceprint authentication method, which is not repeated herein.
The embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 perform communication with each other through the communication bus 704,
a memory 703 for storing a computer program.
The processor 701 is configured to implement any of the above-described StarGAN-VC-based voiceprint authentication methods when executing the program stored in the memory 703.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In summary, compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
according to the invention, the original registration audio is used as a training sample, iterative training is carried out on the generator network in the StarGAN-VC network architecture, and the first generator network and the second generator network are respectively obtained based on different iterative training times. And the original registration voice signals are converted into first registration audio signals through the second generator network, so that the first registration audio signals and the original registration voice signals have certain differences, and the security of voiceprint privacy of the original registration voice signals is ensured. And the original authentication voice signal is converted into the second authentication voice signal through the first generator network, so that the voiceprint privacy security of the original authentication voice signal is also effectively ensured. In the process of converting the voice signal by the generator in the StarGAN-VC network architecture, the specific conversion means in the conversion process cannot be known in the prior art, so that the confidentiality of voice signal conversion is greatly improved, and even if the first registration voice signal is acquired, the first registration voice signal cannot be reversely converted into the original registration voice signal in the prior art, and reliable guarantee is provided for the voiceprint privacy security of the original voiceprint characteristic information of the user.
Moreover, because the first generator network and the second generator network are obtained by training based on the same training samples and the same StarGAN-VC network architecture, only the iterative training times are different, and therefore the conversion process of the original registration voice signal is basically the same as that of the original authentication voice signal, and therefore the voiceprint authentication of the original authentication voice signal can be completed by calculating the similarity of the second authentication voice signal and the first registration voice signal, and the voiceprint authentication accuracy of the voice signal is further ensured on the basis of protecting the voiceprint privacy of the voice signal.
In addition, in the process of carrying out iterative training on the generator in the StarGAN-VC network architecture and obtaining the generator network, as the training iteration number corresponding to the second generator network is smaller than the training iteration number corresponding to the first generator network, compared with the difference between the original authentication voice signal and the second authentication voice signal, the difference between the original registration voice signal and the first registration voice signal is lower, so that the first registration voice signal retains more voice characteristics of the original registration voice signal on the basis of guaranteeing the voice print privacy security of the original registration voice signal, and the accuracy of the subsequent voice print authentication process is further improved.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A voiceprint authentication method based on StarGAN-VC, the method comprising:
acquiring a first authentication voice signal; the first authentication voice signal is obtained by preprocessing an original authentication voice signal;
processing the first authentication voice signal based on a first generator network acquired in advance to obtain a second authentication voice signal; the acquiring process of the first generator network comprises the following steps: taking an original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value;
calculating the similarity between the second authentication voice signal and the first registration voice signal, and if the similarity is larger than a first preset threshold value, successfully authenticating the original authentication voice signal; wherein the first registration voice signal is obtained based on the original registration voice signal and a second generator network acquired in advance; the acquiring process of the second generator network comprises the following steps: taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a second generator network when the number of iterative training reaches a second preset value; the second preset value is smaller than the first preset value.
2. The method of claim 1, wherein prior to acquiring the first authentication voice signal, the method further comprises:
acquiring an original authentication voice signal;
and acquiring a Mel frequency cepstrum coefficient of the original authentication voice signal as the first authentication voice signal.
3. The method of claim 1, wherein processing the first authentication voice signal based on the pre-acquired first generator network to obtain a second authentication voice signal comprises:
the first authentication voice signal and the target voice tag are input into a first generator network which is acquired in advance together, and a signal output by the first generator network is obtained and used as a third authentication voice signal; wherein the target voice tag is a voice tag of a preselected target voice signal;
the third authentication voice signal and the original authentication voice tag are input into the first generator network together, and a signal output by the first generator network is obtained and used as a fourth authentication voice signal; the original authentication voice tag is a voice tag of the original authentication voice signal;
acquiring a first spectrum gain function based on the third authentication voice signal and the fourth authentication voice signal;
Multiplying the spectrum envelope of the original authentication voice with the first spectrum gain function, and processing the multiplication result based on a vocoder to obtain a second authentication voice signal.
4. The method of claim 1, wherein the process of iteratively training a generator network in a StarGAN-VC network architecture using the original registered speech signal as training samples comprises:
acquiring a mel frequency cepstrum coefficient of the original registered voice signal as a voice signal to be processed;
the voice signal to be processed and the target audio tag are input to a generator network in a StarGAN-VC network architecture to be trained together to obtain a first synthesized voice signal;
inputting the first synthesized voice signal into a pre-trained discriminator network to obtain the probability that the first synthesized voice signal is the target voice signal; the first synthesized voice signal and the original audio tag are input to a generator network in a StarGAN-VC network architecture to be trained together to obtain a reconstructed voice signal;
calculating a loss value of a generator network based on a probability that the first synthesized speech signal is the target speech signal;
Adjusting network parameters of the generator network based on the loss values of the generator network; after the network parameters of the generator network are adjusted, recording the parameters as finishing one iteration training;
and replacing the to-be-processed voice signal with the reconstructed voice signal, and inputting the replaced to-be-processed voice signal and the target audio tag into a generator network after network parameters are adjusted again to obtain a first synthesized voice signal, and continuing to perform iterative training.
5. The method of claim 1, wherein prior to calculating the similarity of the second authentication voice signal to the first registration voice signal, the method further comprises:
acquiring a second registration voice signal; wherein the second registration voice signal is obtained based on encrypting the first registration voice signal;
and decrypting the second registration voice signal to obtain the first registration voice signal.
6. The method of claim 5, wherein prior to said acquiring the second registered voice signal, the method further comprises:
acquiring the original registration voice signal;
acquiring a mel frequency cepstrum coefficient of the original registration voice signal as a registration voice signal to be processed;
Inputting the to-be-processed registration voice signal and the target voice tag into a pre-trained second generator network to obtain a signal output by the second generator network as a third registration voice signal;
inputting the third registration voice signal and the original registration voice tag into the second generator network to obtain a signal output by the second generator network as a fourth registration voice signal; the original registration voice tag is a voice tag of the original registration voice signal;
acquiring a second spectrum gain function based on the third registered voice signal and the fourth registered voice signal;
multiplying the spectrum envelope of the original registered voice with the second spectrum gain function, and processing the multiplication result based on a vocoder to obtain a first registered voice signal;
and encrypting the first registration voice signal to obtain a second registration voice signal.
7. The method of claim 6, wherein encrypting the first registered voice signal to obtain a second registered voice signal comprises:
deleting the original registered voice signals;
encrypting the first registration voice signal based on the AES encryption library to obtain an encrypted first registration voice signal serving as a second registration voice signal.
8. The method of claim 7, wherein after encrypting the first registration voice signal based on the AES encryption library to obtain an encrypted first registration voice signal as the second registration voice signal, the method further comprises:
storing the second registered voice signal in a database, and deleting a user password; wherein the user password is the same as a key for encryption based on an AES encryption library.
9. The method of claim 8, wherein prior to deleting the user password, the method further comprises:
hashing the user password to obtain a corresponding hash value as a user hash value;
before acquiring the original authentication speech signal, the method further comprises:
acquiring an authentication password;
hashing the authentication password to obtain a corresponding hash value as an authentication hash value;
judging whether the authentication hash value is the same as the user hash value; if yes, indicating that the authentication password passes;
the step of decrypting the second registration voice signal to obtain the first registration voice signal includes:
acquiring the authentication password;
and decrypting the second registration voice signal based on the authentication password to obtain the first registration voice signal.
10. A StarGAN-VC based voiceprint authentication system, the system comprising:
the signal acquisition module is used for acquiring a first authentication voice signal; the first authentication voice signal is obtained by preprocessing an original authentication voice signal;
the signal processing module is used for processing the first authentication voice signal based on a first generator network acquired in advance to obtain a second authentication voice signal; the acquisition process of the first generator network is as follows: taking an original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a first generator network when the number of iterative training reaches a first preset value;
the similarity calculation module is used for calculating the similarity between the second authentication voice signal and the first registration voice signal, and if the similarity is larger than a first preset threshold value, the original authentication voice signal is successfully authenticated; wherein the first registration voice signal is obtained based on the original registration voice signal and a second generator network acquired in advance; the generation process of the second generator network is as follows: taking the original registered voice signal as a training sample, carrying out iterative training on a generator network in a StarGAN-VC network architecture based on the training sample, and acquiring a corresponding generator network as a second generator network when the number of iterative training reaches a second preset value; the second preset value is smaller than the first preset value.
CN202310413563.2A 2023-04-12 2023-04-12 Voiceprint authentication method and system based on StarGAN-VC Pending CN116524936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310413563.2A CN116524936A (en) 2023-04-12 2023-04-12 Voiceprint authentication method and system based on StarGAN-VC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310413563.2A CN116524936A (en) 2023-04-12 2023-04-12 Voiceprint authentication method and system based on StarGAN-VC

Publications (1)

Publication Number Publication Date
CN116524936A true CN116524936A (en) 2023-08-01

Family

ID=87400378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310413563.2A Pending CN116524936A (en) 2023-04-12 2023-04-12 Voiceprint authentication method and system based on StarGAN-VC

Country Status (1)

Country Link
CN (1) CN116524936A (en)

Similar Documents

Publication Publication Date Title
Balamurali et al. Toward robust audio spoofing detection: A detailed comparison of traditional and learned features
WO2017215558A1 (en) Voiceprint recognition method and device
CN103475490B (en) A kind of auth method and device
US9431016B2 (en) Tamper-resistant element for use in speaker recognition
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
US20180146370A1 (en) Method and apparatus for secured authentication using voice biometrics and watermarking
Zhang et al. A high-performance speech perceptual hashing authentication algorithm based on discrete wavelet transform and measurement matrix
CN110111798B (en) Method, terminal and computer readable storage medium for identifying speaker
Zhang et al. An efficient perceptual hashing based on improved spectral entropy for speech authentication
Zheng et al. When automatic voice disguise meets automatic speaker verification
CN116490920A (en) Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system
WO2020140609A1 (en) Voice recognition method and device and computer readable storage medium
EP3767872A1 (en) Method for generating a private key from biometric characteristics
Kuznetsov et al. Methods of countering speech synthesis attacks on voice biometric systems in banking
Ye et al. Detection of replay attack based on normalized constant q cepstral feature
Wang et al. Multi-format speech biohashing based on energy to zero ratio and improved lp-mmse parameter fusion
CN116524936A (en) Voiceprint authentication method and system based on StarGAN-VC
Wang et al. Recording source identification using device universal background model
CN112967724B (en) Long-sequence biological Hash authentication method based on feature fusion
Park et al. User authentication method via speaker recognition and speech synthesis detection
Chadha et al. Text-independent speaker recognition for low SNR environments with encryption
Huang Watermarking based data spoofing detection against speech synthesis and impersonation with spectral noise perturbation
Huang Mechanism and implementation of watermarked sample scanning method for speech data tampering detection
US20240126851A1 (en) Authentication system and method
CN116343823A (en) Security protection method and system for voiceprint presentation attack

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination