CN109326294B - Text-related voiceprint key generation method - Google Patents

Text-related voiceprint key generation method Download PDF

Info

Publication number
CN109326294B
CN109326294B CN201811139547.4A CN201811139547A CN109326294B CN 109326294 B CN109326294 B CN 109326294B CN 201811139547 A CN201811139547 A CN 201811139547A CN 109326294 B CN109326294 B CN 109326294B
Authority
CN
China
Prior art keywords
voiceprint
key
spectrogram
matrix
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811139547.4A
Other languages
Chinese (zh)
Other versions
CN109326294A (en
Inventor
吴震东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201811139547.4A priority Critical patent/CN109326294B/en
Publication of CN109326294A publication Critical patent/CN109326294A/en
Application granted granted Critical
Publication of CN109326294B publication Critical patent/CN109326294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • H04L9/0866Generation of secret information including derivation or calculation of cryptographic keys or passwords involving user or device identifiers, e.g. serial number, physical or biometrical information, DNA, hand-signature or measurable physical characteristics

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to a text-related voiceprint key generation method. The method comprises the steps of voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key. The invention utilizes the spectrogram related to the speaker text to more fully express the voice characteristics of the speaker and simultaneously keep more stable similarity of the front and rear sampling samples. On the basis, a voiceprint stable characteristic vector extraction matrix is trained from a plurality of spectrogram by using a machine learning method, and a subsequent sample is processed by using the matrix, so that a more stable voiceprint key can be extracted. The method has the characteristics of good stability, simplicity and convenience in use.

Description

Text-related voiceprint key generation method
Technical Field
The invention belongs to the technical field of network space security, and relates to a text-related voiceprint key generation method.
Background
The voiceprint recognition technology is a mature biological feature recognition technology, the accuracy of voiceprint recognition is improved to a certain extent along with the rapid development of artificial intelligence technology in recent years, the voiceprint recognition accuracy can reach more than 96% in a low-noise environment, and the voiceprint recognition technology is widely applied to identity authentication scenes.
With the deep application of the voiceprint technology, an attempt is made in the technical field to directly extract a stable digital sequence from a human voiceprint to be used as a biological key, i.e., various keys can be directly generated by the voiceprint, and the method is seamlessly integrated with the existing password and public and private key cryptography, so that the inconvenience in the voiceprint acquisition and storage process and the safety problem possibly caused can be avoided, and the means and the method of network authentication are further enriched.
The voiceprint biometric key technology has been studied to a certain extent, for example, the document encryption and decryption method based on voiceprint in the chinese patent invention ZL201110003202.8 provides a scheme for extracting a stable key sequence from voiceprint information. But the scheme only uses a chessboard method to stabilize the voiceprint characteristic value, the stabilization effect is limited and the key length is insufficient. The invention discloses a method for generating a human voiceprint biological key, which is disclosed in Chinese patent ZL201410074511.8, and provides a technical route for extracting a voiceprint Gaussian model and projecting model characteristic parameters to a high-dimensional space to obtain a stable voiceprint key. The stability of the voiceprint key obtained by the scheme is obviously improved compared with that of the former patent, but for a key authentication environment with high stability requirement, the stability of extracting the voiceprint biological key by the technical scheme still needs to be further improved.
Disclosure of Invention
The invention aims to provide a text-related voiceprint key generation method.
The method comprises the steps of voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key. The method comprises the following specific steps:
step one, voiceprint key training, which comprises the following specific steps:
firstly, a user logs own voice for the same text information, generally 1-3 continuous words, and repeats the voice more than 20 times, and the times are adjusted by the user according to training conditions.
Recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text messages and voices with similar duration, and the voice reading is repeated more than 20 times respectively.
Thirdly, preprocessing the first and second step recorded voices, and extracting a voiceprint spectrogram specifically comprises the following steps:
1) pre-enhancement (Pre-Emphasis):
s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, and the pre-emphasis formula is: s (n) ═ S1(n) -a × S1(n-1), 0.9< a < 1.0. a is the coefficient to be emphasized for adjusting the amplitude to be enhanced.
2) Framing, i.e. Framing the speech signal.
3) Hamming Window (Hamming Window) processing:
the speech time domain signal after the sound framing is S (N), wherein N is 0, 1,2 … and N-1, and represents that the speech time domain signal is divided into N sections of speech signals; then, the time domain signal of the voice multiplied by the hamming window is S' (n), see table:
S’(n)=S(n)*W(n) ⑴;
obtaining:
Figure BDA0001815469060000021
and a is 0.46, the value range of a is 0.3-0.7, and the specific numerical value is determined according to experimental and empirical data. w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal.
4) Fast Fourier Transform (FFT):
performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth speech frame, k corresponds to a spectrum segment, and each speech frame corresponds to a time slice on a time axis.
5) Generating a text-related voiceprint spectrogram:
using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt 2 The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, the vocal print spectrogram is formed. By transforming 10log 10 (|X(n,k)| 2 ) A dB representation of the spectrogram was obtained.
And fourthly, preprocessing the voiceprint spectrogram in a filtering, normalization and other manners, wherein specific filtering manners include general filtering manners in signal processing fields of gauss, wavelet, binarization and the like, which manner is specifically adopted or a combination of several manners is adopted, and the user selects the filtering manner according to actual test conditions. The normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the values of each pixel point of the speech spectrograms are unified to a range of 0-255, and the specific method can be a general method in the field, for example, the image size adjustment can be realized by using an nonresizing function in a matlab function library.
And fifthly, performing machine learning on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix.
And the voiceprint spectrogram obtained in the fourth step is divided into two categories, wherein one category is the voiceprint spectrogram of the relevant text of the user, and the other category is the comparison voiceprint spectrogram formed by mixing the relevant text of the non-user and the non-relevant text, and is called a positive and negative sample set.
By using M ═ M 1 ,M 2 ]Positive and negative sample sets, M, representing participation in training i =[x i1 ,x i2 ,...,x iL ]I belongs to {1,2} and represents the ith sample set, wherein, i is a positive sample when 1 is equal to 1, and is a negative sample when 2 is equal to 2; x is the number of ir ∈R d ,1≤i≤2,1≤r≤L,x ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram, sequentially splicing each row of the two-dimensional matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x ir ,x ir Length d, R d And L represents that L voiceprint spectrogram modes exist in the same sample set, namely L column vectors.
Now, according to the characteristics of the two types of samples, the voiceprint key extraction matrix W is trained 1 ,W 1 ∈R d×dz The method has the following advantages:
Figure BDA0001815469060000031
wherein
Figure BDA0001815469060000032
For the positive sample mean of the training samples,
Figure BDA0001815469060000033
is the negative sample mean of the training samples. J is a cost function reflecting the voiceprint key extraction matrix W of the training sample 1 And calculating the distance difference between the projected image and the positive and negative sample set mean value by using the Euclidean distance.
Order:
Figure BDA0001815469060000034
solving matrix (H) 1 -H 2 ) Obtaining a voiceprint key extraction matrix W by the characteristic value and the characteristic vector of the voiceprint key 1 Namely: (H) 1 -H 2 ) w ═ λ w; w is a matrix (H) 1 -H 2 ) λ is the eigenvalue.
Due to { w 1 ,w 2 ,...,w dz Is an eigenvector corresponding to the eigenvalue { lambda } 12 ,...,λ dz In which λ is 1 ≥λ 2 ≥...≥λ dz ≧ 0, the eigenvectors with eigenvalues less than 0 are not included in the matrix W 1 The structure of (1).
So far, training out a voiceprint key extraction matrix W 1
Step two, voiceprint key extraction, which comprises the following specific steps:
step 1, a user logs own text related voice for about 3 seconds.
And 2, extracting a voiceprint spectrogram, and particularly referring to the third step.
And 3, preprocessing the voiceprint spectrogram by filtering, normalizing and the like, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x t
Step 4, learning a matrix W by using the voiceprint stable characteristics trained in the step one 1 And the transposed voiceprint vector x obtained by the step 3 is multiplied by the left side t I.e. W 1 T ·x t To obtain d z Dimensional voiceprint feature vector x tz ,x tz Is the stabilized voiceprint feature vector.
Step 5, for x tz Each dimensional component of (a) is subjected to a chessboard method operation, and the vocal print characteristic vector is further stabilized as
Figure BDA0001815469060000043
The chessboard method comprises the following steps:
for x tz Each dimension component in (1) is denoted as x tzi
The quantization formula is shown as the following:
Figure BDA0001815469060000041
wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is tzi Is x tz Λ (x) is an integer value.
Λ (x) is x tzi The quantized value is the closest x in the checkerboard tzi The coordinate values of the grid of points and the origin of coordinates.
Step 6, taking the vector of the calculation result of the step five
Figure BDA0001815469060000042
The first 32 or 64 components are spliced front and back, and each component takes a value of 0-64, so that 4-bit key calculation can be formed, and a 128-bit or 256-bit voiceprint key can be formed; and finishing the extraction of the voiceprint key.
The invention utilizes the spectrogram related to the speaker text to more fully express the voice characteristics of the speaker and simultaneously keep more stable similarity of the front and the rear sampling samples. On the basis, a voiceprint stable characteristic vector extraction matrix is trained from a plurality of spectrogram by using a machine learning method, and a subsequent sample is processed by using the matrix, so that a more stable voiceprint key can be extracted. The method has the characteristics of good stability, simplicity and convenience in use.
Drawings
FIG. 1 is a flowchart of voiceprint key training in accordance with the present invention;
FIG. 2 is a flowchart of voiceprint spectrogram generation in accordance with the present invention;
FIG. 3 is a spectrogram of the voiceprint of the present invention;
FIG. 4 is a flowchart of voiceprint key extraction in accordance with the present invention.
FIG. 5 is a schematic diagram of machine learning of voiceprint features in accordance with the present invention.
Detailed Description
A text-related voiceprint key generation method comprises voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction the voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain the voiceprint key. The method comprises the following specific steps:
step one, voiceprint key training, as shown in fig. 1, the specific steps are:
in the first step, the user logs his own voice for the same text message, generally 1-3 consecutive words, and repeats the same voice for more than 20 times (the number of times can be adjusted by the user according to the training situation).
Recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text messages and voices with similar duration, and the voice reading is repeated more than 20 times respectively.
Step three, preprocessing the first and second step recorded voices, as shown in fig. 2 and 3, and extracting a voiceprint spectrogram specifically comprises the following steps:
1) pre-enhancement (Pre-Emphasis):
s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, the pre-emphasis formula is: s (n) ═ S1(n) -a × S1(n-1), 0.9< a < 1.0. a is the coefficient to be enhanced for adjusting the amplitude to be enhanced.
2) Framing, i.e. Framing the speech signal.
3) Hamming Window (Hamming Window) processing:
the speech time domain signal after the sound framing is S (N), wherein N is 0, 1,2 … and N-1, and represents that the speech time domain signal is divided into N sections of speech signals; then, the time domain signal of the voice multiplied by the hamming window is S' (n), see table:
S’(n)=S(n)*W(n) ⑴;
obtaining:
Figure BDA0001815469060000051
and a is 0.46, the value range of a is 0.3-0.7, and the specific numerical value is determined according to experimental and empirical data. w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal.
5) Fast Fourier Transform (FFT):
performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth speech frame, k corresponds to a spectrum segment, and each speech frame corresponds to a time slice on a time axis.
5) Generating a text-related voiceprint spectrogram:
using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt 2 The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, the vocal print spectrogram is formed. By transforming 10log 10 (|X(n,k)| 2 ) A dB representation of the spectrogram was obtained.
And fourthly, preprocessing the voiceprint spectrogram through filtering, normalization and the like, wherein specific filtering modes include general filtering modes in signal processing fields of gauss, wavelet, binarization and the like, which mode is specifically adopted or a combination of several modes is adopted, and the mode is selected by a user according to actual test conditions. The normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the value of each pixel point of the speech spectrograms is unified to a range of 0-255, and the specific method can be all general methods in the field, for example, the image size adjustment can be realized by an imresize function in a matlab function library.
And fifthly, performing machine learning on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix.
And the voiceprint spectrogram obtained in the fourth step is divided into two categories, wherein one category is the voiceprint spectrogram of the relevant text of the user, and the other category is the comparison voiceprint spectrogram formed by mixing the relevant text of the non-user and the non-relevant text, and is called a positive and negative sample set.
With M ═ M 1 ,M 2 ]Positive and negative sample sets, M, representing participation in training i =[x i1 ,x i2 ,...,x iL ]I belongs to {1,2} to represent the ith sample set, i ═ 1 is a positive sample, i ═ 2 is a negative sample; x is a radical of a fluorine atom ir ∈R d ,1≤i≤2,1≤r≤L,x ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram as a one-dimensional column vector, and then converting the two-dimensional matrix into a two-dimensional matrixSequentially splicing each row of the matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x ir ,x ir Length d, R d And L represents that L voiceprint spectrogram modes exist in the same sample set, namely L column vectors.
Now, according to the characteristics of the two types of samples, the voiceprint key extraction matrix W is trained 1 ,W 1 ∈R d×dz The method has the following advantages:
Figure BDA0001815469060000071
wherein
Figure BDA0001815469060000072
For the positive sample mean of the training samples,
Figure BDA0001815469060000073
is the negative sample mean of the training samples. J is a cost function reflecting the voiceprint key extraction matrix W of the training sample 1 And calculating the distance difference between the projected image and the mean value of the positive and negative sample sets by using Euclidean distance.
Order:
Figure BDA0001815469060000074
solving matrix (H) 1 -H 2 ) Obtaining a voiceprint key extraction matrix W by the characteristic value and the characteristic vector of the voiceprint key 1 Namely: (H) 1 -H 2 ) w ═ λ w; w is a matrix (H) 1 -H 2 ) λ is the eigenvalue.
Due to { w 1 ,w 2 ,...,w dz Is an eigenvector corresponding to the eigenvalue { lambda } 12 ,...,λ dz In which λ is 1 ≥λ 2 ≥...≥λ dz Not less than 0, and eigenvectors with eigenvalues less than 0 are not included in the matrix W 1 The structure of (1).
So far, training out a voiceprint key extraction matrix W 1
Step two, voiceprint key extraction, as shown in fig. 4, the specific steps are as follows:
step 1, a user logs own text related voice for about 3 seconds.
And 2, extracting a voiceprint spectrogram, and particularly referring to the third step.
And 3, preprocessing the voiceprint spectrogram by filtering, normalizing and the like, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x t
Step 4, learning matrix W by using the vocal print stable characteristics trained in the step one 1 And the transposed vocal print vector x obtained by the step 3 is multiplied by the left side t I.e. W 1 T ·x t D is obtained z Dimensional voiceprint feature vector x tz ,x tz Is the stabilized voiceprint feature vector.
Step 5, for x tz Each dimension component of (a) is subjected to a chessboard method operation to further stabilize the vocal print feature vector as
Figure BDA0001815469060000075
The chessboard method comprises the following steps:
for x tz Each dimension component in (1) is denoted as x tzi
The quantization formula is shown as the following:
Figure BDA0001815469060000081
wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is tzi Is x tz Λ (x) is an integer value.
Λ (x) is x tzi The quantized value is the closest x in the checkerboard tzi Point and coordinate value of the grid of origin of coordinates.
Step 6, taking the vector of the calculation result of the step five
Figure BDA0001815469060000082
The first 32 or 64 components are spliced front and back, and each component takes a value of 0-64, so that 4-bit key calculation can be formed, and a 128-bit or 256-bit voiceprint key can be formed; and finishing the extraction of the voiceprint key.
The invention extracts the vocal print spectrogram from the text-related voice by utilizing the characteristic that the vocal print frequency spectrum of the text-related voice of the same speaker has higher similarity, a plurality of vocal print spectrogram obtained by sampling the same text of the same speaker for a plurality of times have higher similarity, and meanwhile, the vocal print spectrogram extracted from the same text of different speakers has more obvious difference. After extracting the voiceprint spectrogram, extracting common characteristic information from a plurality of voiceprint spectrograms by a machine learning method as shown in fig. 5, and obtaining a text-related voiceprint key after segmented quantization. The voiceprint key does not need a server to reserve a biological characteristic template, has higher safety, can be fused with encryption and decryption algorithms of a general network such as AES (advanced encryption standard), RSA (rivest-Shamir-Adleman) and the like, and is convenient for a user to use. The method can obtain a more stable voiceprint key, the voiceprint key extraction accuracy is more than 95%, and the key length can reach 256 bits.

Claims (2)

1. A text-related voiceprint key generation method is characterized by comprising the following steps: the method comprises the steps of voiceprint key training and voiceprint key extraction; training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage; voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key; the method comprises the following specific steps:
step one, voiceprint key training, which comprises the following specific steps:
firstly, recording own voice for the same text information, generally 1-3 continuous words, by a user, repeating the recording for more than 20 times, wherein the times are adjusted by the user according to training conditions;
recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text information and voices with similar duration, and the voice reading is repeated for more than 20 times;
thirdly, preprocessing the first and second step recorded voices, and extracting a voiceprint spectrogram specifically comprises the following steps:
1) pre-reinforcing:
s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, and the pre-emphasis formula is: s (n) -S1 (n) -a S1(n-1), 0.9< a < 1.0; a is the coefficient to be enhanced for adjusting the amplitude to be enhanced;
2) framing, i.e., framing the speech signal;
3) hamming window processing:
the speech time domain signal after the sound framing is S (N), wherein N is 0, 1,2 … and N-1, and represents that the speech time domain signal is divided into N sections of speech signals; then, the time domain signal of the voice multiplied by the hamming window is S' (n), see table:
S’(n)=S(n)*W(n) ⑴;
obtaining:
Figure FDA0003791249320000011
a is 0.46, the value range of a is 0.3-0.7, and specific numerical values are determined according to experimental and empirical data; w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal;
4) fast fourier transform FFT:
performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth section of speech frame, k corresponds to a spectrum section, and each section of speech frame corresponds to a time slice on a time axis;
5) generating a text-related voiceprint spectrogram:
using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt 2 The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, a voiceprint spectrogram is formed; by transforming 10log 10 (|X(n,k)| 2 ) Obtaining dB representation of the spectrogram;
fourthly, filtering and normalizing the voiceprint spectrogram, wherein the specific filtering modes comprise Gaussian filtering, wavelet filtering and binary filtering, and a user can optionally select one or more modes for filtering according to the actual test condition;
fifthly, machine learning is carried out on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix;
dividing the voiceprint spectrogram obtained in the fourth step into two categories, namely a voiceprint spectrogram of a related text of the user, and a comparison voiceprint spectrogram formed by mixing a related text of a non-user and a non-related text, wherein the comparison voiceprint spectrogram is called a positive and negative sample set;
with M ═ M 1 ,M 2 ]Representing positive and negative sample sets participating in the training, M i =[x i1 ,x i2 ,...,x iL ]I belongs to {1,2} to represent the ith sample set, i ═ 1 is a positive sample, i ═ 2 is a negative sample; x is the number of ir ∈R d ,1≤i≤2,1≤r≤L,x ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram, sequentially splicing each row of the two-dimensional matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x ir ,x ir Length d, R d Representing a d-dimensional real number domain, wherein L represents that L voiceprint spectrogram, namely L column vectors, exist in the same type of sample set;
now, according to the characteristics of the two types of samples, the voiceprint key extraction matrix W is trained 1 ,W 1 ∈R d×dz The method has the following advantages:
Figure FDA0003791249320000021
wherein
Figure FDA0003791249320000022
For the positive sample mean of the training samples,
Figure FDA0003791249320000023
is the negative sample mean of the training sample; j is a cost function and reflects the voiceprint density of the training sampleKey extraction matrix W 1 Calculating the distance difference between the projected image and the positive and negative sample set mean value by using Euclidean distance;
order:
Figure FDA0003791249320000031
solving matrix (H) 1 -H 2 ) Obtaining a voiceprint key extraction matrix W according to the characteristic value and the characteristic vector 1 Namely: (H) 1 -H 2 ) w ═ λ w; w is a matrix (H) 1 -H 2 ) λ is a eigenvalue;
due to { w 1 ,w 2 ,...,w dz Is an eigenvector corresponding to the eigenvalue { lambda } 12 ,...,λ dz In which λ is 1 ≥λ 2 ≥...≥λ dz ≧ 0, the eigenvectors with eigenvalues less than 0 are not included in the matrix W 1 The structure of (1);
training a voiceprint key extraction matrix W to 1
Step two, voiceprint key extraction, which comprises the following specific steps:
step 1, recording relevant voices of texts of a user for about 3 seconds;
step 2, extracting a voiceprint spectrogram, and particularly referring to the step one and the third step;
and 3, filtering and normalizing the voiceprint spectrogram, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x t
Step 4, extracting a matrix W by using the voiceprint key trained in the step one 1 And the transposed voiceprint vector x obtained by the step 3 is multiplied by the left side t I.e. W 1 T ·x t To obtain d z Dimensional voiceprint feature vector x tz ,x tz The stable vocal print feature vector is obtained;
step 5, for x tz Each dimension component of the voice print is subjected to chessboard operation to further stabilize the voice print characteristic vector as x tz
The chessboard method comprises the following steps:
for x tz Each dimension component in (1) is denoted as x tzi
The quantization formula is shown as the following:
Figure FDA0003791249320000032
wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is tzi Is x tz Λ (x) is an integer value;
Λ (x) is x tzi The quantized value is the closest x in the checkerboard tzi Coordinate values of the grid of points and the origin of coordinates;
step 6, taking the vector of the calculation result of the step five
Figure FDA0003791249320000033
The first 32 or 64 components are spliced front and back, and each component takes a value of 0-64, so that 4-bit key calculation can be formed, and a 128-bit or 256-bit voiceprint key can be formed; and finishing the extraction of the voiceprint key.
2. A method of generating a text dependent voiceprint key as claimed in claim 1 wherein: and fourthly, the normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the value of each pixel point of the speech spectrograms is unified to be within the range of 0-255, and the normalization processing can be realized by adopting an nonresizing function in a matlab function library.
CN201811139547.4A 2018-09-28 2018-09-28 Text-related voiceprint key generation method Active CN109326294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811139547.4A CN109326294B (en) 2018-09-28 2018-09-28 Text-related voiceprint key generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811139547.4A CN109326294B (en) 2018-09-28 2018-09-28 Text-related voiceprint key generation method

Publications (2)

Publication Number Publication Date
CN109326294A CN109326294A (en) 2019-02-12
CN109326294B true CN109326294B (en) 2022-09-20

Family

ID=65266096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811139547.4A Active CN109326294B (en) 2018-09-28 2018-09-28 Text-related voiceprint key generation method

Country Status (1)

Country Link
CN (1) CN109326294B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110322887B (en) * 2019-04-28 2021-10-15 武汉大晟极科技有限公司 Multi-type audio signal energy feature extraction method
CN110223699B (en) * 2019-05-15 2021-04-13 桂林电子科技大学 Speaker identity confirmation method, device and storage medium
CN111161705B (en) * 2019-12-19 2022-11-18 寒武纪(西安)集成电路有限公司 Voice conversion method and device
CN112908303A (en) * 2021-01-28 2021-06-04 广东优碧胜科技有限公司 Audio signal processing method and device and electronic equipment
CN113179157B (en) * 2021-03-31 2022-05-17 杭州电子科技大学 Text-related voiceprint biological key generation method based on deep learning
CN113129897B (en) * 2021-04-08 2024-02-20 杭州电子科技大学 Voiceprint recognition method based on attention mechanism cyclic neural network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001092974A (en) * 1999-08-06 2001-04-06 Internatl Business Mach Corp <Ibm> Speaker recognizing method, device for executing the same, method and device for confirming audio generation
CN103873254A (en) * 2014-03-03 2014-06-18 杭州电子科技大学 Method for generating human vocal print biometric key
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN106128465A (en) * 2016-06-23 2016-11-16 成都启英泰伦科技有限公司 A kind of Voiceprint Recognition System and method
CN107274890A (en) * 2017-07-04 2017-10-20 清华大学 Vocal print composes extracting method and device
CN108198561A (en) * 2017-12-13 2018-06-22 宁波大学 A kind of pirate recordings speech detection method based on convolutional neural networks
CN112786059A (en) * 2021-03-11 2021-05-11 合肥市清大创新研究院有限公司 Voiceprint feature extraction method and device based on artificial intelligence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001092974A (en) * 1999-08-06 2001-04-06 Internatl Business Mach Corp <Ibm> Speaker recognizing method, device for executing the same, method and device for confirming audio generation
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN103873254A (en) * 2014-03-03 2014-06-18 杭州电子科技大学 Method for generating human vocal print biometric key
CN106128465A (en) * 2016-06-23 2016-11-16 成都启英泰伦科技有限公司 A kind of Voiceprint Recognition System and method
CN107274890A (en) * 2017-07-04 2017-10-20 清华大学 Vocal print composes extracting method and device
CN108198561A (en) * 2017-12-13 2018-06-22 宁波大学 A kind of pirate recordings speech detection method based on convolutional neural networks
CN112786059A (en) * 2021-03-11 2021-05-11 合肥市清大创新研究院有限公司 Voiceprint feature extraction method and device based on artificial intelligence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
TL-CNN-GAP模型下的小样本声纹识别方法研究;丁冬兵;《电脑知识与技术》;20180825(第24期);全文 *
基于PCNN的语谱图特征提取在说话人识别中的应用;马义德等;《计算机工程与应用》;20060801(第20期);全文 *
语谱特征的身份认证向量识别方法;冯辉宗等;《重庆大学学报》;20170515(第05期);全文 *

Also Published As

Publication number Publication date
CN109326294A (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN109326294B (en) Text-related voiceprint key generation method
CN110659468B (en) File encryption and decryption system based on C/S architecture and speaker identification technology
US8447614B2 (en) Method and system to authenticate a user and/or generate cryptographic data
CN103236260A (en) Voice recognition system
CN109584893B (en) VAE and i-vector based many-to-many voice conversion system under non-parallel text condition
Srivastava et al. Privacy and utility of x-vector based speaker anonymization
Firooz et al. Improvement of automatic speech recognition systems via nonlinear dynamical features evaluated from the recurrence plot of speech signals
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
Hsu et al. Local wavelet acoustic pattern: A novel time–frequency descriptor for birdsong recognition
Esmaeilpour et al. Multidiscriminator sobolev defense-GAN against adversarial attacks for end-to-end speech systems
CN115101077A (en) Voiceprint detection model training method and voiceprint recognition method
Do et al. Speech Separation in the Frequency Domain with Autoencoder.
CN113436646B (en) Camouflage voice detection method adopting combined features and random forest
Ziabary et al. A countermeasure based on cqt spectrogram for deepfake speech detection
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
Abbas et al. Heart‐ID: human identity recognition using heart sounds based on modifying mel‐frequency cepstral features
Sanderson et al. Features for robust face-based identity verification
Marras et al. Dictionary attacks on speaker verification
CN115761048A (en) Face age editing method based on video time sequence
Huang et al. Audio-replay Attacks Spoofing Detection for Automatic Speaker Verification System
CN108510995B (en) Identity information hiding method facing voice communication
Thebaud et al. Spoofing speaker verification with voice style transfer and reconstruction loss
Alam On the use of fisher vector encoding for voice spoofing detection
Wilson et al. Voice Aging with Audio-Visual Style Transfer
Nainan et al. A comparison of performance evaluation of ASR for noisy and enhanced signal using GMM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant