WO2020181824A1 - Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur - Google Patents

Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2020181824A1
WO2020181824A1 PCT/CN2019/118656 CN2019118656W WO2020181824A1 WO 2020181824 A1 WO2020181824 A1 WO 2020181824A1 CN 2019118656 W CN2019118656 W CN 2019118656W WO 2020181824 A1 WO2020181824 A1 WO 2020181824A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
feature
voice
fusion
voiceprint feature
Prior art date
Application number
PCT/CN2019/118656
Other languages
English (en)
Chinese (zh)
Inventor
徐凌智
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020181824A1 publication Critical patent/WO2020181824A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • This application relates to the technical field of voiceprint recognition, in particular to voiceprint recognition methods, equipment, devices, and computer-readable storage media.
  • the voiceprint recognition system is a system that automatically recognizes the speaker's identity based on the characteristics of the human voice.
  • the bodyprint recognition technology is a type of biometric verification technology, that is, the speaker's identity is verified through voice. This technology has good convenience, stability, measurability, security and other characteristics, and it is usually used in banking, social security, public security, smart home, mobile payment and other fields.
  • the current voiceprint recognition system is generally based on the Gaussian Mixture Model-Universal Background Model (GMM-UBM) proposed in the 1990s, which is simple and flexible and has good robustness.
  • GMM-UBM Gaussian Mixture Model-Universal Background Model
  • the neural network-based voiceprint verification system has been applied and practiced, and the performance of neural network-based models on some sets is higher than that of a single Gaussian Mixture Model-Universal Background Model (GMM-UBM).
  • the main purpose of this application is to provide a voiceprint recognition method, equipment, device, and computer-readable storage medium, aiming to solve the technical problem of low voice recognition accuracy in the prior art.
  • a voiceprint recognition method includes:
  • the voiceprint recognition result of the verification voice is determined.
  • this application also provides a voiceprint recognition device, including:
  • the data acquisition module is set to acquire the verification voice to be recognized
  • a data processing module configured to extract the first voiceprint feature of the verification voice using a GMM-UBM model, and use a neural network model to extract the second voiceprint feature of the verification voice;
  • a data fusion module configured to perform feature fusion of the first voiceprint feature and the second voiceprint feature of the verification voice to obtain a fusion voiceprint feature vector of the verification voice;
  • a data comparison module configured to calculate the similarity between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of each registered user in the preset registered voiceprint database
  • the data judgment module is configured to judge the voiceprint recognition result of the verification voice based on the similarity.
  • the present application also provides a voiceprint recognition device, the voiceprint recognition device including a processor, a memory, and a voiceprint recognition program stored on the memory and executable by the processor, The steps of the voiceprint recognition method described above are realized when the voiceprint recognition program is executed by the processor.
  • the present application also provides a computer-readable storage medium having a voiceprint recognition program stored on the computer-readable storage medium, and the voiceprint recognition program is executed by a processor to realize the above voiceprint recognition method A step of.
  • This application uses the GMM-UBM model to extract the first voiceprint feature of the verification voice from the verification voice, and extracts the second voiceprint feature of the verification voice from the verification voice through the neural network model; the first voiceprint feature and the first voiceprint feature of the verification voice
  • the two voiceprint features are fused to obtain the fusion voiceprint feature vector of the verified voice; the similarity between the fusion feature voiceprint vector of the verified voice and the voiceprint feature vector of each registered user in the preset voiceprint database is calculated; based on the similarity , Determine the voiceprint recognition result of the verification voice.
  • the GMM-UBM model and the neural network model are combined.
  • the two models extract features for the verification speech.
  • the features extracted by the two models are used for speech verification.
  • the information contained in the features extracted by the two models is more comprehensive, so that the verification voice and the registration voice can be fully verified, so that the accuracy of voiceprint recognition is improved.
  • FIG. 1 is a schematic diagram of the hardware structure of a voiceprint recognition device involved in a solution of an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of an embodiment of the voiceprint recognition method of the present invention.
  • FIG. 3 is a schematic flowchart of another embodiment of the voiceprint recognition method of the present invention.
  • FIG. 4 is a detailed flowchart of an embodiment of step S20 in FIG. 2;
  • FIG. 5 is a schematic diagram of a detailed process of another embodiment of step S20 in FIG. 2;
  • FIG. 6 is a schematic flowchart of an embodiment of step S30 in FIG. 2;
  • FIG. 7 is a schematic diagram of functional modules of an embodiment of the voiceprint device of the present invention.
  • FIG. 1 is a schematic diagram of the hardware structure of the voiceprint recognition device involved in the solution of the embodiment of the present invention.
  • the voiceprint recognition device may include a processor 1001 (for example, a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to realize the connection and communication between these components;
  • the user interface 1003 may include a display (Display), an input unit such as a keyboard (Keyboard);
  • the network interface 1004 may optionally include a standard wired interface, a wireless interface (Such as WI-FI interface);
  • the memory 1005 can be a high-speed RAM memory or a non-volatile memory, such as a disk memory.
  • the memory 1005 can optionally be a storage device independent of the aforementioned processor 1001 .
  • FIG. 1 does not constitute a limitation on the voiceprint recognition device, and may include more or fewer components than shown in the figure, or a combination of certain components, or a different component arrangement .
  • the memory 1005 as a computer-readable storage medium in FIG. 1 may include an operating system, a network communication module, and a voiceprint recognition program.
  • the network communication module is mainly used to connect to the server and perform data communication with the server; and the processor 1001 can call the voiceprint recognition program stored in the memory 1005 and execute the voiceprint recognition method provided by the embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of an embodiment of a voiceprint recognition method according to the present application.
  • the voiceprint recognition method includes the following steps:
  • Step S10 Obtain the verification voice to be recognized
  • the verification voice is the voice uttered by the user after the voice registration has been performed. If the user has not performed the voice registration, the voice uttered by the user is invalid.
  • the voice of a user who has been registered for voice is obtained through a microphone, and the microphone sends the obtained voice to the processing terminal for voiceprint recognition; for example, it is obtained through a smart terminal (mobile phone, tablet, etc.)
  • the smart terminal sends the obtained verification voice to the processing terminal of the voiceprint recognition device; of course, the verification voice can also be obtained by other devices, which will not be listed here.
  • the verification voice to be recognized can also be screened to eliminate the verification voice to be recognized with poor quality.
  • the duration of the verification voice to be recognized and the volume of the verification voice to be recognized can also be detected at the same time. If the duration of the verification voice to be recognized is greater than or equal to the preset voice duration, it is prompted to obtain The recognized verification voice is successful, and if the duration of the verification voice to be recognized is less than the preset voice duration, it is prompted that obtaining the verification voice to be recognized fails.
  • This setting ensures the quality of the obtained verification voice to be recognized, and also ensures that the features extracted from the verification voice to be recognized are relatively obvious and clear, which is beneficial to improve the accuracy of voiceprint recognition.
  • Step S20 Use a GMM-UBM model to extract the first voiceprint feature of the verification voice, and use a neural network model to extract the second voiceprint feature of the verification voice;
  • the GMM-UBM model (Gaussian Mixture Model-Universal Background Model) and the neural network model extract features from the verification speech at the same time. Since the GMM-UBM model and the neural network model are two different models, the two models When extracting voiceprint features from the verification speech, the same voiceprint features may be extracted, different voiceprint features may be extracted, or some of the same voiceprint features may be extracted, which will not be specifically limited here.
  • the GMM-UBM model and the neural network model extract different voiceprint features from the verified voice.
  • the first voiceprint feature extracted from the verified voice by the GMM-UBM model includes timbre, frequency, amplitude, volume, etc. Sub-features.
  • the second voiceprint feature extracted by the neural network model from the verification voice includes multiple sub-features such as fundamental frequency, Mel frequency cepstrum coefficient, formant, fundamental tone, reflection coefficient and so on.
  • the GMM-UBM model and the neural network model extract voiceprint features in the same sound segment of the verified speech.
  • the GMM-UBM model and the neural network model can also extract voiceprint features in different sound segments of the verified speech.
  • GMM -The UBM model and the neural network model can also extract voiceprint features in the partially overlapping sound segments of the verification speech, which is not specifically limited here.
  • Step S30 Perform feature fusion of the first voiceprint feature and the second voiceprint feature of the verification voice to obtain a fusion voiceprint feature vector of the verification voice;
  • the fusion voiceprint feature vector of the verification voice is obtained by fusing the first voiceprint feature and the second voiceprint feature of the verification voice.
  • merge the first voiceprint feature and the second voiceprint feature There are many ways to merge the first voiceprint feature and the second voiceprint feature.
  • the first voiceprint feature and the second voiceprint feature are fused by mutual superposition to form a fusion voiceprint feature vector for verification voice, and then the first voiceprint feature and the second voiceprint feature are fused by partial sub-feature superposition
  • the fusion voiceprint feature vector of the verification voice is formed.
  • the first voiceprint feature and the second voiceprint feature of the verified voice can also be fused in other ways, which will not be listed here.
  • Step S40 Calculate the similarity between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of each registered user in the preset registered voiceprint database;
  • the voiceprint feature vector of the registered user is established by the voiceprint recognition device during the user's voice registration.
  • Each user corresponds to a registered user's voiceprint feature vector, and each user’s registered user’s voiceprint feature
  • the vectors are all stored in the data storage module of the voiceprint recognition device, and the voiceprint feature vectors of multiple registered users form a preset registered voiceprint database.
  • the similarity between the fusion voiceprint feature vector of the verified voice and the voiceprint feature vector of the registered user can also be calculated using Pearson correlation coefficient, Euclidean distance, cosine similarity, Manhattan distance, etc. Not listed one by one.
  • the fusion voiceprint feature vector of the verification voice of the verification voice needs to be combined with the registered voiceprint database
  • the voiceprint feature vector of each registered user in the comparison is performed, which makes the voiceprint recognition device need to perform a lot of calculations.
  • the voiceprint feature vector of each registered user in the preset registered voiceprint database can be associated. Specifically, the voiceprint feature vector of any two registered users in the preset registered voiceprint database can be calculated. The similarity is to associate the voiceprint feature vectors of each registered user in the preset registered voiceprint database.
  • the fusion voiceprint feature vector of the verification voice can be compared with the voiceprint feature vector of a registered user.
  • the similarity between the feature vectors is screened to exclude the voiceprint feature vectors of other registered users with low similarity to the voiceprint feature vectors of a certain registered user, so that the calculation amount of the voiceprint recognition device can be reduced.
  • Step S50 Determine the voiceprint recognition result of the verification voice based on the similarity.
  • the voiceprint recognition result of the verified voice is determined based on the magnitude relationship between the similarity and the preset threshold, that is, between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of a registered user
  • the voiceprint recognition is determined to be successful; when the similarity between the fusion voiceprint feature vector of the verified voice and the voiceprint feature vector of each registered user is less than the preset threshold, the voiceprint is determined Recognition failed.
  • the similarity of the fusion voiceprint feature vector of the verification voice is highest and matches the fusion voiceprint feature vector of the verification voice.
  • This application uses the GMM-UBM model to extract the first voiceprint feature of the verification voice from the verification voice, and extracts the second voiceprint feature of the verification voice from the verification voice through the neural network model; the first voiceprint feature and the first voiceprint feature of the verification voice
  • the two voiceprint features are fused to obtain the fusion voiceprint feature vector of the verified voice; the similarity between the fusion feature voiceprint vector of the verified voice and the voiceprint feature vector of each registered user in the preset voiceprint database is calculated; based on the similarity , Determine the voiceprint recognition result of the verification voice.
  • the GMM-UBM model and the neural network model are combined.
  • the two models extract features for the verification speech.
  • the features extracted by the two models are used for speech verification.
  • the information contained in the features extracted by the two models is more comprehensive, so that the verification voice and the registration voice can be fully verified, so that the accuracy of voiceprint recognition is improved.
  • step S10 the following steps are further included before step S10:
  • Step S100 Obtain the registered voice of the registered user
  • the registered voice is the voice uttered by the user who needs to register, and the method of obtaining the registered voice is the same as that of the verification voice in step S10.
  • the voiceprint recognition system will use the registered voice when the user registers as the user's verification standard.
  • the quality of the registered voice directly affects the accuracy of voiceprint recognition.
  • Step S110 Use a GMM-UBM model to extract the third voiceprint feature of the registered voice, and use a neural network model to extract the fourth voiceprint feature of the registered voice;
  • the sub-features contained in the third voiceprint feature are the same as the sub-features contained in the first voiceprint feature, and the sub-features contained in the fourth voiceprint feature are the same as the sub-features contained in the second voiceprint feature. .
  • Step S120 Perform feature fusion of the third voiceprint feature and the fourth voiceprint feature of the registered voice to obtain a fused voiceprint feature vector of the registered voice;
  • Step S130 Save the fused voiceprint feature vector of the registered voice in the registered voiceprint database as the voiceprint feature vector of the registered user.
  • the data storage module of the voiceprint recognition device is provided with a registered voiceprint database
  • the fusion voiceprint feature vector of the registered voice is stored in the registered voiceprint database
  • the registered voiceprint database stores the fusion voiceprint of the registered voice.
  • the fusion voiceprint feature vectors of registered voices can be classified and stored, for example, classified and stored according to similarity, that is, the fusion voiceprint feature vectors of multiple registered voices with higher similarity are stored in a subset. Multiple subsets form the registered voiceprint database.
  • storage is classified according to gender, that is, the fusion voiceprint feature vector of the registered voice of male registered users and the fusion voiceprint feature vector of the registered voice of female registered users are stored separately.
  • the fusion feature vector of the registered voice can also be stored in other ways, which will not be listed here.
  • step S20 includes:
  • Step S210 Perform pre-emphasis, framing and window preprocessing on the verification voice
  • the voice signal Because the voice signal has short-term stability, the voice signal needs to be framed and windowed after the preprocessing is completed, which is convenient for short-term analysis technology to process the voice signal. Under normal circumstances, the number of frames per second is about 33-100 frames.
  • the framing method can be either continuous segmentation or overlapping segmentation, but the latter can make the transition between frames smoothly , To maintain its continuity.
  • the overlapping part of the previous frame and the next frame is called frame shift, and the ratio of frame shift to frame length is generally taken as (0 ⁇ 1/2).
  • the voice signal is intercepted by a movable window of limited length, that is, framed.
  • the commonly used window functions are Rectangular, Hamming and Hanning.
  • the characteristic parameters will be extracted.
  • the selection of the characteristic parameters should meet several principles: first, it is easy to extract the characteristic parameters from the speech signal; second, it is not easy to be imitated; third, it does not change with time and space , It has relative stability; fourth, it can effectively identify different speakers.
  • the current speaker verification system mainly relies on the low-level acoustic features of speech for recognition. These features can be divided into time domain features and transform domain features.
  • Step S220 Extract Mel frequency cepstral coefficients, first-order difference of linear prediction cepstral coefficients, energy, first-order difference of energy, and Gammatone filter cepstrum system from the preprocessed verification voice Number of feature parameters to obtain the first voiceprint feature of the verification voice;
  • the energy spectrum is obtained by squaring the frequency spectrum X(k), and then smoothing and eliminating harmonics through the Mel frequency filter to obtain the corresponding Mel spectrum.
  • the Mel frequency filter bank is based on the masking effect of the sound, and a number of triangular band pass filters H m (k) (0 ⁇ m ⁇ M, M is the number of filters) are set in the frequency spectrum of the speech.
  • the center frequency is f(m), and the interval between each f(m) widens as the value of m increases.
  • the transfer function of the triangular band-pass filter bank can be expressed by the following formula:
  • L is the order of the MFCC parameter.
  • L is the number of frames of the speech segment.
  • E max maxE 1 , which is the largest logarithmic energy in the speech segment.
  • A(z) is the inverse filter.
  • the analysis of LPC is to solve the linear prediction coefficient a k , and this application adopts the recursive solution formula method based on autocorrelation (ie, the Durbin algorithm).
  • Dynamic characteristic parameters The specific steps of extracting the first-order difference of the Mel frequency cepstral coefficient, the first-order difference of the linear prediction cepstral coefficient, and the first-order difference energy parameter are as follows:
  • the Mel frequency cepstrum coefficients, linear prediction cepstrum coefficients, and energy feature parameters introduced above only represent the timely information of the speech spectrum, which are static parameters.
  • the dynamic information of the speech spectrum also contains speaker-related information, which can be used to improve the recognition rate of the speaker recognition system.
  • the dynamic information of speech cepstrum represents the law of the change of speech characteristic parameters over time.
  • the change of speech cepstrum over time can be expressed as follows:
  • c m m denotes the order cepstrum coefficient
  • n and k represent serial cepstral coefficients on a time axis.
  • the first-order coefficient ⁇ c m (n) of the orthogonal polynomial is shown in the above formula.
  • the window function in practical applications mostly adopts a rectangular window, and K usually takes 2.
  • the dynamic parameter is called a linear combination of the parameters of the first two frames and the last two frames of the current frame. Therefore, the first-order dynamic parameters of Mel frequency cepstrum coefficient, linear prediction cepstrum coefficient and energy can be obtained according to the above formula.
  • the Gamma tone filter is a standard cochlear auditory filter.
  • the time domain impulse response of the filter is:
  • A is the filter gain
  • f i is the center frequency of the filter
  • U(t) is the step function
  • ⁇ i is the phase.
  • ⁇ i be 0 and n is the order of the filter.
  • N is the number of filters.
  • the center frequencies of each filter bank are equally spaced in the ERB domain.
  • the frequency coverage of the entire filter bank is 80Hz-8000Hz.
  • the calculation formula for each center frequency is as follows:
  • v i is the filter overlap factor specifies the percentage of overlap between adjacent filters. After the center frequency of each filter is determined, the corresponding bandwidth can be obtained by the above formula.
  • step (3) Gamma tone filter bank filtering.
  • the power spectrum X(k) obtained in step (1) is squared to obtain the power spectrum, and then the Gamma tone filter group G m (k) is used for filtering.
  • the logarithmic spectrum S(m) is obtained, which is used to compress the dynamic range of the speech spectrum and convert the multiplicative component of the noise in the frequency domain into an additive component.
  • step S20 includes:
  • Step S210' Arrange the verification voice into a spectrogram with a predetermined number of latitudes
  • a feature vector of a predetermined latitude may be extracted from the verification voice every predetermined time interval, so as to arrange the verification voice into a spectrogram of a predetermined latitude.
  • the above-mentioned predetermined number of latitudes, predetermined latitude, and predetermined time interval can be set according to requirements and/or system performance during specific implementation.
  • the size of the above-mentioned predetermined number of latitudes, predetermined latitude and predetermined time interval is not set. limited.
  • Step S220' Recognizing the spectrogram of the predetermined number of latitudes through a neural network to obtain the second voiceprint feature of the verification voice.
  • the second voiceprint feature can better characterize the acoustic features in the speech and improve the accuracy of speech recognition.
  • step S30 specifically includes:
  • the Markov chain Monte Carlo random model is used to perform the fusion of the first voiceprint feature dimension and the second voiceprint feature dimension to obtain the fused voiceprint feature vector of the verification voice.
  • the Markov chain Monte Carlo random model randomly obtains multiple features from the first voiceprint feature, multiple features from the second voiceprint feature, and then obtains from the first voiceprint feature The multiple features of and the multiple features obtained from the second voiceprint feature are fused to obtain the fused voiceprint feature vector of the verification voice.
  • the Markov chain Monte Carlo random model randomly extracts 10 features from the 15 features in the first voiceprint feature, and 15 features from the 20 features in the second voiceprint feature, which can be obtained after fusion There are 25 voiceprint feature fusion voiceprint feature vectors along the speech.
  • FIG. 6 is a detailed flowchart of an embodiment of step S30 in FIG. 1.
  • the first voiceprint feature includes a plurality of first voiceprint sub-features
  • the second voiceprint feature includes a plurality of second voiceprint sub-features
  • the foregoing step S30 includes:
  • Step S310 Set the total feature number of the fusion feature voiceprint of the verification voice to K;
  • Step S320 Determine the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature according to the total feature of the fusion voiceprint feature of the verification voice as K;
  • Step S330 According to the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature, use MCMC's Gibbs sampling to simulate the sampling process of the joint normal distribution, and determine the selection of the first voiceprint feature respectively
  • the first voiceprint sub-feature of and the second voiceprint sub-feature selected from the second voiceprint feature constitute a fusion voiceprint feature vector of the verification voice.
  • step 320 specifically includes:
  • Step A Generate a random number between [0,1] as a parameter p, which represents the proportion of the first voiceprint sub-feature in the fused voiceprint features of the verification voice;
  • Step C Generate a random number q between [0,1] and compare it with the parameter p.
  • q ⁇ p select one of the second voiceprint sub-features. The number is increased by 1.
  • q>p one of the first voiceprint sub-features is selected, and the number of the first voiceprint sub-features is increased by 1;
  • Step D Increase the value of k by 1, and judge whether k ⁇ K. If so, count the number of first and second voiceprint sub-features of the fusion feature voiceprint vector to be selected into the verification voice, respectively Record as A and B, end the sampling process; otherwise, return to step C above.
  • the number of second voiceprint sub-features B 5, then three first voiceprint sub-features and five second voiceprint sub-features should be selected in the subsequent specific feature selection process.
  • step 330 specifically includes:
  • Step F Count the number of features in the fusion voiceprint feature vector of the collected verification voice, record it as M, and generate M random numbers between [0,1] as the initial state
  • Step G For each increase in the number of transitions t by 1, for each variable x i (t), i ⁇ 1,2...M ⁇ , perform the following calculations according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean value of the joint probability distribution is X; judge whether t ⁇ T, if yes, return to step G, otherwise get
  • P(T) [P(x 1 (T)),P(x 2 (T)),...P(x i (T)),...P(x M (T))];
  • Step H According to the number A of the first voiceprint sub-feature in the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first A with the largest probability Px i (T) A voiceprint sub-feature as the first voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification voice;
  • Step J Collect statistically the number of features in the fusion voiceprint feature vector of the verification voice, record it as N, and generate N random numbers between [0,1] as the initial state;
  • y(0) [y 1 (0), y 2 (0)...y N (0)]
  • Step K Every time the number of transitions t increases by 1, for each variable y j (t), j ⁇ ⁇ 1,2...M ⁇ , the following calculation is performed according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean of the joint probability distribution is Y;
  • Step L According to the number B of the second voiceprint sub-features of the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first B second voice with the highest probability Py j (T) The pattern sub-feature is used as the second voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification speech.
  • T 50
  • Px i (50) [0.6, 0.2, 0.5, 0.8, 0.9] is calculated, then the corresponding maximum is selected.
  • Two behavioral features of probability are added to verify voice fusion voiceprint feature vector.
  • this application also provides a voiceprint recognition device.
  • FIG. 7 is a functional block diagram of an embodiment of a voiceprint recognition device of the present application.
  • the voiceprint recognition device includes:
  • the data acquisition module 10 is configured to acquire the verification voice to be recognized
  • the data processing module 20 is configured to use the GMM-UBM model to extract the first voiceprint feature of the verified voice, and use the neural network model to extract the second voiceprint feature of the verified voice;
  • the data fusion module 30 is configured to perform feature fusion of the first voiceprint feature and the second voiceprint feature of the verification voice to obtain a fusion voiceprint feature vector of the verification voice;
  • the data comparison module 40 is configured to calculate the similarity between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of each registered user in the preset registered voiceprint database;
  • the data judgment module 50 is configured to judge the voiceprint recognition result of the verification voice based on the similarity.
  • a module for acquiring a voiceprint feature vector of a registered user includes:
  • Extracting the voiceprint feature unit configured to use a GMM-UBM model to extract the third voiceprint feature of the registered voice, and use a neural network model to extract the fourth voiceprint feature of the registered voice;
  • a fusion unit configured to perform feature fusion of the third voiceprint feature and the fourth voiceprint feature of the registered voice to obtain a fused voiceprint feature vector of the registered voice
  • the saving unit is configured to save the fused voiceprint feature vector of the registered voice in the registered voiceprint database as the voiceprint feature vector of the registered user.
  • the data processing module 20 further includes:
  • the first pre-processing unit 201 is configured to perform pre-emphasis, framing, and windowing pre-processing on the verification voice;
  • the first extraction unit 202 is configured to extract the pitch period, linear prediction cepstral coefficient, first-order difference of linear prediction cepstral coefficient, energy, first-order difference of energy, and Gammatone filter from the preprocessed verification voice Feature parameters of cepstral coefficients to obtain the first voiceprint feature of the verification voice;
  • the second preprocessing unit 203 is configured to arrange the verification voice into a spectrogram of a predetermined number of latitudes
  • the second extraction unit 202 is configured to recognize the spectrogram of the predetermined number of latitudes through a neural network to obtain the second voiceprint feature of the verification voice.
  • the data fusion module 30 includes:
  • the data fusion unit 301 is configured to use a Markov chain Monte Carlo random model to perform the fusion of the first voiceprint feature dimension and the second voiceprint feature dimension to obtain the fused voiceprint feature vector of the verification voice.
  • the data fusion unit 301 includes:
  • the setting subunit 3011 is set to set the total number of voiceprint features of the fusion feature of the verification voice to K;
  • the determining subunit 3012 is configured to determine the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature by using the direct sampling method according to the total feature of the fused voiceprint feature of the verification voice as K;
  • the fusion sub-unit 3013 is configured to use the Gibbs sampling of MCMC to simulate the sampling process of the joint normal distribution according to the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature, and to determine the selection of the first voiceprint feature respectively
  • the first voiceprint sub-feature and the second voiceprint sub-feature selected from the second voiceprint feature constitute a fusion voiceprint feature vector of the verification voice.
  • determining subunit 3012 is configured to:
  • Step A Generate a random number between [0,1] as a parameter p, which represents the proportion of the first voiceprint sub-feature in the fused voiceprint features of the verification voice;
  • Step C Generate a random number q between [0,1] and compare it with the parameter p.
  • q ⁇ p select one of the second voiceprint sub-features. The number is increased by 1.
  • q>p one of the first voiceprint sub-features is selected, and the number of the first voiceprint sub-features is increased by 1;
  • Step D Increase the value of k by 1, and judge whether k ⁇ K. If so, count the number of first and second voiceprint sub-features of the fusion feature voiceprint vector to be selected into the verification voice, respectively Record as A and B, end the sampling process; otherwise, return to step C above.
  • fusion subunit 3013 is configured as:
  • Step F Count the number of features in the fusion voiceprint feature vector of the collected verification voice, record it as M, and generate M random numbers between [0,1] as the initial state
  • Step G For each increase in the number of transitions t by 1, for each variable x i (t), i ⁇ 1,2...M ⁇ , perform the following calculations according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean value of the joint probability distribution is X; judge whether t ⁇ T, if yes, return to step G, otherwise get
  • P(T) [P(x 1 (T)),P(x 2 (T)),...P(x i (T)),...P(x M (T))];
  • Step H According to the number A of the first voiceprint sub-feature in the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first A with the largest probability Px i (T) A voiceprint sub-feature as the first voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification voice;
  • Step J Collect statistically the number of features in the fusion voiceprint feature vector of the verification voice, record it as N, and generate N random numbers between [0,1] as the initial state
  • y(0) [y 1 (0), y 2 (0)...y N (0)];
  • Step K Every time the number of transitions t increases by 1, for each variable y j (t), j ⁇ ⁇ 1,2...M ⁇ , the following calculation is performed according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean of the joint probability distribution is Y;
  • Step L According to the number B of the second voiceprint sub-features of the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first B second voice with the highest probability Py j (T) The pattern sub-feature is used as the second voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification speech.
  • an embodiment of the present application also provides a voiceprint recognition device, including a processor, a memory, and a voiceprint recognition program stored in the memory and executable by the processor.
  • the voiceprint recognition program is executed by the processor to implement the above-mentioned implementations. Example of the steps of the voiceprint recognition method.
  • the embodiments of the present application also provide a computer-readable storage medium, and a voiceprint recognition program is stored on the computer-readable storage medium.
  • a voiceprint recognition program is executed by a processor, the voiceprint recognition method of the foregoing embodiments is implemented. step.
  • the storage medium may be a volatile storage medium, and the storage medium may also be a non-volatile storage medium.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

L'invention concerne un procédé, un appareil et un dispositif de reconnaissance d'empreinte vocale, et un support de stockage lisible par ordinateur, le procédé de reconnaissance d'empreinte vocale consistant à : acquérir une voix de vérification à reconnaître (S10) ; utiliser un modèle GMM-UBM pour extraire une première caractéristique d'empreinte vocale de la voix de vérification, et utiliser un modèle de réseau neuronal pour extraire une seconde caractéristique d'empreinte vocale de la voix de vérification (S20) ; fusionner la première caractéristique d'empreinte vocale et la seconde caractéristique d'empreinte vocale de la voix de vérification afin d'obtenir un vecteur de caractéristique d'empreinte vocale fusionnée de la voix de vérification (S30) ; calculer le degré de similitude entre le vecteur de caractéristique d'empreinte vocale fusionnée et les vecteurs de caractéristique d'empreinte vocale de chaque utilisateur enregistré dans une base de données d'empreinte vocale enregistrée prédéfinie (S40) ; et sur la base du degré de similitude, déterminer un résultat de reconnaissance d'empreinte vocale de la voix de vérification (S50). Les deux modèles extraient des caractéristiques de la voix de vérification et sont utilisés pour effectuer une vérification vocale respectivement. Par comparaison à l'extraction des caractéristiques de la voix de vérification et la réalisation d'une vérification vocale par un modèle unique, les informations contenues dans les caractéristiques extraites par les deux modèles sont plus complètes, ce qui permet une amélioration du taux de précision de la reconnaissance d'empreinte vocale.
PCT/CN2019/118656 2019-03-12 2019-11-15 Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur WO2020181824A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910182453.3A CN110047490A (zh) 2019-03-12 2019-03-12 声纹识别方法、装置、设备以及计算机可读存储介质
CN201910182453.3 2019-03-12

Publications (1)

Publication Number Publication Date
WO2020181824A1 true WO2020181824A1 (fr) 2020-09-17

Family

ID=67274752

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118656 WO2020181824A1 (fr) 2019-03-12 2019-11-15 Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN110047490A (fr)
WO (1) WO2020181824A1 (fr)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047490A (zh) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 声纹识别方法、装置、设备以及计算机可读存储介质
CN110517698B (zh) * 2019-09-05 2022-02-01 科大讯飞股份有限公司 一种声纹模型的确定方法、装置、设备及存储介质
CN110556126B (zh) * 2019-09-16 2024-01-05 平安科技(深圳)有限公司 语音识别方法、装置以及计算机设备
CN112687274A (zh) * 2019-10-17 2021-04-20 北京猎户星空科技有限公司 一种语音信息的处理方法、装置、设备及介质
CN110880321B (zh) * 2019-10-18 2024-05-10 平安科技(深圳)有限公司 基于语音的智能刹车方法、装置、设备及存储介质
CN110838294B (zh) * 2019-11-11 2022-03-04 效生软件科技(上海)有限公司 一种语音验证方法、装置、计算机设备及存储介质
CN111370003B (zh) * 2020-02-27 2023-05-30 杭州雄迈集成电路技术股份有限公司 一种基于孪生神经网络的声纹比对方法
CN112185344A (zh) * 2020-09-27 2021-01-05 北京捷通华声科技股份有限公司 语音交互方法、装置、计算机可读存储介质和处理器
CN112614493B (zh) * 2020-12-04 2022-11-11 珠海格力电器股份有限公司 声纹识别方法、系统、存储介质及电子设备
CN112382300A (zh) * 2020-12-14 2021-02-19 北京远鉴信息技术有限公司 声纹鉴定方法、模型训练方法、装置、设备及存储介质
CN115310066A (zh) * 2021-05-07 2022-11-08 华为技术有限公司 一种升级方法、装置及电子设备
CN115022087B (zh) * 2022-07-20 2024-02-27 中国工商银行股份有限公司 一种语音识别验证处理方法及装置
CN115019804B (zh) * 2022-08-03 2022-11-01 北京惠朗时代科技有限公司 一种多员工密集签到的多重校验式声纹识别方法及系统
CN115831152B (zh) * 2022-11-28 2023-07-04 国网山东省电力公司应急管理中心 一种用于实时监测应急装备发电机运行状态的声音监测装置及方法
CN116386647B (zh) * 2023-05-26 2023-08-22 北京瑞莱智慧科技有限公司 音频验证方法、相关装置、存储介质及程序产品

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010065A1 (en) * 2006-06-05 2008-01-10 Harry Bratt Method and apparatus for speaker recognition
CN103745002A (zh) * 2014-01-24 2014-04-23 中国科学院信息工程研究所 一种基于行为特征与内容特征融合的水军识别方法及系统
CN104900235A (zh) * 2015-05-25 2015-09-09 重庆大学 基于基音周期混合特征参数的声纹识别方法
CN105575394A (zh) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 基于全局变化空间及深度学习混合建模的声纹识别方法
CN106887225A (zh) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 基于卷积神经网络的声学特征提取方法、装置和终端设备
US10008209B1 (en) * 2015-09-25 2018-06-26 Educational Testing Service Computer-implemented systems and methods for speaker recognition using a neural network
CN108417217A (zh) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 说话人识别网络模型训练方法、说话人识别方法及系统
CN109102812A (zh) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 一种声纹识别方法、系统及电子设备
CN109147797A (zh) * 2018-10-18 2019-01-04 平安科技(深圳)有限公司 基于声纹识别的客服方法、装置、计算机设备及存储介质
CN110047490A (zh) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 声纹识别方法、装置、设备以及计算机可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1308911C (zh) * 2003-07-10 2007-04-04 上海优浪信息科技有限公司 一种说话者身份识别方法和系统
CN103440873B (zh) * 2013-08-27 2015-10-28 大连理工大学 一种基于相似性的音乐推荐方法
CN104835498B (zh) * 2015-05-25 2018-12-18 重庆大学 基于多类型组合特征参数的声纹识别方法
CN106710589B (zh) * 2016-12-28 2019-07-30 百度在线网络技术(北京)有限公司 基于人工智能的语音特征提取方法及装置
CN106847309A (zh) * 2017-01-09 2017-06-13 华南理工大学 一种语音情感识别方法
CN109767790A (zh) * 2019-02-28 2019-05-17 中国传媒大学 一种语音情感识别方法及系统

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010065A1 (en) * 2006-06-05 2008-01-10 Harry Bratt Method and apparatus for speaker recognition
CN103745002A (zh) * 2014-01-24 2014-04-23 中国科学院信息工程研究所 一种基于行为特征与内容特征融合的水军识别方法及系统
CN104900235A (zh) * 2015-05-25 2015-09-09 重庆大学 基于基音周期混合特征参数的声纹识别方法
US10008209B1 (en) * 2015-09-25 2018-06-26 Educational Testing Service Computer-implemented systems and methods for speaker recognition using a neural network
CN105575394A (zh) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 基于全局变化空间及深度学习混合建模的声纹识别方法
CN106887225A (zh) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 基于卷积神经网络的声学特征提取方法、装置和终端设备
CN109102812A (zh) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 一种声纹识别方法、系统及电子设备
CN108417217A (zh) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 说话人识别网络模型训练方法、说话人识别方法及系统
CN109147797A (zh) * 2018-10-18 2019-01-04 平安科技(深圳)有限公司 基于声纹识别的客服方法、装置、计算机设备及存储介质
CN110047490A (zh) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 声纹识别方法、装置、设备以及计算机可读存储介质

Also Published As

Publication number Publication date
CN110047490A (zh) 2019-07-23

Similar Documents

Publication Publication Date Title
WO2020181824A1 (fr) Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur
CN106486131B (zh) 一种语音去噪的方法及装置
WO2021139425A1 (fr) Procédé, appareil et dispositif de détection d'activité vocale, et support d'enregistrement
US9940935B2 (en) Method and device for voiceprint recognition
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
EP2763134B1 (fr) Procédé et dispositif de reconnaissance de la parole
WO2018166187A1 (fr) Serveur, procédé et système de vérification d'identité, et support d'informations lisible par ordinateur
WO2019019256A1 (fr) Appareil électronique, procédé et système de vérification d'identité et support de stockage lisible par ordinateur
US20120143608A1 (en) Audio signal source verification system
WO2018223727A1 (fr) Procédé, appareil et dispositif de reconnaissance d'empreinte vocale, et support
WO2014114049A1 (fr) Procédé et dispositif de reconnaissance vocale
US6990446B1 (en) Method and apparatus using spectral addition for speaker recognition
WO2021051608A1 (fr) Procédé et dispositif de reconnaissance d'empreinte vocale utilisant un apprentissage profond et appareil
WO2014114116A1 (fr) Procédé et système de reconnaissance d'empreinte vocale
EP3989217B1 (fr) Procédé pour détecter une attaque audio adverse par rapport à une entrée vocale traitée par un système de reconnaissance vocale automatique, dispositif correspondant, produit programme informatique et support lisible par ordinateur
WO2021042537A1 (fr) Procédé et système d'authentification de reconnaissance vocale
CN113823293B (zh) 一种基于语音增强的说话人识别方法及系统
CN109256138A (zh) 身份验证方法、终端设备及计算机可读存储介质
WO2019232826A1 (fr) Procédé d'extraction de vecteur i, procédé et appareil d'identification de locuteur, dispositif, et support
CN108154371A (zh) 电子装置、身份验证的方法及存储介质
CN110570871A (zh) 一种基于TristouNet的声纹识别方法、装置及设备
CN113241059B (zh) 语音唤醒方法、装置、设备及存储介质
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
CN111199742A (zh) 一种身份验证方法、装置及计算设备
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919458

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919458

Country of ref document: EP

Kind code of ref document: A1