WO2021051608A1 - 一种基于深度学习的声纹识别方法、装置及设备 - Google Patents

一种基于深度学习的声纹识别方法、装置及设备 Download PDF

Info

Publication number
WO2021051608A1
WO2021051608A1 PCT/CN2019/118402 CN2019118402W WO2021051608A1 WO 2021051608 A1 WO2021051608 A1 WO 2021051608A1 CN 2019118402 W CN2019118402 W CN 2019118402W WO 2021051608 A1 WO2021051608 A1 WO 2021051608A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
training
speech
features
dnn
Prior art date
Application number
PCT/CN2019/118402
Other languages
English (en)
French (fr)
Inventor
王健宗
赵峰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051608A1 publication Critical patent/WO2021051608A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application relates to the field of biological recognition technology, and in particular to a method, device and equipment for voiceprint recognition based on deep learning.
  • Voiceprint recognition is based on some voice signals and registered speaker recordings to verify the speaker's identity. Generally, low-dimensional features rich in speaker information are extracted for registration and test voices, and some algorithm operations are used to map them to verification scores. Variants include text-related voiceprint recognition, where the voice content is fixed to a certain phrase, and text-independent voiceprint recognition, where the voice content is random.
  • the main voiceprint recognition systems in the industry use the Gaussian mixture model and the i-vector model. These models abstract the digital information of the voiceprint into the models we expect, and then compare the models, which have certain limitations to a certain extent. It needs to be constructed according to the model expected by humans, but in many cases the expected model processing effect is not ideal.
  • the present application provides a method, device and equipment for voiceprint recognition based on deep learning.
  • the main purpose is to solve the technical problem that the current voiceprint recognition model has an unsatisfactory voiceprint recognition effect.
  • a method for voiceprint recognition based on deep learning includes: acquiring a target person’s certified voice, and using MFCC to perform feature extraction on the certified voice to obtain certified voice features;
  • the authentication speech features are input into the neural network model for authentication processing.
  • the DNN architecture is trained by multi-person speech to obtain a function that can authenticate the speech, and then save the function to the last layer of the DNN architecture to obtain the neural network model; according to the authentication process
  • the parameters of the functions inside the neural network model are adjusted to obtain a target neural network model capable of recognizing the voice of the target person; using MFCC to perform feature extraction on the acquired voice to be recognized to obtain the voice feature to be recognized;
  • the voice features to be recognized are input into the target neural network model for voice recognition processing, and it is determined whether the voice to be recognized belongs to the target person.
  • a voiceprint recognition device based on deep learning.
  • the device includes: an acquisition module for acquiring certified voice of a target person, and using MFCC to perform feature extraction on the certified voice to obtain the certified voice Features; authentication module, used to input the authentication voice features into the neural network model for authentication processing, where the DNN architecture is trained on multi-person voice to obtain a function that can authenticate the voice, and then the function is saved to the end of the DNN architecture One layer obtains the neural network model; the adjustment module is used to adjust the parameters of the functions inside the neural network model according to the authentication processing result to obtain a target neural network model capable of recognizing the voice of the target person; the extraction module is used to Use MFCC to perform feature extraction on the acquired voice to be recognized to obtain the voice feature to be recognized; the processing module is used to input the voice feature to be recognized into the target neural network model for voice recognition processing to determine the voice to be recognized Whether it belongs to the target person.
  • a computer device including a memory and a processor, the memory stores a computer program, and the processor implements the deep learning-based sound of the first aspect when the computer program is executed. Steps of pattern recognition method.
  • a computer storage medium having a computer program stored thereon, and when the computer program is executed by a processor, it implements the deep learning-based voiceprint recognition steps described in the first aspect.
  • the present application provides a deep learning-based voiceprint recognition method, device, and equipment, which uses the voices of multiple people to form a training set to learn and train the DNN architecture, and obtain a neural network model capable of voiceprint recognition , And use the neural network model to authenticate the target person’s voice, form a function corresponding to the target person’s voice in the neural network model, and then use the authenticated target neural network model to recognize the voice to determine whether the voice is the target person Myself.
  • the speech recognition process of the target neural network model formed according to the characteristics of each person's voiceprint is relatively fast and accurate, so that the recognition efficiency is effectively improved.
  • FIG. 1 is a flowchart of an embodiment of a voiceprint recognition method based on deep learning of this application
  • Figure 2 is a diagram of the DNN network composition of the application
  • Figure 3 is a composition diagram of the DNN architecture of the application.
  • FIG. 4 is a structural block diagram of an embodiment of a voiceprint recognition device based on deep learning of this application
  • FIG. 5 is a schematic diagram of the structure of the computer equipment of this application.
  • the embodiment of the application provides a method for voiceprint recognition based on deep learning, which uses the voices of multiple people to form a training set to learn and train the DNN architecture to obtain a neural network model capable of voiceprint recognition, and use the neural network model to
  • the target person’s voice is authenticated, a function corresponding to the target person’s voice is formed in the neural network model, and then the authenticated neural network model is used to recognize the voice to determine whether the voice is the target person himself.
  • an embodiment of the present application provides a deep learning-based voiceprint recognition method, which includes the following steps:
  • Step 101 Obtain the authentication voice of the target person, and use the MFCC to perform feature extraction on the authentication voice to obtain the authentication voice feature.
  • the authentication voice can be obtained in real time through a microphone, or a recording record in the memory can be retrieved or a part of the voice in the recording record can be intercepted as the authentication voice.
  • MFCC Mel Frequency Cepstral Coefficents, Mel Frequency Cepstral Coefficents
  • MFCC Mel Frequency Cepstral Coefficents, Mel Frequency Cepstral Coefficents
  • Step 102 Input the authentication voice features into the neural network model for authentication processing, where the DNN architecture is trained on multi-person speech to obtain a function that can authenticate the voice, and then save the function to the last layer of the DNN architecture to obtain the neural network model .
  • the DNN architecture is constructed by DNN (Deep Neural Network, deep neural network), which can learn and train according to multiple voices, which is beneficial to improve the intelligence of the entire voiceprint recognition process.
  • DNN Deep Neural Network, deep neural network
  • multi-person speech is multiple speeches uttered by multiple people, and each speech is marked with a label with the speaker.
  • the output result of the DNN architecture is compared with the label to determine whether the output result is correct.
  • Step 103 Adjust parameters of functions inside the neural network model according to the authentication processing result to obtain a target neural network model capable of recognizing the voice of the target person.
  • the adjusted function embedded in the neural network model can correspond to the target person’s voice one-to-one, so that in the process of speech recognition, the neural network model can help the neural network model to determine whether the voice belongs to the target person, and then increase the neural network The recognition efficiency and recognition accuracy of the model.
  • Step 104 Perform feature extraction on the acquired voice to be recognized by using the MFCC to obtain the voice feature to be recognized.
  • Step 105 Input the voice features to be recognized into the target neural network model for voice recognition processing, and determine whether the voice to be recognized belongs to the target person.
  • this step there are multiple voice features to be recognized after MFCC processing. Arrange the multiple voice features to be recognized to form a feature vector matrix, and then extract the feature vector matrix from the target neural network model.
  • the input port is input, and the target neural network model processes the eigenvector matrix and outputs the output result from the output port.
  • the encryption process can use the target neural network formed by the above steps 101-103, and embed the target neural network model into the encrypted file.
  • the user wants to use voice to decrypt Then, using steps 104 and 105 again, when the output result of the target neural network model is "the target person himself", it is determined that the decryption is successful, and the corresponding function is activated.
  • the training set of multiple people’s voices is used to learn and train the DNN architecture, and a neural network model capable of voiceprint recognition is obtained, and the neural network model is used to authenticate the target person’s voice.
  • the neural network model A function corresponding to the target person's voice is formed, and then the voice is recognized by the authenticated target neural network model to determine whether the voice is the target person himself.
  • the speech recognition process of the target neural network model formed according to the characteristics of each person's voiceprint is relatively fast and accurate, so that the recognition efficiency is effectively improved.
  • the method before step 102, the method further includes:
  • Step A Collect training voices of multiple speakers, and use MFCC to perform feature extraction on the training voices to obtain training voice features, where each segment of the training voice contains a label corresponding to the speaker.
  • the sound of the training voice is uttered by multiple people, so as to ensure that the trained neural network model can adapt to the timbre of various people, and to ensure the recognition effect of the neural network model.
  • each training speech needs to be processed by MFCC to ensure that each speech can be input into the DNN architecture.
  • Step B Use the training speech features to train the DNN architecture.
  • the training speech features can be input into the DNN architecture for training randomly or according to the first letter of the pronunciation.
  • the output result is compared with the corresponding label during the training process. If the comparison is successful, the output is proved to be correct. If the comparison is made The failure proves that the output is wrong, and the DNN architecture is adjusted according to the output results to ensure the correct rate of the DNN architecture output.
  • Step C Perform statistics on the output data of the DNN architecture during the training process, and determine a function capable of recognizing speech according to the statistical results.
  • the data output by the DNN architecture is integrated, and data such as the correct rate of the output is calculated, and a function capable of recognizing speech is calculated based on the data.
  • Step D Save the function in the last layer of the DNN architecture to obtain a neural network model capable of recognizing speech.
  • the DNN architecture is trained using multiple speeches of multiple people to obtain the corresponding neural network recognition model, which can ensure the diversification of the neural network recognition model, recognize the timbre of different people such as men, women, old and young, and can also use
  • the function further confirms the speech to ensure the accuracy of recognition.
  • step A specifically includes:
  • Step A1 Obtain N segments of speech of multiple people, divide each segment of speech into two parts to obtain 2N segments of training speech, and add tags corresponding to the speaker of the speech to each part.
  • Step A2 Use MFCC to perform feature extraction on 2N training speech to obtain 2N training speech features.
  • Step A3 randomly select two training voice features from 2N training voice features and combine them to obtain N voice feature groups.
  • each segment is divided into two parts, and then recombined into N speech feature groups, so that the two training speech features in each group of speech features may come from the same person, or they may come from different people, so that you can Used to train the DNN framework to recognize the voice characteristics of the same or different people. Ensure the diversification of DNN architecture training, thereby improving the training effect.
  • step B specifically includes:
  • Step B1 construct two DNN networks, and combine the two DNN networks into a DNN architecture.
  • Step B2 Input the two training voice features of each voice feature group into two DNN networks in the DNN architecture for processing.
  • Step B3 Integrate the output results of the two DNN networks and output the integration result, where the integration result includes whether the two training speech features belong to the same speaker.
  • Step B4 Calculate the loss function according to the difference between the integration result and the labels corresponding to the two input training speech features, and adjust the parameters of the DNN architecture according to the loss function.
  • the DNN architecture includes two DNN networks, which are used to process the two training speech features of the speech feature group, and then compare the output results of the two DNN networks to determine whether the two training speech features are from the same person , And then determine whether the judgment result is correct according to the labels of the two training speech features, calculate the loss function according to the difference between the output result and the label, adjust the parameters of the DNN architecture according to the loss function, and then use the next speech feature group for training, and Keep repeating the above process until all the speech feature groups have been trained.
  • step C specifically includes:
  • Step C1 Calculate the offset distance L(x, y) of the two training voice features of each voice feature group in the N voice feature groups, where x and y respectively represent the two training voice features.
  • the voiceprint features between the two training voice features are different, and there is a certain offset distance, which can be expressed by the following formula:
  • S represents the vector matrix output from the DNN network after the training speech features are converted into feature vectors
  • b represents the set constant value, which can be adjusted according to the actual situation.
  • Step C2 According to the offset distance, calculate the probability value Pr(x, y) that the two training speech features of each of the N speech feature groups belong to the same speaker.
  • Step C3 Count the correct speech feature groups of the same speaker output from the DNN architecture during the training process to form a set P same .
  • Step C4 Count the correct speech feature groups of the different speakers output from the DNN architecture during the training process to form a set P diff .
  • Step C5 Calculate the function E capable of recognizing speech:
  • K is the set weight value.
  • step B1 specifically includes:
  • Step B11 Set M hidden layers for the DNN network for processing the input training voice features.
  • Step B12 Set a pooling layer behind each of the first M-1 hidden layers to aggregate the processing results output by the hidden layer, calculate the average difference and standard deviation, and integrate the results output by all the pooling layers Send to the last hidden layer.
  • the average of the processing results is calculated, and the arithmetic mean of the absolute deviations of the processing results from the average is counted as the average difference, and the square root of the square root of the square of the square of the processing results is calculated as the standard deviation. Then, these calculation results are integrated and sent to the last hidden layer, and the neurons in the last hidden layer are used for processing, and then the voice of which person the corresponding voice feature belongs to is obtained, and the representative mark corresponding to that person is output.
  • Step B13 Set a linear output layer in front of the output port of the DNN network, and the last hidden layer sends the integration result to the linear output layer and outputs it from the output port.
  • the linear output layer performs data processing on the representative mark output by the last hidden layer, and converts it into a corresponding representative symbol (ie, a label). Then output the representative symbol.
  • Step B14 Combine the set linear output layers of the two DNN networks to obtain the DNN architecture.
  • the output results of the linear output layers of the two DNN networks are compared. If they are the same, they are the voices of the same person, and if they are different, they are the voices of different people. Then output whether they belong to the same person's voice and the representative symbols of the speakers of the two training speeches, and compare the representative symbols with the corresponding tags. If the same, it proves that the recognition is correct, if they are different, it proves that the recognition is wrong.
  • the constructed DNN network can perform speech recognition more accurately after training, and the recognition efficiency and accuracy can be effectively improved.
  • step 101 specifically includes:
  • Step 1011 Perform pre-emphasis processing on the authentication voice using a high-pass filter.
  • Step 1012 Perform framing processing on the to-be-recognized speech after pre-emphasis processing.
  • Step 1013 Multiply each frame of the authentication voice by the Hamming window to perform windowing processing to obtain a windowed authentication voice frame.
  • Step 1014 Perform fast Fourier transform on the windowed authentication speech frame to obtain the corresponding energy spectrum.
  • Step 1015 Pass the energy spectrum through a triangular bandpass filter to smooth the energy spectrum and eliminate the effect of harmonics of the energy spectrum.
  • Step 1016 Calculate the logarithmic energy of the output result of the triangular bandpass filter, and perform the discrete cosine transform to obtain the MFCC feature.
  • Step 1017 Perform normalization processing on the MFCC features, filter out non-speech frames using a voice activity detection tool, and obtain authenticated voice features.
  • the MFCC is used to preprocess the speech to obtain the speech features that can be input to the neural network model.
  • a set of bandpass filters are arranged in the frequency band from low to high frequency according to the critical bandwidth from dense to sparse.
  • the input signal is filtered.
  • the signal energy output by each band-pass filter is taken as the basic feature of the signal, and this feature can be used as the input feature of the voice after further processing. Since this feature does not depend on the nature of the signal, it does not make any assumptions and restrictions on the input signal, and uses the research results of the auditory model. Therefore, this parameter has better Lupin performance, is more in line with the auditory characteristics of the human ear, and still has better recognition performance when the signal-to-noise ratio is reduced.
  • the MFCC feature extraction process must be performed in accordance with the above steps 1011-1017 for the voice to be input.
  • the DNN architecture is learned and trained by the training set composed of multiple people’s voices, and a neural network model capable of voiceprint recognition is obtained, and the neural network model is used to target the target person.
  • the voice is authenticated, a function corresponding to the target person’s voice is formed in the neural network model, and then the authenticated target neural network model is used to recognize the voice to determine whether the voice is the target person himself.
  • the speech recognition process of the target neural network model formed according to the characteristics of each person's voiceprint is relatively fast and accurate, so that the recognition efficiency is effectively improved.
  • a deep learning-based voiceprint recognition method includes the following steps:
  • MFCC Mel Frequency Cepstral Coefficents, Mel Frequency Cepstral Coefficents
  • the training set is divided into frames. Each word in the training set has L sampling points.
  • the L sampling points are collected into an observation unit, called a frame.
  • This overlapping area contains H sampling points, and the value of H is usually about 1/2 or 1/3 of L.
  • the DNN network architecture shown in Figure 2 has a hidden layer (NIN Layer) and pooling.
  • the layer (Temporal Pooling) and the linear output layer (Linear Layer) combine the two DNN network architectures together as shown in Figure 3.
  • each speaker has multiple speech segments, and each speech segment corresponds to one MFCC feature.
  • Two MFCC features of the same speaker form a feature pair.
  • Select N feature pairs of different speakers that is, N segments of feature pairs to form training features.
  • Two arbitrary features X and Y (X and Y can belong to the same speaker or belong to different speakers) are inputted into the hidden layers of the two DNN network architectures in FIG. 3 for processing.
  • the processing results are output to the pooling layer, and the pooling layer aggregates the output results of the hidden layer and calculates its average value and standard deviation. Integrate these data together and send it to the final hidden layer.
  • the final hidden layer inputs the output result into the linear output layer, performs linear output, determines the loss function according to the output result, and adjusts the neural network according to the loss function to complete the neural network For training, repeat the above process until all the data in the training set is fully trained, and a DNN architecture that can classify speech is obtained.
  • Pr(x, y) is the probability of the same speaker
  • x and y refer to the feature vectors of the speech of two speakers
  • the voice features to be processed are input into the target neural network model for recognition processing, and it is determined whether the voice belongs to the user himself. If it is, start the corresponding function accordingly.
  • an embodiment of the present application provides a voiceprint recognition device based on deep learning.
  • the device includes: an acquisition module 41, an authentication module 42, and an adjustment module connected in sequence 43.
  • the acquiring module 41 is used to acquire the certified voice of the target person, and use MFCC to perform feature extraction on the certified voice to obtain the certified voice feature;
  • the authentication module 42 is used to input the authentication speech features into the neural network model for authentication processing, where the DNN architecture is trained by multi-person speech to obtain a function that can authenticate the speech, and then the function is saved to the last layer of the DNN architecture to obtain Neural network model;
  • the adjustment module 43 is configured to adjust the parameters of the functions inside the neural network model according to the authentication processing result to obtain a target neural network model capable of recognizing the voice of the target person;
  • the extraction module 44 is configured to use the MFCC to perform feature extraction on the acquired voice to be recognized to obtain the voice feature to be recognized;
  • the processing module 45 is configured to input the voice features to be recognized into the target neural network model for voice recognition processing, and determine whether the voice to be recognized belongs to the target person.
  • the device further includes: a collection module for collecting training voices of multiple speakers, and using MFCC to perform feature extraction on the training voices to obtain training voice features, wherein each training voice includes a corresponding speaker
  • the training module is used to train the DNN architecture using the training voice features
  • the calculation module is used to count the output data of the DNN architecture during the training process, and determine the function that can recognize the voice according to the statistical results; save
  • the module is used to save the function to the last layer of the DNN architecture to obtain a neural network model capable of recognizing speech.
  • the collection module specifically includes: a dividing unit for acquiring N voices of multiple people, dividing each voice into two parts to obtain 2N training voices, and adding each part corresponding to the speaker of the voice
  • the extraction unit is used to use MFCC to perform feature extraction on 2N training voices to obtain 2N training voice features; the combination unit is used to arbitrarily select two training voice features from the 2N training voice features to combine to obtain N Voice feature group.
  • the training module specifically includes: a construction unit for constructing two DNN networks and combining the two DNN networks into a DNN architecture; an input unit for combining two training speech features of each speech feature group Input the two DNN networks in the DNN architecture for processing; the integration unit is used to integrate the output results of the two DNN networks and output the integration result, where the integration result includes whether the two training speech features belong to the same speaker ; The adjustment training unit is used to calculate the loss function according to the difference between the integration result and the labels corresponding to the two input training speech features, and to adjust the parameters of the DNN architecture according to the loss function.
  • the calculation module specifically includes: an offset distance calculation unit for calculating the offset distance L(x, y) of the two training voice features of each voice feature group in the N voice feature groups, where: x and y respectively represent the two training speech features; the probability value calculation unit is used to calculate the probability value that the two training speech features of each of the N speech feature groups belong to the same speaker according to the offset distance Pr(x,y),
  • the statistical unit is used to count the speech feature groups with the correct integration result of the same speaker of the output of the DNN architecture during the training process to form a set P same ; the output of the DNN architecture during the statistical training process is the correct voice of the integration result of different speakers
  • the feature group forms the set P diff ; the calculation unit is used to calculate the function E capable of recognizing speech:
  • K is the set weight value.
  • the construction unit specifically includes: a setting unit for setting M hidden layers for the DNN network for processing input training voice features; setting pools after each first M-1 hidden layers The transformation layer is used to aggregate the processing results of the hidden layer output, calculate the average deviation and standard deviation, and integrate the output results of all pooling layers to send to the last hidden layer; set the linear output in front of the output port of the DNN network The last hidden layer sends the integration result to the linear output layer and outputs it from the output port; the linear output layers of the two DNN networks that have been set are combined to obtain the DNN architecture.
  • the acquisition module 41 specifically includes: an emphasis unit, configured to perform pre-emphasis processing on the authentication voice using a high-pass filter; and a framing unit, configured to perform framing processing on the voice to be recognized after the pre-emphasis processing;
  • the windowing unit is used to multiply each frame of the authentication voice by the Hamming window for windowing processing to obtain the windowed authentication voice frame;
  • the transformation unit is used to perform fast Fourier on the windowed authentication voice frame Leaf transformation to obtain the corresponding energy spectrum;
  • a filtering unit for passing the energy spectrum through a triangular band-pass filter to smooth the energy spectrum and eliminate the effect of harmonics of the energy spectrum;
  • a logarithmic conversion unit for using To calculate the logarithmic energy of the output result of the triangular bandpass filter, and perform the discrete cosine transform to obtain the MFCC feature;
  • the normalization unit is used to normalize the MFCC feature, and use the voice activity detection tool to filter out non-speech Frame
  • an embodiment of the present application also provides a computer device, as shown in FIG. 5, including a memory 52 and a processor 51, wherein the memory Both the processor 52 and the processor 51 are arranged on the bus 53 and the memory 52 stores a computer program.
  • the processor 51 executes the computer program, the deep learning-based voiceprint recognition method shown in FIG. 1 is implemented.
  • the technical solution of this application can be embodied in the form of a software product, which can be stored in a non-volatile memory (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in each implementation scenario of this application.
  • a non-volatile memory which can be a CD-ROM, U disk, mobile hard disk, etc.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the device can also be connected to a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like.
  • the network interface can optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), and the like.
  • a computer device does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components.
  • an embodiment of the present application also provides a storage medium on which a computer program is stored.
  • the program is executed by a processor, the above-mentioned Figure 1 shows the voiceprint recognition method based on deep learning.
  • the storage medium may also include an operating system and a network communication module.
  • the operating system is a program that manages the hardware and software resources of computer equipment, and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to realize the communication between the various components in the storage medium and the communication with other hardware and software in the computer equipment.
  • a training set of multiple people’s voices is used to learn and train the DNN architecture, and a neural network model capable of voiceprint recognition is obtained, and the neural network model is used to authenticate the target person’s voice.
  • the network model forms a function corresponding to the target person's voice, and then uses the authenticated target neural network model to recognize the voice and determine whether the voice is the target person himself.
  • the speech recognition process of the target neural network model formed according to the characteristics of each person's voiceprint is relatively fast and accurate, so that the recognition efficiency is effectively improved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

一种基于深度学习的声纹识别方法、装置及设备,其中,所述方法包括:获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征(101);将认证语音特征输入神经网络模型进行认证处理(102);根据认证处理结果对神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型(103);利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征(104);将待识别的语音特征输入目标神经网络模型进行语音识别处理,确定待识别的语音是否属于目标人(105)。利用训练得到的神经网络模型对语音进行识别,确定语音是否是目标人本人,语音识别过程比较快速精准,使得识别效率得到有效的提高。

Description

一种基于深度学习的声纹识别方法、装置及设备 技术领域
本申请涉及生物识别技术领域,特别是涉及一种基于深度学习的声纹识别方法、装置及设备。
背景技术
声纹识别是基于一些语音信号和注册的说话者录音来验证说话者身份的。通常,针对注册和测试语音提取富有说话者信息的低维特征,并使用一些算法操作将其映射到验证分数。变体包括文本相关的声纹识别,其语音内容固定为某个短语,以及文本无关的声纹识别,其语音内容随机。
目前业内的主要声纹识别系统是利用混合高斯模型以及i-vector模型,这些模型都是通过将声纹数字信息抽象成我们预想的模型,然后进行模型对比,一定程度上具有一定的局限性,需要按照人类预想的模型进行构建,但很多时候预想的模型处理效果并不理想。
发明内容
有鉴于此,本申请提供了一种基于深度学习的声纹识别方法、装置及设备。主要目的在于解决目前的声纹识别模型进行声纹识别效果不理想技术问题。
依据本申请的第一方面,提供了一种基于深度学习的声纹识别方法,所述方法包括:获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征;将所述认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型;根据认证处理结果对所述神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型;利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征;将所述待识别的语音特征输入目标神经网络模型进行语音识别处理,确定所述待识别的语音是否属于目标人。
依据本申请的第二方面,提供了一种基于深度学习的声纹识别装置,所述装置包括:获取模块,用于获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征;认证模块,用于将所述认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型;调整模块,用于根据认证处理结果对所述神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型;提取模块,用于利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征;处理模块,用于将所述待识别的语音特征输入目标神经网络模型进行语音识别处理,确定所述待识别的语音是否属于目标人。
依据本申请的第三方面,提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现第一方面所述基于深度学习的声纹识别方法的步骤。
依据本申请的第四方面,提供了一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现第一方面所述基于深度学习的声纹识别的步骤。
借由上述技术方案,本申请提供的一种基于深度学习的声纹识别方法、装置及设备,利用多人的语音组成训练集对DNN架构进行学习训练,得到能够进行声纹识别的神经网络模型,并利用该神经网络模型对目标人的语音进行认证,在神经网络模型中形成与目标人语音相对应的函数,进而利用认证后的目标神经网络模型对语音进行识别,确定语音是否是目标人本人。这样根据每个人的声纹特点形成的目标神经网络模型的语音识别过程比较快速精准,使得识别效率得到有效的提高。
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1为本申请的基于深度学习的声纹识别方法的一个实施例的流程图;
图2为本申请的DNN网络组成图;
图3为本申请的DNN架构组成图;
图4为本申请的基于深度学习的声纹识别装置的一个实施例的结构框图;
图5为本申请的计算机设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
本申请实施例提供了一种基于深度学习的声纹识别方法,利用多人的语音组成训练集对DNN架构进行学习训练,得到能够进行声纹识别的神经网络模型,并利用该神经网络模型对目标人的语音进行认证,在神经网络模型中形成与目标人语音相对应的函数,进而利用认证后的神经网络模型对语音进行识别,确定语音是否是目标人本人。
如图1所示,本申请实施例提供了一种基于深度学习的声纹识别方法,包括如下步骤:
步骤101,获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征。
在该步骤中,该认证语音可以通过麦克风进行实时获取,也可以调取存储器中录音记录或者截取录音记录中的一部分语音作为认证语音。MFCC(Mel Frequency Cepstral Coefficents,梅尔频率倒谱系数)是在Mel标度频率域提取出来的倒谱参数,用于对语音进行特征提取。
步骤102,将认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型。
在该步骤中,DNN架构是由DNN(Deep Neural Network,深度神经网络)构建而成的,DNN能够根据多个语音进行学习训练,这样有利于提高整个声纹识别过程的智能化。其中,多人语音为多个人发出的多段语音,每段语音均标记了带有发音者的标签,在训练过程中,将DNN架构的输出结果与该标签进行比对,确定输出结果是否正确。
步骤103,根据认证处理结果对神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型。
在该步骤中,经过调整后镶嵌在神经网络模型内部的函数能够与目标人的语音一一对应,这样在进行语音识别过程中,能够协助神经网络模型确定语音是否属于目标人,进而增加神经网络模型的识别效率和识别精度。
步骤104,利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征。
在该步骤中,由于目标神经网络模型的输入口无法直接将待识别的语音输入进去,因此,需要利用MFCC对该待识别的语音进行特征提取,得到能够输入目标神经网络的待识别的语音特征。
步骤105,将待识别的语音特征输入目标神经网络模型进行语音识别处理,确定待识别的语音是否属于目标人。
在该步骤中,经过MFCC处理后得到的待识别的语音特征有多个,将这多个待识别的语音特征进行排列,组成特征向量矩阵,然后再将该特征向量矩阵从目标神经网络模型的输入口输入,目标神经网络模型对特征向量矩阵进行处理后将输出结果从输出口输出。
本方案可以应用于语音识别,以及语音加密解密的过程,加密过程可以利用上述步骤101-103形成的目标神经网络,将该目标神经网络模型嵌入至加密文件中,当用户想要利用语音进行解密时,再利用步骤104和105,当目标神经网络模型输出的结果为“是目标人本人”时,确定解密成功,启动相应的功能。
通过上述技术方案,利用多人的语音组成训练集对DNN架构进行学习训练,得到能够进行声纹识别的神经网络模型,并利用该神经网络模型对目标人的语音进行认证,在神经网络模型中形成与目标人语音相对应的函数,进而利用认证后的目标神经网络模型对语音进行识别,确定语音是否是目标人本人。这样根据每个人的声纹特点形成的目标神经网络模型的语音识别过程比较快速精准,使得识别效率得到有效的提高。
在具体实施例中,在步骤102之前,方法还包括:
步骤A,收集多个发音者的训练语音,利用MFCC对训练语音进行特征提取,得到训练语音特征,其中,每段训练语音中包含有对应发音者的标签。
在该步骤中,该训练语音的声音是由多个人发出的,是为了保证训练出的神经网络模型能够适应各种不同人的音色,保证神经网络模型的识别效果。
并且每一段训练语音都需要进行MFCC处理,保证每一段语音都能够被输入至DNN架构中。
步骤B,利用训练语音特征对DNN架构进行训练处理。
在该步骤中,训练语音特征可随机或者按照发音首字母的排序输入至DNN架构中进行训练,训练过程中将输出结果与对应的标签进行比对,如果比对成功证明输出正确,如果比对失败证明输出错误,并根据输出结果对DNN架构进行调整保证DNN架构输出的正确率。
步骤C,对训练过程中DNN架构的输出数据进行统计,根据统计结果确定出能够对语音进行识别的函数。
在该步骤中,将DNN架构输出的数据整合在一起,并计算输出的正确率等数据,根据这些数据计算能够对语音进行识别的函数。
步骤D,将所述函数保存至所述DNN架构的最后一层中,得到能够对语音的进行识别的神经网络模型。
在该步骤中,将得到的函数,保存到DNN架构的最后一层,待DNN架构的前端部分对语音进行处理完成之后,将处理结果输入至最后一层利用该函数对语音进行进一步确定,保证语音识别精度。
通过上述技术方案,利用多个人的多段语音对DNN架构进行训练,得到相应的神经网络识别模型,这样能够保证神经网络识别模型的多样化,识别男女老幼等不同人的音色,并且还能利用函数对语音进行进一步确认处理,保证识别的准确率。
在具体实施例中,步骤A具体包括:
步骤A1,获取多个人的N段语音,将每段语音分成两部分,得到2N段训练语音,并将每部分添加与语音的发音者相对应的标签。
步骤A2,利用MFCC对2N段训练语音进行特征提取,得到2N段训练语音特征。
步骤A3,从2N段训练语音特征任意选取两段训练语音特征进行组合,得到N个语音特征组。
在上述方案中,将每段分成两部分,然后再重新组合成N个语音特征组,这样每组语音特征中的两段训练语音特征可能出自同一个人,也可能出自不同的人,这样就可以用来训练DNN架构识别相同或不同人的语音特点。保证DNN架构训练的多样化,进而提高训练效果。
在具体实施例中,步骤B具体包括:
步骤B1,构建两个DNN网络,并将两个DNN网络组合成DNN架构。
步骤B2,将每个语音特征组的两段训练语音特征分别输入DNN架构中的两个DNN网络进行处理。
步骤B3,将两个DNN网络的输出结果进行整合后输出整合结果,其中,整合结果中包含两段训练语音特征是否属于同一个发音人。
步骤B4,根据整合结果和输入的两段训练语音特征对应的标签的差异计算损失函数,并根据损失函数对DNN架构的参数进行调整。
在上述技术方案中,DNN架构包含两个DNN网络,分别用来处理语音特征组的两段训练语音特征,然后将两个DNN网络的输出结果进行比对确定两段训练语音特征是否来自同一个人,然后根据两段训练语音特征的标签确定判断结果是否正确,根据输出结果与标签的差异计算损失函数,根据损失函数对DNN 架构的参数进行调整后,再利用下一个语音特征组进行训练,并不断重复上述过程,直至所有的语音特征组全部训练完成为止。
另外,DNN架构训练完成后,还可以再收集多个人的多段语音作为测试集,将测试集中的语音按照上述步骤A1-A3进行处理后,并输入训练后的DNN架构中,进行识别,统计识别的正确率,如果正确率大于等于设定阈值,则证明DNN架构训练成功,如果正确率小于设定阈值,则证明DNN架构训练失败,重新选多个人的N段语音,对上述训练后的DNN架构进行再次训练,直到统计的的正确率大于等于设定阈值为止。
在具体实施例中,步骤C具体包括:
步骤C1,计算N个语音特征组中每个语音特征组的两段训练语音特征的偏移距离L(x,y),其中,x和y分别表示两段训练语音特征。
在该步骤中,两个训练语音特征之间的声纹特征是不同的,有一定的偏移距离,该距离可以用下式表示:
L(x,y)=x Ty-x TSx-y TSy+b
式中,S表示训练语音特征转换成特征向量后输入DNN网络中输出的向量矩阵,b表示设定的常数值,可以根据实际情况进行调整。
步骤C2,根据偏移距离,计算N个语音特征组中的每个语音特征组的两段训练语音特征属于同一个发音人的概率值Pr(x,y),
Figure PCTCN2019118402-appb-000001
步骤C3,统计训练过程中DNN架构的输出的同一个发音人的整合结果正确的语音特征组,组成集合P same
步骤C4,统计训练过程中DNN架构的输出的不同发音人的整合结果正确的语音特征组,组成集合P diff
步骤C5,计算能够对语音进行识别的函数E:
Figure PCTCN2019118402-appb-000002
其中,K为设定权重值。
通过上述方案,得到能够对语音进行识别的函数E,然后再将该函数镶嵌至训练完成后的DNN架构中,就形成的最终的神经网络模型。
在具体实施例中,步骤B1具体包括:
步骤B11,为DNN网络设置M个隐藏层,用于对输入的训练语音特征进行处理。
步骤B12,在每个前M-1个隐藏层后面分别设置池化层,用于将隐藏层输出的处理结果进行聚合,计算平均差和标准偏差,并将所有池化层输出的结果进行整合发送至最后一个隐藏层。
在该步骤中,计算处理结果的平均数,并统计处理结果同平均数的离差绝对值的算术平均数作为平均差,计算处理结果均差平方的算术平均数的平方根作为标准偏差。然后将这些计算结果进行整合后发送至最后一个隐藏层,利用最后一个隐藏层中的神经元进行处理,然后得出对应语音特征属于哪个人的声音,并将该人对应的代表标记输出。
步骤B13,在DNN网络的输出口前设置线性输出层,最后一个隐藏层将整合结果发送至线性输出层,并从输出口输出。
在该步骤中,线性输出层将最后一个隐藏层输出的代表标记进行数据处理,转换成相应的代表符号(即,标签)。然后将该代表符号输出。
步骤B14,将设置好的两个DNN网络的线性输出层进行组合,得到DNN架构。
在该步骤中,两个DNN网络的线性输出层输出的结果进行比对,若相同则为同一个人的声音,若不同则为不同人的声音。然后将是否属于同一个人的声音,以及两段训练语音的发音者的代表符号进行输出,将代表符号与对应的标签进行比对,若相同证明识别正确,若不同证明识别错误。
通过上述方案,构建的DNN网络经过训练之后,能够更加准确的进行语音识别,识别效率和精度都能够得到有效的提高。
在具体实施例中,步骤101具体包括:
步骤1011,对认证语音利用高通滤波器进行预加重处理。
步骤1012,将预加重处理后的待识别的语音进行分帧处理。
步骤1013,将认证语音中的每一帧乘以汉明窗进行加窗处理,得到加窗后的认证语音帧。
步骤1014,对加窗后的认证语音帧进行快速傅里叶变换,得到对应的能量谱。
步骤1015,将能量谱通过三角带通滤波器,对所述能量谱进行平滑化,消除所述能量谱的谐波的作用。
步骤1016,对三角带通滤波器的输出结果进行计算对数能量,并进行离散余弦变换得到MFCC特征。
步骤1017,对MFCC特征进行归一化处理,利用语音活动检测工具过滤掉非语音帧,得到认证语音特征。
通过上述方案,利用MFCC对语音进行预处理,得到能够输入神经网络模型的语音特征,这样从低频到高频这一段频带内按临界带宽的大小由密到疏安排一组带通滤波器,对输入信号进行滤波。将每个带通滤波器输出的信号能量作为信号的基本特征,对此特征经过进一步处理后就可以作为语音的输入特征。由于这种特征不依赖于信号的性质,对输入信号不做任何的假设和限制,又利用了听觉模型的研究成果。因此,这种参数具有更好的鲁邦性,更符合人耳的听觉特性,而且当信噪比降低时仍然具有较好的识别性能。
另外,在利用DNN进行训练、认证以及识别过程中,对于将要输入的语音都要按照上述步骤1011-1017进行MFCC的特征提取过程。
通过上述实施例的基于深度学习的声纹识别方法,利用多人的语音组成训练集对DNN架构进行学习训练,得到能够进行声纹识别的神经网络模型,并利用该神经网络模型对目标人的语音进行认证,在神经网络模型中形成与目标人语音相对应的函数,进而利用认证后的目标神经网络模型对语音进行识别,确定语音是否是目标人本人。这样根据每个人的声纹特点形成的目标神经网络模型的语音识别过程比较快速精准,使得识别效率得到有效的提高。
在本申请的另一个实施例的基于深度学习的声纹识别方法中,包括如下步骤:
一、获取训练集
1、收集来自各种说话人的语音,对语音进行标注(标注说话人身份)作为训练集。
二、对训练集进行预处理
利用MFCC(Mel Frequency Cepstral Coefficents,梅尔频率倒谱系数)提取训练集中并抽取说话人特征,具体过程如下:
1、将训练集中的语音利用高通滤波器进行预加重处理。
2、对训练集进行分帧处理,训练集中语音的每一个字都有L个采样点,将L个采样点集合成一个观测单位,称为帧,为了避免相邻两帧的变化过大,因此会让两相邻帧之间有一段重叠区域,此重叠区域包含了H个取样点,通常H的值约为L的1/2或1/3。
3、对训练集进行加窗处理,将每一帧乘以汉明窗,以增加帧左端和右端的连续性。
4、对加窗后的训练集进行快速傅里叶变换,得到对应的能量谱。
5、将能量谱通过三角带通滤波器,对频谱进行平滑化,并消除谐波的作用,突显原先语音的共振峰。(因此一段语音的音调或音高,是不会呈现在MFCC参数内,换句话说,以MFCC为特征的语音辨识系统,并不会受到输入语音的音调不同而有所影响)此外,还可以降低运算量。
6、对三角带通滤波器的输出结果进行计算对数能量,然后再经离散余弦变换(DCT)得到20维MFCC特征,帧长为25ms。
7、在最多3秒的滑动窗口内进行均值归一化。将9个帧拼接在一起创建一个180维输入向量,用相同帧级的VAD(语音活动检测Voice Activity Detection)过滤掉非语音帧,得到过滤后的MFCC特征。
三、利用训练集进行训练得到神经网络模型
1、构建DNN网络架构。
用Kaldi语音识别工具包中的nnet3神经网络库构建前馈DNN(Deep Neural Network,深度神经网络算法)识别系统,如图2所示的DNN网络架构,设有隐藏层(NIN Layer)、池化层(Temporal Pooling)和线性输出层(linear Layer),将两个DNN网络架构组合在一起如图3所示。
2、确定训练特征。
训练集中有多个说话人,每个说话人有多个语音段,每个语音段对应一个MFCC特征,将同一个说话人的两个MFCC特征组成一个特征pair。挑选N个不同说话人的特征pair,也就是N段特征pair组成训练特征。
3、进行训练
将2N段特征中任取两个特征X和Y(X和Y可以属于同一说话人,也可以属于不同说话人)分别输入图3中的两个DNN网络架构的隐藏层进行处理。
然后将处理结果输出至池化层,池化层将隐藏层的输出结果进行聚合,并计算其平均值和标准偏差。将这些数据整合在一起,发送到最终的隐藏层,最终的隐藏层将输出结果输入线性输出层,进行线性输出,根据输出结果确定损失函数,根据损失函数对神经网络进行调整,以完成神经网络的训练,重复上述过程直至所有训练集中的数据全部训练完成,得到能够对语音进行分类的DNN架构。
4、计算能够识别说话人的函数
确定训练集中的语音属于同一说话人的概率如公式(1),Pr(x,y)是同一说话人概率,x、y指两个说话人语音的特征向量;
计算x和y之间的距离L(x,y)如等式2,其中,对称矩阵S和偏移b是DNN网络架构的常数输出。
确定神经网络模型对训练集进行训练时不同说话人和相同说话人分类正确的集合P same和P diff,得出如下等式(3)的函数,该函数能够得出对应语音是否属于同一说话人,其中K是指设定权重值。
Figure PCTCN2019118402-appb-000003
L(x,y)=x Ty-x TSx-y TSy+b                 (2)
Figure PCTCN2019118402-appb-000004
5、将函数嵌入至训练好的DNN网络架构中,组成能够进行语音识别的神经网络模型。
四、利用神经网络模型进行语音识别
1、获取用户的认证语音,将该认证语音进行MFCC处理成认证语音特征后输入神经网络模型,经过神经网络模型认证后,形成能够识别用户的目标神经网络模型。
2、当用户想要进行语音识别时,录入待处理的语音,并对语音进行MFCC预处理,得到待处理的语音特征。
将待处理的语音特征输入目标神经网络模型进行识别处理,确定出该语音是否属于用户本人。如果是,则对应启动相应的功能。
进一步的,作为图1方法的具体实现,本申请实施例提供了一种基于深度学习的声纹识别装置,如图4所示,装置包括:依次连接的获取模块41、认证模块42、调整模块43、提取模块44和处理模块45。
获取模块41,用于获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征;
认证模块42,用于将认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型;
调整模块43,用于根据认证处理结果对神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型;
提取模块44,用于利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征;
处理模块45,用于将待识别的语音特征输入目标神经网络模型进行语音识别处理,确定待识别的语音是否属于目标人。
在具体实施例中,装置还包括:收集模块,用于收集多个发音者的训练语音,利用MFCC对训练语音进行特征提取,得到训练语音特征,其中,每段训练语音中包含有对应发音者的标签;训练模块,用于利用训练语音特征对DNN架构进行训练处理;计算模块,用于对训练过程中DNN架构的输出数据进行统计,根据统计结果确定出能够对语音进行识别的函数;保存模块,用于将函数保存至DNN架构的最后一层中,得到能够对语音的进行识别的神经网络模型。
在具体实施例中,收集模块具体包括:划分单元,用于获取多个人的N段语音,将每段语音分成两部分,得到2N段训练语音,并将每部分添加与语音的发音者相对应的标签;提取单元,用于利用MFCC 对2N段训练语音进行特征提取,得到2N段训练语音特征;组合单元,用于从2N段训练语音特征任意选取两段训练语音特征进行组合,得到N个语音特征组。
在具体实施例中,训练模块具体包括:构建单元,用于构建两个DNN网络,并将两个DNN网络组合成DNN架构;输入单元,用于将每个语音特征组的两段训练语音特征分别输入DNN架构中的两个DNN网络进行处理;整合单元,用于将两个DNN网络的输出结果进行整合后输出整合结果,其中,整合结果中包含两段训练语音特征是否属于同一个发音人;调整训练单元,用于根据整合结果和输入的两段训练语音特征对应的标签的差异计算损失函数,并根据损失函数对DNN架构的参数进行调整。
在具体实施例中,计算模块具体包括:偏移距离计算单元,用于计算N个语音特征组中每个语音特征组的两段训练语音特征的偏移距离L(x,y),其中,x和y分别表示两段训练语音特征;概率值计算单元,用于根据偏移距离,计算N个语音特征组中的每个语音特征组的两段训练语音特征属于同一个发音人的概率值Pr(x,y),
Figure PCTCN2019118402-appb-000005
统计单元,用于统计训练过程中DNN架构的输出的同一个发音人的整合结果正确的语音特征组,组成集合P same;统计训练过程中DNN架构的输出的不同发音人的整合结果正确的语音特征组,组成集合P diff;计算单元,用于计算能够对语音进行识别的函数E:
Figure PCTCN2019118402-appb-000006
其中,K为设定权重值。
在具体实施例中,构建单元具体包括:设置单元,用于为DNN网络设置M个隐藏层,用于对输入的训练语音特征进行处理;在每个前M-1个隐藏层后面分别设置池化层,用于将隐藏层输出的处理结果进行聚合,计算平均差和标准偏差,并将所有池化层输出的结果进行整合发送至最后一个隐藏层;在DNN网络的输出口前设置线性输出层,最后一个隐藏层将整合结果发送至线性输出层,并从输出口输出;将设置好的两个DNN网络的线性输出层进行组合,得到DNN架构。
在具体实施例中,获取模块41具体包括:加重单元,用于对认证语音利用高通滤波器进行预加重处理;分帧单元,用于将预加重处理后的待识别的语音进行分帧处理;加窗单元,用于将认证语音中的每 一帧乘以汉明窗进行加窗处理,得到加窗后的认证语音帧;变换单元,用于对加窗后的认证语音帧进行快速傅里叶变换,得到对应的能量谱;滤波单元,用于将能量谱通过三角带通滤波器,对所述能量谱进行平滑化,消除所述能量谱的谐波的作用;对数转换单元,用于对三角带通滤波器的输出结果进行计算对数能量,并进行离散余弦变换得到MFCC特征;归一化单元,用于对MFCC特征进行归一化处理,利用语音活动检测工具过滤掉非语音帧,得到认证语音特征。
基于上述图1所示方法和图4所示装置的实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,如图5所示,包括存储器52和处理器51,其中存储器52和处理器51均设置在总线53上存储器52存储有计算机程序,处理器51执行计算机程序时实现图1所示的基于深度学习的声纹识别方法。
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储器(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景所述的方法。
可选地,该设备还可以连接用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。
本领域技术人员可以理解,本实施例提供的一种计算机设备的结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。
基于上述如图1所示方法和图4所示装置的实施例,相应的,本申请实施例还提供了一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述如图1所示的基于深度学习的声纹识别方法。
存储介质中还可以包括操作系统、网络通信模块。操作系统是管理计算机设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信,以及与计算机设备中其它硬件和软件之间通信。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。
通过应用本申请的技术方案,利用多人的语音组成训练集对DNN架构进行学习训练,得到能够进行声纹识别的神经网络模型,并利用该神经网络模型对目标人的语音进行认证,在神经网络模型中形成与目标人语音相对应的函数,进而利用认证后的目标神经网络模型对语音进行识别,确定语音是否是目标人本人。这样根据每个人的声纹特点形成的目标神经网络模型的语音识别过程比较快速精准,使得识别效率得到有效的提高。
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。

Claims (20)

  1. 一种基于深度学习的声纹识别方法,其中,所述方法包括:
    获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征;
    将所述认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型;
    根据认证处理结果对所述神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型;
    利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征;
    将所述待识别的语音特征输入目标神经网络模型进行语音识别处理,确定所述待识别的语音是否属于目标人。
  2. 根据权利要求1所述的方法,在将所述认证语音特征输入神经网络模型进行认证处理之前,所述方法还包括:
    收集多个发音者的训练语音,利用MFCC对所述训练语音进行特征提取,得到训练语音特征,其中,每段训练语音中包含有对应发音者的标签;
    利用所述训练语音特征对DNN架构进行训练处理;
    对训练过程中所述DNN架构的输出数据进行统计,根据统计结果确定出能够对语音进行识别的函数;
    将所述函数保存至所述DNN架构的最后一层中,得到能够对语音的进行识别的神经网络模型。
  3. 根据权利要求2所述的方法,所述收集多个发音者的训练语音,利用MFCC对所述训练语音进行特征提取,得到训练语音特征,具体包括:
    获取多个人的N段语音,将每段语音分成两部分,得到2N段训练语音,并将每部分添加与语音的发音者相对应的标签;
    利用MFCC对2N段训练语音进行特征提取,得到2N段训练语音特征;
    从所述2N段训练语音特征任意选取两段训练语音特征进行组合,得到N个语音特征组。
  4. 根据权利要求3所述的方法,利用所述训练语音特征对DNN架构进行训练处理,具体包括:
    构建两个DNN网络,并将两个DNN网络组合成DNN架构;
    将每个语音特征组的两段训练语音特征分别输入DNN架构中的两个DNN网络进行处理;
    将两个DNN网络的输出结果进行整合后输出整合结果,其中,整合结果中包含所述两段训练语音特征是否属于同一个发音人;
    根据整合结果和输入的两段训练语音特征对应的标签的差异计算损失函数,并根据损失函数对所述DNN架构的参数进行调整。
  5. 根据权利要求4所述的方法,所述对训练过程中所述DNN架构的输出数据进行统计,根据统计结果确定出能够对语音进行识别的函数,具体包括:
    计算N个语音特征组中每个语音特征组的两段训练语音特征的偏移距离L(x,y),其中,x和y分别表示两段训练语音特征;
    根据所述偏移距离,计算N个语音特征组中的每个语音特征组的两段训练语音特征属于同一个发音人的概率值Pr(x,y),
    Figure PCTCN2019118402-appb-100001
    统计训练过程中所述DNN架构的输出的同一个发音人的整合结果正确的语音特征组,组成集合P same
    统计训练过程中所述DNN架构的输出的不同发音人的整合结果正确的语音特征组,组成集合P diff
    计算能够对语音进行识别的函数E:
    Figure PCTCN2019118402-appb-100002
    其中,K为设定权重值。
  6. 根据权利要求4所述的方法,所述构建两个DNN网络,并将两个DNN网络组合成DNN架构,具体包括:
    为DNN网络设置M个隐藏层,用于对输入的训练语音特征进行处理;
    在每个前M-1个隐藏层后面分别设置池化层,用于将隐藏层输出的处理结果进行聚合,计算平均差和标准偏差,并将所有池化层输出的结果进行整合发送至最后一个隐藏层;
    在所述DNN网络的输出口前设置线性输出层,所述最后一个隐藏层将整合结果发送至线性输出层,并从所述输出口输出;
    将设置好的两个DNN网络的线性输出层进行组合,得到DNN架构。
  7. 根据权利要求1所述的方法,所述利用MFCC对认证语音进行特征提取,得到认证语音特征,具体包括:
    对所述认证语音利用高通滤波器进行预加重处理;
    将预加重处理后的待识别的语音进行分帧处理;
    将认证语音中的每一帧乘以汉明窗进行加窗处理,得到加窗后的认证语音帧;
    对所述加窗后的认证语音帧进行快速傅里叶变换,得到对应的能量谱;
    将所述能量谱通过三角带通滤波器,对所述能量谱进行平滑化,消除所述能量谱的谐波的作用;
    对三角带通滤波器的输出结果进行计算对数能量,并进行离散余弦变换得到MFCC特征;
    对MFCC特征进行归一化处理,利用语音活动检测工具过滤掉非语音帧,得到认证语音特征。
  8. 一种基于深度学习的声纹识别装置,其中,所述装置包括:
    获取模块,用于获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征;
    认证模块,用于将所述认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型;
    调整模块,用于根据认证处理结果对所述神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型;
    提取模块,用于利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征;
    处理模块,用于将所述待识别的语音特征输入目标神经网络模型进行语音识别处理,确定所述待识别的语音是否属于目标人。
  9. 根据权利要求8所述的装置,所述装置还包括:
    收集模块,用于收集多个发音者的训练语音,利用MFCC对训练语音进行特征提取,得到训练语音特征,其中,每段训练语音中包含有对应发音者的标签;
    训练模块,用于利用训练语音特征对DNN架构进行训练处理;
    计算模块,用于对训练过程中DNN架构的输出数据进行统计,根据统计结果确定出能够对语音进行识别的函数;
    保存模块,用于将函数保存至DNN架构的最后一层中,得到能够对语音的进行识别的神经网络模型。
  10. 根据权利要求9所述的装置,收集模块包括:
    划分单元,用于获取多个人的N段语音,将每段语音分成两部分,得到2N段训练语音,并将每部分添加与语音的发音者相对应的标签;
    提取单元,用于利用MFCC对2N段训练语音进行特征提取,得到2N段训练语音特征;
    组合单元,用于从2N段训练语音特征任意选取两段训练语音特征进行组合,得到N个语音特征组。
  11. 根据权利要求9所述的装置,所述训练模块包括:
    构建单元,用于构建两个DNN网络,并将两个DNN网络组合成DNN架构;
    输入单元,用于将每个语音特征组的两段训练语音特征分别输入DNN架构中的两个DNN网络进行处理;
    整合单元,用于将两个DNN网络的输出结果进行整合后输出整合结果,其中,整合结果中包含两段训练语音特征是否属于同一个发音人;
    调整训练单元,用于根据整合结果和输入的两段训练语音特征对应的标签的差异计算损失函数,并根据损失函数对DNN架构的参数进行调整。
  12. 根据权利要求9所述的装置,所述计算模块具体包括:
    偏移距离计算单元,用于计算N个语音特征组中每个语音特征组的两段训练语音特征的偏移距离L(x,y),其中,x和y分别表示两段训练语音特征;
    概率值计算单元,用于根据偏移距离,计算N个语音特征组中的每个语音特征组的两段训练语音特征属于同一个发音人的概率值Pr(x,y),
    Figure PCTCN2019118402-appb-100003
    统计单元,用于统计训练过程中DNN架构的输出的同一个发音人的整合结果正确的语音特征组,组成集合P same;统计训练过程中DNN架构的输出的不同发音人的整合结果正确的语音特征组,组成集合P diff
    计算单元,用于计算能够对语音进行识别的函数E:
    Figure PCTCN2019118402-appb-100004
    其中,K为设定权重值。
  13. 根据权利要求11所述的装置,所述构建单元具体包括:
    设置单元,用于为DNN网络设置M个隐藏层,用于对输入的训练语音特征进行处理;在每个前M-1个隐藏层后面分别设置池化层,用于将隐藏层输出的处理结果进行聚合,计算平均差和标准偏差,并将所有池化层输出的结果进行整合发送至最后一个隐藏层;在DNN网络的输出口前设置线性输出层,最后 一个隐藏层将整合结果发送至线性输出层,并从输出口输出;将设置好的两个DNN网络的线性输出层进行组合,得到DNN架构。
  14. 根据权利要求8所述的装置,所述获取模块包括:
    加重单元,用于对认证语音利用高通滤波器进行预加重处理;
    分帧单元,用于将预加重处理后的待识别的语音进行分帧处理;
    加窗单元,用于将认证语音中的每一帧乘以汉明窗进行加窗处理,得到加窗后的认证语音帧;
    变换单元,用于对加窗后的认证语音帧进行快速傅里叶变换,得到对应的能量谱;
    滤波单元,用于将能量谱通过三角带通滤波器,对所述能量谱进行平滑化,消除所述能量谱的谐波的作用;
    对数转换单元,用于对三角带通滤波器的输出结果进行计算对数能量,并进行离散余弦变换得到MFCC特征;
    归一化单元,用于对MFCC特征进行归一化处理,利用语音活动检测工具过滤掉非语音帧,得到认证语音特征。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种基于深度学习的声纹识别方法的步骤,包括:
    获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征;
    将所述认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型;
    根据认证处理结果对所述神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型;
    利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征;
    将所述待识别的语音特征输入目标神经网络模型进行语音识别处理,确定所述待识别的语音是否属于目标人。
  16. 根据权利要求15所述的计算机设备,在将所述认证语音特征输入神经网络模型进行认证处理之前,所述方法还包括:
    收集多个发音者的训练语音,利用MFCC对所述训练语音进行特征提取,得到训练语音特征,其中,每段训练语音中包含有对应发音者的标签;
    利用所述训练语音特征对DNN架构进行训练处理;
    对训练过程中所述DNN架构的输出数据进行统计,根据统计结果确定出能够对语音进行识别的函数;
    将所述函数保存至所述DNN架构的最后一层中,得到能够对语音的进行识别的神经网络模型。
  17. 根据权利要求15所述的计算机设备,所述收集多个发音者的训练语音,利用MFCC对所述训练语音进行特征提取,得到训练语音特征,具体包括:
    获取多个人的N段语音,将每段语音分成两部分,得到2N段训练语音,并将每部分添加与语音的发音者相对应的标签;
    利用MFCC对2N段训练语音进行特征提取,得到2N段训练语音特征;
    从所述2N段训练语音特征任意选取两段训练语音特征进行组合,得到N个语音特征组。
  18. 一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基于深度学习的声纹识别方法的步骤,包括:
    获取目标人的认证语音,利用MFCC对认证语音进行特征提取,得到认证语音特征;
    将所述认证语音特征输入神经网络模型进行认证处理,其中,DNN架构经过多人语音进行训练,获得能够对语音进行认证的函数,再将函数保存至DNN架构的最后一层得到神经网络模型;
    根据认证处理结果对所述神经网络模型内部的函数的参数进行调整,得到能够对目标人语音的进行识别的目标神经网络模型;
    利用MFCC对获取的待识别的语音进行特征提取,得到待识别的语音特征;
    将所述待识别的语音特征输入目标神经网络模型进行语音识别处理,确定所述待识别的语音是否属于目标人。
  19. 根据权利要求18所述的计算机存储介质,在将所述认证语音特征输入神经网络模型进行认证处理之前,所述方法还包括:
    收集多个发音者的训练语音,利用MFCC对所述训练语音进行特征提取,得到训练语音特征,其中,每段训练语音中包含有对应发音者的标签;
    利用所述训练语音特征对DNN架构进行训练处理;
    对训练过程中所述DNN架构的输出数据进行统计,根据统计结果确定出能够对语音进行识别的函数;
    将所述函数保存至所述DNN架构的最后一层中,得到能够对语音的进行识别的神经网络模型。
  20. 根据权利要求18所述的计算机存储介质,所述收集多个发音者的训练语音,利用MFCC对所述训练语音进行特征提取,得到训练语音特征,具体包括:
    获取多个人的N段语音,将每段语音分成两部分,得到2N段训练语音,并将每部分添加与语音的发 音者相对应的标签;
    利用MFCC对2N段训练语音进行特征提取,得到2N段训练语音特征;
    从所述2N段训练语音特征任意选取两段训练语音特征进行组合,得到N个语音特征组。
PCT/CN2019/118402 2019-09-20 2019-11-14 一种基于深度学习的声纹识别方法、装置及设备 WO2021051608A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910894120.3 2019-09-20
CN201910894120.3A CN110767239A (zh) 2019-09-20 2019-09-20 一种基于深度学习的声纹识别方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2021051608A1 true WO2021051608A1 (zh) 2021-03-25

Family

ID=69330817

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118402 WO2021051608A1 (zh) 2019-09-20 2019-11-14 一种基于深度学习的声纹识别方法、装置及设备

Country Status (2)

Country Link
CN (1) CN110767239A (zh)
WO (1) WO2021051608A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421575A (zh) * 2021-06-30 2021-09-21 平安科技(深圳)有限公司 声纹识别方法、装置、设备及存储介质
CN113707159A (zh) * 2021-08-02 2021-11-26 南昌大学 一种基于Mel语图与深度学习的电网涉鸟故障鸟种识别方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111524525B (zh) * 2020-04-28 2023-06-16 平安科技(深圳)有限公司 原始语音的声纹识别方法、装置、设备及存储介质
CN112017632A (zh) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 一种自动化会议记录生成方法
CN112637209A (zh) * 2020-12-23 2021-04-09 四川虹微技术有限公司 安全认证方法及装置、安全注册方法及装置、存储介质
CN113037781A (zh) * 2021-04-29 2021-06-25 广东工业大学 基于rnn的语音信息加密方法及装置
CN113488059A (zh) * 2021-08-13 2021-10-08 广州市迪声音响有限公司 一种声纹识别方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018107810A1 (zh) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 声纹识别方法、装置、电子设备及介质
CN108564954A (zh) * 2018-03-19 2018-09-21 平安科技(深圳)有限公司 深度神经网络模型、电子装置、身份验证方法和存储介质
CN109074822A (zh) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 特定声音识别方法、设备和存储介质
CN110010133A (zh) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 基于短文本的声纹检测方法、装置、设备及存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9502038B2 (en) * 2013-01-28 2016-11-22 Tencent Technology (Shenzhen) Company Limited Method and device for voiceprint recognition
EP3433854B1 (en) * 2017-06-13 2020-05-20 Beijing Didi Infinity Technology and Development Co., Ltd. Method and system for speaker verification
CN107358626B (zh) * 2017-07-17 2020-05-15 清华大学深圳研究生院 一种利用条件生成对抗网络计算视差的方法
JP7143591B2 (ja) * 2018-01-17 2022-09-29 トヨタ自動車株式会社 発話者推定装置
CN108958810A (zh) * 2018-02-09 2018-12-07 北京猎户星空科技有限公司 一种基于声纹的用户识别方法、装置及设备
CN108647643B (zh) * 2018-05-11 2021-08-03 浙江工业大学 一种基于深度学习的填料塔液泛状态在线辨识方法
CN108898595B (zh) * 2018-06-27 2021-02-19 慧影医疗科技(北京)有限公司 一种胸部图像中病灶区域的定位模型的构建方法及应用
CN109472196A (zh) * 2018-09-28 2019-03-15 天津大学 一种基于视频图像的室内人员检测方法
CN109243467B (zh) * 2018-11-14 2019-11-05 龙马智声(珠海)科技有限公司 声纹模型构建方法、声纹识别方法及系统
CN109801636A (zh) * 2019-01-29 2019-05-24 北京猎户星空科技有限公司 声纹识别模型的训练方法、装置、电子设备及存储介质
CN110211594B (zh) * 2019-06-06 2021-05-04 杭州电子科技大学 一种基于孪生网络模型和knn算法的说话人识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018107810A1 (zh) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 声纹识别方法、装置、电子设备及介质
CN109074822A (zh) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 特定声音识别方法、设备和存储介质
CN108564954A (zh) * 2018-03-19 2018-09-21 平安科技(深圳)有限公司 深度神经网络模型、电子装置、身份验证方法和存储介质
CN110010133A (zh) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 基于短文本的声纹检测方法、装置、设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421575A (zh) * 2021-06-30 2021-09-21 平安科技(深圳)有限公司 声纹识别方法、装置、设备及存储介质
CN113421575B (zh) * 2021-06-30 2024-02-06 平安科技(深圳)有限公司 声纹识别方法、装置、设备及存储介质
CN113707159A (zh) * 2021-08-02 2021-11-26 南昌大学 一种基于Mel语图与深度学习的电网涉鸟故障鸟种识别方法
CN113707159B (zh) * 2021-08-02 2024-05-03 南昌大学 一种基于Mel语图与深度学习的电网涉鸟故障鸟种识别方法

Also Published As

Publication number Publication date
CN110767239A (zh) 2020-02-07

Similar Documents

Publication Publication Date Title
WO2021051608A1 (zh) 一种基于深度学习的声纹识别方法、装置及设备
CN107492382B (zh) 基于神经网络的声纹信息提取方法及装置
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
JP6954680B2 (ja) 話者の確認方法及び話者の確認装置
WO2017215558A1 (zh) 一种声纹识别方法和装置
CN109036382B (zh) 一种基于kl散度的音频特征提取方法
CN106062871B (zh) 使用所选择的群组样本子集来训练分类器
US9530417B2 (en) Methods, systems, and circuits for text independent speaker recognition with automatic learning features
Baloul et al. Challenge-based speaker recognition for mobile authentication
WO2019134247A1 (zh) 基于声纹识别模型的声纹注册方法、终端装置及存储介质
CN103794207A (zh) 一种双模语音身份识别方法
JP2001092974A (ja) 話者認識方法及びその実行装置並びに音声発生確認方法及び装置
CN110299142A (zh) 一种基于网络融合的声纹识别方法及装置
CN113223536B (zh) 声纹识别方法、装置及终端设备
CN113823293B (zh) 一种基于语音增强的说话人识别方法及系统
CN110570870A (zh) 一种文本无关的声纹识别方法、装置及设备
WO2018095167A1 (zh) 声纹识别方法和声纹识别系统
CN108877812B (zh) 一种声纹识别方法、装置及存储介质
CN111816185A (zh) 一种对混合语音中说话人的识别方法及装置
CN110570871A (zh) 一种基于TristouNet的声纹识别方法、装置及设备
WO2020140609A1 (zh) 一种语音识别方法、设备及计算机可读存储介质
WO2021072893A1 (zh) 一种声纹聚类方法、装置、处理设备以及计算机存储介质
Wu et al. Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification.
Brunet et al. Speaker recognition for mobile user authentication: An android solution
Sukor et al. Speaker identification system using MFCC procedure and noise reduction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945682

Country of ref document: EP

Kind code of ref document: A1