CN107731233B - Voiceprint recognition method based on RNN - Google Patents

Voiceprint recognition method based on RNN Download PDF

Info

Publication number
CN107731233B
CN107731233B CN201711070510.6A CN201711070510A CN107731233B CN 107731233 B CN107731233 B CN 107731233B CN 201711070510 A CN201711070510 A CN 201711070510A CN 107731233 B CN107731233 B CN 107731233B
Authority
CN
China
Prior art keywords
voice
speaker
neural network
voice data
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711070510.6A
Other languages
Chinese (zh)
Other versions
CN107731233A (en
Inventor
冯毅夫
王华锋
徐雷
杜俊逸
付明霞
马晨南
齐一凡
潘海侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Fuji Robot Co.,Ltd.
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201711070510.6A priority Critical patent/CN107731233B/en
Publication of CN107731233A publication Critical patent/CN107731233A/en
Application granted granted Critical
Publication of CN107731233B publication Critical patent/CN107731233B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Abstract

The invention provides a voiceprint recognition method based on RNN, which is characterized in that after MFCC characteristics of de-noised voice data and a first-order difference of the MFCC characteristics are obtained, a recurrent neural network is used for extracting high-level characteristics of a speaker in the MFCC characteristics, the extracted characteristics are classified by a softmax classifier, and finally, the speaker is recognized by a naive Bayes method. Different from silence elimination of the traditional method, the method reserves a silence segment in voice data, can extract the characteristics related to the context based on the recurrent neural network, and can extract the high-level characteristics of the voice of the speaker, such as speaking mode, rhythm and the like aiming at the voice data, so that the characteristic information is more complete and can represent the speaker more. Compared with the existing voice print identification method based on Gaussian, the method has relatively low requirement on voice data and higher accuracy, the accuracy is still kept at a high level even facing large data, and the running speed is not reduced.

Description

Voiceprint recognition method based on RNN
Technical Field
The invention provides a voiceprint recognition method based on RNN, and relates to the fields of deep learning, pattern recognition and voice signal processing.
Background
The rapid development of information technology is how to accurately authenticate the identity of a person, protect the privacy of the person and guarantee the information security, and the problem which needs to be solved at present is formed. Compared with the traditional identity authentication mode, the biological characteristic identification identity authentication technology has the characteristic of no loss, theft or forgetting in the using process; it is not only quick and convenient, but also accurate and reliable. Voiceprint recognition is one of the most popular biometric feature recognition technologies at present, has unique advantages in the application fields of remote authentication and the like, receives more and more attention, a WeChat has started a voice lock verification login mode, and the first global idea of associative phone A586 which adopts the voiceprint recognition technology for unlocking opens the way of applying the voiceprint recognition technology, and a user in a private banking department, Barcley Wealth (Barcleys Wealth) under the Barcley Bank flags completes identity verification through own voice. Compared with human faces and fingerprints, voiceprint recognition is always low in tone, and public cognition is not high. In fact, in recent years, voiceprint recognition is developing rapidly and in low tone due to the characteristics of high usability, high user acceptance, low acquisition cost and the like of voiceprint recognition, and the application range is continuously expanded. The annual investment in speech and voiceprint recognition by large global companies, including apple, google, microsoft, hundredths, science news, has risen, and public data has shown that by 2020, the market size of global speech-related pattern recognition will increase from $ 61.9 million to $ 200 million in 2015, which can be said to be of great potential for future market development.
Common voiceprint recognition methods mainly include: the method comprises a voiceprint recognition method based on signal processing, a voiceprint recognition method based on acoustic characteristics and pattern matching, a voiceprint recognition method based on a Gaussian mixture model and a voiceprint recognition method based on deep learning.
Method based on signal processing: this is the earliest method applied in the development of voiceprint recognition technology. The method calculates the parameters of the speech data in the signal science by using some technical methods in the signal processing technology, and then carries out template matching, statistical variance analysis and the like. The method is extremely sensitive to voice data, low in accuracy and not ideal in recognition effect.
The identification method based on acoustic features and pattern matching comprises the following steps: from the end of the 70 s to the end of the 80 s of the 20 th century, speaker recognition research has focused on processing acoustic feature parameters and new pattern matching methods. Researchers successively put forward speaker identification characteristic parameters such as LPC spectral coefficients, LSP spectral coefficients, perceptual linear prediction coefficients, mel-frequency cepstrum coefficients and the like. In this case, techniques such as a dynamic time warping method, a vector quantization method, a support vector machine, and an artificial neural network method are widely used in the speech recognition field, and are also core techniques for speaker recognition. In the above speaker recognition model algorithms, there are certain limitations on the length of speech, text, and speech channel, etc., but in practical applications, short speech and cross-channel problems are common, where the cross-channel problem has the greatest impact on the performance of the voiceprint recognition system.
The identification method based on the Gaussian mixture model comprises the following steps: after the 90 s of the 20 th century, the Gaussian Mixture Model (GMM) rapidly became the mainstream technology in speaker recognition independent of text at present with its simplicity, flexibility, effectiveness and better robustness, bringing speaker recognition research into a new stage. The GMM model is a probabilistic model that models based on probability distributions derived from features, unlike the way that features of speech are modeled directly. Meanwhile, the judgment mode is also changed, and the similarity of the models is judged according to the likelihood scores. But the requirement on the voice data volume is very large, the method is very sensitive to channel environment noise, and the requirement under a real scene cannot be met.
The voiceprint recognition method based on deep learning comprises the following steps: the method uses a large number of training samples to automatically learn the voiceprint features, and can extract excellent voiceprint features with discrimination. However, the existing methods based on deep learning do not consider the context-dependent nature of the speech signal, and the extracted features do not represent the speaker well, and the advantages of deep learning are not fully exerted.
In order to solve the problems, the invention provides a voiceprint recognition method based on RNN, which can extract high-level voice features and accurately and efficiently complete a voiceprint recognition task.
Disclosure of Invention
The technical problem solved by the invention is as follows: the problems that the context correlation of voice data is not considered in the existing voiceprint recognition method, the extracted features cannot well represent speakers, the strong feature extraction capability of deep learning is not exerted, and the like are solved. A voiceprint recognition method based on a Recurrent Neural Networks (RNN) is provided.
The technical scheme adopted by the invention is as follows: the method comprises the following four steps:
the method comprises the following steps that (1) denoising processing is carried out on input voice data by adopting a spectral subtraction method, wherein channel noise is eliminated, and the channel noise is noise caused by recording equipment; and taking the pure voice data after the channel noise is eliminated as the input of the training data.
Step (2), framing the pure voice data obtained in the step (1) according to the frame length of 25ms and the frame shift of 10ms, wherein each voice data can be divided into hundreds of thousands of frames of voice signals, respectively calculating the MFCC characteristic parameters of each frame of voice signals, selecting the first 13-dimensional MFCC characteristic parameters, continuously calculating the first-order difference and the second-order difference of the MFCC characteristic parameters, respectively extracting the first 13-dimensional MFCC characteristics, and splicing into a 39-dimensional characteristic vector to serve as the characteristic parameters of the frame of voice signals; and combining 39 features of every 64 frames of voice signals into 64-39 two-dimensional data, discarding the voice signals with less than 64 frames, and labeling the two-dimensional data as the identity representation of the speaker to be used as the input of the neural network.
And (3) adding the two-dimensional data obtained in the step (2) as input into the training of the recurrent neural network. The recurrent neural network has 64 LSTM units; each LSTM unit has 256 hidden neurons, and the hidden neurons are expanded into 64 steps on time series, and each time series is the same network model; the circular neural network adopts a one-way circular neural network, so that the last LSTM unit contains the information of all the previous LSTM units, and the output of the last LSTM unit is used as the final voice feature to enter a recognition stage.
And (4) identifying the voice characteristics obtained in the step (3) and determining the speaker to which the voice data belongs.
Further, the spectral subtraction denoising in step (1) has the advantage that only the channel noise is removed, and the silence segment therein is not removed. Because the joint of the silent segment and the voiced segment can well represent the high-level characteristics of the speaker, such as speaking mode, rhythm and the like, the high-level characteristics need to be extracted by using a recurrent neural network in the follow-up process.
Further, the MFCC features in step (2) take human auditory features into consideration, and first the linear spectrum is mapped into Mel nonlinear spectrum based on auditory perception, and then the spectrum is converted onto cepstrum, which is very prominent in terms of artificial speech features. Whereas the standard cepstral parameters MFCC reflect only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a differential spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved.
Furthermore, a brand-new technical means is proposed in the step (3), namely a Recurrent Neural Network (RNN) is applied to speaker recognition. The voice data is continuous data and has strong context correlation. The RNN has outstanding capability of extracting context-related features, and is widely applied to the fields of natural language processing and speech recognition. The RNN is used for providing high-level features containing context information on the basis of traditional voice features, so that the completeness of the features is better, and the representativeness is stronger.
Further, the step (4) of recognizing the speech features includes classifying the speech segments formed by splicing 64 frames of speech features. The speech segment formed by splicing 64 frames of speech features can be classified by using softmax as a classifier. And then, confirming the speaker of the whole voice by using a naive Bayes method, namely that the speaker of the whole voice with the highest frequency of all the segmented voices is the speaker of the whole voice. Because a speech segment can obtain a plurality of speech features, that is, a speech segment may obtain a plurality of results, according to the naive bayes method, among the plurality of speech features obtained by a speech segment, the speaker with the largest classification result obtained by softmax is identified as the speaker to which the speech segment belongs.
The principle of the invention is as follows:
the invention provides an RNN-based voiceprint recognition method, overcomes the defects that the conventional deep learning-based method does not consider the context-related essence of a voice signal, the extracted features cannot represent a speaker well, the advantages of deep learning are not fully exerted, and the like, and has the characteristics of strong adaptability, good performance and high result accuracy. The method comprises four steps: firstly, denoising input voice data by adopting a spectral subtraction method, and taking the obtained pure voice data with channel noise eliminated as the input of training data. Pure voice data is divided into frames according to the frame length of 25ms and the frame shift of 10ms, MFCC characteristic parameters of each frame are respectively calculated, front 13-dimensional MFCC characteristic parameters are selected and are continuously calculated to calculate first-order and second-order differences, front 13-dimensional characteristics are respectively extracted and spliced into a 39-dimensional characteristic vector to serve as the characteristic parameters of the voice signal of the frame, each voice can be divided into hundreds of frames, 39-dimensional characteristics of every 64 frames of voice signals are combined into a 64 x 39 two-dimensional voice acoustic characteristic parameter, voice signals of less than 64 frames are discarded, and labels of all two-dimensional voice acoustic characteristic parameters generated by voice data spoken by the same speaker are represented by the identity of the same speaker and serve as the input of a neural network. The recurrent neural network has 64 LSTM units (equal to the number of rows of input data), with 256 neurons in each LSTM unit. The method adopts a one-way circulation neural network, so that the last LSTM unit contains the information of all the units, and the output of the last LSTM unit is used as the final voice characteristic to enter a recognition stage. And classifying the obtained speech features by using softmax and obtaining the result. Because a speech segment can obtain a plurality of speech features, that is, a speech segment may obtain a plurality of results, according to the naive bayes method, among the plurality of speech features obtained by a speech segment, the speaker with the largest classification result obtained by softmax is identified as the speaker to which the speech segment belongs.
The invention mainly comprises the following four aspects:
and preprocessing voice data. In an actual scene, due to the difference of recording equipment and environments, collected voice data can generate more channel noise, and great difficulty is brought to an identification task. Therefore, there is a need for a limited method for preprocessing speech data to improve the accuracy of the algorithmic prediction. The method adopts a spectral subtraction method to carry out denoising on the voice data, removes channel noise and completely saves all information related to a speaker.
And extracting and splicing acoustic features of the voice signals. The acoustic features commonly used for voiceprint recognition or speech recognition are MFCC features, however, speech is a continuous signal and has strong context correlation, and the traditional MFCC features cannot well represent data. Because it reflects only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a difference spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved. After the MFCC features and the binary differential features thereof are spliced, the input required by the neural network is considered to be two-dimensional data, so that 39 features of every 64 frames of voice signals are combined into 64-39 two-dimensional data, less than 64 frames are discarded, and the label of the data is the identity representation of a speaker, so that the data is used as the input of the neural network.
Extraction of pitch level features using RNN. Speech is continuous data and is extremely contextually relevant. RNNs have been shown to be highly significant in extracting context-dependent features, and have been highly successful in the fields of speech recognition, natural language processing, and the like. In the method, a cyclic neural network model has 256 hidden neurons, and is expanded into 64 steps on time series, wherein each time series is the same network model. Because the unidirectional cyclic neural network is adopted, the last LSTM unit contains the information of all the previous units, and the output of the last LSTM unit is taken as the final voice characteristic to enter the recognition stage.
And the recognition of the speaker is carried out by Softmax in cooperation with naive Bayes. Because an entire speech segment can be divided into many frames, each frame produces a MFCC signature (containing a second order difference). Entering the recurrent neural network as an input for every 64 frame features can yield one result, so a segment of speech will yield multiple results. The method classifies each 64-frame feature input using softmax and obtains the result. And a section of voice may obtain a plurality of results, and according to the naive Bayes method, the speaker with the most classification result obtained by softmax in a plurality of voice features obtained by the section of voice is identified as the speaker to which the section of voice belongs.
Compared with the prior art, the invention has the advantages that:
1. the invention provides extraction and splicing of acoustic features of voice signals. The acoustic features commonly used for voiceprint recognition or speech recognition are MFCC features, however, speech is a continuous signal and has strong context correlation, and the traditional MFCC features cannot well represent data. Because it reflects only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a difference spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved. After the MFCC features and the binary differential features thereof are spliced, the input required by the neural network is considered to be two-dimensional data, so that 39 features of every 64 frames of voice signals are combined into 64-39 two-dimensional data, less than 64 frames are discarded, and the label of the data is the identity representation of a speaker, so that the data is used as the input of the neural network.
2. The invention provides extraction of pitch level features using RNN. Speech is continuous data and is extremely contextually relevant. RNNs have been shown to be highly significant in extracting context-dependent features, and have been highly successful in the fields of speech recognition, natural language processing, and the like. The recurrent neural network in this method has 64 LSTM units (equal to the number of rows of input data), with 256 neurons in each LSTM unit. Because the unidirectional cyclic neural network is adopted, the last LSTM unit contains the information of all the previous units, and the output of the last LSTM unit is taken as the final voice characteristic to enter the recognition stage.
3. The invention provides a method for confirming a speaker by using Softmax in combination with naive Bayes. Because an entire speech segment can be divided into many frames, each frame produces a MFCC signature (containing a second order difference). Entering the recurrent neural network as an input for every 64 frame features can yield one result, so a segment of speech will yield multiple results. The method classifies each 64-frame feature input using softmax and obtains the result. And a section of voice may obtain a plurality of results, and according to the naive Bayes method, the speaker with the most classification result obtained by softmax in a plurality of voice features obtained by the section of voice is identified as the speaker to which the section of voice belongs.
Drawings
FIG. 1 is a flowchart of a RNN-based voiceprint recognition method of the present invention;
FIG. 2 is a schematic diagram of speech denoising;
FIG. 3 is a schematic diagram of extraction and concatenation of acoustic features of a speech signal;
FIG. 4 is a schematic diagram of RNN feature extraction;
FIG. 5 is a schematic diagram of Softmax in conjunction with naive Bayes recognition.
Detailed Description
Figure 1 shows the overall process flow of the present invention. The invention provides a voiceprint recognition method based on RNN, which mainly comprises the following steps: firstly, input voice data is denoised by using a spectral subtraction method, obtained pure voice data is subjected to framing to extract MFCC characteristic parameters and second-order differences, then the MFCC parameters and the second-order differences of a plurality of continuous frames are spliced to obtain a two-dimensional characteristic parameter matrix, and the two-dimensional characteristic parameter matrix is used as the input of a recurrent neural network. The invention uses the variant LSTM of the recurrent neural network for training, the LSTM reserves the long-term information of the sequence through the specific door mechanisms of a forgetting door, an input door and an output door (delta is a sigmoid activation function and tanh is a hyperbolic tangent activation function in figure 1), thereby completing the training to obtain the LSTM model for the recognition of a small segment of voice, finally counting the recognition result of each small segment of voice of the long segment of voice, and using a naive Bayes method to confirm the final speaker.
The invention is further described below with reference to other figures and embodiments.
1. Voice preprocessing module
In an actual scene, due to the difference of recording equipment and environments, collected voice data can generate more channel noise, and great difficulty is brought to an identification task. Therefore, there is a need for a limited method for preprocessing speech data to improve the accuracy of the algorithmic prediction. The method adopts a spectral subtraction method to carry out denoising on the voice data, removes channel noise and completely saves all information related to a speaker. In the method, voice noise does not need to be extracted particularly, the first 5 frames of voice data are selected to be about 0.1s, no speaker voice exists at the moment, only the noise of a signal in recording is used as a channel noise template in sequence, and the noise is removed by using a spectral subtraction method, so that pure voice data after the channel noise is removed can be obtained. As shown in fig. 2, first, discrete fourier transform (FFT) is performed on the noisy speech to retain phase information, then the power spectrum is taken to subtract the power spectrum of the noise to obtain the power spectrum of the clean speech, and after matching with the previous phase information, inverse fourier transform (IFFT) is performed to recover the clean speech.
2. Extraction and splicing of acoustic features of voice signals
The acoustic features commonly used for voiceprint recognition or speech recognition are MFCC features, however, speech is a continuous signal and has strong context correlation, and the traditional MFCC features cannot well represent data. Because it reflects only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a difference spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved. The invention divides the frame of the pure voice data obtained in the last step to extract the MFCC characteristic parameter and the second-order difference, then splices the MFCC parameter and the second-order difference of a plurality of continuous frames to obtain a two-dimensional characteristic parameter matrix, and uses the two-dimensional characteristic parameter matrix as the input of the recurrent neural network. As shown in FIG. 3, the method divides the frame of the voice according to the frame length of 25ms and the frame shift of 10ms, calculates the MFCC characteristics and the second-order difference characteristics thereof and splices the MFCC characteristics and the second-order difference characteristics thereof, and then combines the 39-dimensional characteristics of each 64 frames of voice signals into a 64 x 39 two-dimensional voice acoustic characteristic parameter, and discards the voice signals less than 64 frames, and the labels of all the two-dimensional voice acoustic characteristic parameters generated by the voice data spoken by the same speaker are all the identity representation of the same speaker, so as to be used as the input of the neural network. For example, 15s for a segment of speech, which may be divided into 1498 frames, may produce 23 feature inputs of 64 × 39 with tags pointing to the same speaker.
3. Extraction of pitch level features using RNN
Speech is continuous data and is extremely contextually relevant. RNNs have been shown to be highly significant in extracting context-dependent features, and have been highly successful in the fields of speech recognition, natural language processing, and the like. The recurrent neural network in this method has 64 LSTM units (equal to the number of rows of input data), with 256 neurons in each LSTM unit. Because the unidirectional cyclic neural network is adopted, the last LSTM unit contains the information of all the previous units, and the output of the last LSTM unit is taken as the final voice characteristic to enter the recognition stage. As shown in fig. 4, the 64 × 39 dimensional features are input into the neural network, i.e. the 39 dimensional features of each frame are sequentially filled into the neural network, and since the training set used in the method has 251 people, the final feature output selects 251 vector for processing softmax during recognition.
4. Recognition of speaker by softmax in cooperation with naive Bayes
Because an entire speech segment can be divided into many frames, each frame produces a MFCC signature (containing a second order difference). Entering the recurrent neural network as an input for every 64 frame features can yield one result, so a segment of speech will yield multiple results. The method uses softmax as a classifier to classify each 64-frame feature input and obtain the result. And a section of voice may obtain a plurality of results, and according to the naive Bayes method, the speaker with the most classification result obtained by softmax in a plurality of voice features obtained by the section of voice is identified as the speaker to which the section of voice belongs. As shown in fig. 5, the test speech is de-noised as the training speech, the feature parameters are extracted and spliced to obtain a plurality of feature inputs, for example, 23 feature inputs generated by 15s speech, and the feature inputs are input into the LSTM model to obtain 23 classification results, wherein the speaker corresponding to the label with the most appeared result is the speaker of the whole speech.
Technical contents not described in detail in the present invention belong to the well-known techniques of those skilled in the art.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (4)

1. An RNN-based voiceprint recognition method, characterized by: the method comprises the following steps:
the method comprises the following steps that (1) denoising processing is carried out on input voice data by adopting a spectral subtraction method, wherein channel noise is eliminated, and the channel noise is noise caused by recording equipment; the pure voice data after the channel noise is eliminated is used as the input of training data;
step (2), framing the pure voice data obtained in the step (1) according to the frame length of 25ms and the frame shift of 10ms, wherein each voice data can be divided into hundreds of thousands of frames of voice signals, respectively calculating the MFCC characteristic parameters of each frame of voice signals, selecting the first 13-dimensional MFCC characteristic parameters, continuously calculating the first-order difference and the second-order difference of the MFCC characteristic parameters, respectively extracting the first 13-dimensional MFCC characteristics, and splicing into a 39-dimensional characteristic vector to serve as the characteristic parameters of the frame of voice signals; combining 39-dimensional features of every 64 frames of voice signals into 64-39 two-dimensional voice acoustic feature parameters, discarding voice signals less than 64 frames, and taking labels of all two-dimensional voice acoustic feature parameters generated by voice data spoken by the same speaker as identity representation of the same speaker as input of a neural network;
step (3), the two-dimensional data obtained in the step (2) is used as input and added into the training of the recurrent neural network; the recurrent neural network has 64 LSTM units; each LSTM unit has 256 hidden neurons, and the hidden neurons are expanded into 64 steps on time series, and each time series is the same network model; the circulating neural network adopts a unidirectional circulating neural network, the last LSTM unit contains the information of all the LSTM units in the front, and the output of the last LSTM unit is used as the final voice characteristic to enter a recognition stage;
step (4), the voice features obtained in the step (3) are identified, and a naive Bayes method is used for determining the speaker to which the voice data belongs; the method specifically comprises the following steps: classifying each 64-frame feature input using softmax as a classifier; according to the naive Bayes method, the speaker with the largest classification result obtained by softmax in a plurality of voice features obtained by a section of voice is confirmed as the speaker to which the section of voice belongs.
2. The voiceprint recognition method according to claim 1, characterized in that: and (2) denoising the input voice data by adopting a spectral subtraction method in the step (1), only eliminating channel noise and reserving a silence segment.
3. The voiceprint recognition method according to claim 1, characterized in that: the MFCC characteristics in the step (2) take the auditory characteristics of human beings into account; the linear spectrum is first mapped into the Mel nonlinear spectrum based on auditory perception and then converted onto the cepstrum.
4. The voiceprint recognition method according to claim 1, characterized in that: and (4) identifying the voice characteristics, including classifying the voice sections formed by splicing 64 frames of voice characteristics.
CN201711070510.6A 2017-11-03 2017-11-03 Voiceprint recognition method based on RNN Active CN107731233B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711070510.6A CN107731233B (en) 2017-11-03 2017-11-03 Voiceprint recognition method based on RNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711070510.6A CN107731233B (en) 2017-11-03 2017-11-03 Voiceprint recognition method based on RNN

Publications (2)

Publication Number Publication Date
CN107731233A CN107731233A (en) 2018-02-23
CN107731233B true CN107731233B (en) 2021-02-09

Family

ID=61222539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711070510.6A Active CN107731233B (en) 2017-11-03 2017-11-03 Voiceprint recognition method based on RNN

Country Status (1)

Country Link
CN (1) CN107731233B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288974B (en) * 2018-03-19 2024-04-05 北京京东尚科信息技术有限公司 Emotion recognition method and device based on voice
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
CN108922513B (en) * 2018-06-04 2023-03-17 平安科技(深圳)有限公司 Voice distinguishing method and device, computer equipment and storage medium
CN109256139A (en) * 2018-07-26 2019-01-22 广东工业大学 A kind of method for distinguishing speek person based on Triplet-Loss
CN108877812B (en) * 2018-08-16 2021-04-02 桂林电子科技大学 Voiceprint recognition method and device and storage medium
CN112863547B (en) * 2018-10-23 2022-11-29 腾讯科技(深圳)有限公司 Virtual resource transfer processing method, device, storage medium and computer equipment
US11114103B2 (en) * 2018-12-28 2021-09-07 Alibaba Group Holding Limited Systems, methods, and computer-readable storage media for audio signal processing
CN109712628B (en) * 2019-03-15 2020-06-19 哈尔滨理工大学 Speech noise reduction method and speech recognition method of DRNN noise reduction model established based on RNN
CN109903774A (en) * 2019-04-12 2019-06-18 南京大学 A kind of method for recognizing sound-groove based on angle separation loss function
CN110444223B (en) * 2019-06-26 2023-05-23 平安科技(深圳)有限公司 Speaker separation method and device based on cyclic neural network and acoustic characteristics
CN111951791A (en) * 2020-08-26 2020-11-17 上海依图网络科技有限公司 Voiceprint recognition model training method, recognition method, electronic device and storage medium
CN112420056A (en) * 2020-11-04 2021-02-26 乐易欢 Speaker identity authentication method and system based on variational self-encoder and unmanned aerial vehicle
CN113823290A (en) * 2021-08-31 2021-12-21 杭州电子科技大学 Multi-feature fusion voiceprint recognition method
CN114040052B (en) * 2021-11-01 2024-01-19 江苏号百信息服务有限公司 Method for identifying audio collection and effective audio screening of telephone voiceprint

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160372119A1 (en) * 2015-06-19 2016-12-22 Google Inc. Speech recognition with acoustic models
CN106682089A (en) * 2016-11-26 2017-05-17 山东大学 RNNs-based method for automatic safety checking of short message
WO2017112466A1 (en) * 2015-12-21 2017-06-29 Microsoft Technology Licensing, Llc Multi-speaker speech separation

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923855A (en) * 2009-06-17 2010-12-22 复旦大学 Test-irrelevant voice print identifying system
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN104157290B (en) * 2014-08-19 2017-10-24 大连理工大学 A kind of method for distinguishing speek person based on deep learning
US9824684B2 (en) * 2014-11-13 2017-11-21 Microsoft Technology Licensing, Llc Prediction-based sequence recognition
CN104408483B (en) * 2014-12-08 2017-08-25 西安电子科技大学 SAR texture image classification methods based on deep neural network
CN104952448A (en) * 2015-05-04 2015-09-30 张爱英 Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks
KR102313028B1 (en) * 2015-10-29 2021-10-13 삼성에스디에스 주식회사 System and method for voice recognition
CN106128465A (en) * 2016-06-23 2016-11-16 成都启英泰伦科技有限公司 A kind of Voiceprint Recognition System and method
CN106919662B (en) * 2017-02-14 2021-08-31 复旦大学 Music identification method and system
CN107220588A (en) * 2017-04-20 2017-09-29 苏州神罗信息科技有限公司 A kind of real-time gesture method for tracing based on cascade deep neutral net
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160372119A1 (en) * 2015-06-19 2016-12-22 Google Inc. Speech recognition with acoustic models
WO2017112466A1 (en) * 2015-12-21 2017-06-29 Microsoft Technology Licensing, Llc Multi-speaker speech separation
CN106682089A (en) * 2016-11-26 2017-05-17 山东大学 RNNs-based method for automatic safety checking of short message

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"deep neural network based text-dependent speaker verification:preliminary results";Bhattacharya G;《Odyssey》;20161231;全文 *
"RNN-BLSTM声学模型的说话人自适应方法研究";黄智颖;《万方平台》;20170828;全文 *
"speaker recognition using artificial neural network";Fazal Mueen;《IEEE students conference 》;20021231;全文 *

Also Published As

Publication number Publication date
CN107731233A (en) 2018-02-23

Similar Documents

Publication Publication Date Title
CN107731233B (en) Voiceprint recognition method based on RNN
Kabir et al. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities
Tirumala et al. Speaker identification features extraction methods: A systematic review
CN108305616B (en) Audio scene recognition method and device based on long-time and short-time feature extraction
CN111276131B (en) Multi-class acoustic feature integration method and system based on deep neural network
Zhang et al. Deep belief networks based voice activity detection
EP3156978A1 (en) A system and a method for secure speaker verification
Devi et al. Automatic speaker recognition from speech signals using self organizing feature map and hybrid neural network
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
KR20010102549A (en) Speaker recognition
CN113053410B (en) Voice recognition method, voice recognition device, computer equipment and storage medium
Khdier et al. Deep learning algorithms based voiceprint recognition system in noisy environment
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
El-Moneim et al. Text-dependent and text-independent speaker recognition of reverberant speech based on CNN
Sun et al. A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
Birla A robust unsupervised pattern discovery and clustering of speech signals
Mohammed et al. Advantages and disadvantages of automatic speaker recognition systems
Dennis et al. Generalized Hough transform for speech pattern classification
Imam et al. Speaker recognition using automated systems
Nguyen et al. Vietnamese speaker authentication using deep models
GS et al. Synthetic speech classification using bidirectional LSTM Networks
Nainan et al. Performance evaluation of text independent automatic speaker recognition using VQ and GMM
Golik et al. Mobile music modeling, analysis and recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231206

Address after: Room 903-60, Building 59, Xiangshan Huijing Business Center, No. 2, Houtang Road, the Taihu Lake National Tourism Resort, Suzhou, Jiangsu

Patentee after: Suzhou Fuji Robot Co.,Ltd.

Address before: 100191 1010 Xueyuan international building, No.1 Zhichun Road, Haidian District, Beijing

Patentee before: Wang Huafeng

TR01 Transfer of patent right