WO2020073518A1 - 声纹验证的方法、装置、计算机设备和存储介质 - Google Patents

声纹验证的方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020073518A1
WO2020073518A1 PCT/CN2018/124401 CN2018124401W WO2020073518A1 WO 2020073518 A1 WO2020073518 A1 WO 2020073518A1 CN 2018124401 W CN2018124401 W CN 2018124401W WO 2020073518 A1 WO2020073518 A1 WO 2020073518A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
frame
voice
voiceprint feature
speech
Prior art date
Application number
PCT/CN2018/124401
Other languages
English (en)
French (fr)
Inventor
杨翘楚
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020073518A1 publication Critical patent/WO2020073518A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present application relates to the field of voiceprint verification, in particular to a method, device, computer equipment and storage medium for voiceprint verification.
  • the main purpose of the present application is to provide a voiceprint verification method, which aims to solve the technical problem that the noise in the existing voice data adversely affects the voiceprint verification effect.
  • This application proposes a method for voiceprint verification, including:
  • the first voiceprint feature is the same as the pre-stored voiceprint feature, otherwise it is not the same.
  • This application also provides a voiceprint verification device, including:
  • a distinguishing module used to input the voice signal to be voiceprint verified into the VAD model, to distinguish the voice frame and the noise frame in the voice signal;
  • a removing module configured to remove the noise frame and obtain purified voice data composed of each voice frame
  • An extraction module for extracting the first voiceprint feature corresponding to the purified voice data
  • the judgment module is used to judge whether the similarity between the first voiceprint feature and the pre-stored voiceprint feature meets a preset condition
  • the determining module is configured to determine that the first voiceprint feature is the same as the pre-stored voiceprint feature if the preset condition is satisfied, otherwise it is not the same.
  • the present application also provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the steps of the foregoing method are implemented.
  • the present application also provides a computer non-volatile readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the above method are implemented.
  • the application obtains the purified voice data by identifying the noise data in the voice signal and removing the noise data, and then performs voiceprint recognition based on the cleaned voice data to improve the accuracy of voiceprint verification.
  • This application uses the GMM-VAD model, combined with local judgment and global judgment, to achieve an accurate distinction between noise data and voice data, to improve the degree of purification of voice signals, and further improve the accuracy of voiceprint verification.
  • This application is based on GMM-UBM to realize the mapping of each of the voiceprint feature vectors into low-dimensional voiceprint identification vectors I-vector, which reduces the computational cost in the process of voiceprint feature extraction and reduces the cost of voiceprint verification.
  • This application compares and analyzes the pre-stored data of multiple people in the process of voiceprint verification to reduce the equivalence rate of voiceprint verification and the accuracy of voiceprint verification caused by the model errors of voiceprint verification.
  • FIG. 1 is a schematic flowchart of a method for voiceprint verification according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a device for voiceprint verification according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
  • a method of voiceprint verification includes:
  • S1 Input the voice signal to be verified by the voiceprint into the VAD model, and distinguish the voice frame and the noise frame in the voice signal.
  • the VAD model of this embodiment also known as a voice endpoint detector, is used to detect whether voice data of human voice exists in a noisy environment.
  • the VAD model scores each input speech signal, that is, the probability that the speech signal is a speech frame or a noise frame. When the probability value of the speech frame is greater than the pre-set decision threshold, it is determined as a speech frame, otherwise it is Noise frame.
  • the VAD model distinguishes the speech frame and the noise frame according to the above decision result, so as to remove the noise frame in the speech signal.
  • the decision threshold in this embodiment adopts the default decision threshold in the Webrtc source code. The decision threshold is obtained by analyzing a large amount of data when Webrtc technology is developed to improve the discrimination effect and accuracy, and at the same time reduce the model training of the VAD model. Workload.
  • the data marked as noise frames are cut out according to the result of the above distinction, and the remaining voice frames are successively arranged according to the original arrangement time sequence to form the purified Voice data.
  • the data marked as voice frames can be selected and saved based on the above discrimination results, and the extracted and saved voice frames can be successively arranged according to the original arrangement time sequence to form each voice frame.
  • the purified voice data reduces the effect of noise data in the voice signal on the effect of voiceprint verification by removing the background noise data of the environment where the identity registration or identity verification is located from the speaker, so as to improve the success rate of voiceprint verification .
  • only the first voiceprint feature corresponding to the purified voice data is analyzed to reduce the amount of calculation in voiceprint verification, and at the same time improve the effectiveness, pertinence and timeliness of voiceprint verification.
  • S4 Determine whether the similarity between the first voiceprint feature and the pre-stored voiceprint feature satisfies a preset condition.
  • the preset conditions in this embodiment include a specified preset threshold range, or a specified sorting, etc., which can be customized according to specific application scenarios to meet personalized usage requirements more widely.
  • the client it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, and the result of verification verification is fed back to the client; otherwise, the result of verification failure is fed back to the client, so that the client can proceed further according to the feedback result.
  • Application operation For example, after the verification is passed, the smart door is controlled to open, etc.
  • the security system is controlled to lock the screen to prevent criminals from further damaging the electronic banking system.
  • the VAD model of this embodiment includes Fourier transform, Gaussian mixed distribution GMM-NOISE and GMM-SPEECH, and step S1 includes:
  • S100 Input the voice signal into a Fourier transform in a VAD model, and convert the voice signal from a time domain signal form to a frequency domain signal form.
  • the Fourier transform in the VAD model converts the time-domain signal form into the frequency-domain signal form one-to-one, and analyzes the attributes of the speech signals of each frame to facilitate distinguishing between speech frames and noise frames.
  • S101 Input each frame data of the voice signal in the form of a frequency domain signal into GMM-NOISE and GMM-SPEECH respectively to perform VAD judgment to distinguish between the voice frame and the noise frame in the voice signal.
  • This embodiment is preferably based on the VAD model of the mixed Gaussian GMM, which extracts the energy of the input speech signal in the frequency domain signal of each frame in 6 frequency bands as the feature vector of the speech signal of the frame, and detects
  • the Gaussian mixture distribution GMM modeling is performed on 6 frequency bands respectively.
  • Each frequency band has noise GMM-NOISE with two Gaussian components and voice GMM-SPEECH with two Gaussian components.
  • the above 6 frequency bands are set based on the difference in the spectrum of noise and speech according to the Webrtc technology, in order to improve the analysis accuracy and the compatibility with the Webrtc technology.
  • the analysis frequency bands in other embodiments of this application do not necessarily need to be 6, which can be set according to actual needs.
  • the interference of the power supply of 50Hz will be mixed into the microphone that collects the voice signal, and the collected interference signal and physical vibration will have an impact.
  • the voice signal above 80Hz is preferably collected to reduce
  • the interference of the alternating current, and the highest frequency that the voice can reach is 4 kHz, so this embodiment preferably divides the limit at the spectral valley in the range of 80 Hz to 4 kHz.
  • the VAD decision in this embodiment includes local decision (Local Decision) and global decision (Global Decision).
  • step S101 of this embodiment includes:
  • S1010 Input each frame data of the voice signal in the form of a frequency domain signal into GMM-NOISE and GMM-SPEECH respectively to obtain the noise frame probability of each frame data And speech frame probability
  • the frame data of the voice signal pre-analyzing whether it is a voice frame or a noise frame is input into GMM-NOISE and GMM-SPEECH respectively to obtain the noise frame probability of GMM-NOISE and GMM-SPEECH, respectively Value and speech frame probability value, so as to determine whether it is a noise frame or a speech frame by comparing the noise frame probability value and the speech frame probability value.
  • This embodiment is preferably based on the VAD model of the mixed Gaussian GMM, which extracts the energy of the input speech signal in the frequency domain signal of each frame in 6 frequency bands as the feature vector of the speech signal of the frame, so in this embodiment
  • the value of n is 6.
  • S1012 Determine whether the local log-likelihood ratio is higher than the departmental limit.
  • the local judgment is used to distinguish between the speech frame and the noise frame.
  • the local judgment in this embodiment is made once for each frequency band, a total of 6 times.
  • Likelihood ratio is an indicator that reflects the authenticity, which is a composite indicator that reflects both sensitivity and specificity, and improves the probability estimation accuracy.
  • the parameters of the GMM of this embodiment have an adaptive update capability. After each frame of the voice signal is determined to be a voice frame or a noise frame, the parameters of its corresponding model are updated according to the feature value of the frame. For example, if the frame is judged as a voice frame, the expected value, standard deviation, and Gaussian component weight value of GMM-SPEECH are updated once according to the feature value of the frame. After more and more voice frames are input to GMM-SPEECH, GMM-SPEECH will be more and more adapted to the voiceprint characteristics of the speaker who has this voice signal, and the analysis conclusions given will be more accurate.
  • step S1012 in another embodiment of the present application, it includes:
  • the local judgment is performed first, and then the global judgment is performed.
  • the global judgment is based on the results of the local judgment to calculate the weighted sum of each frequency band to improve the accuracy of distinguishing between speech frames and noise frames.
  • S1015 Determine whether the global log-likelihood ratio is higher than a global threshold.
  • the global log-likelihood ratio is compared with the global threshold to further improve the accuracy of screening speech frames.
  • the global decision is not made, so as to improve the efficiency of voiceprint verification, and try to recognize all voice frames to avoid voice distortion.
  • a voice may be present in the local decision result, and then a global decision may be made to further verify and confirm the existence of the voice, and improve the accuracy of distinguishing between voice frames and noise frames.
  • step S3 of this embodiment includes:
  • the process of extracting MFCC (Mel Frequency Cepstrum Coefficient) type voiceprint features is as follows: first sampling and quantization are performed, and the continuous analog voice signal of the purified voice data is sampled with a certain sampling period and converted into Discrete signals, and quantize the discrete signals into digital signals according to certain coding rules; then pre-emphasis, due to the physiological characteristics of the human body, the high-frequency components of the voice signal are often suppressed, the role of pre-emphasis is to compensate for high-frequency components; then frame Processing, due to the "instant stationarity" of the speech signal, when performing spectrum analysis, a segment of the speech signal is framed (generally 10 to 30 milliseconds per frame), and then feature extraction is performed in units of frames; followed by window processing, The role is to reduce the discontinuity of the signal corresponding to the beginning and end of the frame, and the Hamming window is used for windowing; then the frame signal is DFT to convert the signal from the time domain to the frequency
  • S31 Construct a voiceprint feature vector corresponding to each voice frame according to the voiceprint features of each MFCC type.
  • the MFCC type voiceprint features have non-linear characteristics, which makes the analysis results in each frequency band closer to the characteristics of the real voice emitted by the human body, makes the voiceprint feature extraction more accurate, and improves the effect of voiceprint verification.
  • each of the voiceprint feature vectors is mapped to a low-dimensional voiceprint discrimination vector I-vector to reduce the voiceprint features
  • the calculation cost in the extraction process reduces the use cost of voiceprint verification.
  • the training process of the GMM-UBM in this embodiment is as follows: B1: Acquire a preset number (for example, 100,000) of voice data samples, each voice data sample corresponds to a voiceprint discrimination vector, and each voice data sample can be collected from Speech formation of different people in different environments, such speech data samples are used to train a general background model (GMM-UBM) that can characterize general speech characteristics; B2.
  • GMM-UBM General Background Model
  • each speech data sample separately to extract each speech data Preset type voiceprint features corresponding to the samples, and construct voiceprint feature vectors corresponding to each voice data sample based on the preset type voiceprint features corresponding to each voice data sample; B3, all the preset type voiceprint feature vectors will be constructed Divided into a first percentage training set and a second percentage verification set, the first percentage and the second percentage are less than or equal to 100%; B4, using the voiceprint feature vector pair in the training set The second model is trained, and after the training is completed, a verification set is used to verify the accuracy of the trained second model; B5.
  • the model training ends, otherwise, increasing the number of samples of voice data, and re-execute the above-described step B2, B3, B4, B5 based on the voice data samples increases.
  • the preset standard Rate e.g., 98.5%
  • the voiceprint discrimination vector of this embodiment is expressed by I-vector, i-vector is a vector, compared to the dimension of Gaussian space, i-vector dimension is lower, which is convenient to reduce the calculation cost, and extract the low-dimensional i-vector
  • the process of is to multiply the low-dimensional vector w and a conversion matrix T into a Gaussian space with higher dimensions by the following calculation formula.
  • the extraction of I-vector includes the following steps: after processing the training speech data from a target speaker, the extracted preset type voiceprint feature vector (for example, MFCC) is input into the GMM-VAD model to obtain a characterization of the segment
  • step S4 of this embodiment includes:
  • S40 Acquire respective corresponding pre-stored voiceprint features from pre-stored voiceprint feature data of multiple persons, wherein the voiceprint feature data of multiple persons includes pre-stored voiceprint features of the target person.
  • the pre-stored voiceprint feature data of multiple persons including the target person is used to determine whether the voiceprint feature of the currently collected voice signal is the same as the target person's voiceprint feature, so as to improve the judgment accuracy.
  • S41 Calculate the similarity value between each of the pre-stored voiceprint features and the first voiceprint feature separately.
  • the similarity value in this embodiment represents the similarity between the pre-stored voiceprint feature and the first voiceprint feature.
  • the method for obtaining the similarity value in this embodiment includes obtaining the feature distance value between the pre-stored voiceprint feature and the first voiceprint feature.
  • the above feature distance value includes a cosine distance value, an Euclidean distance value, and the like.
  • the similarity values between the pre-stored voiceprint features and the first voiceprint feature are sorted from large to small, so as to more accurately analyze the relationship between the first voiceprint feature and each pre-stored voiceprint feature.
  • the similarity distribution state in order to obtain the verification of the first voiceprint feature more accurately.
  • S43 Determine whether a similarity value corresponding to the pre-stored voiceprint feature of the target person is included in the preset similarity value of the preset number.
  • the preset number of similarity values includes 1, 2, or 3, etc., and can be set according to usage requirements.
  • Other embodiments of the present application implement effective voiceprint verification by setting a distance threshold between the first voiceprint feature and the pre-stored voiceprint feature of the target user.
  • the preset threshold is 0.6. If the cosine distance between the first voiceprint feature and the pre-stored voiceprint feature of the target user is calculated to be less than or equal to the preset threshold, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature of the target user , The verification is passed; if the cosine distance between the first voiceprint feature and the pre-stored voiceprint feature of the target user is calculated to be greater than a preset threshold, it is determined that the first voiceprint feature is not the same as the pre-stored voiceprint feature of the target user, and the verification fails.
  • step S41 of this embodiment includes:
  • S410 Pass the cosine distance formula Calculate the cosine distance between each of the pre-stored voiceprint features and the first voiceprint feature, where x represents each prestored voiceprint discrimination vector, and y represents the voiceprint discrimination vector of the first voiceprint feature.
  • This embodiment uses the cosine distance formula Represents the similarity between each of the pre-stored voiceprint features and the first voiceprint feature, where the smaller the distance value of the cosine distance indicates that the two voiceprint features are closer or the same.
  • S411 Convert the cosine distance value into the similarity value, where the smallest cosine distance value corresponds to the largest similarity value.
  • the cosine distance value can be converted into a similarity value by converting the cosine distance value according to an inverse proportional formula carrying a specified inverse ratio coefficient.
  • the purified voice data is obtained by identifying the noise data in the voice signal and removing the noise data, and then performing voiceprint recognition based on the cleaned voice data to improve the accuracy of voiceprint verification.
  • the GMM-VAD model is used to combine the local judgment and the global judgment to achieve an accurate distinction between noise data and voice data, so as to improve the degree of purifying the voice signal and further improve the accuracy of voiceprint verification.
  • each of the voiceprint feature vectors is mapped to a low-dimensional voiceprint discrimination vector I-vector, which reduces the calculation cost in the voiceprint feature extraction process and reduces the cost of voiceprint verification.
  • comparative analysis is performed with pre-stored data of multiple people to reduce the equivalence rate of voiceprint verification and the accuracy of voiceprint verification caused by model errors of voiceprint verification.
  • a voiceprint verification device includes:
  • the distinguishing module 1 is used for inputting the voice signal to be voiceprint verified into the VAD model, and distinguishing the voice frame and the noise frame in the voice signal.
  • the VAD model of this embodiment also known as a voice endpoint detector, is used to detect whether voice data of human voice exists in a noisy environment.
  • the VAD model scores each input speech signal, that is, the probability that the speech signal is a speech frame or a noise frame. When the probability value of the speech frame is greater than the pre-set decision threshold, it is determined as a speech frame, otherwise it is Noise frame.
  • the VAD model distinguishes the speech frame and the noise frame according to the above decision result, so as to remove the noise frame in the speech signal.
  • the decision threshold in this embodiment adopts the default decision threshold in the Webrtc source code. The decision threshold is obtained by analyzing a large amount of data when Webrtc technology is developed to improve the discrimination effect and accuracy, and at the same time reduce the model training of the VAD model. Workload.
  • the removing module 2 is used to remove noise frames and obtain purified voice data composed of each of the voice frames.
  • the data marked as noise frames are cut out according to the result of the above distinction, and the remaining voice frames are successively arranged according to the original arrangement time sequence to form the purified Voice data.
  • the data marked as voice frames can be selected and saved based on the above discrimination results, and the extracted and saved voice frames can be successively arranged according to the original arrangement time sequence to form each voice frame.
  • the purified voice data reduces the effect of noise data in the voice signal on the effect of voiceprint verification by removing the background noise data of the environment where the identity registration or identity verification is located from the speaker, so as to improve the success rate of voiceprint verification .
  • the extraction module 3 is used to extract the first voiceprint feature corresponding to the purified voice data.
  • only the first voiceprint feature corresponding to the purified voice data is analyzed to reduce the amount of calculation in voiceprint verification, and at the same time improve the effectiveness, pertinence and timeliness of voiceprint verification.
  • the judgment module 4 is used to judge whether the similarity between the first voiceprint feature and the pre-stored voiceprint feature satisfies the preset condition.
  • the preset conditions in this embodiment include a specified preset threshold range, or a specified sorting, etc., which can be customized according to specific application scenarios to meet personalized usage requirements more widely.
  • the determining module 5 is configured to determine that the first voiceprint feature is the same as the pre-stored voiceprint feature if the preset condition is satisfied, otherwise it is not the same.
  • the client it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, and the result of verification verification is fed back to the client; otherwise, the result of verification failure is fed back to the client, so that the client can proceed further according to the feedback Application operation.
  • the smart door is controlled to open, etc.
  • the security system is controlled to lock the screen to prevent criminals from further damaging the electronic banking system.
  • the VAD model of this embodiment includes Fourier transform, Gaussian mixture distribution GMM-NOISE and GMM-SPEECH, and the above-mentioned distinguishing module 1 includes:
  • a transformation unit is used to input the speech signal into the Fourier transform in the VAD model, and transform the speech signal from the time domain signal form to the frequency domain signal form.
  • the Fourier transform in the VAD model converts the time-domain signal form into the frequency-domain signal form one-to-one, and analyzes the attributes of the speech signals of each frame to facilitate distinguishing between speech frames and noise frames.
  • the discriminating unit is used to input each frame data of the voice signal in the form of a frequency domain signal into GMM-NOISE and GMM-SPEECH respectively to make VAD judgment, so as to distinguish the voice frame and the noise frame in the voice signal.
  • This embodiment is preferably based on the VAD model of the mixed Gaussian GMM, which extracts the energy of the input speech signal in the frequency domain signal of each frame in 6 frequency bands, as the feature vector of the speech signal of the frame, and the noise and speech in The Gaussian mixture distribution GMM modeling is performed on 6 frequency bands respectively.
  • Each frequency band has noise GMM-NOISE with two Gaussian components and voice GMM-SPEECH with two Gaussian components.
  • the above 6 frequency bands are set based on the difference in the spectrum of noise and speech according to the Webrtc technology, in order to improve the analysis accuracy and the compatibility with the Webrtc technology.
  • the analysis frequency bands in other embodiments of this application do not necessarily need to be 6, which can be set according to actual needs.
  • this embodiment is based on China ’s AC standard is 220V, 50Hz, and the interference of the power supply 50Hz will be mixed into the microphone that collects the voice signal, and the collected interference signal and physical vibration will have an impact.
  • the highest frequency that the voice can reach is 4kHz, so in this embodiment, it is preferable to divide the boundary at the spectral valley in the range of 80Hz to 4kHz.
  • the VAD decision in this embodiment includes local decision (Local Decision) and global decision (Global Decision).
  • the distinguishing unit in this embodiment includes:
  • Input subunit used to input each frame data of the voice signal in the form of frequency domain signals into GMM-NOISE and GMM-SPEECH, respectively, to obtain the noise frame probability of each frame data And speech frame probability
  • the frame data of the voice signal pre-analyzing whether it is a voice frame or a noise frame is input into GMM-NOISE and GMM-SPEECH respectively to obtain the noise frame probability of each frame of data analyzed by GMM-NOISE and GMM-SPEECH.
  • Value and speech frame probability value so as to determine whether it is a noise frame or a speech frame by comparing the noise frame probability value and the speech frame probability value.
  • the first calculation subunit is used according to log likelihood Calculate the local log-likelihood ratio.
  • This embodiment is preferably based on the VAD model of the mixed Gaussian GMM, which extracts the energy of the input speech signal in the frequency domain signal of each frame in 6 frequency bands as the feature vector of the speech signal of the frame, so in this embodiment
  • the value of n is 6.
  • the first judgment subunit is used to judge whether the local log-likelihood ratio is higher than the local department limit.
  • the local judgment is used to distinguish between the speech frame and the noise frame.
  • the local judgment in this embodiment is made once for each frequency band, a total of 6 times.
  • Likelihood ratio is an indicator that reflects the authenticity, which is a composite indicator that reflects both sensitivity and specificity, and improves the probability estimation accuracy.
  • the local log-likelihood ratio is higher than the local limit to ensure the accuracy of determining that the voice signal is a voice frame.
  • the first determining subunit is used to determine that the frame data whose local log likelihood ratio is higher than the local department limit is a voice frame if the local log likelihood ratio is higher than the local department limit.
  • the parameters of the GMM of this embodiment have an adaptive update capability. After each frame of the voice signal is determined to be a voice frame or a noise frame, the parameters of its corresponding model are updated according to the feature value of the frame. For example, if the frame is judged as a voice frame, the expected value, standard deviation, and Gaussian component weight value of GMM-SPEECH are updated once according to the feature value of the frame. After more and more voice frames are input to GMM-SPEECH, GMM-SPEECH will be more and more adapted to the voiceprint characteristics of the speaker who has this voice signal, and the analysis conclusions given will be more accurate.
  • the distinguishing unit in another embodiment of the present application includes:
  • the second calculation subunit is used to calculate the local log likelihood ratio based on the sum log likelihood if the local log likelihood ratio is not higher than the local limit likelihood ratio local, i , calculate the global log likelihood ratio.
  • a local decision is made, followed by a global decision.
  • the global decision is based on the results of the local decision to calculate the weighted sum of each frequency band, so as to improve the accuracy of distinguishing between speech frames and noise frames.
  • the second judgment subunit is used to judge whether the global log-likelihood ratio is higher than a global threshold.
  • the global log-likelihood ratio is compared with the global threshold to further improve the accuracy of screening speech frames.
  • the second determination subunit is used to determine that the frame data whose global log likelihood ratio is higher than the global threshold value is a voice frame if the global log likelihood ratio is higher than the global threshold value.
  • the global decision is not made, so as to improve the efficiency of voiceprint verification, and try to recognize all voice frames to avoid voice distortion.
  • a voice may be present in the local decision result, and then a global decision may be made to further verify and confirm the existence of the voice, and improve the accuracy of distinguishing between voice frames and noise frames.
  • the extraction module 3 of this embodiment includes:
  • the extracting unit is used for extracting MFCC type voiceprint features corresponding to the speech frames in the purified speech data.
  • the process of extracting MFCC (Mel Frequency Cepstrum Coefficient) type voiceprint features is as follows: first sampling and quantization are performed, and the continuous analog voice signal of the purified voice data is sampled with a certain sampling period and converted into Discrete signals, and quantize the discrete signals into digital signals according to certain coding rules; then pre-emphasis, due to the physiological characteristics of the human body, the high-frequency components of the voice signal are often suppressed, the role of pre-emphasis is to compensate for high-frequency components; Processing, due to the "instant stationarity" of the speech signal, when performing spectrum analysis, a segment of the speech signal is framed (generally 10 to 30 milliseconds per frame), and then feature extraction is performed in units of frames; followed by window processing, The role is to reduce the discontinuity of the signal corresponding to the beginning and end of the frame, and the Hamming window is used for windowing; then the frame signal is DFT to convert the signal from the time domain to the frequency domain,
  • the construction unit is configured to construct a voiceprint feature vector corresponding to each voice frame according to each MFCC type voiceprint feature.
  • the MFCC type voiceprint features have non-linear characteristics, which makes the analysis results in each frequency band closer to the characteristics of the real voice emitted by the human body, makes the voiceprint feature extraction more accurate, and improves the effect of voiceprint verification.
  • the mapping unit is configured to map each voiceprint feature vector to a low-dimensional voiceprint discrimination vector I-vector to obtain the first voiceprint feature corresponding to each of the voice frames in the purified voice data.
  • each of the voiceprint feature vectors is mapped to a low-dimensional voiceprint discrimination vector I-vector to reduce the voiceprint features
  • the calculation cost in the extraction process reduces the use cost of voiceprint verification.
  • the training process of the GMM-UBM in this embodiment is as follows: B1: Acquire a preset number (for example, 100,000) of voice data samples, each voice data sample corresponds to a voiceprint discrimination vector, and each voice data sample can be collected from Speech formation of different people in different environments, such speech data samples are used to train a general background model (GMM-UBM) that can characterize general speech characteristics; B2.
  • GMM-UBM General Background Model
  • each speech data sample separately to extract each speech data Preset type voiceprint features corresponding to the samples, and construct voiceprint feature vectors corresponding to each voice data sample based on the preset type voiceprint features corresponding to each voice data sample; B3, all the preset type voiceprint feature vectors will be constructed Divided into a first percentage training set and a second percentage verification set, the first percentage and the second percentage are less than or equal to 100%; B4, using the voiceprint feature vector pair in the training set The second model is trained, and after the training is completed, a verification set is used to verify the accuracy of the trained second model; B5.
  • the model training ends, otherwise, increasing the number of samples of voice data, and re-execute the above-described step B2, B3, B4, B5 based on the voice data samples increases.
  • the preset standard Rate e.g., 98.5%
  • the voiceprint discrimination vector of this embodiment is expressed by I-vector, i-vector is a vector, compared to the dimension of Gaussian space, i-vector dimension is lower, which is convenient to reduce the calculation cost, and extract the low-dimensional i-vector
  • the process of is to multiply the low-dimensional vector w and a conversion matrix T into a Gaussian space with higher dimensions by the following calculation formula.
  • the extraction of I-vector includes the following steps: after processing the training speech data from a target speaker, the extracted preset type voiceprint feature vector (for example, MFCC) is input into the GMM-VAD model to obtain a characterization of the segment
  • judgment module 4 of this embodiment includes:
  • the obtaining unit is configured to respectively obtain pre-stored voiceprint features corresponding to pre-stored voiceprint feature data of a plurality of persons, wherein the voiceprint feature data of the plurality of persons includes pre-stored voiceprint features of the target person.
  • the pre-stored voiceprint feature data of multiple persons including the target person is used to determine whether the voiceprint feature of the currently collected voice signal is the same as the target person's voiceprint feature, so as to improve the accuracy of judgment.
  • the calculation unit is used to calculate the similarity value between each pre-stored voiceprint feature and the first voiceprint feature, respectively.
  • the similarity value in this embodiment represents the similarity between the pre-stored voiceprint feature and the first voiceprint feature.
  • the method for obtaining the similarity value in this embodiment includes obtaining the feature distance value between the pre-stored voiceprint feature and the first voiceprint feature.
  • the above feature distance value includes a cosine distance value, an Euclidean distance value, and the like.
  • a sorting unit is used to sort the similarity values in descending order.
  • the similarity values between the pre-stored voiceprint features and the first voiceprint feature are sorted from large to small, so as to more accurately analyze the relationship between the first voiceprint feature and each pre-stored voiceprint feature.
  • the similarity distribution state in order to obtain the verification of the first voiceprint feature more accurately.
  • the judging unit is configured to judge whether the similarity value corresponding to the pre-stored voiceprint feature of the target person is included in the preset similarity value of the preset number.
  • the preset preset number of similarity values including the similarity value corresponding to the pre-stored voiceprint feature of the target person, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature of the target person. Reducing the identification error rate caused by model errors, the error rate is "the frequency of verification failures that occur when verification should pass, and the frequency of verification passes that occur when verification should fail".
  • the preset number of similarity values in this embodiment includes 1, 2, or 3, etc., and can be set according to usage requirements.
  • the determining unit is configured to determine that the similarity between the first voiceprint feature and the pre-stored voiceprint feature meets the preset condition if the similarity value corresponding to the pre-stored voiceprint feature of the target person is included, otherwise the preset condition is not satisfied.
  • Other embodiments of the present application implement effective voiceprint verification by setting a distance threshold between the first voiceprint feature and the pre-stored voiceprint feature of the target user.
  • the preset threshold is 0.6. If the cosine distance between the first voiceprint feature and the pre-stored voiceprint feature of the target user is calculated to be less than or equal to the preset threshold, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature of the target user , The verification is passed; if the cosine distance between the first voiceprint feature and the pre-stored voiceprint feature of the target user is calculated to be greater than a preset threshold, it is determined that the first voiceprint feature is not the same as the pre-stored voiceprint feature of the target user, and the verification fails.
  • calculation unit of this embodiment includes:
  • the third calculation subunit is used to pass the cosine distance formula Calculate the cosine distance between each of the pre-stored voiceprint features and the first voiceprint feature, where x represents each prestored voiceprint discrimination vector, and y represents the voiceprint discrimination vector of the first voiceprint feature.
  • This embodiment uses the cosine distance formula Represents the similarity between each of the pre-stored voiceprint features and the first voiceprint feature, where the smaller the distance value of the cosine distance indicates that the two voiceprint features are closer or the same.
  • a conversion subunit configured to convert the cosine distance value into the similarity value, wherein the smallest cosine distance value corresponds to the largest similarity value.
  • the cosine distance value can be converted into a similarity value by converting the cosine distance value according to an inverse proportional formula carrying a specified inverse ratio coefficient.
  • an embodiment of the present application further provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the processor designed by the computer is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the memory device provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store data such as voiceprint verification.
  • the network interface of the computer device is used to communicate with external terminals through a network connection.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer nonvolatile readable storage medium on which computer readable instructions are stored.
  • the processes of the foregoing method embodiments are performed.
  • the above is only the preferred embodiment of the present application, and does not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by the description and drawings of this application, or directly or indirectly used in other related In the technical field, the same reason is included in the scope of patent protection of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请揭示了声纹验证的方法,包括:将待声纹验证的语音信号输入到VAD模型中,区分语音信号中的语音帧和噪音帧;去除噪音帧,得到各语音帧组成的净化的语音数据;提取净化的语音数据对应的第一声纹特征;判断第一声纹特征与预存声纹特征的相似度是否满足预设条件;若满足则判定第一声纹特征与预存声纹特征相同,否则不相同。

Description

声纹验证的方法、装置、计算机设备和存储介质
本申请要求于2018年10月11日提交中国专利局、申请号为2018111846939,发明名称为“声纹验证的方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到声纹验证领域,特别是涉及到声纹验证的方法、装置、计算机设备和存储介质。
背景技术
目前,很多大型金融公司的业务范围涉及保险、银行、投资等多个业务范畴,而每个业务范畴通常都需要同客户进行沟通,且都需要进行反欺诈识别,因此,对客户的身份验证及反欺诈识别也就成为保证业务安全的重要组成部分。在客户身份验证环节中,声纹验证由于其具有的实时性和易便性而被许多公司采用。发明人意识到在实际应用中,受说话人在身份注册或身份验证环节所处的环境因素影响,采集到的语音数据常常带有非来自说话人的背景噪音,这一因素成为影响声纹验证成功率的主要因素之一。
技术问题
本申请的主要目的为提供一种声纹验证的方法,旨在解决现有语音数据中的噪音对声纹验证效果产生不良影响的技术问题。
技术解决方案
本申请提出一种声纹验证的方法,包括:
将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧;
去除所述噪音帧,得到各所述语音帧组成的净化的语音数据;
提取所述净化的语音数据对应的第一声纹特征;
判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件;
若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。
本申请还提供了一种声纹验证的装置,包括:
区分模块,用于将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧;
去除模块,用于去除所述噪音帧,得到各所述语音帧组成的净化的语音数据;
提取模块,用于提取所述净化的语音数据对应的第一声纹特征;
判断模块,用于判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件;
判定模块,用于若满足预设条件,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。
本申请还提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述方法的步骤。
本申请还提供了一种计算机非易失性可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述的方法的步骤。
有益效果
本申请通过识别语音信号中的噪音数据,并去除噪音数据得到净化的语音数据,然后依据净化后的语音数据进行声纹识别,提高声纹验证的准确性。本申请通过GMM-VAD模型,结合局部判决和全局判决,实现对噪音数据和语音数据的准确区分,以提高净化语音信号的程度,进一步提高声纹验证的准确性。本申请基于GMM-UBM实现将各所述声纹特征向量分别映射为低维度的声纹鉴别向量I-vector,降低声纹特征提取过程中的计算成本,降低声纹验证的使用成本。本申请在声纹验证过程中通过与多人的预存数据进行比较分析,降低声纹验证的等错率,降低声纹验证的模型误差带来的声纹验证精准度的误差。
附图说明
图1本申请一实施例的声纹验证的方法流程示意图;
图2本申请一实施例的声纹验证的装置结构示意图;
图3本申请一实施例的计算机设备内部结构示意图。
本发明的最佳实施方式
参照图1,本申请一实施例的一种声纹验证的方法,包括:
S1:将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧。
本实施例的VAD模型,又称语音端点检测器,用于在噪声环境中检测是否存在人声的语音数据。VAD模型通过对输入的每一帧语音信号进行打分,即该帧语音信号是语音帧或噪音帧的概率,当语音帧的概率值大于预先设定的判决门限,则判定为语音帧,否则为噪音帧。VAD模型根据上述判决结果对语音帧和噪音帧进行区分,以便去除语音信号中的噪音帧。本实施例的判决门限采用了Webrtc源代码中默认的判决门限,该判决门限是Webrtc技术开发时通过分 析大量数据得来的,以提高区分的效果和准确度,并同时减少VAD模型的模型训练工作量。
S2:去除所述噪音帧,得到各所述语音帧组成的净化的语音数据。
本实施例根据上述区分结果,将标记为噪音帧的数据剪切掉,将剩余的各所述语音帧按照原排布时间顺序依次连续排布,形成各所述语音帧组成的所述净化的语音数据。本申请其他实施例也可通过上述区分结果,筛选标记为语音帧的数据进行提取保存,将提取保存的各所述语音帧按照原排布时间顺序依次连续排布,形成各所述语音帧组成的所述净化的语音数据。本实施例通过将在身份注册或身份验证环节所处的环境的非来自说话人的背景噪音数据去除掉,降低语音信号中的噪音数据对声纹验证效果的影响,以提高声纹验证成功率。
S3:提取所述净化的语音数据对应的第一声纹特征。
本实施例通过只分析净化的语音数据对应的第一声纹特征,以减少声纹验证中的计算量,并同时提高声纹验证的有效性、针对性以及时效性。
S4:判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件。
本实施例的预设条件包括指定的预设阈值范围、或指定的排序等,可根据具体的应用场景进行自定义设定,以更广泛地满足个性化使用需求。
S5:若满足,则判定第一声纹特征与预存声纹特征相同,否则不相同。
本实施例将判定所述第一声纹特征与所述预存声纹特征相同,则反馈验证通过的结果到客户端,否则,反馈验证失败的结果到客户端,以便客户端根据反馈结果进行进一步的应用操作。举例地,验证通过后控制智能门打开等。再举例地,验证失败指定次数后控制安全系统进行锁屏,以防犯罪分子进一步破坏电子银行系统。
进一步地,本实施例的VAD模型中包括傅里叶变换,高斯混合分布的GMM-NOISE和GMM-SPEECH,步骤S1,包括:
S100:将所述语音信号输入到VAD模型中的傅里叶变换中,将所述语音信号从时域信号形式转变为频域信号形式。
本实施例通过VAD模型中的傅里叶变换将时域信号形式一一对应地转换成频域信号形式,进行分析各帧语音信号的属性,方便区分语音帧和噪音帧。
S101:将频域信号形式的语音信号的每一帧数据分别输入到GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧。
本实施例优选基于混合高斯GMM的VAD模型,其对输入的每一帧频域信号 形式的语音信号在6个频段上进行能量提取,作为该帧语音信号的特征向量,并对噪音和语音在6个频段上分别进行高斯混合分布GMM建模,每个频段上都有含有两个高斯分量的噪音GMM-NOISE和含有两个高斯分量的语音GMM-SPEECH。上述6个频段根据Webrtc技术基于噪音和语音的频谱差异进行设置,以便提高分析准确度以及与Webrtc技术的匹配性。本申请其他实施例的分析频段不一定必须是6个,可根据实际需求进行设定。而且基于我国交流电标准是220V、50Hz,电源50Hz的干扰会混入采集语音信号的麦克风中,采集到的干扰信号以及物理震动均会带来影响,本实施例优选采集80Hz以上的语音信号,以减少交流电的干扰,而语音能达到的最高频率是4kHz,所以本实施例优选在80Hz至4kHz范围内的频谱波谷处划分界限。本实施例的VAD判决包括局部判决(Local Decision)和全局判决(Global Decisioin)。
进一步地,本实施例的步骤S101,包括:
S1010:将频域信号形式的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,分别得到各帧数据的噪音帧概率
Figure PCTCN2018124401-appb-000001
和语音帧概率
Figure PCTCN2018124401-appb-000002
本实施例通过将预分析是语音帧还是噪音帧的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,获取GMM-NOISE和GMM-SPEECH分别分析各帧数据的噪音帧概率值和语音帧概率值,以便通过比较噪音帧概率值和语音帧概率值的大小,以便确定是噪音帧还是语音帧。
S1011:根据log likelihood
Figure PCTCN2018124401-appb-000003
计算局部对数似然比。
本实施例优选基于混合高斯GMM的VAD模型,其对输入的每一帧频域信号形式的语音信号,在6个频段上进行能量提取,作为该帧语音信号的特征向量,所以本实施例中n取值为6,对每一帧进行判断的时候,都会进行6次局部判决,即在6个频段上分别进行局部判决,只要有一次认为该帧为语音帧,即保留这一帧。
S1012:判断所述局部对数似然比是否高于局部门限值。
本实施例通过局部判决,实现对语音帧和噪音帧的区分,本实施例的局部判决在每个频段上做一次,一共6次。似然比是反映真实性的一种指标,属于 同时反映灵敏度和特异度的复合指标,提高概率估算准确度,本实施例在确保语音帧概率值大于噪音帧概率值的情况下,进一步通过比较局部对数似然比是否高于局部门限值,以确保判定为该语音信号为语音帧的准确性。
S1013:若是,则判定局部对数似然比高于局部门限值的帧数据为语音帧。
本实施例的GMM的参数具有自适应更新能力,在每一帧语音信号被判断为语音帧或者噪声帧之后,会根据该帧的特征值来更新其对应模型的参数。例如,如果该帧被判断为语音帧,则GMM-SPEECH的期望值、标准差和高斯分量权重值就根据该帧的特征值进行一次更新,在越来越多的语音帧输入GMM-SPEECH之后,GMM-SPEECH会越来越适应此通语音信号的说话人的声纹特征,给出的分析结论会更加准确。
进一步地,本申请另一实施例的步骤S1012之后,包括:
S1014:若局部对数似然比不高于局部门限值,则根据sum log likelihood
Figure PCTCN2018124401-appb-000004
likelihood ratio local,i,计算全局对数似然比。
本实施例先进行局部判决,再进行全局判决,全局判决是基于局部判决结果的基础上进行各频段加权和的计算,提高区分语音帧和噪音帧的准确度。
S1015:判断所述全局对数似然比是否高于全局门限值。
本实施例的全局判决中将全局对数似然比与全局门限值相比,以进一步提高筛选语音帧的准确性。
S1016:若全局对数似然比高于全局门限值,则判定全局对数似然比高于全局门限值的帧数据为语音帧。
本实施例可以先根据局部判决结果有语音存在,则不进行全局判决,以提高声纹验证的效率,且尽量将所有的语音帧都能识别到,以免语音失真。本申请其他实施例也可在局部判决结果有语音存在,再进行全局判决,以进一步核实和确认语音的存在,提高区分语音帧和噪音帧的准确度。
进一步地,本实施例的步骤S3,包括:
S30:提取净化的语音数据中各语音帧分别对应的MFCC类型声纹特征。
本实施例提取MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)类型声纹特征的过程如下:先采样和量化,将净化的语音数据的连续模拟语音信号以一定的采样周期采样,转化为离散信号,并根据一定的编码规 则将离散信号量化为数字信号;然后预加重,由于人体的生理特性,语音信号的高频成分往往被压抑,预加重的作用是补偿高频成分;接着分帧处理,由于语音信号的“瞬时平稳性”,在进行频谱分析时对一段话音信号进行分帧处理(一般为10至30毫秒一帧),然后以帧为单位进行特征提取;接着加窗处理,作用是减少帧起始和帧结束对应信号的不连续性问题,采用汉明窗进行加窗处理;接着对帧信号进行DFT,将信号从时域转换到频域,然后再利用如下公式将信号从线性频谱域映射到梅尔频谱域:
Figure PCTCN2018124401-appb-000005
将转化后的帧信号输入到一组梅尔三角滤波器组,计算每个频段的滤波器输出的信号对数能量,得到一个对数能量序列;对上一步得到的对数能量序列做离散余弦变换(DCT,Discrete Cosine Transform)即可得到该帧语音信号的MFCC类型声纹特征。
S31:根据各MFCC类型声纹特征构建各语音帧分别对应的声纹特征向量。
MFCC类型声纹特征具有非线性特征,使各频段上的分析结果更贴近人体发出的真实语音的特征,使声纹特征提取更加准确,提高声纹验证的效果。
S32:将各所述声纹特征向量分别映射为低维度的声纹鉴别向量I-vector,以得到净化的语音数据中各所述语音帧分别对应的第一声纹特征。
本实施例基于GMM-UBM(Gaussian Mixture Model-Universal Background Model,高斯混合模型-背景模型)实现将各所述声纹特征向量分别映射为低维度的声纹鉴别向量I-vector,降低声纹特征提取过程中的计算成本,降低声纹验证的使用成本。本实施例的GMM-UBM的训练过程如下:B1:获取预设数量(例如,10万个)的语音数据样本,每个语音数据样本对应一个声纹鉴别向量,每个语音数据样本可以采集自不同的人在不同环境中的语音形成,这样的语音数据样本用来训练能够表征一般语音特性的通用背景模型(GMM-UBM);B2、分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征,并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量;B3、将构建出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比小于或等于100%;B4、利用训练集中的声纹特征向量对所述第二模型进行训练,并在训练完成之后利用验证集对训练的所述第二模型的准确率进行验证;B5、若准确率大于预设准确率(例如,98.5%),则模型训练结束,否则, 增加语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤B2、B3、B4、B5。
本实施例的声纹鉴别向量采用I-vector表达,i-vector是一个向量,相对于高斯空间的维度来讲,i-vector维度更低,便于降低计算成本,而提取低维度的i-vector的过程是通过下述计算公式,将低维度的向量w与一个转换矩阵T相乘映射到维度较高的高斯空间。I-vector的提取包括如下步骤:将来自某位目标说话人的训练语音数据处理后,提取得到的预设类型声纹特征向量(例如,MFCC)输入到GMM-VAD模型,得到一个表征该段语音数据在各个高斯分量上的概率分布的高斯超向量;利用如下公式可以计算得到该段语音对应的较低维度的声纹鉴别向量I-vector:m r=μ+Tω r,其中m r为代表该段语音的高斯超向量,μ为所述第二模型的均值超向量,T为将低维度的I-vector,ω r映射到高维度的高斯空间的转换矩阵,T的训练采用EM算法。
进一步地,本实施例的步骤S4,包括:
S40:分别在预存的多个人的声纹特征数据中获取各自对应的预存声纹特征,其中,多个人的声纹特征数据中包括目标人的预存声纹特征。
本实施例将预存的包括目标人的多人的声纹特征数据,同时用于判断当前采集的语音信号的声纹特征是否与目标人的声纹特征相同,以提高判断准确性。
S41:分别计算各所述预存声纹特征与第一声纹特征之间的相似度值。
本实施例的相似度值表征了预存声纹特征与所述第一声纹特征之间的相似度,相似度值越大,则上述两者越相似。本实施例的相似度值的获取方法包括通过比较预存声纹特征与所述第一声纹特征之间的特征距离值得到,上述特征距离值包括余弦距离值、欧式距离值等。
S42:将各所述相似度值按照从大到小的顺序进行排序。
本实施例通过将各所述预存声纹特征与所述第一声纹特征之间的相似度值进行从大到小排序,以便更准确地分析第一声纹特征与各预存声纹特征的相似度分布状态,以便更准确地获得对第一声纹特征的验证。
S43:判断排序在前的预设数量的相似度值中,是否包括所述目标人的预存声纹特征对应的相似度值。
本实施例通过排序在前的预设数量的相似度值,包括目标人的预存声纹特征对应的相似度值,则判定第一声纹特征与预存的目标人的声纹特征相同,以 减小模型误差带来的识别等错率,等错率为“应验证通过时发生的验证未通过的频率,与应验证未通过时发生的验证通过的频率相等”。本实施例的预设数量的相似度值包括1个、2个或3个等,可根据使用需求进行自设定。
S44:若是,则判定所述第一声纹特征与预存声纹特征的相似度满足预设条件,否则不满足预设条件。
本申请其他实施例通过设定第一声纹特征与目标用户的预存声纹特征的距离阈值,实现有效的声纹验证。举例地,预设阈值为0.6,若计算第一声纹特征与目标用户的预存声纹特征的余弦距离小于或等于预设阈值,则确定第一声纹特征与目标用户的预存声纹特征相同,则验证通过;若计算第一声纹特征与目标用户的预存声纹特征的余弦距离大于预设阈值,则确定第一声纹特征与目标用户的预存声纹特征不相同,则验证失败。
进一步地,本实施例的步骤S41,包括:
S410:分别通过余弦距离公式
Figure PCTCN2018124401-appb-000006
计算各所述预存声纹特征与所述第一声纹特征之间的余弦距离值,其中,x代表各预存声纹鉴别向量,y代表第一声纹特征的声纹鉴别向量。
本实施例通过余弦距离公式
Figure PCTCN2018124401-appb-000007
表示各所述预存声纹特征与所述第一声纹特征之间的相似度,其中余弦距离的距离值越小,表明两声纹特征更接近或相同。
S411:将所述余弦距离值转换成所述相似度值,其中,最小的所述余弦距离值对应最大的相似度值。
本实施例可通过将余弦距离值按照携带指定反比系数的反比例公式,将余弦距离值转换成相似度值。
本实施例通过识别语音信号中的噪音数据,并去除噪音数据得到净化的语音数据,然后依据净化后的语音数据进行声纹识别,提高声纹验证的准确性。本实施例通过GMM-VAD模型,结合局部判决和全局判决,实现对噪音数据和语音数据的准确区分,以提高净化语音信号的程度,进一步提高声纹验证的准确性。本实施例基于GMM-UBM实现将各所述声纹特征向量分别映射为低维度的声纹鉴别向量I-vector,降低声纹特征提取过程中的计算成本,降低声纹验证的使用成本。本实施例在声纹验证过程中通过与多人的预存数据进行比较分 析,降低声纹验证的等错率,降低声纹验证的模型误差带来的声纹验证精准度的误差。
参照图2,本申请一实施例的一种声纹验证的装置,包括:
区分模块1,用于将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧。
本实施例的VAD模型,又称语音端点检测器,用于在噪声环境中检测是否存在人声的语音数据。VAD模型通过对输入的每一帧语音信号进行打分,即该帧语音信号是语音帧或噪音帧的概率,当语音帧的概率值大于预先设定的判决门限,则判定为语音帧,否则为噪音帧。VAD模型根据上述判决结果对语音帧和噪音帧进行区分,以便去除语音信号中的噪音帧。本实施例的判决门限采用了Webrtc源代码中默认的判决门限,该判决门限是Webrtc技术开发时通过分析大量数据得来的,以提高区分的效果和准确度,并同时减少VAD模型的模型训练工作量。
去除模块2,用于去除噪音帧,得到各所述语音帧组成的净化的语音数据。
本实施例根据上述区分结果,将标记为噪音帧的数据剪切掉,将剩余的各所述语音帧按照原排布时间顺序依次连续排布,形成各所述语音帧组成的所述净化的语音数据。本申请其他实施例也可通过上述区分结果,筛选标记为语音帧的数据进行提取保存,将提取保存的各所述语音帧按照原排布时间顺序依次连续排布,形成各所述语音帧组成的所述净化的语音数据。本实施例通过将在身份注册或身份验证环节所处的环境的非来自说话人的背景噪音数据去除掉,降低语音信号中的噪音数据对声纹验证效果的影响,以提高声纹验证成功率。
提取模块3,用于提取所述净化的语音数据对应的第一声纹特征。
本实施例通过只分析净化的语音数据对应的第一声纹特征,以减少声纹验证中的计算量,并同时提高声纹验证的有效性、针对性以及时效性。
判断模块4,用于判断第一声纹特征与预存声纹特征的相似度是否满足预设条件。
本实施例的预设条件包括指定的预设阈值范围、或指定的排序等,可根据具体的应用场景进行自定义设定,以更广泛地满足个性化使用需求。
判定模块5,用于若满足预设条件,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。
本实施例将判定所述第一声纹特征与所述预存声纹特征相同,则反馈验证 通过的结果到客户端,否则,反馈验证失败的结果到客户端,以便客户端根据反馈结果进行进一步的应用操作。举例地,验证通过后控制智能门打开等。再举例地,验证失败指定次数后控制安全系统进行锁屏,以防犯罪分子进一步破坏电子银行系统。
进一步地,本实施例的VAD模型中包括傅里叶变换,高斯混合分布的GMM-NOISE和GMM-SPEECH,上述区分模块1,包括:
转变单元,用于将所述语音信号输入到VAD模型中的傅里叶变换中,将所述语音信号从时域信号形式转变为频域信号形式。
本实施例通过VAD模型中的傅里叶变换将时域信号形式一一对应地转换成频域信号形式,进行分析各帧语音信号的属性,方便区分语音帧和噪音帧。
区分单元,用于将频域信号形式的语音信号的每一帧数据分别输入到GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧。
本实施例优选基于混合高斯GMM的VAD模型,其对输入的每一帧频域信号形式的语音信号在6个频段上进行能量提取,作为该帧语音信号的特征向量,并对噪音和语音在6个频段上分别进行高斯混合分布GMM建模,每个频段上都有含有两个高斯分量的噪音GMM-NOISE和含有两个高斯分量的语音GMM-SPEECH。上述6个频段根据Webrtc技术基于噪音和语音的频谱差异进行设置,以便提高分析准确度以及与Webrtc技术的匹配性。本申请其他实施例的分析频段不一定必须是6个,可根据实际需求进行设定。而且本实施例基于我国交流电标准是220V、50Hz,电源50Hz的干扰会混入采集语音信号的麦克风中,采集到的干扰信号以及物理震动均会带来影响,本实施例优选采集80Hz以上的语音信号,以减少交流电的干扰,而语音能达到的最高频率是4kHz,所以本实施例优选在80Hz至4kHz范围内的频谱波谷处划分界限。本实施例的VAD判决包括局部判决(Local Decision)和全局判决(Global Decisioin)。
进一步地,本实施例的区分单元,包括:
输入子单元,用于将频域信号形式的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,分别得到各帧数据的噪音帧概率
Figure PCTCN2018124401-appb-000008
和语音帧概率
Figure PCTCN2018124401-appb-000009
本实施例通过将预分析是语音帧还是噪音帧的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,获取GMM-NOISE和GMM-SPEECH分别分析 各帧数据的噪音帧概率值和语音帧概率值,以便通过比较噪音帧概率值和语音帧概率值的大小,以便确定是噪音帧还是语音帧。
第一计算子单元,用于根据log likelihood
Figure PCTCN2018124401-appb-000010
计算局部对数似然比。
本实施例优选基于混合高斯GMM的VAD模型,其对输入的每一帧频域信号形式的语音信号,在6个频段上进行能量提取,作为该帧语音信号的特征向量,所以本实施例中n取值为6,对每一帧进行判断的时候,都会进行6次局部判决,即在6个频段上分别进行局部判决,只要有一次认为该帧为语音帧,即保留这一帧。
第一判断子单元,用于判断所述局部对数似然比是否高于局部门限值。
本实施例通过局部判决,实现对语音帧和噪音帧的区分,本实施例的局部判决在每个频段上做一次,一共6次。似然比是反映真实性的一种指标,属于同时反映灵敏度和特异度的复合指标,提高概率估算准确度,本实施例在确保语音帧概率值大于噪音帧概率值的情况下,进一步通过比较局部对数似然比是否高于局部门限值,以确保判定为该语音信号为语音帧的准确性。
第一判定子单元,用于若局部对数似然比高于局部门限值,则判定局部对数似然比高于局部门限值的帧数据为语音帧。
本实施例的GMM的参数具有自适应更新能力,在每一帧语音信号被判断为语音帧或者噪声帧之后,会根据该帧的特征值来更新其对应模型的参数。例如,如果该帧被判断为语音帧,则GMM-SPEECH的期望值、标准差和高斯分量权重值就根据该帧的特征值进行一次更新,在越来越多的语音帧输入GMM-SPEECH之后,GMM-SPEECH会越来越适应此通语音信号的说话人的声纹特征,给出的分析结论会更加准确。
进一步地,本申请另一实施例的区分单元,包括:
第二计算子单元,用于若局部对数似然比不高于局部门限值,则根据sum log likelihood
Figure PCTCN2018124401-appb-000011
likelihood ratio local,i,计算全局对数似然比。
本实施例进行局部判决,再进行全局判决,全局判决是基于局部判决结果的基础上进行各频段加权和的计算,以便提高区分语音帧和噪音帧的准确度。
第二判断子单元,用于判断所述全局对数似然比是否高于全局门限值。
本实施例的全局判决中将全局对数似然比与全局门限值相比,以进一步提 高筛选语音帧的准确性。
第二判定子单元,用于若全局对数似然比高于全局门限值,则判定全局对数似然比高于全局门限值的帧数据为语音帧。
本实施例可以先根据局部判决结果有语音存在,则不进行全局判决,以提高声纹验证的效率,且尽量将所有的语音帧都能识别到,以免语音失真。本申请其他实施例也可在局部判决结果有语音存在,再进行全局判决,以进一步核实和确认语音的存在,提高区分语音帧和噪音帧的准确度。
进一步地,本实施例的提取模块3,包括:
提取单元,用于提取所述净化的语音数据中各所述语音帧分别对应的MFCC类型声纹特征。
本实施例提取MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)类型声纹特征的过程如下:先采样和量化,将净化的语音数据的连续模拟语音信号以一定的采样周期采样,转化为离散信号,并根据一定的编码规则将离散信号量化为数字信号;然后预加重,由于人体的生理特性,语音信号的高频成分往往被压抑,预加重的作用是补偿高频成分;接着分帧处理,由于语音信号的“瞬时平稳性”,在进行频谱分析时对一段话音信号进行分帧处理(一般为10至30毫秒一帧),然后以帧为单位进行特征提取;接着加窗处理,作用是减少帧起始和帧结束对应信号的不连续性问题,采用汉明窗进行加窗处理;接着对帧信号进行DFT,将信号从时域转换到频域,然后再利用如下公式将信号从线性频谱域映射到梅尔频谱域:
Figure PCTCN2018124401-appb-000012
将转化后的帧信号输入到一组梅尔三角滤波器组,计算每个频段的滤波器输出的信号对数能量,得到一个对数能量序列;对上一步得到的对数能量序列做离散余弦变换(DCT,Discrete Cosine Transform)即可得到该帧语音信号的MFCC类型声纹特征。
构建单元,用于根据各所述MFCC类型声纹特征构建各所述语音帧分别对应的声纹特征向量。
MFCC类型声纹特征具有非线性特征,使各频段上的分析结果更贴近人体发出的真实语音的特征,使声纹特征提取更加准确,提高声纹验证的效果。
映射单元,用于将各声纹特征向量分别映射为低维度的声纹鉴别向量 I-vector,以得到净化的语音数据中各所述语音帧分别对应的第一声纹特征。
本实施例基于GMM-UBM(Gaussian Mixture Model-Universal Background Model,高斯混合模型-背景模型)实现将各所述声纹特征向量分别映射为低维度的声纹鉴别向量I-vector,降低声纹特征提取过程中的计算成本,降低声纹验证的使用成本。本实施例的GMM-UBM的训练过程如下:B1:获取预设数量(例如,10万个)的语音数据样本,每个语音数据样本对应一个声纹鉴别向量,每个语音数据样本可以采集自不同的人在不同环境中的语音形成,这样的语音数据样本用来训练能够表征一般语音特性的通用背景模型(GMM-UBM);B2、分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征,并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量;B3、将构建出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比小于或等于100%;B4、利用训练集中的声纹特征向量对所述第二模型进行训练,并在训练完成之后利用验证集对训练的所述第二模型的准确率进行验证;B5、若准确率大于预设准确率(例如,98.5%),则模型训练结束,否则,增加语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤B2、B3、B4、B5。
本实施例的声纹鉴别向量采用I-vector表达,i-vector是一个向量,相对于高斯空间的维度来讲,i-vector维度更低,便于降低计算成本,而提取低维度的i-vector的过程是通过下述计算公式,将低维度的向量w与一个转换矩阵T相乘映射到维度较高的高斯空间。I-vector的提取包括如下步骤:将来自某位目标说话人的训练语音数据处理后,提取得到的预设类型声纹特征向量(例如,MFCC)输入到GMM-VAD模型,得到一个表征该段语音数据在各个高斯分量上的概率分布的高斯超向量;利用如下公式可以计算得到该段语音对应的较低维度的声纹鉴别向量I-vector:m r=μ+Tω r,其中m r为代表该段语音的高斯超向量,μ为所述第二模型的均值超向量,T为将低维度的I-vector,ω r映射到高维度的高斯空间的转换矩阵,T的训练采用EM算法。
进一步地,本实施例的判断模块4,包括:
获取单元,用于分别在预存的多个人的声纹特征数据中获取各自对应的预存声纹特征,其中,多个人的声纹特征数据中包括目标人的预存声纹特征。
本实施例通过将预存的包括目标人的多人的声纹特征数据,同时用于判断 当前采集的语音信号的声纹特征是否与目标人的声纹特征相同,以提高判断准确性。
计算单元,用于分别计算各预存声纹特征与第一声纹特征之间的相似度值。
本实施例的相似度值表征了预存声纹特征与所述第一声纹特征之间的相似度,相似度值越大,则上述两者越相似。本实施例的相似度值的获取方法包括通过比较预存声纹特征与所述第一声纹特征之间的特征距离值得到,上述特征距离值包括余弦距离值、欧式距离值等。
排序单元,用于将各所述相似度值按照从大到小的顺序进行排序。
本实施例通过将各所述预存声纹特征与所述第一声纹特征之间的相似度值进行从大到小排序,以便更准确地分析第一声纹特征与各预存声纹特征的相似度分布状态,以便更准确地获得对第一声纹特征的验证。
判断单元,用于判断排序在前的预设数量的相似度值中,是否包括所述目标人的预存声纹特征对应的相似度值。
本实施例通过排序在前的预设数量的相似度值中,包括目标人的预存声纹特征对应的相似度值,则判定第一声纹特征与预存的目标人的声纹特征相同,以减小模型误差带来的识别等错率,等错率为“应验证通过时发生的验证未通过的频率,与应验证未通过时发生的验证通过的频率相等”。本实施例的预设数量的相似度值包括1、2或3个等,可根据使用需求进行自设定。
判定单元,用于若包括目标人的预存声纹特征对应的相似度值,则判定第一声纹特征与预存声纹特征的相似度满足预设条件,否则不满足预设条件。
本申请其他实施例通过设定第一声纹特征与目标用户的预存声纹特征的距离阈值,实现有效的声纹验证。举例地,预设阈值为0.6,若计算第一声纹特征与目标用户的预存声纹特征的余弦距离小于或等于预设阈值,则确定第一声纹特征与目标用户的预存声纹特征相同,则验证通过;若计算第一声纹特征与目标用户的预存声纹特征的余弦距离大于预设阈值,则确定第一声纹特征与目标用户的预存声纹特征不相同,则验证失败。
进一步地,本实施例的计算单元,包括:
第三计算子单元,用于分别通过余弦距离公式
Figure PCTCN2018124401-appb-000013
计算各所述预存声纹特征与所述第一声纹特征之间的余弦距离值,其中,x代表各预 存声纹鉴别向量,y代表第一声纹特征的声纹鉴别向量。
本实施例通过余弦距离公式
Figure PCTCN2018124401-appb-000014
表示各所述预存声纹特征与所述第一声纹特征之间的相似度,其中余弦距离的距离值越小,表明两声纹特征更接近或相同。
转换子单元,用于将所述余弦距离值转换成所述相似度值,其中,最小的所述余弦距离值对应最大的相似度值。
本实施例可通过将余弦距离值按照携带指定反比系数的反比例公式,将余弦距离值转换成相似度值。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储声纹验证等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令在执行时,执行如上述各方法的实施例的流程。本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,该计算机可读指令在执行时,执行如上述各方法的实施例的流程。以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种声纹验证的方法,其特征在于,包括:
    将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧;
    去除所述噪音帧,得到各所述语音帧组成的净化的语音数据;
    提取所述净化的语音数据对应的第一声纹特征;
    判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件;
    若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。
  2. 根据权利要求1所述的声纹验证的方法,其特征在于,所述VAD模型中包括傅里叶变换、高斯混合分布的GMM-NOISE和GMM-SPEECH,所述将所述语音信号输入到VAD模型中,区分语音信号中的语音帧和噪音帧的步骤,包括:
    将所述语音信号输入到VAD模型中的傅里叶变换中,将所述语音信号从时域信号形式转变为频域信号形式;
    将频域信号形式的语音信号的每一帧数据分别输入到所述GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧。
  3. 根据权利要求2所述的声纹验证的方法,其特征在于,所述将频域信号形式的语音信号的每一帧数据分别输入到所述GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧的步骤,包括:
    将频域信号形式的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,分别得到各帧数据的噪音帧概率
    Figure PCTCN2018124401-appb-100001
    和语音帧概率
    Figure PCTCN2018124401-appb-100002
    根据
    Figure PCTCN2018124401-appb-100003
    计算局部对数似然比;
    判断所述局部对数似然比是否高于局部门限值;
    若是,则判定所述局部对数似然比高于局部门限值的帧数据为语音帧。
  4. 根据权利要求3所述的声纹验证的方法,其特征在于,所述判断所述对数似然比是否高于局部门限值的步骤之后,包括:
    若局部对数似然比不高于局部门限值,则根据
    Figure PCTCN2018124401-appb-100004
    计算全局对数似然 比;
    判断所述全局对数似然比是否高于全局门限值;
    若全局对数似然比高于全局门限值,则判定所述全局对数似然比高于全局门限值的帧数据为语音帧。
  5. 根据权利要求1所述的声纹验证的方法,其特征在于,所述提取所述净化的语音数据对应的第一声纹特征的步骤,包括:
    提取所述净化的语音数据中各所述语音帧分别对应的MFCC类型声纹特征;
    根据各所述MFCC类型声纹特征构建各所述语音帧分别对应的声纹特征向量;
    将各所述声纹特征向量分别映射为低维度的声纹鉴别向量I-vector,以得到所述净化的语音数据中各所述语音帧分别对应的第一声纹特征。
  6. 根据权利要求5所述的声纹验证的方法,其特征在于,所述判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件的步骤,包括:
    分别在预存的多个人的声纹特征数据中获取各自对应的预存声纹特征,其中,多个人的声纹特征数据中包括目标人的预存声纹特征;
    分别计算各所述预存声纹特征与所述第一声纹特征之间的相似度值;
    将各所述相似度值按照从大到小的顺序进行排序;
    判断排序在前的预设数量的相似度值中,是否包括所述目标人的预存声纹特征对应的相似度值;
    若是,则判定所述第一声纹特征与预存声纹特征的相似度满足预设条件,否则不满足预设条件。
  7. 根据权利要求6所述的声纹验证的方法,其特征在于,所述分别计算各所述预存声纹特征与所述第一声纹特征之间的相似度值的步骤,包括:
    分别通过余弦距离公式
    Figure PCTCN2018124401-appb-100005
    计算各所述预存声纹特征与所述第一声纹特征之间的余弦距离值,其中,x代表各预存声纹鉴别向量,y代表第一声纹特征的声纹鉴别向量;
    将所述余弦距离值转换成所述相似度值,其中,最小的所述余弦距离值对应最大的相似度值。
  8. 一种声纹验证的装置,其特征在于,包括:
    区分模块,用于将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧;
    去除模块,用于去除所述噪音帧,得到各所述语音帧组成的净化的语音数据;
    提取模块,用于提取所述净化的语音数据对应的第一声纹特征;
    判断模块,用于判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件;
    判定模块,用于若满足预设条件,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。
  9. 根据权利要求8所述的声纹验证的装置,其特征在于,所述VAD模型中包括傅里叶变换、高斯混合分布的GMM-NOISE和GMM-SPEECH,所述区分模块,包括:
    转变单元,用于将所述语音信号输入到VAD模型中的傅里叶变换中,将所述语音信号从时域信号形式转变为频域信号形式;
    区分单元,用于将频域信号形式的语音信号的每一帧数据分别输入到所述GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧。
  10. 根据权利要求9所述的声纹验证的装置,其特征在于,所述区分单元,包括:
    输入子单元,用于将频域信号形式的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,分别得到各帧数据的噪音帧概率
    Figure PCTCN2018124401-appb-100006
    和语音帧概率
    Figure PCTCN2018124401-appb-100007
    第一计算子单元,用于根据
    Figure PCTCN2018124401-appb-100008
    计算局部对数似然比;
    第一判断子单元,用于判断所述局部对数似然比是否高于局部门限值;
    第一判定子单元,用于若局部对数似然比高于局部门限值,则判定所述局部对数似然比高于局部门限值的帧数据为语音帧。
  11. 根据权利要求10所述的声纹验证的装置,其特征在于,所述区分单 元,包括:
    第二计算子单元,用于若局部对数似然比不高于局部门限值,则根据
    Figure PCTCN2018124401-appb-100009
    计算全局对数似然比;
    第二判断子单元,用于判断所述全局对数似然比是否高于全局门限值;
    第二判定子单元,用于若全局对数似然比高于全局门限值,则判定所述全局对数似然比高于全局门限值的帧数据为语音帧。
  12. 根据权利要求8所述的声纹验证的装置,其特征在于,所述提取模块,包括:
    提取单元,用于提取所述净化的语音数据中各所述语音帧分别对应的MFCC类型声纹特征;
    构建单元,用于根据各所述MFCC类型声纹特征构建各所述语音帧分别对应的声纹特征向量;
    映射单元,用于将各所述声纹特征向量分别映射为低维度的声纹鉴别向量I-vector,以得到所述净化的语音数据中各所述语音帧分别对应的第一声纹特征。
  13. 根据权利要求12所述的声纹验证的装置,其特征在于,所述判断模块,包括:
    获取单元,用于分别在预存的多个人的声纹特征数据中获取各自对应的预存声纹特征,其中,多个人的声纹特征数据中包括目标人的预存声纹特征;
    计算单元,用于分别计算各所述预存声纹特征与所述第一声纹特征之间的相似度值;
    排序单元,用于将各所述相似度值按照从大到小的顺序进行排序;
    判断单元,用于判断排序在前的预设数量的相似度值中,是否包括所述目标人的预存声纹特征对应的相似度值;
    判定单元,用于若包括所述目标人的预存声纹特征对应的相似度值,则判定所述第一声纹特征与预存声纹特征的相似度满足预设条件,否则不满足预设条件。
  14. 根据权利要求13所述的声纹验证的装置,其特征在于,所述计算单元,包括:
    第三计算子单元,用于分别通过余弦距离公式
    Figure PCTCN2018124401-appb-100010
    计算各所述预存声纹特征与所述第一声纹特征之间的余弦距离值,其中,x代表各预存声纹鉴别向量,y代表第一声纹特征的声纹鉴别向量;
    转换子单元,用于将所述余弦距离值转换成所述相似度值,其中,最小的所述余弦距离值对应最大的相似度值。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现声纹验证的方法,声纹验证的方法包括:
    将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧;
    去除所述噪音帧,得到各所述语音帧组成的净化的语音数据;
    提取所述净化的语音数据对应的第一声纹特征;
    判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件;
    若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述VAD模型中包括傅里叶变换、高斯混合分布的GMM-NOISE和GMM-SPEECH,所述将所述语音信号输入到VAD模型中,区分语音信号中的语音帧和噪音帧的步骤,包括:
    将所述语音信号输入到VAD模型中的傅里叶变换中,将所述语音信号从时域信号形式转变为频域信号形式;
    将频域信号形式的语音信号的每一帧数据分别输入到所述GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧。
  17. 根据权利要求16所述的计算机设备,其特征在于,所述将频域信号形式的语音信号的每一帧数据分别输入到所述GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧的步骤,包括:
    将频域信号形式的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,分别得到各帧数据的噪音帧概率
    Figure PCTCN2018124401-appb-100011
    和语音帧概率
    Figure PCTCN2018124401-appb-100012
    根据
    Figure PCTCN2018124401-appb-100013
    计算局部 对数似然比;
    判断所述局部对数似然比是否高于局部门限值;
    若是,则判定所述局部对数似然比高于局部门限值的帧数据为语音帧。
  18. 一种计算机非易失性可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现声纹验证的方法,声纹验证的方法包括:
    将待声纹验证的语音信号输入到VAD模型中,区分所述语音信号中的语音帧和噪音帧;
    去除所述噪音帧,得到各所述语音帧组成的净化的语音数据;
    提取所述净化的语音数据对应的第一声纹特征;
    判断所述第一声纹特征与预存声纹特征的相似度是否满足预设条件;
    若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。
  19. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述VAD模型中包括傅里叶变换、高斯混合分布的GMM-NOISE和GMM-SPEECH,所述将所述语音信号输入到VAD模型中,区分语音信号中的语音帧和噪音帧的步骤,包括:
    将所述语音信号输入到VAD模型中的傅里叶变换中,将所述语音信号从时域信号形式转变为频域信号形式;
    将频域信号形式的语音信号的每一帧数据分别输入到所述GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧。
  20. 根据权利要求19所述的计算机非易失性可读存储介质,其特征在于,所述将频域信号形式的语音信号的每一帧数据分别输入到所述GMM-NOISE和GMM-SPEECH中进行VAD判决,以区分语音信号中的语音帧和噪音帧的步骤,包括:
    将频域信号形式的语音信号的各帧数据,分别输入到GMM-NOISE和GMM-SPEECH中,分别得到各帧数据的噪音帧概率
    Figure PCTCN2018124401-appb-100014
    和语音帧概率
    Figure PCTCN2018124401-appb-100015
    根据
    Figure PCTCN2018124401-appb-100016
    计算局部对数似然比;
    判断所述局部对数似然比是否高于局部门限值;
    若是,则判定所述局部对数似然比高于局部门限值的帧数据为语音帧。
PCT/CN2018/124401 2018-10-11 2018-12-27 声纹验证的方法、装置、计算机设备和存储介质 WO2020073518A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811184693.9A CN109378002B (zh) 2018-10-11 2018-10-11 声纹验证的方法、装置、计算机设备和存储介质
CN201811184693.9 2018-10-11

Publications (1)

Publication Number Publication Date
WO2020073518A1 true WO2020073518A1 (zh) 2020-04-16

Family

ID=65403684

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124401 WO2020073518A1 (zh) 2018-10-11 2018-12-27 声纹验证的方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN109378002B (zh)
WO (1) WO2020073518A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110265012A (zh) * 2019-06-19 2019-09-20 泉州师范学院 基于开源硬件可交互智能语音家居控制装置及控制方法
CN110675878A (zh) * 2019-09-23 2020-01-10 金瓜子科技发展(北京)有限公司 一种车商识别的方法、装置、存储介质及电子设备
CN110838296B (zh) * 2019-11-18 2022-04-29 锐迪科微电子科技(上海)有限公司 录音过程的控制方法、系统、电子设备和存储介质
CN111108554A (zh) * 2019-12-24 2020-05-05 广州国音智能科技有限公司 一种基于语音降噪的声纹识别方法和相关装置
CN111274434A (zh) * 2020-01-16 2020-06-12 上海携程国际旅行社有限公司 音频语料自动标注方法、系统、介质和电子设备
CN111524524B (zh) * 2020-04-28 2021-10-22 平安科技(深圳)有限公司 声纹识别方法、装置、设备及存储介质
CN114333767A (zh) * 2020-09-29 2022-04-12 华为技术有限公司 发声者语音抽取方法、装置、存储介质及电子设备
CN112331217B (zh) * 2020-11-02 2023-09-12 泰康保险集团股份有限公司 声纹识别方法和装置、存储介质、电子设备
CN112735433A (zh) * 2020-12-29 2021-04-30 平安普惠企业管理有限公司 身份核验方法、装置、设备及存储介质
CN113488059A (zh) * 2021-08-13 2021-10-08 广州市迪声音响有限公司 一种声纹识别方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001011606A1 (en) * 1999-08-04 2001-02-15 Ericsson, Inc. Voice activity detection in noisy speech signal
CN102479511A (zh) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 一种大规模声纹认证方法及其系统
CN103236260A (zh) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 语音识别系统
CN108109612A (zh) * 2017-12-07 2018-06-01 苏州大学 一种基于自适应降维的语音识别分类方法
CN108428456A (zh) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 语音降噪算法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105575406A (zh) * 2016-01-07 2016-05-11 深圳市音加密科技有限公司 一种基于似然比测试的噪声鲁棒性的检测方法
CN107068154A (zh) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 基于声纹识别的身份验证的方法及系统
CN108172230A (zh) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 基于声纹识别模型的声纹注册方法、终端装置及存储介质
CN108154371A (zh) * 2018-01-12 2018-06-12 平安科技(深圳)有限公司 电子装置、身份验证的方法及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001011606A1 (en) * 1999-08-04 2001-02-15 Ericsson, Inc. Voice activity detection in noisy speech signal
CN102479511A (zh) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 一种大规模声纹认证方法及其系统
CN103236260A (zh) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 语音识别系统
CN108109612A (zh) * 2017-12-07 2018-06-01 苏州大学 一种基于自适应降维的语音识别分类方法
CN108428456A (zh) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 语音降噪算法

Also Published As

Publication number Publication date
CN109378002A (zh) 2019-02-22
CN109378002B (zh) 2024-05-07

Similar Documents

Publication Publication Date Title
WO2020073518A1 (zh) 声纹验证的方法、装置、计算机设备和存储介质
TWI641965B (zh) 基於聲紋識別的身份驗證的方法及系統
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
EP1569422A2 (en) Method and apparatus for multi-sensory speech enhancement on a mobile device
WO2020073519A1 (zh) 声纹验证的方法、装置、计算机设备以及存储介质
CN104900229A (zh) 一种语音信号混合特征参数的提取方法
CN102509547A (zh) 基于矢量量化的声纹识别方法及系统
CN104887263B (zh) 一种基于心音多维特征提取的身份识别算法及其系统
CN108564956B (zh) 一种声纹识别方法和装置、服务器、存储介质
CN109256138A (zh) 身份验证方法、终端设备及计算机可读存储介质
Mahesha et al. LP-Hillbert transform based MFCC for effective discrimination of stuttering dysfluencies
Goh et al. Robust computer voice recognition using improved MFCC algorithm
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
Maazouzi et al. MFCC and similarity measurements for speaker identification systems
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN113299295B (zh) 声纹编码网络的训练方法及装置
Huang et al. Robust Speech Perception Hashing Authentication Algorithm Based on Spectral Subtraction and Multi-feature Tensor.
Abushariah et al. Voice based automatic person identification system using vector quantization
Al-Karawi Robustness speaker recognition based on feature space in clean and noisy condition
Putra Implementation of secure speaker verification at web login page using mel frequency cepstral coefficient-gaussian mixture model (mfcc-gmm)
Zhao et al. Voice activity detection method based on multivalued coarse-graining Lempel-Ziv complexity
Nguyen et al. Vietnamese speaker authentication using deep models
Chakraborty et al. An improved approach to open set text-independent speaker identification (OSTI-SI)
CN106971733A (zh) 基于语音降噪的声纹识别的方法及系统以及智能终端
Tzagkarakis et al. Robust text-independent speaker identification using short test and training sessions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18936595

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18936595

Country of ref document: EP

Kind code of ref document: A1