CN106297805A

CN106297805A - A kind of method for distinguishing speek person based on respiratory characteristic

Info

Publication number: CN106297805A
Application number: CN201610626034.0A
Authority: CN
Inventors: 鲁力; 刘玲霜
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-08-02
Filing date: 2016-08-02
Publication date: 2017-01-04
Anticipated expiration: 2036-08-02
Also published as: CN106297805B

Abstract

The invention discloses a kind of method for distinguishing speek person based on respiratory characteristic, the method specifically includes that unknown input sound bite, by the breathing template set up by mel-frequency cepstrum coefficient MFCC, zero-crossing rate ZCR and short-time energy E extracts the respiratory murmur in unknown sound bite, then the border detection algorithm eliminating false low ebb is utilized to reject the false positive part in respiratory murmur, obtain the respiratory murmur after clean cut separation, whether the speaker of unknown sound bite is from sample speaker and judge whether the speaker of the unknown sound bite is legal speaker finally to utilize the respiratory murmur after clean cut separation to distinguish.The present invention achieves the uniqueness of human body respiration first and is paid close attention to and study, and it is effectively applied in Speaker Recognition System, overcome " extraction of breath signal " that exploitation based on the speaker Recognition Technology breathed face and " breath signal process " two is challenged greatly.Thus the Speaker Recognition System that the present invention provides is simply efficiently, and recognition result is accurately and reliably.

Description

A kind of method for distinguishing speek person based on respiratory characteristic

Technical field

A kind of method that the present invention relates to contactless biometric acquisition of signal, especially relates to a kind of based on breathing spy The method for distinguishing speek person levied.

Background technology

Speaker Identification (Speaker Recognition) is a class underlying issue, is subdivided into two classes: speaker identification asks Topic (Speaker Identification) and speaker verification's problem (Speaker Verification).The former distinguishes unknown Whether speaker is a member in speaker's sample database known to certain；The latter confirms whether the speaker's identity of statement closes Method.Identifying that speaker is divided into training and two stages of test, the training stage is for the foundation of speaker characteristic template, test phase Then calculate the similarity of test data and feature templates, and draw judged result.According to the degree of dependence difference to speech text, Speaker Identification is divided into again text relationship type (the most effective to certain special text), text independent type (any text is effective), literary composition This prompting-type (is subordinate to special text collection effective).Although phonetic feature can weaken because of mike, the reason of channel, can be by strong Health, the impact of emotion, the most imitated, but in recent years, speech processes correlation technique quickly grows, and has occurred many real Time application, make speech processes relevant issues obtain more concern and research.

The Speaker Identification scheme deposited now or based on Source-Filter (source-wave filter), or based on Source- System (source-system) model, or be simultaneously based on both and extract characteristic vector.Excitaton source information can pass through glottal signal base Residue sample linear prediction in shape represents.Channel information can be captured by cepstrum signal.Prosodic information can be held by statistics Continuous time, tone, the time dynamic of energy obtain.It is the energy source that sound produces based on aerodynamic respiratory One of, can be extracted and be processed as one section of complete voice.Existing research is devoted to breath signal in voice Detection and rejecting, in order to improve sound quality, improve speech-to-text converting algorithm, training typist and identify psychology shape Condition etc..

Source-Filter (source-wave filter) theory thinks that voice is the response of sound channel system, and gives non-linear , the good approximation of time dependent voice." source (source) " refers to 4 kinds of source speech signals: suction source, sources of friction, Glottis (sounding) source and transient state source.Sound channel act like a wave filter, its input is produced by above-mentioned 4 kinds of source speech signals, Output then forms vowel, consonant or arbitrarily voice.Sound channel also controls to manage tone and produces, voice quality, and harmonic wave, resonance is special Property, rdaiation response etc..

In source/system (source/system) model, voice is built according to linear slowly varying discrete-time system Mould.By this system of pulse excitation paracycle in the random noise in unvoiced speech source or speech sound source.Source comprises tone The phonetic feature easily made mistakes.Therefore, source model is rarely used in Speaker Identification, is the most seldom strengthened by other features. Relatively, system (system) model is corresponding with smooth power spectral envelope, and envelope passes through linear prediction or Mel wave filter Analysis obtains.Therefore, this model is widely used in the Speaker Recognition System about cepstrum coefficient.

Both models all using breathing the part as speech source, are converted into the voice in speech sound source or noiseless language Noise in source of sound.Shift to new management mechanisms it practice, respiratory is a kind of energy that energy is converted into sound.Additionally, at voice In breathing be limited, usually, expiratory duration is longer than inspiratory duration, and the breathing in non-voice in living, its exhale and Inspiration time is of substantially equal.

Respiratory system comprises lung, diaphram, Intercostal muscle and by bronchus, trachea, larynx, sound channel, the breathing letter that oral cavity is constituted Road.We regard and breathe as the physiology fingerprint of whole respiratory system, it by intra-pulmonary pressure, air flow and muscular movement managed with Control.During air-breathing, respiratory muscle shrinks, and intra-pulmonary pressure reduces, and air is in external inflow lung.Similarly, due in lung during expiration Pressure increases, and space compression in lung, air breathes out external in lung.According to anatomy principle, before and after breathing, certainly exist one Individual reticent interval.Breathing and affected by age, sex factor, normal continuous 100 400 milliseconds, reticent gap continues 20 milliseconds Above.Reticent gap is by breathing the key separated of demarcating.

The generation breathed is lung, intra-pulmonary pressure, diaphragm, sound channel, trachea, the coefficient result of respiratory muscle, is to breathe system Physiology fingerprint in system meaning.The flowing of air is not to complete moment, all has one before the generation therefore breathed and after occurring Individual reticent gap (>=20 milliseconds).Comparing with the voice signal (not comprising breathing) of ordinary meaning, the energy of breath signal is weak, time Between short (100 400 milliseconds), occurrence frequency low (12-18 beat/min), and produce overlapping at low frequency and non-respiratory voice signal (100Hz–1kHz).Additionally, respiratory murmur is the most similar to phoneme and consonant friction sound, as in " church "/ù/, " vision " In<Z>.Therefore, the exploitation breathed in speaker Recognition Technology face " extraction of breath signal " and " breath signal Process " two challenge greatly, thus cause breathing and be not exploited in speaker Recognition Technology, and often as breathing noise quilt Reject.

Summary of the invention

It is an object of the invention to: cannot be used effectively in speaker Recognition Technology for above-mentioned prior art is breathed In, and face " extraction of breath signal " and " breath signal process " based on the exploitation of speaker Recognition Technology breathed Two challenge greatly, and the present invention provides a kind of method for distinguishing speek person based on respiratory characteristic.

The technical solution used in the present invention is as follows:

A kind of method for distinguishing speek person based on respiratory characteristic, it is characterised in that comprise the following steps:

Step 1: input breath sample collection, carries out sub-frame processing to breath sample collection, obtains breathing frame, passes through mel-frequency Breathing frame is established as breathing template by cepstrum coefficient MFCC, and calculates breathing frame and the breathing template that each breath sample collection obtains Similarity, obtains its minima Bm；

Step 2: unknown input sound bite, carries out sub-frame processing to unknown sound bite, obtains unknown speech frame, calculates Each unknown speech frame and the similarity breathing template；Calculate the zero-crossing rate ZCR of unknown sound bite and unknown sound bite Short-time energy E；According to unknown sound bite with breathe the similarity of template, Bm, the zero-crossing rate ZCR of unknown sound bite and The short-time energy E of unknown sound bite filters out the respiratory murmur in unknown sound bite, and the respiratory murmur filtered out composition is preliminary Respiratory murmur after separation；

Step 3: utilize the reticent gap of the respiratory murmur after the border detection algorithm detection initial gross separation eliminating false low ebb, Reject the false positive part in the respiratory murmur after initial gross separation according to reticent gap, obtain the respiratory murmur after clean cut separation；

Step 4: choose one group of sample speaker, gathers the breathing fragment of each sample speaker, sets up one group of speaker Sample database, if need to judge, the speaker of unknown sound bite, whether from sample speaker, carries out step 5；If needing to judge Whether the speaker of unknown sound bite is legal speaker, carries out step 6；

Step 5: calculate the respiratory murmur after the clean cut separation of described unknown sound bite every with speaker's sample database The similarity of individual speaker's breath sample, takes sample corresponding to maximum similarity and speaks the speaker of artificial unknown sound bite, Terminate；

Step 6: to each sample speaker's collecting test sample, choose a test sample；

Step 7: calculate test sample speaker breath sample each to speaker's sample database similar chosen Degree, takes described test sample and the maximum in the similarity of each speaker's breath sample in speaker's sample database, To a maximum similarity；

Step 8: choose another test sample, repeats step 7, until the maximum obtaining all test samples corresponding is similar Degree, obtains maximum similarity group；

Step 9: gather the sound bite of legal speaker, utilizes breath sample collection to extract the breathing sheet of legal speaker Section, calculates the similarity of the respiratory murmur after the clean cut separation of unknown sound bite and the breathing fragment of legal speaker；

Step 10: if the respiratory murmur after the clean cut separation of unknown sound bite is similar to the breathing fragment of legal speaker Degree is more than the minima of maximum similarity group, then the speaker of unknown sound bite is accredited as legal speaker, otherwise is non- Method speaker.

In such scheme, described step 1 comprises the following steps:

Step 1.1: input breath sample collection, is divided into the breathing frame of a length of 100 milliseconds by described breath sample collection, will be every Individual breathing frame is divided into again continuous and overlapped breathing subframe, and each breather frame length is 10ms, and adjacent breather A length of 5ms overlapped between frame；

Step 1.2 uses first-order difference wave filter that each breathing subframe is carried out preemphasis, obtains the breather after preemphasis Frame；Wherein, first-order difference wave filter H:

H (z)=1-α z^-1

Wherein, α is pre-emphasis parameters α ≈ 0.095, and z is signal sampling point data；

Step 1.3: the breathing subframe after each preemphasis of each breathing frame is calculated MFCC, obtains each breathing frame Cepstrum matrix in short-term, removes DC component to the every string of matrix of cepstrum in short-term of each breathing frame, obtains each breathing frame MFCC cepstrum matrix；

Step 1.4: the Mean Matrix T of calculating breath sample collection:

T = \frac{1}{N} Σ_{i = 1}^{N} M (X_{i})

Wherein, N represents breath sample and concentrates the number breathing frame, and M (Xi) represents that i-th breathes the MFCC cepstrum square of frame Battle array, i ∈ [1,2 ..., N]；

The variance matrix V of calculating breath sample collection:

V = {\frac{1}{N} Σ_{i = 1}^{N} {[M (X_{i}) - T]}^{2}}

Step 1.5: being connected by the MFCC cepstrum matrix of all breathing frames is a big matrix M_b: M_b=[M (X₁),…,M (X_i),M(X_i+1),…,M(X_N)]

Described big matrix is carried out singular value decomposition:

M_b=U Σ V^*

Wherein, U is m × m rank unitary matrice；Σ is positive semidefinite m × n rank diagonal matrix, and V* represents the conjugate transpose of V, be n × N rank unitary matrice, the element on Σ diagonal is { λ₁,λ₂,λ₃... }, it is the singular value of M, obtains singular value vector { λ₁,λ₂, λ₃,…}；

Use maximum singular value λ_mDescribed singular value vector is normalized, obtains the singular value vector after final normalizationWherein, λ_m=max{ λ₁,λ₂,λ₃,…}；

Step 1.6: obtain one group and breathe template, described breathing template includes singular value vector S after normalization, breathes sample The variance matrix V of this collection and the Mean Matrix T of breath sample collection.

In such scheme, described step 2 comprises the following steps:

Step 2.1: unknown input sound bite, carries out sub-frame processing to unknown sound bite, obtain unknown speech frame with Unknown speech subframe, calculates each unknown speech frame and similarity B (X, T, V, S) breathing template；Calculate breath sample collection Each breathing frame and the similarity breathing template, taking minimum similarity degree is Bm；

Calculate each unknown speech frame short-time energy E:

E = \frac{1}{N} Σ_{n = N_{0}}^{N_{0} + N + 1} x^{2} [n]

Wherein, n represents that signal the n-th sampled point, x [n] represent the n-th speech sample signal, and N represents that the window of sample is long Degree, N₀The window start of expression sample is N₀Individual sampled point；

Calculate the meansigma methods of all unknown speech frames

The zero-crossing rate ZCR of the unknown sound bite of calculating:

Z C R = \frac{1}{N} Σ_{n = N_{0} + 1}^{N_{0} + N + 1} 0.5 | s g n (x [n]) - s g n (x [n - 1]) |

Step 2.2: choose a unknown speech frame；

Step 2.3: if the unknown speech frame being selected and similarity B (X, T, V, S) breathing template are more than threshold value Bm/2, And the zero-crossing rate ZCR of unknown sound bite is less than 0.25, and the short-time energy E of the unknown speech frame being selected is less than all The meansigma methods of unknown speech frameThen judging that the unknown speech frame being selected is respiratory murmur, if being unsatisfactory for described condition, then judging The unknown speech frame being selected is non-respiratory sound.

Step 2.4: choose other unknown speech frames, repeat step 2.3, until judging all the unknowns in unknown sound bite Whether speech frame is respiratory murmur；

Step 2.5: retain respiratory murmur, rejects non-respiratory sound, obtains initial gross separation respiratory murmur；

In such scheme, described step 2.1 calculates the side breathing frame or unknown speech frame with the similarity breathing template Method comprises the following steps:

Step 2.1.1: input breath sample collection or unknown sound bite, is divided into breath sample collection or unknown sound bite The breathing frame of a length of 100 milliseconds or unknown speech frame, be divided into each breathing frame or unknown speech frame again continuously and phase mutual respect Folded breathing subframe or unknown speech subframe, each subframe or unknown a length of 10ms of speech subframe of breathing, and adjacent the unknown A length of 5ms overlapped between speech subframe；

Step 2.1.2: use first-order difference wave filter that each unknown speech subframe is carried out preemphasis, after obtaining preemphasis Breathe frame or unknown speech frame；Wherein, first-order difference wave filter H:

H (z)=1-α z^-1

Wherein, α is pre-emphasis parameters α ≈ 0.095；Z is signal sampling point data；

Step 2.1.3: to the breathing subframe after each preemphasis of each breathing frame or unknown speech frame or unknown voice Subframe calculates MFCC, obtains each breathing frame or the cepstrum matrix in short-term of unknown speech frame, to each breathing frame or unknown voice The every string of matrix of cepstrum in short-term of frame removes DC component, obtains each breathing frame or the MFCC cepstrum matrix M of unknown speech frame (X)

Step 2.1.4: choose one and breathe frame or unknown speech frame X；

Step 2.1.5: calculate be selected breathe frame or the normalization difference matrix D of unknown speech frame:

D = \frac{M (X) - T}{V}

Wherein, T represents the Mean Matrix of breath sample collection, and V represents breath sample collection variance matrix, and M (X) is selected Take breathes frame or the MFCC cepstrum matrix of unknown speech frame；

Step 2.1.6: every string and half Hamming window of D are multiplied, make the cepstrum coefficient of low frequency be strengthened:

D (:, j)=D (:, j) hamming, j ∈ [1, N_C]

Wherein, Nc represents the MFCC number of parameters in each breathing subframe or unknown speech subframe, the i.e. columns of D； Hamming represents Hamming window.

Step 2.1.7: calculate similarity B (X, T, V, S) breathing frame or unknown speech frame X and breathing template being selected Component Cp:

C_{p} = \frac{1}{Σ_{k = 1}^{n} Σ_{j = 1}^{N_{c}} | D_{k j} |^{2}}

Wherein, n represents that be selected breathes breathing subframe or the quantity of unknown speech subframe, k in frame or unknown speech frame X ∈ [1, n],Represent that the kth of frame to be breathed or unknown speech frame X breathes jth MFCC in subframe or unknown speech subframe Parameter；

Calculate another component breathing frame or unknown speech frame X and similarity B (X, T, V, S) breathing template being selected Cn:

C_{n} = Σ_{j = 1}^{N_{C}} D (:, j) \cdot S;

Step 2.1.8: calculate and breathe frame or unknown speech frame X and similarity B (X, T, V, S) breathing template:

B (X, T, V, S)=Cp*Cn；

Step 2.1.9: choose another and breathe frame or the MFCC cepstrum matrix of unknown speech frame, repeat step 2.1.5- 2.1.8；

Step 2.1.10: repeat step 2.1.9, until obtaining all breathing frames or unknown speech frame and the phase breathing template Like degree；

In such scheme, the value of the border detection algorithm utilization eliminating false low ebb in described step 3 includes that breathing continues Time threshold, energy threshold, zero-crossing rate ZCR bound threshold value and spectrum slope accurately find breathing border, described step 3 profit Breathe in the position of current speech segment with Binary Zero-1 accurately instruction.

In such scheme, method for distinguishing speek person based on respiratory characteristic according to claim 1, it is characterised in that Described step 5 calculates the respiratory murmur after the clean cut separation of this unknown sound bite and the speaker in speaker's sample database The similarity of breath sample comprises the following steps:

Step 5.1: set the MFCC characteristic vector of sample in breath sample data base as (a₁,a₂,...,a_n), calculate and say The Mean Matrix M of the MFCC characteristic vector of the speaker's breath sample in words people's sample database:

M = \frac{1}{n} Σ_{i = 1}^{n} a_{i}

Wherein, a_iI-th MFCC for the MFCC characteristic vector of the speaker's breath sample in speaker's sample database Cepstrum matrix, n represents the MFCC cepstrum square in the MFCC characteristic vector of the speaker's breath sample in speaker's sample database The number of battle array, i ∈ [1,2 ..., n]；

The variance matrix V of the MFCC characteristic vector of the speaker's breath sample in calculating speaker's sample database:

V = {\frac{1}{n} Σ_{i = 1}^{n} {[a_{i} - M]}^{2}}

Step 5.2: calculate the MFCC characteristic vector of all respiratory murmurs after the clean cut separation of unknown sound bite, be designated as (b₁,b₂,...,b_n), b_iMFCC cepstrum matrix for the respiratory murmur after the i-th clean cut separation of unknown sound bite；

Step 5.3: the characteristic vector (a to the speaker's breath sample in speaker's sample database₁,a₂,...,a_n) enter Row normalization:

S_{a_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{a_{k}})}_{i j} |^{2}

D_{a_{k}} = \frac{a_{k} - M}{V}

Wherein, r and c represents respectivelyRow and column,For a_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈[1,2,…,r],j∈[1,2,…,c]；

Step 5.4: by (Sa₁,Sa₂,...,Sa_n) carry out ascending order arrangement, obtain (S₁,S₂,...,S_n)；

Step 5.5: to the MFCC characteristic vector of all respiratory murmurs after the clean cut separation of unknown sound bite (b1, B2 ..., bn) it is normalized:

S_{b_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{b_{k}})}_{i j} |^{2}

D_{b_{k}} = \frac{b_{k} - M}{V}

Wherein, r and c represents respectivelyRow and column,For b_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈ [1,2,…,r],j∈[1,2,…,c]；

Step 5.6: calculate the similarity degree Pk of bk and reference template: by Sb_kWith ordered vector (S1, S2 ..., Sn) Element compares one by one, Pk be in ordered vector less than Sbk number of elements divided by element sum, calculate Pk meansigma methods, Obtain the similarity of the respiratory murmur after the clean cut separation of unknown sound bite and the sample in breath sample data base.

In such scheme, described step 9 calculates the respiratory murmur after the clean cut separation of unknown sound bite and speaks with legal The similarity of the breathing fragment of people comprises the following steps:

Step 9.1: set the MFCC characteristic vector of breathing fragment of legal speaker as (a₁,a₂,...,a_n), it is legal to calculate The Mean Matrix M of the MFCC characteristic vector of the breathing fragment of speaker:

M = \frac{1}{n} Σ_{i = 1}^{n} a_{i}

Wherein, a_iFor the i-th MFCC cepstrum matrix of MFCC characteristic vector of the breathing fragment of legal speaker, n represents The number of the MFCC cepstrum matrix in the MFCC characteristic vector of the breathing fragment of legal speaker, i ∈ [1,2 ..., n]；

Calculate the variance matrix V of the MFCC characteristic vector of the breathing fragment of legal speaker:

V = {\frac{1}{n} Σ_{i = 1}^{n} {[a_{i} - M]}^{2}}

Step 9.2: calculate the MFCC characteristic vector of all respiratory murmurs after the clean cut separation of unknown sound bite, be designated as (b₁,b₂,...,b_n), b_iMFCC cepstrum matrix for the respiratory murmur after the i-th clean cut separation of unknown sound bite；

Step 9.3: legal speaker is breathed the characteristic vector (a of fragment₁,a₂,...,a_n) it is normalized:

S_{a_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{a_{k}})}_{i j} |^{2}

D_{a_{k}} = \frac{a_{k} - M}{V}

Step 9.4: by (Sa₁,Sa₂,...,Sa_n) carry out ascending order arrangement, obtain (S₁,S₂,...,S_n)；

Step 9.5: to the MFCC characteristic vector of all respiratory murmurs after the clean cut separation of unknown sound bite (b1, B2 ..., bn) it is normalized:

S_{b_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{b_{k}})}_{i j} |^{2}

D_{b_{k}} = \frac{b_{k} - M}{V}

Step 9.6: calculate the similarity degree Pk of bk and reference template: by Sb_kWith ordered vector (S1, S2 ..., Sn) Element compares one by one, Pk be in ordered vector less than Sbk number of elements divided by element sum, calculate Pk meansigma methods, Obtain the similarity of the respiratory murmur after the clean cut separation of unknown sound bite and the breathing fragment of legal speaker.

In such scheme, described step 7 calculates the test sample and each speaker in speaker's sample database chosen The method of the similarity of breath sample and step 5 calculate the respiratory murmur after the clean cut separation of this unknown sound bite and speaker In sample database, the method for the similarity of each speaker's breath sample is identical.

In such scheme, the method calculating MFCC in described step 1.3 and step 5.2 includes: calculate MFCC's by needing Signal carries out fast Fourier transform, then calculates complicated sine curve coefficient, finally by bank of filters based on melscale Exported.

In sum, owing to have employed technique scheme, the invention has the beneficial effects as follows:

1) present invention is as a set of based on the Verification System breathed, and the uniqueness achieving human body respiration first is paid close attention to And research, and be effectively applied in Speaker Recognition System, overcome exploitation based on the speaker Recognition Technology breathed " extraction of breath signal " and " the breath signal process " two that face are challenged greatly.

2) present invention knowledge based on mathematical statistics, devises a light similarity algorithm for decision-making: this calculation Method is a series of simple vector operations utilizing MFCC Mean Matrix and variance matrix.Compared with traditional classification algorithm, this Similarity algorithm in bright has more excellent classification performance.

3) present invention can operate with speaker identification's experiment and speaker verification tests；If simultaneously because a people's exhales Haustorium official be interfered, then his breathing is signed it is possible to be modified, therefore this invention can be used for judging human body respiration organ Whether it is interfered.

4) present invention can realize needing the identification under quiet occasion.

5) can realize cannot the identification of tester of sounding for the present invention.

6) sorting technique that the present invention uses relatively with traditional based on multiparameter, the complex model classification side assumed more Method, has relatively low time complexity and space complexity.Additionally, the present invention uses algorithm process data based on MFCC more Hurry up, required training sample is less, and ensures recognition accuracy, thus the Speaker Recognition System that the present invention provides is simply efficient, And recognition result is accurately and reliably.

Accompanying drawing explanation

Fig. 1 is to judge the system framework figure that unknown speaker identity is the most legal in the present invention；

Fig. 2 is the frame diagram breathing Preliminary detection in the present invention in step 2；

Fig. 3 is the frame diagram breathing final detection in the present invention in step 3；

Fig. 4 is the experimental result signal table of step 6-8 in the present invention；

Fig. 5 represents the contrast after Mel bank of filters acts on breath signal and non-respiratory voice signal in the present invention；

Fig. 6 represents the feature of ZCR in the present invention, spectrum slope and STE；

Fig. 7 represents the formant of the voice signal of breath signal and non-respiratory in the present invention；

Fig. 8 represents the breathing under normal condition and the breath signal under abnormal condition in the present invention；

Detailed description of the invention

All features disclosed in this specification, in addition to mutually exclusive feature and/or step, all can be with any Mode combines.

Elaborate below in conjunction with-8 couples of present invention of Fig. 1.

The present invention proposes a kind of method for distinguishing speek person based on respiratory characteristic, and this model is applied to Speaker Identification and takes Obtain effect well.The schematic diagram that realizes of whole algorithm is similar to Fig. 1, including step:

Step 1: such as Fig. 1, input breath sample collection, breath sample collection is carried out sub-frame processing, obtains breathing frame, passes through prunus mume (sieb.) sieb.et zucc. Breathing frame is established as breathing template by you frequency cepstral coefficient MFCC；Step 1 specifically includes following steps:

H (z)=1-α z^-1

Step 1.4: the Mean Matrix T of calculating breath sample collection:

T = \frac{1}{N} Σ_{i = 1}^{N} M (X_{i})

The variance matrix V of calculating breath sample collection:

V = {\frac{1}{N} Σ_{i = 1}^{N} {[M (X_{i}) - T]}^{2}}

Described big matrix is carried out singular value decomposition:

M_b=U Σ V^*

Step 2: such as Fig. 2, unknown input sound bite, unknown sound bite is carried out sub-frame processing, obtains unknown voice Frame, the similarity calculating each unknown speech frame with breathing template, calculate the zero-crossing rate ZCR of unknown sound bite and unknown language The short-time energy E of tablet section；According to unknown sound bite and the breathing similarity of template, Bm, the zero-crossing rate of unknown sound bite The short-time energy E of ZCR and unknown sound bite filters out the respiratory murmur in unknown sound bite, the respiratory murmur group filtered out Become the respiratory murmur after initial gross separation；

Described step 2 comprises the following steps:

Step 2.1: unknown input sound bite, carries out sub-frame processing to unknown sound bite, obtain unknown speech frame with Unknown speech subframe, calculates each unknown speech frame and similarity B (X, T, V, S) breathing template；

Calculating each breathing frame of breath sample collection and breathe the similarity of template, taking minimum similarity degree is Bm；

Calculate each unknown speech frame short-time energy E:

E = \frac{1}{N} Σ_{n = N_{0}}^{N_{0} + N + 1} x^{2} [n]

Calculate the meansigma methods of all unknown speech frames

The zero-crossing rate ZCR of the unknown sound bite of calculating:

Z C R = \frac{1}{N} Σ_{n = N_{0} + 1}^{N_{0} + N + 1} 0.5 | s g n (x [n]) - s g n (x [n - 1]) |

The method calculating the similarity breathed frame or unknown speech frame and breathe template in described step 2.1 includes following step Rapid:

H (z)=1-α z^-1

Step 2.1.4: choose one and breathe frame or unknown speech frame X；

D = \frac{M (X) - T}{V}

D (:, j)=D (:, j) hamming, j ∈ [1, N_C]

C_{p} = \frac{1}{Σ_{k = 1}^{n} Σ_{j = 1}^{N_{c}} | D_{k j} |^{2}}

C_{n} = Σ_{j = 1}^{N_{C}} D (:, j) \cdot S;

B (X, T, V, S)=Cp*Cn；

Step 2.2: choose a unknown speech frame；

Step 2.3: if the unknown speech frame being selected and similarity B (X, T, V, S) breathing template are more than threshold value Bm/2, And the zero-crossing rate ZCR of unknown sound bite is less than 0.25 (now sample rate is 44kHz), and the unknown speech frame being selected Short-time energy E less than the meansigma methods of all unknown speech framesThen judge that the unknown speech frame being selected is respiratory murmur, if not Meet described condition, then judge that the unknown speech frame being selected is non-respiratory sound.

Step 3: such as Fig. 3, utilize the heavy of the respiratory murmur after the border detection algorithm detection initial gross separation eliminating false low ebb Silent gap, rejects the false positive part in the respiratory murmur after initial gross separation according to reticent gap, obtains the breathing after clean cut separation Sound；The IEEE of the 3rd phase of volume 15 in described such as in the March, 2007 of border detection algorithm specific implementation eliminating false low ebb TRANSACTIONS ON AUDIO, " An effective algorithm in SPEECH, AND LANGUAGE PROCESSING for automatic detection and exact demarcation of breath sounds in speech and Song " literary composition；

Step 5: calculate in the respiratory murmur after the clean cut separation of described unknown sound bite and speaker's sample database The similarity of each speaker's breath sample, takes sample corresponding to maximum similarity and speaks the speaker of artificial unknown sound bite, Terminate；

Described step 5 calculates in the respiratory murmur after the clean cut separation of this unknown sound bite and speaker's sample database The similarity of speaker's breath sample comprise the following steps:

Step 5.1: set the MFCC characteristic vector of speaker's breath sample in speaker's sample database as (a₁, a₂,...,a_n), the Mean Matrix M of the MFCC characteristic vector of the speaker's breath sample in calculating speaker's sample database:

M = \frac{1}{n} Σ_{i = 1}^{n} a_{i}

V = {\frac{1}{n} Σ_{i = 1}^{n} {[a_{i} - M]}^{2}}

S_{a_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{a_{k}})}_{i j} |^{2}

D_{a_{k}} = \frac{a_{k} - M}{V}

S_{b_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{b_{k}})}_{i j} |^{2}

D_{b_{k}} = \frac{b_{k} - M}{V}

Wherein, r and c represents respectivelyRow and column,For b_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈[1,2,…,r],j∈[1,2,…,c]；

Step 7: such as Fig. 4, calculate the test sample and each speaker's breath sample in speaker's sample database chosen Similarity, take the maximum in the similarity of each speaker's breath sample in described test sample and speaker's sample database Value, obtains a maximum similarity；

Described step 7 calculates the test sample and each speaker's breath sample in speaker's sample database chosen The method of similarity and step 5 calculate the respiratory murmur after the clean cut separation of this unknown sound bite and speaker's sample database In the method for similarity of each speaker's breath sample identical.

Step 8: such as Fig. 4, chooses another test sample, repeats step 7, until it is corresponding to obtain all test samples Big similarity, obtains maximum similarity group；

Step 9: gather the breathing fragment of legal speaker, calculate the respiratory murmur after the clean cut separation of unknown sound bite with The similarity of the breathing fragment of legal speaker；

Described step 9 calculates the breathing fragment of the respiratory murmur after the clean cut separation of unknown sound bite and legal speaker Similarity comprise the following steps:

M = \frac{1}{n} Σ_{i = 1}^{n} a_{i}

V = {\frac{1}{n} Σ_{i = 1}^{n} {[a_{i} - M]}^{2}}

S_{a_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{a_{k}})}_{i j} |^{2}

D_{a_{k}} = \frac{a_{k} - M}{V}

S_{b_{k}} = 1 / Σ_{i = 1}^{r} Σ_{j = 1}^{c} | {(D_{b_{k}})}_{i j} |^{2}

D_{b_{k}} = \frac{b_{k} - M}{V}

The method calculating MFCC in described step 1.3 and step 5.2 includes: carry out needing the signal calculating MFCC quickly Fourier transformation, then calculates complicated sine curve coefficient, is finally exported by bank of filters based on melscale.

The present invention is illustrated by above-described embodiment, but it is to be understood that, above-described embodiment is only intended to Citing and descriptive purpose, and be not intended to limit the invention in described scope of embodiments.In addition people in the art Member, it is understood that the invention is not limited in above-described embodiment, can also make more kinds of according to the teachings of the present invention Variants and modifications, within these variants and modifications all fall within scope of the present invention.Protection scope of the present invention by The appended claims and equivalent scope thereof are defined.

Claims

1. a method for distinguishing speek person based on respiratory characteristic, it is characterised in that comprise the following steps:

Step 1: input breath sample collection, carries out sub-frame processing to breath sample collection, obtains breathing frame, by mel-frequency cepstrum Breathing frame is established as breathing template by coefficient MFCC, and calculate that each breath sample collection obtains to breathe frame similar with breathing template Degree, obtains its minima Bm；

Step 2: unknown input sound bite, carries out sub-frame processing to unknown sound bite, obtains unknown speech frame, calculates each Unknown speech frame and the similarity breathing template；Calculate the zero-crossing rate ZCR of unknown sound bite and the short of unknown sound bite Shi NengliangE；According to unknown sound bite and the breathing similarity of template, Bm, the zero-crossing rate ZCR of unknown sound bite and the unknown The short-time energy E of sound bite filters out the respiratory murmur in unknown sound bite, the respiratory murmur filtered out composition initial gross separation After respiratory murmur；

Step 3: utilize the reticent gap of the respiratory murmur after the border detection algorithm detection initial gross separation eliminating false low ebb, according to The false positive part in the respiratory murmur after initial gross separation is rejected in reticent gap, obtains the respiratory murmur after clean cut separation；

Step 4: choose one group of sample speaker, gathers the breathing fragment of each sample speaker, sets up one group of speaker's sample Data base, if need to judge, the speaker of unknown sound bite, whether from sample speaker, carries out step 5；

If need to judge, the speaker of unknown sound bite, whether as legal speaker, carries out step 6；

Step 5: calculate the respiratory murmur after the clean cut separation of described unknown sound bite and each theory in speaker's sample database The similarity of words people's breath sample, takes sample corresponding to maximum similarity and speaks the speaker of artificial unknown sound bite, terminate；

Step 7: calculate the test sample and the similarity of each speaker's breath sample in speaker's sample database chosen, take Described test sample and the maximum in the similarity of each speaker's breath sample in speaker's sample database, obtain one Maximum similarity；

Step 8: choose another test sample, repeats step 7, until obtaining the maximum similarity that all test samples are corresponding, Obtain maximum similarity group；

Step 9: gather the sound bite of legal speaker, utilizes breath sample collection to extract the breathing fragment of legal speaker, meter Calculate the similarity of the respiratory murmur after the clean cut separation of unknown sound bite and the breathing fragment of legal speaker；

Step 10: if the respiratory murmur after the clean cut separation of unknown sound bite is big with the similarity of the breathing fragment of legal speaker In the minima of maximum similarity group, then the speaker of unknown sound bite is accredited as legal speaker, otherwise is illegally to say Words people.

Method for distinguishing speek person based on respiratory characteristic the most according to claim 1, it is characterised in that described step 1 includes Following steps:

Step 1.1: input breath sample collection, is divided into the breathing frame of a length of 100 milliseconds, by each by described breath sample collection Breathing frame is divided into again continuous and overlapped breathing subframe, each breather frame length to be 10ms, and adjacent breathing subframe Between overlapped a length of 5ms；

Step 1.2 uses first-order difference wave filter that each breathing subframe is carried out preemphasis, obtains the breathing subframe after preemphasis；

Wherein, first-order difference wave filter H:

H (z)=1-α z^-1

Step 1.3: the breathing subframe after each preemphasis of each breathing frame is calculated MFCC, obtains each breathing frame in short-term Cepstrum matrix, removes DC component to the every string of matrix of cepstrum in short-term of each breathing frame, and the MFCC obtaining each breathing frame falls Spectrum matrix；

Step 1.4: the Mean Matrix T of calculating breath sample collection:

Wherein, N represents breath sample and concentrates the number breathing frame, and M (Xi) represents that i-th breathes the MFCC cepstrum matrix of frame, i ∈ [1,2,…,N]；

The variance matrix V of calculating breath sample collection:

Step 1.5: being connected by the MFCC cepstrum matrix of all breathing frames is a big matrix M_b:

M_b=[M (X₁),…,M(X_i),M(X_i+1),…,M(X_N)]

Described big matrix is carried out singular value decomposition:

M_b=U Σ V^*

Wherein, U is m × m rank unitary matrice；Σ is positive semidefinite m × n rank diagonal matrix, and V* represents the conjugate transpose of V, is n × n rank Unitary matrice, the element on Σ diagonal is { λ₁,λ₂,λ₃... }, it is the singular value of M, obtains singular value vector { λ₁,λ₂, λ₃,…}；

Step 1.6: obtaining one group and breathe template, described breathing template includes singular value vector S after normalization, breath sample collection Variance matrix V and the Mean Matrix T of breath sample collection.

Method for distinguishing speek person based on respiratory characteristic the most according to claim 1, it is characterised in that described step 2 includes Following steps:

Step 2.1: unknown input sound bite, carries out sub-frame processing to unknown sound bite, obtains unknown speech frame with unknown Speech subframe, calculates each unknown speech frame and similarity B (X, T, V, S) breathing template；Calculate each of breath sample collection The similarity breathing frame and breathe template, taking minimum similarity degree is Bm；

Calculate each unknown speech frame short-time energy E:

Wherein, n represents that signal the n-th sampled point, x [n] represent the n-th speech sample signal, and N represents the length of window of sample, N₀ The window start of expression sample is N₀Individual sampled point；

Calculate the meansigma methods of all unknown speech frames

The zero-crossing rate ZCR of the unknown sound bite of calculating:

Step 2.2: choose a unknown speech frame；

Step 2.3: if the unknown speech frame being selected and similarity B (X, T, V, S) breathing template are more than threshold value Bm/2, and The zero-crossing rate ZCR of unknown sound bite is less than 0.25, and the short-time energy E of the unknown speech frame being selected is less than all the unknowns The meansigma methods of speech frameThen judging that the unknown speech frame being selected is respiratory murmur, if being unsatisfactory for described condition, then judging selected The unknown speech frame taken is non-respiratory sound.

Step 2.4: choose other unknown speech frames, repeat step 2.3, until judging all unknown voices in unknown sound bite Whether frame is respiratory murmur；

Step 2.5: retain respiratory murmur, rejects non-respiratory sound, obtains initial gross separation respiratory murmur.

Method for distinguishing speek person based on respiratory characteristic the most according to claim 3, it is characterised in that in described step 2.1 Calculate and breathe the method for frame or unknown speech frame and the similarity of breathing template and comprise the following steps:

Step 2.1.1: input breath sample collection or unknown sound bite, is divided into length by breath sample collection or unknown sound bite It is the breathing frame of 100 milliseconds or unknown speech frame, each breathing frame or unknown speech frame are divided into again continuous and overlapped Breathe subframe or unknown speech subframe, each subframe or unknown a length of 10ms of speech subframe of breathing, and adjacent unknown voice A length of 5ms overlapped between subframe；

Step 2.1.2: use first-order difference wave filter that each unknown speech subframe is carried out preemphasis, obtain exhaling after preemphasis Inhale frame or unknown speech frame；Wherein, first-order difference wave filter H:

H (z)=1-α z^-1

α is pre-emphasis parameters α ≈ 0.095；Z is signal sampling point data；

Step 2.1.3: to the breathing subframe after each preemphasis of each breathing frame or unknown speech frame or unknown speech subframe Calculate MFCC, obtain each breathing frame or the cepstrum matrix in short-term of unknown speech frame, to each breathing frame or unknown speech frame The every string of cepstrum matrix removes DC component in short-term, obtains each breathing frame or MFCC cepstrum matrix M (X) of unknown speech frame

Step 2.1.4: choose one and breathe frame or unknown speech frame X；

Wherein, T represents the Mean Matrix of breath sample collection, and V represents breath sample collection variance matrix, and M (X) is be selected Breathe frame or the MFCC cepstrum matrix of unknown speech frame；

D (:, j)=D (:, j) hamming, j ∈ [1, N_C]

Wherein, Nc represents the MFCC number of parameters in each breathing subframe or unknown speech subframe, the i.e. columns of D；Hamming table Show Hamming window.

Step 2.1.7: what calculating was selected breathes dividing of frame or unknown speech frame X similarity B (X, T, V, S) with breathing template Amount Cp:

Wherein, n represents that be selected breathes breathing subframe or the quantity of unknown speech subframe, k ∈ in frame or unknown speech frame X [1, n], D_kjRepresent that the kth of frame to be breathed or unknown speech frame X breathes jth MFCC ginseng in subframe or unknown speech subframe Number；

Calculate another component Cn breathing frame or unknown speech frame X and similarity B (X, T, V, S) breathing template being selected:

B (X, T, V, S)=Cp*Cn；

Step 2.1.9: choose another and breathe frame or the MFCC cepstrum matrix of unknown speech frame, repeat step 2.1.5-2.1.8；

Step 2.1.10: repeat step 2.1.9, until it is similar with breathing template to obtain all breathing frames or unknown speech frame Degree.

5. according to method for distinguishing speek person based on respiratory characteristic described in claim 1-4, it is characterised in that in described step 3 The value of the border detection algorithm utilization eliminating false low ebb includes breathing duration threshold, and energy threshold, zero-crossing rate ZCR is upper and lower Limit threshold value and spectrum slope accurately find breathing border, and described step 3 utilizes Binary Zero-1 accurately instruction to breathe at current language The position of tablet section.

6. according to method for distinguishing speek person based on respiratory characteristic described in claim 1-5, it is characterised in that in described step 5 Calculate the respiratory murmur after the clean cut separation of described unknown sound bite and breathe sample with each speaker in speaker's sample database This similarity comprises the following steps:

Step 5.1: set the MFCC characteristic vector of speaker's breath sample in speaker's sample database as (a₁,a₂,..., a_n), the Mean Matrix M of the MFCC characteristic vector of the speaker's breath sample in calculating speaker's sample database:

Wherein, a_iI-th MFCC cepstrum square for the MFCC characteristic vector of the speaker's breath sample in speaker's sample database Battle array, n represents the individual of the MFCC cepstrum matrix in the MFCC characteristic vector of the speaker's breath sample in speaker's sample database Number, i ∈ [1,2 ..., n]；

Step 5.2: calculate the MFCC characteristic vector of all respiratory murmurs after the clean cut separation of unknown sound bite, be designated as (b₁, b₂,...,b_n), b_iMFCC cepstrum matrix for the respiratory murmur after the i-th clean cut separation of unknown sound bite；

Step 5.3: the characteristic vector (a to the speaker's breath sample in speaker's sample database₁,a₂,...,a_n) return One changes:

Wherein, r and c represents respectivelyRow and column,For a_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈ [1, 2,…,r],j∈[1,2,…,c]；

Wherein, r and c represents respectivelyRow and column,For b_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈ [1, 2,…,r],j∈[1,2,…,c]；

Step 5.6: calculate the similarity degree Pk of bk and reference template: by Sb_kWith ordered vector (S1, S2 ..., Sn) element enter Row compare one by one, Pk be in ordered vector less than Sbk number of elements divided by element sum, calculate Pk meansigma methods, obtain not Know the similarity of the respiratory murmur after the clean cut separation of sound bite and the sample in breath sample data base.

7. according to method for distinguishing speek person based on respiratory characteristic described in claim 1-6, it is characterised in that in described step 9 The similarity of the breathing fragment calculating the respiratory murmur after the clean cut separation of unknown sound bite and legal speaker includes following step Rapid:

Step 9.1: set the MFCC characteristic vector of breathing fragment of legal speaker as (a₁,a₂,...,a_n), calculate legal speaking The Mean Matrix M of the MFCC characteristic vector of the breathing fragment of people:

Wherein, a_iFor the i-th MFCC cepstrum matrix of MFCC characteristic vector of the breathing fragment of legal speaker, n represents legal theory The number of the MFCC cepstrum matrix in the MFCC characteristic vector of the breathing fragment of words people, i ∈ [1,2 ..., n]；

Step 9.2: calculate the MFCC characteristic vector of all respiratory murmurs after the clean cut separation of unknown sound bite, be designated as (b₁, b₂,...,b_n), b_iMFCC cepstrum matrix for the respiratory murmur after the i-th clean cut separation of unknown sound bite；

Step 9.6: calculate the similarity degree Pk of bk and reference template: by Sb_kWith ordered vector (S1, S2 ..., Sn) element enter Row compare one by one, Pk be in ordered vector less than Sbk number of elements divided by element sum, calculate Pk meansigma methods, obtain not Know the similarity of the respiratory murmur after the clean cut separation of sound bite and the breathing fragment of legal speaker.

8. according to method for distinguishing speek person based on respiratory characteristic described in claim 1-7, it is characterised in that in described step 7 Calculate in method and the step 5 of the similarity of each speaker's breath sample in the test sample and speaker's sample database chosen Calculate the respiratory murmur after the clean cut separation of this unknown sound bite and each speaker's breath sample in speaker's sample database The method of similarity is identical.

9. according to method for distinguishing speek person based on respiratory characteristic described in claim 1-8, it is characterised in that described step 1.3 Include with the method calculating MFCC in step 5.2: by needing the signal calculating MFCC to carry out fast Fourier transform, then calculate Complicated sine curve coefficient, is finally exported by bank of filters based on melscale.